How Large Corpora Sizes Influence the Distribution of Low Frequency Text n-grams

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The prediction of the numbers of distinct word n-grams and their frequency distributions in text corpora is important in domains like information processing and language modelling. With big data corpora, there is an increased application complexity due to the large volume of data. Traditional studies have been confined to small or moderate size corpora leading to statistical laws on word frequency distributions. However, when going to very large corpora, some of the assumptions underlying those laws need to be revised, related to the corpus vocabulary and numbers of word occurrences. So, although it becomes critical to know how the corpus size influences those distributions, there is a lack of models that characterise such influence. This paper aims at filling this gap, enabling the prediction of the impact of corpus growth upon application time and space complexities. It presents a fully principled model, which, distinctively, considers words and multiwords in very large corpora, predicting the cumulative numbers of distinct n-grams above or equal to a given frequency in a corpus, as well as the sizes of equal-frequency n-gram groups, from unigrams to hexagrams, as a function of corpus size, in a language, assuming a finite n-gram vocabulary. The model applies to low occurrence frequencies, encompassing the larger populations of n-grams. Practical assessment with real corpora shows relative errors around 3%, stable over the considered ranges of n-gram frequencies, n-gram sizes and corpora sizes from million to billion words, for English and French.

Original languageEnglish
Title of host publicationAdvances in Knowledge Discovery and Data Mining - 28th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2024, Proceedings
EditorsDe-Nian Yang, Xing Xie, Vincent S. Tseng, Jian Pei, Jen-Wei Huang, Jerry Chun-Wei Lin
PublisherSpringer Science and Business Media Deutschland GmbH
Pages210-222
Number of pages13
ISBN (Print)9789819722617
DOIs
Publication statusPublished - 25 Apr 2024
Event28th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2024 - Taipei, Taiwan, Province of China
Duration: 7 May 202410 May 2024

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume14647 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference28th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2024
Country/TerritoryTaiwan, Province of China
CityTaipei
Period7/05/2410/05/24

Keywords

  • large corpora
  • low-frequency n-grams
  • n-gram distribution

Cite this