TY - GEN
T1 - How Large Corpora Sizes Influence the Distribution of Low Frequency Text n-grams
AU - Silva, Joaquim F.
AU - Cunha, Jose C.
N1 - info:eu-repo/grantAgreement/FCT/Concurso de avaliação no âmbito do Programa Plurianual de Financiamento de Unidades de I&D (2017%2F2018) - Financiamento Base/UIDB%2F04516%2F2020/PT#
Funding Information:
This work is supported by NOVA LINCS (UIDB/04516/2020) with the financial support of FCT.IP.
Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024.
PY - 2024/4/25
Y1 - 2024/4/25
N2 - The prediction of the numbers of distinct word n-grams and their frequency distributions in text corpora is important in domains like information processing and language modelling. With big data corpora, there is an increased application complexity due to the large volume of data. Traditional studies have been confined to small or moderate size corpora leading to statistical laws on word frequency distributions. However, when going to very large corpora, some of the assumptions underlying those laws need to be revised, related to the corpus vocabulary and numbers of word occurrences. So, although it becomes critical to know how the corpus size influences those distributions, there is a lack of models that characterise such influence. This paper aims at filling this gap, enabling the prediction of the impact of corpus growth upon application time and space complexities. It presents a fully principled model, which, distinctively, considers words and multiwords in very large corpora, predicting the cumulative numbers of distinct n-grams above or equal to a given frequency in a corpus, as well as the sizes of equal-frequency n-gram groups, from unigrams to hexagrams, as a function of corpus size, in a language, assuming a finite n-gram vocabulary. The model applies to low occurrence frequencies, encompassing the larger populations of n-grams. Practical assessment with real corpora shows relative errors around 3%, stable over the considered ranges of n-gram frequencies, n-gram sizes and corpora sizes from million to billion words, for English and French.
AB - The prediction of the numbers of distinct word n-grams and their frequency distributions in text corpora is important in domains like information processing and language modelling. With big data corpora, there is an increased application complexity due to the large volume of data. Traditional studies have been confined to small or moderate size corpora leading to statistical laws on word frequency distributions. However, when going to very large corpora, some of the assumptions underlying those laws need to be revised, related to the corpus vocabulary and numbers of word occurrences. So, although it becomes critical to know how the corpus size influences those distributions, there is a lack of models that characterise such influence. This paper aims at filling this gap, enabling the prediction of the impact of corpus growth upon application time and space complexities. It presents a fully principled model, which, distinctively, considers words and multiwords in very large corpora, predicting the cumulative numbers of distinct n-grams above or equal to a given frequency in a corpus, as well as the sizes of equal-frequency n-gram groups, from unigrams to hexagrams, as a function of corpus size, in a language, assuming a finite n-gram vocabulary. The model applies to low occurrence frequencies, encompassing the larger populations of n-grams. Practical assessment with real corpora shows relative errors around 3%, stable over the considered ranges of n-gram frequencies, n-gram sizes and corpora sizes from million to billion words, for English and French.
KW - large corpora
KW - low-frequency n-grams
KW - n-gram distribution
UR - http://www.scopus.com/inward/record.url?scp=85192809352&partnerID=8YFLogxK
U2 - 10.1007/978-981-97-2259-4_16
DO - 10.1007/978-981-97-2259-4_16
M3 - Conference contribution
AN - SCOPUS:85192809352
SN - 9789819722617
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 210
EP - 222
BT - Advances in Knowledge Discovery and Data Mining - 28th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2024, Proceedings
A2 - Yang, De-Nian
A2 - Xie, Xing
A2 - Tseng, Vincent S.
A2 - Pei, Jian
A2 - Huang, Jen-Wei
A2 - Lin, Jerry Chun-Wei
PB - Springer Science and Business Media Deutschland GmbH
T2 - 28th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2024
Y2 - 7 May 2024 through 10 May 2024
ER -