TY - GEN

T1 - An Empirical Model for n-gram Frequency Distribution in Large Corpora

AU - Silva, Joaquim F.

AU - Cunha, José C.

N1 - Acknowledgements to FCT MCTES, NOVA LINCS UID/CEC/04516/2019 and Carlos Gonçalves.

PY - 2020

Y1 - 2020

N2 - Statistical multiword extraction methods can benefit from the knowledge on the n-gram (n ≥ 1) frequency distribution in natural language corpora, for indexing and time/space optimization purposes. The appearance of increasingly large corpora raises new challenges on the investigation of the large scale behavior of the n-gram frequency distributions, not typically emerging on small scale corpora. We propose an empirical model, based on the assumption of finite n-gram language vocabularies, to estimate the number of distinct n-grams in large corpora, as well as the sizes of the equal-frequency n-gram groups, which occur in the lower frequencies starting from 1. The model was validated for n-grams with 1 ≤ n ≤ 6, by a wide range of real corpora in English and French, from 60 million up to 8 billion words. These are full non-truncated corpora data, that is, their associated frequency data include the entire range of observed n-gram frequencies, from 1 up to the maximum. The model predicts the monotonic growth of the numbers of distinct n-grams until reaching asymptotic plateaux when the corpus size grows to infinity. It also predicts the non-monotonicity of the sizes of the equal-frequency n-gram groups as a function of the corpus size.

AB - Statistical multiword extraction methods can benefit from the knowledge on the n-gram (n ≥ 1) frequency distribution in natural language corpora, for indexing and time/space optimization purposes. The appearance of increasingly large corpora raises new challenges on the investigation of the large scale behavior of the n-gram frequency distributions, not typically emerging on small scale corpora. We propose an empirical model, based on the assumption of finite n-gram language vocabularies, to estimate the number of distinct n-grams in large corpora, as well as the sizes of the equal-frequency n-gram groups, which occur in the lower frequencies starting from 1. The model was validated for n-grams with 1 ≤ n ≤ 6, by a wide range of real corpora in English and French, from 60 million up to 8 billion words. These are full non-truncated corpora data, that is, their associated frequency data include the entire range of observed n-gram frequencies, from 1 up to the maximum. The model predicts the monotonic growth of the numbers of distinct n-grams until reaching asymptotic plateaux when the corpus size grows to infinity. It also predicts the non-monotonicity of the sizes of the equal-frequency n-gram groups as a function of the corpus size.

KW - Large text corpora

KW - n-gram frequency distribution

UR - http://www.scopus.com/inward/record.url?scp=85085727081&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-47436-2_63

DO - 10.1007/978-3-030-47436-2_63

M3 - Conference contribution

AN - SCOPUS:85085727081

SN - 978-3-030-47435-5

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 840

EP - 851

BT - Advances in Knowledge Discovery and Data Mining - 24th Pacific-Asia Conference, PAKDD 2020, Proceedings

A2 - Lauw, Hady W.

A2 - Lim, Ee-Peng

A2 - Wong, Raymond Chi-Wing

A2 - Ntoulas, Alexandros

A2 - Ng, See-Kiong

A2 - Pan, Sinno Jialin

PB - Springer

CY - Cham

T2 - 24th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2020

Y2 - 11 May 2020 through 14 May 2020

ER -