TY - GEN
T1 - A Model for Predicting n-gram Frequency Distribution in Large Corpora
AU - Silva, Joaquim F.
AU - Cunha, José C.
N1 - info:eu-repo/grantAgreement/FCT/6817 - DCRRNI ID/UIDB%2F04516%2F2020/PT#
PY - 2021
Y1 - 2021
N2 - The statistical extraction of multiwords (n-grams) from natural language corpora is challenged by computationally heavy searching and indexing, which can be improved by low error prediction of the n-gram frequency distributions. For different n-gram sizes (n≥1 ), we model the sizes of groups of equal-frequency n-grams, for the low frequencies, k= 1, 2, …, by predicting the influence of the corpus size upon the Zipf’s law exponent and the n-gram group size. The average relative errors of the model predictions, from 1-grams up to 6-grams, are near 4 %, for English and French corpora from 62 Million to 8.6 Billion words.
AB - The statistical extraction of multiwords (n-grams) from natural language corpora is challenged by computationally heavy searching and indexing, which can be improved by low error prediction of the n-gram frequency distributions. For different n-gram sizes (n≥1 ), we model the sizes of groups of equal-frequency n-grams, for the low frequencies, k= 1, 2, …, by predicting the influence of the corpus size upon the Zipf’s law exponent and the n-gram group size. The average relative errors of the model predictions, from 1-grams up to 6-grams, are near 4 %, for English and French corpora from 62 Million to 8.6 Billion words.
KW - Large corpora
KW - n-gram frequency distribution
UR - http://www.scopus.com/inward/record.url?scp=85111401717&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-77961-0_55
DO - 10.1007/978-3-030-77961-0_55
M3 - Conference contribution
SN - 978-3-030-77960-3
VL - 1
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 699
EP - 706
BT - Computational Science – ICCS 2021: 21st International Conference, Krakow, Poland, June 16–18, 2021, Proceedings, Part I
A2 - Paszynski, Maciej
A2 - Kranzlmüller, Dieter
A2 - Krzhizhanovskaya, Valeria V.
A2 - Dongarra, Jack J.
A2 - Sloot, Peter M. A.
PB - Springer
CY - Cham
T2 - 21st International Conference on Computational Science, ICCS 2021
Y2 - 16 June 2021 through 18 June 2021
ER -