An Empirical Model for n-gram Frequency Distribution in Large Corpora

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Citation (Scopus)

Abstract

Statistical multiword extraction methods can benefit from the knowledge on the n-gram (n ≥ 1) frequency distribution in natural language corpora, for indexing and time/space optimization purposes. The appearance of increasingly large corpora raises new challenges on the investigation of the large scale behavior of the n-gram frequency distributions, not typically emerging on small scale corpora. We propose an empirical model, based on the assumption of finite n-gram language vocabularies, to estimate the number of distinct n-grams in large corpora, as well as the sizes of the equal-frequency n-gram groups, which occur in the lower frequencies starting from 1. The model was validated for n-grams with 1 ≤ n ≤ 6, by a wide range of real corpora in English and French, from 60 million up to 8 billion words. These are full non-truncated corpora data, that is, their associated frequency data include the entire range of observed n-gram frequencies, from 1 up to the maximum. The model predicts the monotonic growth of the numbers of distinct n-grams until reaching asymptotic plateaux when the corpus size grows to infinity. It also predicts the non-monotonicity of the sizes of the equal-frequency n-gram groups as a function of the corpus size.

Original languageEnglish
Title of host publicationAdvances in Knowledge Discovery and Data Mining - 24th Pacific-Asia Conference, PAKDD 2020, Proceedings
EditorsHady W. Lauw, Ee-Peng Lim, Raymond Chi-Wing Wong, Alexandros Ntoulas, See-Kiong Ng, Sinno Jialin Pan
Place of PublicationCham
PublisherSpringer
Pages840-851
Number of pages12
ISBN (Electronic)978-3-030-47436-2
ISBN (Print)978-3-030-47435-5
DOIs
Publication statusPublished - 2020
Event24th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2020 - Singapore, Singapore
Duration: 11 May 202014 May 2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
PublisherSpringer
Volume12085 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference24th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2020
CountrySingapore
CitySingapore
Period11/05/2014/05/20

Keywords

  • Large text corpora
  • n-gram frequency distribution

Fingerprint

Dive into the research topics of 'An Empirical Model for n-gram Frequency Distribution in Large Corpora'. Together they form a unique fingerprint.

Cite this