A theoretical model for n-gram distribution in big data corpora

Joaquim F. Silva, Carlos Gonçalves, José C. Cunha

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

There is a wide diversity of applications relying on the identification of the sequences of n consecutive words (n-grams) occurring in corpora. Many studies follow an empirical approach for determining the statistical distribution of the n-grams but are usually constrained by the corpora sizes, which for practical reasons stay far away from Big Data. However, Big Data sizes imply hidden behaviors to the applications, such as extraction of relevant information from Web scale sources. In this paper we propose a theoretical approach for estimating the number of distinct n-grams in each corpus. It is based on the Zipf-Mandelbrot Law and the Poisson distribution, and it allows an efficient estimation of the number of distinct 1-grams, 2-grams,..., 6-grams, for any corpus size. The proposed model was validated for English and French corpora. We illustrate a practical application of this approach to the extraction of relevant expressions from natural language corpora, and predict its asymptotic behaviour for increasingly large sizes.

Original languageEnglish
Title of host publicationProceedings - 2016 IEEE International Conference on Big Data, Big Data 2016
EditorsRonay Ak, George Karypis, Yinglong Xia, Xiaohua Tony Hu, Philip S. Yu, James Joshi, Lyle Ungar, Ling Liu, Aki-Hiro Sato, Toyotaro Suzumura, Sudarsan Rachuri, Rama Govindaraju, Weijia Xu
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages134-141
Number of pages8
ISBN (Electronic)9781467390040
DOIs
Publication statusPublished - 1 Jan 2016
Event4th IEEE International Conference on Big Data, Big Data 2016 - Washington, United States
Duration: 5 Dec 20168 Dec 2016

Conference

Conference4th IEEE International Conference on Big Data, Big Data 2016
CountryUnited States
CityWashington
Period5/12/168/12/16

Keywords

  • Big Data
  • Extraction of Relevant Expressions
  • n-gram Models
  • Poisson Distribution
  • Zipf-Mandelbrot Law

Fingerprint Dive into the research topics of 'A theoretical model for n-gram distribution in big data corpora'. Together they form a unique fingerprint.

Cite this