A Model for Predicting n-gram Frequency Distribution in Large Corpora

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The statistical extraction of multiwords (n-grams) from natural language corpora is challenged by computationally heavy searching and indexing, which can be improved by low error prediction of the n-gram frequency distributions. For different n-gram sizes (n≥1 ), we model the sizes of groups of equal-frequency n-grams, for the low frequencies, k= 1, 2, …, by predicting the influence of the corpus size upon the Zipf’s law exponent and the n-gram group size. The average relative errors of the model predictions, from 1-grams up to 6-grams, are near 4 %, for English and French corpora from 62 Million to 8.6 Billion words.

Original languageEnglish
Title of host publicationComputational Science – ICCS 2021: 21st International Conference, Krakow, Poland, June 16–18, 2021, Proceedings, Part I
EditorsMaciej Paszynski, Dieter Kranzlmüller, Valeria V. Krzhizhanovskaya, Jack J. Dongarra, Peter M. A. Sloot
Place of PublicationCham
PublisherSpringer
Pages699-706
Number of pages8
Volume1
Edition1st
ISBN (Electronic)978-3-030-77961-0
ISBN (Print)978-3-030-77960-3
DOIs
Publication statusPublished - 2021
Event21st International Conference on Computational Science, ICCS 2021 - Virtual, Online
Duration: 16 Jun 202118 Jun 2021

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
PublisherSpringer
Volume12742 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference21st International Conference on Computational Science, ICCS 2021
CityVirtual, Online
Period16/06/2118/06/21

Keywords

  • Large corpora
  • n-gram frequency distribution

Fingerprint

Dive into the research topics of 'A Model for Predicting n-gram Frequency Distribution in Large Corpora'. Together they form a unique fingerprint.

Cite this