n-gram Cache Performance in Statistical Extraction of Relevant Terms in Large Corpora

Carlos Gonçalves, Joaquim F. Silva, Jose C. Cunha

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Statistical extraction of relevant n-grams in natural language corpora is important for text indexing and classification since it can be language independent. We show how a theoretical model identifies the distribution properties of the distinct n-grams and singletons appearing in large corpora and how this knowledge contributes to understanding the performance of an n-gram cache system used for extraction of relevant terms. We show how this approach allowed us to evaluate the benefits from using Bloom filters for excluding singletons and from using static prefetching of nonsingletons in an n-gram cache. In the context of the distributed and parallel implementation of the LocalMaxs extraction method, we analyze the performance of the cache miss ratio and size, and the efficiency of n-gram cohesion calculation with LocalMaxs.

Original languageEnglish
Title of host publicationComputational Science – ICCS 2019 - 19th International Conference, Proceedings
EditorsJoão M. F. Rodrigues, Pedro J. S. Cardoso, Jânio Monteiro, Roberto Lam, Valeria V. Krzhizhanovskaya, Michael H. Lees, Peter M. A. Sloot, Jack J. Dongarra
Place of PublicationCham
PublisherSpringer
Pages75-88
Number of pages14
ISBN (Electronic)978-3-030-22741-8
ISBN (Print)978-3-030-22740-1
DOIs
Publication statusPublished - 2019
Event19th International Conference on Computational Science, ICCS 2019 - Faro, Portugal
Duration: 12 Jun 201914 Jun 2019

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
PublisherSpringer
Volume11537 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference19th International Conference on Computational Science, ICCS 2019
CountryPortugal
CityFaro
Period12/06/1914/06/19

Keywords

  • Cloud computing
  • Large corpora
  • Multiword terms
  • n-gram cache performance
  • Parallel processing
  • Statistical extraction

Fingerprint Dive into the research topics of 'n-gram Cache Performance in Statistical Extraction of Relevant Terms in Large Corpora'. Together they form a unique fingerprint.

  • Cite this

    Gonçalves, C., Silva, J. F., & Cunha, J. C. (2019). n-gram Cache Performance in Statistical Extraction of Relevant Terms in Large Corpora. In J. M. F. Rodrigues, P. J. S. Cardoso, J. Monteiro, R. Lam, V. V. Krzhizhanovskaya, M. H. Lees, P. M. A. Sloot, ... J. J. Dongarra (Eds.), Computational Science – ICCS 2019 - 19th International Conference, Proceedings (pp. 75-88). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11537 LNCS). Cham: Springer. https://doi.org/10.1007/978-3-030-22741-8_6