An n-gram cache for large-scale parallel extraction of multiword relevant expressions with LocalMaxs

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

LocalMaxs extracts relevant multiword terms based on their cohesion but is computationally intensive, a critical issue for very large natural language corpora. The corpus properties concerning n-gram distribution determine the algorithm complexity and were empirically analyzed for corpora up to 982 million words. A parallel LocalMaxs implementation exhibits almost linear relative efficiency, speedup, and sizeup, when executed with up to 48 cloud virtual machines and a distributed key-value store. To reduce the remote data communication, we present a novel n-gram cache with cooperative-based warm-up, leading to reduced miss ratio and time penalty. A cache analytical model is used to estimate the performance of cohesion calculation of n-gram expressions, based on corpus empirical data. The model estimates agree with the real execution results.
Original languageEnglish
Title of host publicationProceedings of the 2016 IEEE 12th International Conference on eScience
PublisherIEEE - Institute of Electrical and Electronic Engineers Inc
Pages120-129
Number of pages10
Publication statusPublished - 3 Mar 2017

Keywords

  • Cloud Computing
  • Large Corpora
  • Multiword Terms
  • n-gram cache
  • Parallel Processing
  • Performance Evaluation
  • Statistical Extraction

Fingerprint Dive into the research topics of 'An n-gram cache for large-scale parallel extraction of multiword relevant expressions with LocalMaxs'. Together they form a unique fingerprint.

  • Cite this

    Gonçalves, C., Silva, J. F. F., & Cunha, J. A. C. E. (2017). An n-gram cache for large-scale parallel extraction of multiword relevant expressions with LocalMaxs. In Proceedings of the 2016 IEEE 12th International Conference on eScience (pp. 120-129). [7870892] IEEE - Institute of Electrical and Electronic Engineers Inc.