A Parallel Algorithm for Statistical Multiword Term Extraction from Very Large Corpora

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Citations (Scopus)

Abstract

Multi-word Relevant Expressions (REs) can be defined as sequences of words (n grams) with strong semantic meaning, such as “ice melting” and “Ministère des Affaires Étrangères”, useful in Information Retrieval, Document Clustering or Classification and Indexing of Documents. The need of extracting REs in several languages led research on statistical approaches rather than symbolic methods, since the former allow language-independence. Based on the assumption that REs have strong cohesion between their consecutive n-grams, the LocalMaxs algorithm is a language independent approach that extracts REs. Apart from its good precision, this extractor is time-consuming, being inoperable for Big Data if implemented in a sequential manner. This paper presents the first parallel and distributed version of this algorithm, achieving almost linear speedup and sizeup when processing corpora up to 1 billion words, using up to 54 virtual machines in a public cloud.
Original languageEnglish
Title of host publicationProceedings - 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security and 2015 IEEE 12th International Conference on Embedded Software and Systems (HPCC-CSS-ICESS 2015)
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages219-224
Number of pages6
ISBN (Electronic)978-1-4799-8937-9
ISBN (Print)978-1-4799-8938-6
DOIs
Publication statusPublished - 2015
Event17th IEEE International Conference on High Performance Computing and Communications, IEEE 7th International Symposium on Cyberspace Safety and Security and IEEE 12th International Conference on Embedded Software and Systems (HPCC-ICESS-CSS 2015) - New York, United States
Duration: 24 Aug 201526 Aug 2015

Conference

Conference17th IEEE International Conference on High Performance Computing and Communications, IEEE 7th International Symposium on Cyberspace Safety and Security and IEEE 12th International Conference on Embedded Software and Systems (HPCC-ICESS-CSS 2015)
Abbreviated titleHPCC-ICESS-CSS 2015
CountryUnited States
CityNew York
Period24/08/1526/08/15

Keywords

  • Cloud
  • Large Corpora
  • Multiword Terms
  • Parallel Processing
  • Statistical Extraction
  • Text Mining

Fingerprint Dive into the research topics of 'A Parallel Algorithm for Statistical Multiword Term Extraction from Very Large Corpora'. Together they form a unique fingerprint.

Cite this