LocalMaxs extracts relevant multiword terms based on their cohesion but is computationally intensive, a critical issue for very large natural language corpora. The corpus properties concerning n-gram distribution determine the algorithm complexity and were empirically analyzed for corpora up to 982 million words. A parallel LocalMaxs implementation exhibits almost linear relative efficiency, speedup, and sizeup, when executed with up to 48 cloud virtual machines and a distributed key-value store. To reduce the remote data communication, we present a novel n-gram cache with cooperative-based warm-up, leading to reduced miss ratio and time penalty. A cache analytical model is used to estimate the performance of cohesion calculation of n-gram expressions, based on corpus empirical data. The model estimates agree with the real execution results.
|Title of host publication||Proceedings of the 2016 IEEE 12th International Conference on eScience|
|Publisher||IEEE - Institute of Electrical and Electronic Engineers Inc|
|Number of pages||10|
|Publication status||Published - 3 Mar 2017|
- Cloud Computing
- Large Corpora
- Multiword Terms
- n-gram cache
- Parallel Processing
- Performance Evaluation
- Statistical Extraction
Gonçalves, C., Silva, J. F. F., & Cunha, J. A. C. E. (2017). An n-gram cache for large-scale parallel extraction of multiword relevant expressions with LocalMaxs. In Proceedings of the 2016 IEEE 12th International Conference on eScience (pp. 120-129).  IEEE - Institute of Electrical and Electronic Engineers Inc.