TY - GEN
T1 - Improving LocalMaxs Multiword Expression Statistical Extractor
AU - Silva, Joaquim F.
AU - Cunha, José C.
N1 - Funding Information:
info:eu-repo/grantAgreement/FCT/6817 - DCRRNI ID/UIDB%2F04516%2F2020/PT
Publisher Copyright:
© 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2023
Y1 - 2023
N2 - LocalMaxs algorithm extracts relevant Multiword Expressions from text corpora based on a statistical approach. However, statistical extractors face an increased challenge of obtaining good practical results, compared to linguistic approaches which benefit from language-specific, syntactic and/or semantic, knowledge. First, this paper contributes to an improvement to the LocalMaxs algorithm, based on a more selective evaluation of the cohesion of each Multiword Expressions candidate with respect to its neighbourhood, and a filtering criterion guided by the location of stopwords within each candidate. Secondly, a new language-independent method is presented for the automatic self-identification of stopwords in corpora, requiring no external stopwords lists or linguistic tools. The obtained results for LocalMaxs reach Precision values of about 80% for English, French, German and Portuguese, showing an increase of around 12-13% compared to the previous LocalMaxs version. The performance of the self-identification of stopwords reaches high Precision for top-ranked stopword candidates.
AB - LocalMaxs algorithm extracts relevant Multiword Expressions from text corpora based on a statistical approach. However, statistical extractors face an increased challenge of obtaining good practical results, compared to linguistic approaches which benefit from language-specific, syntactic and/or semantic, knowledge. First, this paper contributes to an improvement to the LocalMaxs algorithm, based on a more selective evaluation of the cohesion of each Multiword Expressions candidate with respect to its neighbourhood, and a filtering criterion guided by the location of stopwords within each candidate. Secondly, a new language-independent method is presented for the automatic self-identification of stopwords in corpora, requiring no external stopwords lists or linguistic tools. The obtained results for LocalMaxs reach Precision values of about 80% for English, French, German and Portuguese, showing an increase of around 12-13% compared to the previous LocalMaxs version. The performance of the self-identification of stopwords reaches high Precision for top-ranked stopword candidates.
KW - LocalMaxs algorithm
KW - Multiword Expressions
KW - Statistical Extractor
KW - Stopwords
UR - http://www.scopus.com/inward/record.url?scp=85169691248&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-36021-3_13
DO - 10.1007/978-3-031-36021-3_13
M3 - Conference contribution
AN - SCOPUS:85169691248
SN - 978-3-031-36020-6
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 154
EP - 162
BT - Computational Science – ICCS 2023
A2 - Mikyška, Jiří
A2 - de Mulatier, Clélia
A2 - Paszynski, Maciej
A2 - Krzhizhanovskaya, Valeria V.
A2 - Dongarra, Jack J.
A2 - Sloot, Peter M. A.
PB - Springer
CY - Cham
T2 - 23rd International Conference on Computational Science, ICCS 2023
Y2 - 3 July 2023 through 5 July 2023
ER -