TY - JOUR
T1 - A Methodology to Measure the Diachronic Language Distance between Three Languages Based on Perplexity
AU - Pichel, Jose Ramom
AU - Gamallo, Pablo
AU - Alegria, Inaki
AU - Neves, Marco
N1 - UIDB/04097/2020
UIDP/04097/2020
PGC2018-102041-B-I00, MCIU/AEI/FEDER, UE]
RTI2018-093336-B-C21
PY - 2020/3/1
Y1 - 2020/3/1
N2 - The aim of this paper is to apply a corpus-based methodology, based on the measure of perplexity, to automatically calculate the cross-lingual language distance between historical periods of three languages. The three historical corpora have been constructed and collected with the closest spelling to the original on a balanced basis of fiction and non-fiction. This methodology has been applied to measure the historical distance of Galician with respect to Portuguese and Spanish, from the Middle Ages to the end of the 20th century, both in original spelling and automatically transcribed spelling. The quantitative results are contrasted with hypotheses extracted from experts in historical linguistics. Results show that Galician and Portuguese are varieties of the same language in the Middle Ages and that Galician converges and diverges with Portuguese and Spanish since the last period of the 19th century. In this process, orthography plays a relevant role. It should be pointed out that the method is unsupervised and can be applied to other languages.
AB - The aim of this paper is to apply a corpus-based methodology, based on the measure of perplexity, to automatically calculate the cross-lingual language distance between historical periods of three languages. The three historical corpora have been constructed and collected with the closest spelling to the original on a balanced basis of fiction and non-fiction. This methodology has been applied to measure the historical distance of Galician with respect to Portuguese and Spanish, from the Middle Ages to the end of the 20th century, both in original spelling and automatically transcribed spelling. The quantitative results are contrasted with hypotheses extracted from experts in historical linguistics. Results show that Galician and Portuguese are varieties of the same language in the Middle Ages and that Galician converges and diverges with Portuguese and Spanish since the last period of the 19th century. In this process, orthography plays a relevant role. It should be pointed out that the method is unsupervised and can be applied to other languages.
UR - https://www.scopus.com/record/display.uri?eid=2-s2.0-85080989880&doi=10.1080%2f09296174.2020.1732177&origin=inward&txGid=851371e8b9e73f127e536835a0949fe0#
U2 - 10.1080/09296174.2020.1732177
DO - 10.1080/09296174.2020.1732177
M3 - Article
SN - 0929-6174
VL - 27
SP - 1
EP - 32
JO - Journal of Quantitative Linguistics
JF - Journal of Quantitative Linguistics
ER -