TY - GEN
T1 - Using Covariance as a Similarity Measure for Document Language Identification in Hard Contexts
AU - Lopes, José Gabriel Pereira
AU - Silva, Joaquim Francisco Ferreira
PY - 2007/3/19
Y1 - 2007/3/19
N2 - Existing Language Identification (LID) approaches achieve 100% precision in most common situations, dealing with sufficiently large documents, written in just one language. However, there are many situations where text language is hard to identify and where current LID approaches do not provide a reliable solution. One such situation occurs when it is necessary to discriminate the correct variant of the language used in a text. In this paper, we present a fully statistics-based LID approach which is shown to be correct for common texts and maintains its robustness when classifying hard LID documents. For that, character sequences were used as base features. The Discriminant Ability of each sequence, in each training situation, is measured and used to filter out less important character sequences. Document similarity measure, based on the covariance concept, was defined. In the training phase, document clusters are built in a reduced $k$ uncorrelated dimensions space. In the classification phase the Quadratic Discriminant Score decides which cluster (language) must be assigned to the documents one needs to classify.
AB - Existing Language Identification (LID) approaches achieve 100% precision in most common situations, dealing with sufficiently large documents, written in just one language. However, there are many situations where text language is hard to identify and where current LID approaches do not provide a reliable solution. One such situation occurs when it is necessary to discriminate the correct variant of the language used in a text. In this paper, we present a fully statistics-based LID approach which is shown to be correct for common texts and maintains its robustness when classifying hard LID documents. For that, character sequences were used as base features. The Discriminant Ability of each sequence, in each training situation, is measured and used to filter out less important character sequences. Document similarity measure, based on the covariance concept, was defined. In the training phase, document clusters are built in a reduced $k$ uncorrelated dimensions space. In the classification phase the Quadratic Discriminant Score decides which cluster (language) must be assigned to the documents one needs to classify.
KW - Statistical applications
M3 - Conference contribution
T3 - Pliska Studia Mathematica Bulgarica, Bulgaria
SP - 341
EP - 360
BT - Proceedings of the XII International Summer Conference on Probability and Statistics and Seminar on Statistical Data Analysis, SDA, 2006
A2 - Yanev, N.
PB - Institute of Mathematics and Informatics, Bulgarian Academy of Sciences
CY - Sofia
T2 - XII International Summer Conference on Probability and Statistics and Seminar on Statistical Data Analysis, SDA
Y2 - 1 January 2016
ER -