TY - GEN
T1 - The biovisualspeech European Portuguese sibilants corpus
AU - Grilo, Margarida
AU - Guimarães, Isabel
AU - Ascensão, Mariana
AU - Abad, Alberto
AU - Anjos, Ivo
AU - Magalhães, João
AU - Cavaco, Sofia
N1 - Funding Information:
Acknowledgements. This work was supported by the Portuguese Foundation for Science and Technology under projects BioVisualSpeech (CMUP-ERI/TIC/0033/2014), NOVA-LINCS (PEest/UID/CEC/04516/2019) and INESC-ID (UIDB/50021/2020).
We thank Cátia Pedrosa and Diogo Carrasco for the segmentation and annotation of our corpus, all postgraduate SLP students who collaborated in the data collection task, and the schools and children who participated in the study.
PY - 2020
Y1 - 2020
N2 - The development of reliable speech therapy computer tools that automatically classify speech productions depends on the quality of the speech data set used to train the classification algorithms. The data set should characterize the population in terms of age, gender and native language, but it should also have other important properties that characterize the population that is going to use the tool. Thus, apart from including samples from correct speech productions, it should also have samples from people with speech disorders. Also, the annotation of the data should include information on whether the phonemes are correctly or wrongly pronounced. Here, we present a corpus of European Portuguese children’s speech data that we are using in the development of speech classifiers for speech therapy tools for Portuguese children. The corpus includes data from children with speech disorders and in which the labelling includes information about the speech production errors. This corpus, which has data from 356 children from 5 to 9 years of age, focuses on the European Portuguese sibilant consonants and can be used to train speech recognition models for tools to assist the detection and therapy of sigmatism.
AB - The development of reliable speech therapy computer tools that automatically classify speech productions depends on the quality of the speech data set used to train the classification algorithms. The data set should characterize the population in terms of age, gender and native language, but it should also have other important properties that characterize the population that is going to use the tool. Thus, apart from including samples from correct speech productions, it should also have samples from people with speech disorders. Also, the annotation of the data should include information on whether the phonemes are correctly or wrongly pronounced. Here, we present a corpus of European Portuguese children’s speech data that we are using in the development of speech classifiers for speech therapy tools for Portuguese children. The corpus includes data from children with speech disorders and in which the labelling includes information about the speech production errors. This corpus, which has data from 356 children from 5 to 9 years of age, focuses on the European Portuguese sibilant consonants and can be used to train speech recognition models for tools to assist the detection and therapy of sigmatism.
KW - European Portuguese corpus
KW - Sibilants
KW - Speech sound disorders
UR - http://www.scopus.com/inward/record.url?scp=85081550541&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-41505-1_3
DO - 10.1007/978-3-030-41505-1_3
M3 - Conference contribution
AN - SCOPUS:85081550541
SN - 978-3-030-41504-4
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 23
EP - 33
BT - Computational Processing of the Portuguese Language - 14th International Conference, PROPOR 2020, Proceedings
A2 - Quaresma, Paulo
A2 - Vieira, Renata
A2 - Gonçalves, Teresa
A2 - Aluísio, Sandra
A2 - Moniz, Helena
A2 - Batista, Fernando
PB - Springer
CY - Cham
T2 - 14th International Conference on Computational Processing of the Portuguese Language, PROPOR 2020
Y2 - 2 March 2020 through 4 March 2020
ER -