TY - JOUR
T1 - Enhancing Automatic Speech Recognition
T2 - Effects of Semantic Audio Filtering on Models Performance
AU - Perezhohin, Yuriy
AU - Santos, Tiago
AU - Costa, Victor
AU - Peres, Fernando
AU - Castelli, Mauro
N1 - info:eu-repo/grantAgreement/FCT/6817 - DCRRNI ID/UIDB%2F04152%2F2020/PT#
https://doi.org/10.54499/UIDB/04152/2020#
Perezhohin, Y., Santos, T., Costa, V., Peres, F., & Castelli, M. (2024). Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance. IEEE Access, 12, 155136 - 155150. https://doi.org/10.1109/ACCESS.2024.3482970 --- This work was supported by MyNorth AI Research. This work was supported by national funds through FCT (Fundação para a Ciência e a Tecnologia), under the project - UIDB/04152/2020 (DOI: 10.54499/UIDB/04152/2020) - Centro de Investigação em Gestão de Informação (MagIC)/NOVA IMS).
PY - 2024/12/31
Y1 - 2024/12/31
N2 - This paper presents a novel methodology for enhancing Automatic Speech Recognition (ASR) performance by utilizing contrastive learning to filter synthetic audio data. We address the challenge of incorporating synthetic data into ASR training, especially in scenarios with limited real-world data or unique linguistic characteristics. The method utilizes a contrastive learning model to align representations of synthetic audio and its corresponding text transcripts, enabling the identification and removal of low-quality samples that do not align well semantically. We evaluate the methodology on a medium-resource language across two distinct datasets: a general-domain dataset and a regionally specific dataset characterized by unique pronunciation patterns. Experimental results reveal that the optimal filtering strategy depends on both model capacity and dataset characteristics. Larger models, like Whisper Large V3, particularly benefit from aggressive filtering, while smaller models may not require such stringent
AB - This paper presents a novel methodology for enhancing Automatic Speech Recognition (ASR) performance by utilizing contrastive learning to filter synthetic audio data. We address the challenge of incorporating synthetic data into ASR training, especially in scenarios with limited real-world data or unique linguistic characteristics. The method utilizes a contrastive learning model to align representations of synthetic audio and its corresponding text transcripts, enabling the identification and removal of low-quality samples that do not align well semantically. We evaluate the methodology on a medium-resource language across two distinct datasets: a general-domain dataset and a regionally specific dataset characterized by unique pronunciation patterns. Experimental results reveal that the optimal filtering strategy depends on both model capacity and dataset characteristics. Larger models, like Whisper Large V3, particularly benefit from aggressive filtering, while smaller models may not require such stringent
KW - Automatic Speech Recognition
KW - Contrastive Learning
KW - Data Augmentation
KW - Embeddings
KW - Synthetic Data Filtering
KW - Text-to-Speech
UR - https://github.com/my-north-ai/semantic_audio_filtering
UR - http://www.scopus.com/inward/record.url?scp=85207761371&partnerID=8YFLogxK
UR - https://www.webofscience.com/wos/woscc/full-record/WOS:001346098400001
U2 - 10.1109/ACCESS.2024.3482970
DO - 10.1109/ACCESS.2024.3482970
M3 - Article
SN - 2169-3536
VL - 12
SP - 155136
EP - 155150
JO - IEEE Access
JF - IEEE Access
ER -