Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance

Research output: Contribution to journalArticlepeer-review

13 Downloads (Pure)

Abstract

This paper presents a novel methodology for enhancing Automatic Speech Recognition (ASR) performance by utilizing contrastive learning to filter synthetic audio data. We address the challenge of incorporating synthetic data into ASR training, especially in scenarios with limited real-world data or unique linguistic characteristics. The method utilizes a contrastive learning model to align representations of synthetic audio and its corresponding text transcripts, enabling the identification and removal of low-quality samples that do not align well semantically. We evaluate the methodology on a medium-resource language across two distinct datasets: a general-domain dataset and a regionally specific dataset characterized by unique pronunciation patterns. Experimental results reveal that the optimal filtering strategy depends on both model capacity and dataset characteristics. Larger models, like Whisper Large V3, particularly benefit from aggressive filtering, while smaller models may not require such stringent
Original languageEnglish
Pages (from-to)155136 - 155150
Number of pages15
JournalIEEE Access
Volume12
Early online date17 Oct 2024
DOIs
Publication statusPublished - 31 Dec 2024

Keywords

  • Automatic Speech Recognition
  • Contrastive Learning
  • Data Augmentation
  • Embeddings
  • Synthetic Data Filtering
  • Text-to-Speech

Cite this