Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech

Despite recent improvements in speaker-independent automatic speech recognition (ASR), the performance of large-scale speech recognition systems is still significantly worse on dysarthric speech than on standard speech. Both the inherent noise of dysarthric speech and the lack of large datasets add...

Full description

Saved in:
Bibliographic Details
Main Author: Houston, Charles
Other Authors: Britz, Stefan S
Format: Thesis
Language:English
Published: Department of Statistical Sciences 2023
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613249859485696
access_status_str Open Access
author Houston, Charles
author2 Britz, Stefan S
author_browse Britz, Stefan S
Houston, Charles
author_facet Britz, Stefan S
Houston, Charles
author_sort Houston, Charles
collection Thesis
description Despite recent improvements in speaker-independent automatic speech recognition (ASR), the performance of large-scale speech recognition systems is still significantly worse on dysarthric speech than on standard speech. Both the inherent noise of dysarthric speech and the lack of large datasets add to the difficulty of solving this problem. This thesis explores different approaches to improving the performance of Deep Learning ASR systems on dysarthric speech. The primary goal was to find out whether a model trained on thousands of hours of standard speech could successfully be fine-tuned to dysarthric speech. Deep Speech – an open-source Deep Learning based speech recognition system developed by Mozilla – was used as the baseline model. The UASpeech dataset, composed of utterances from 15 speakers with cerebral palsy, was used as the source of dysarthric speech. In addition to investigating fine-tuning, layer freezing, data augmentation and re-initialization were also investigated. Data augmentation took the form of time and frequency masking, while layer freezing consisted of fixing the first three feature extraction layers of Deep Speech during fine-tuning. Re-initialization was achieved by randomly initializing the weights of Deep Speech and training from scratch. A separate encoder-decoder recurrent neural network consisting of far fewer parameters was also trained from scratch. The Deep Speech acoustic model obtained a word error rate (WER) of 141.53% on the UASpeech test set of commands, digits, the radio alphabet, common words, and uncommon words. Once fine-tuned to dysarthric speech, a WER of 70.30% was achieved, thus demonstrating the ability of fine-tuning to improve upon the performance of a model initially trained on standard speech. While fine-tuning lead to a substantial improvement in performance, the benefit of data augmentation was far more subtle, improving on the fine-tuned model by a mere 1.31%. Freezing the first three layers of Deep Speech and fine-tuning the remaining layers was slightly detrimental, increasing the WER by 0.89%. Finally, both re-initialization of Deep Speech's weights and the encoder-decoder model generated highly inaccurate predictions. The best performing model was Deep Speech fine-tuned to augmented dysarthric speech, which achieved a WER of 60.72% with the inclusion of a language model.
format Thesis
id oai:open.uct.ac.za:11427/37267
institution University of Cape Town (South Africa)
language eng
last_indexed 2026-06-10T12:33:08.525Z
license_str Not specified — see source repository
provenance_str_mv Harvested via OAI-PMH from UCTD — University of Cape Town Open Access Repository
publishDate 2023
publishDateRange 2023
publishDateSort 2023
publisher Department of Statistical Sciences
publisherStr Department of Statistical Sciences
record_format dspace
source_str UCTD — University of Cape Town Open Access Repository
spelling oai:open.uct.ac.za:11427/37267 Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech Houston, Charles Britz, Stefan S Durbach, Ian Statistical Sciences Despite recent improvements in speaker-independent automatic speech recognition (ASR), the performance of large-scale speech recognition systems is still significantly worse on dysarthric speech than on standard speech. Both the inherent noise of dysarthric speech and the lack of large datasets add to the difficulty of solving this problem. This thesis explores different approaches to improving the performance of Deep Learning ASR systems on dysarthric speech. The primary goal was to find out whether a model trained on thousands of hours of standard speech could successfully be fine-tuned to dysarthric speech. Deep Speech – an open-source Deep Learning based speech recognition system developed by Mozilla – was used as the baseline model. The UASpeech dataset, composed of utterances from 15 speakers with cerebral palsy, was used as the source of dysarthric speech. In addition to investigating fine-tuning, layer freezing, data augmentation and re-initialization were also investigated. Data augmentation took the form of time and frequency masking, while layer freezing consisted of fixing the first three feature extraction layers of Deep Speech during fine-tuning. Re-initialization was achieved by randomly initializing the weights of Deep Speech and training from scratch. A separate encoder-decoder recurrent neural network consisting of far fewer parameters was also trained from scratch. The Deep Speech acoustic model obtained a word error rate (WER) of 141.53% on the UASpeech test set of commands, digits, the radio alphabet, common words, and uncommon words. Once fine-tuned to dysarthric speech, a WER of 70.30% was achieved, thus demonstrating the ability of fine-tuning to improve upon the performance of a model initially trained on standard speech. While fine-tuning lead to a substantial improvement in performance, the benefit of data augmentation was far more subtle, improving on the fine-tuned model by a mere 1.31%. Freezing the first three layers of Deep Speech and fine-tuning the remaining layers was slightly detrimental, increasing the WER by 0.89%. Finally, both re-initialization of Deep Speech's weights and the encoder-decoder model generated highly inaccurate predictions. The best performing model was Deep Speech fine-tuned to augmented dysarthric speech, which achieved a WER of 60.72% with the inclusion of a language model. 2023-03-06T10:16:35Z 2023-03-06T10:16:35Z 2022 2023-02-20T12:56:38Z Master Thesis Masters MSc http://hdl.handle.net/11427/37267 eng application/pdf Department of Statistical Sciences Faculty of Science
spellingShingle Statistical Sciences
Houston, Charles
Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech
thesis_degree_str Master's
title Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech
title_full Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech
title_fullStr Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech
title_full_unstemmed Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech
title_short Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech
title_sort adapting large scale speaker independent automatic speech recognition to dysarthric speech
topic Statistical Sciences
url http://hdl.handle.net/11427/37267
work_keys_str_mv AT houstoncharles adaptinglargescalespeakerindependentautomaticspeechrecognitiontodysarthricspeech