Full Text Available

Access Repository

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech

Despite recent improvements in speaker-independent automatic speech recognition (ASR), the performance of large-scale speech recognition systems is still significantly worse on dysarthric speech than on standard speech. Both the inherent noise of dysarthric speech and the lack of large datasets add...

Full description

Saved in:

Bibliographic Details
Main Author:	Houston, Charles
Other Authors:	Britz, Stefan S
Format:	Thesis
Language:	English
Published:	Department of Statistical Sciences 2023
Subjects:	Statistical Sciences
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1867613249859485696
access_status_str	Open Access
author	Houston, Charles
author2	Britz, Stefan S
author_browse	Britz, Stefan S Houston, Charles
author_facet	Britz, Stefan S Houston, Charles
author_sort	Houston, Charles
collection	Thesis
description	Despite recent improvements in speaker-independent automatic speech recognition (ASR), the performance of large-scale speech recognition systems is still significantly worse on dysarthric speech than on standard speech. Both the inherent noise of dysarthric speech and the lack of large datasets add to the difficulty of solving this problem. This thesis explores different approaches to improving the performance of Deep Learning ASR systems on dysarthric speech. The primary goal was to find out whether a model trained on thousands of hours of standard speech could successfully be fine-tuned to dysarthric speech. Deep Speech – an open-source Deep Learning based speech recognition system developed by Mozilla – was used as the baseline model. The UASpeech dataset, composed of utterances from 15 speakers with cerebral palsy, was used as the source of dysarthric speech. In addition to investigating fine-tuning, layer freezing, data augmentation and re-initialization were also investigated. Data augmentation took the form of time and frequency masking, while layer freezing consisted of fixing the first three feature extraction layers of Deep Speech during fine-tuning. Re-initialization was achieved by randomly initializing the weights of Deep Speech and training from scratch. A separate encoder-decoder recurrent neural network consisting of far fewer parameters was also trained from scratch. The Deep Speech acoustic model obtained a word error rate (WER) of 141.53% on the UASpeech test set of commands, digits, the radio alphabet, common words, and uncommon words. Once fine-tuned to dysarthric speech, a WER of 70.30% was achieved, thus demonstrating the ability of fine-tuning to improve upon the performance of a model initially trained on standard speech. While fine-tuning lead to a substantial improvement in performance, the benefit of data augmentation was far more subtle, improving on the fine-tuned model by a mere 1.31%. Freezing the first three layers of Deep Speech and fine-tuning the remaining layers was slightly detrimental, increasing the WER by 0.89%. Finally, both re-initialization of Deep Speech's weights and the encoder-decoder model generated highly inaccurate predictions. The best performing model was Deep Speech fine-tuned to augmented dysarthric speech, which achieved a WER of 60.72% with the inclusion of a language model.
format	Thesis
id	oai:open.uct.ac.za:11427/37267
institution	University of Cape Town (South Africa)
language	eng
last_indexed	2026-06-10T12:33:08.525Z
license_str	Not specified — see source repository
provenance_str_mv	Harvested via OAI-PMH from UCTD — University of Cape Town Open Access Repository
publishDate	2023
publishDateRange	2023
publishDateSort	2023
publisher	Department of Statistical Sciences
publisherStr	Department of Statistical Sciences
record_format	dspace
source_str	UCTD — University of Cape Town Open Access Repository
spelling	oai:open.uct.ac.za:11427/37267 Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech Houston, Charles Britz, Stefan S Durbach, Ian Statistical Sciences Despite recent improvements in speaker-independent automatic speech recognition (ASR), the performance of large-scale speech recognition systems is still significantly worse on dysarthric speech than on standard speech. Both the inherent noise of dysarthric speech and the lack of large datasets add to the difficulty of solving this problem. This thesis explores different approaches to improving the performance of Deep Learning ASR systems on dysarthric speech. The primary goal was to find out whether a model trained on thousands of hours of standard speech could successfully be fine-tuned to dysarthric speech. Deep Speech – an open-source Deep Learning based speech recognition system developed by Mozilla – was used as the baseline model. The UASpeech dataset, composed of utterances from 15 speakers with cerebral palsy, was used as the source of dysarthric speech. In addition to investigating fine-tuning, layer freezing, data augmentation and re-initialization were also investigated. Data augmentation took the form of time and frequency masking, while layer freezing consisted of fixing the first three feature extraction layers of Deep Speech during fine-tuning. Re-initialization was achieved by randomly initializing the weights of Deep Speech and training from scratch. A separate encoder-decoder recurrent neural network consisting of far fewer parameters was also trained from scratch. The Deep Speech acoustic model obtained a word error rate (WER) of 141.53% on the UASpeech test set of commands, digits, the radio alphabet, common words, and uncommon words. Once fine-tuned to dysarthric speech, a WER of 70.30% was achieved, thus demonstrating the ability of fine-tuning to improve upon the performance of a model initially trained on standard speech. While fine-tuning lead to a substantial improvement in performance, the benefit of data augmentation was far more subtle, improving on the fine-tuned model by a mere 1.31%. Freezing the first three layers of Deep Speech and fine-tuning the remaining layers was slightly detrimental, increasing the WER by 0.89%. Finally, both re-initialization of Deep Speech's weights and the encoder-decoder model generated highly inaccurate predictions. The best performing model was Deep Speech fine-tuned to augmented dysarthric speech, which achieved a WER of 60.72% with the inclusion of a language model. 2023-03-06T10:16:35Z 2023-03-06T10:16:35Z 2022 2023-02-20T12:56:38Z Master Thesis Masters MSc http://hdl.handle.net/11427/37267 eng application/pdf Department of Statistical Sciences Faculty of Science
spellingShingle	Statistical Sciences Houston, Charles Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech
thesis_degree_str	Master's
title	Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech
title_full	Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech
title_fullStr	Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech
title_full_unstemmed	Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech
title_short	Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech
title_sort	adapting large scale speaker independent automatic speech recognition to dysarthric speech
topic	Statistical Sciences
url	http://hdl.handle.net/11427/37267
work_keys_str_mv	AT houstoncharles adaptinglargescalespeakerindependentautomaticspeechrecognitiontodysarthricspeech

Full Text Available

Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech

Similar Items