國立高雄大學圖資館 |

語系: 繁體中文

說明(常見問題)

圖資館首頁

登入

回首頁

切換: 標籤 | MARC模式 | ISBD

Neural Network Based Representation ...

Guo, Jinxi.

Neural Network Based Representation Learning and Modeling for Speech and Speaker Recognition.

紀錄類型:	書目-電子資源 : Monograph/item
正題名/作者:	Neural Network Based Representation Learning and Modeling for Speech and Speaker Recognition.
作者:	Guo, Jinxi.
出版者:	Ann Arbor : ProQuest Dissertations & Theses, 2019
面頁冊數:	128 p.
附註:	Source: Dissertations Abstracts International, Volume: 81-04, Section: B.
附註:	Advisor: Alwan, Abeer A. H.
Contained By:	Dissertations Abstracts International81-04B.
標題:	Artificial intelligence.
電子資源:	http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=13900095
ISBN:	9781085676014

Neural Network Based Representation Learning and Modeling for Speech and Speaker Recognition.
Guo, Jinxi.

Neural Network Based Representation Learning and Modeling for Speech and Speaker Recognition. - Ann Arbor : ProQuest Dissertations & Theses, 2019 - 128 p.

Source: Dissertations Abstracts International, Volume: 81-04, Section: B.

Thesis (Ph.D.)--University of California, Los Angeles, 2019.

This item must not be sold to any third party vendors.

Deep learning and neural network research has grown significantly in the fields of automatic speech recognition (ASR) and speaker recognition. Compared to traditional methods, deep learning-based approaches are more powerful in learning representation from data and building complex models. In this dissertation, we focus on representation learning and modeling using neural network-based approaches for speech and speaker recognition. In the first part of the dissertation, we present two novel neural network-based methods to learn speaker-specific and phoneme-invariant features for short-utterance speaker verification. We first propose to learn a spectral feature mapping from each speech signal to the corresponding subglottal acoustic signal which has less phoneme variation, using deep neural networks (DNNs). The estimated subglottal features show better speaker-separation ability and provide complementary information when combined with traditional speech features on speaker verification tasks. Additional, we propose another DNN-based mapping model, which maps the speaker representation extracted from short utterances to the speaker representation extracted from long utterances of the same speaker. Two non-linear regression models using an autoencoder are proposed to learn this mapping, and they both improve speaker verification performance significantly.In the second part of the dissertation, we design several new neural network models which take raw speech features (either complex Discrete Fourier Transform (DFT) features or raw waveforms) as input, and perform the feature extraction and phone classification jointly. We first propose a unified deep Highway (HW) network with a time-delayed bottleneck layer (TDB), in the middle, for feature extraction. The TDB-HW networks with complex DFT features as input provide significantly lower error rates compared with hand-designed spectrum features on large-scale keyword spotting tasks. Next, we present a 1-D Convolutional Neural Network (CNN) model, which takes raw waveforms as input and uses convolutional layers to do hierarchical feature extraction. The proposed 1-D CNN model outperforms standard systems with hand-designed features. In order to further reduce the redundancy of the 1-D CNN model, we propose a filter sampling and combination (FSC) technique, which can reduce the model size by 70% and still improve the performance on ASR tasks.In the third part of dissertation, we propose two novel neural-network models for sequence modeling. We first propose an attention mechanism for acoustic sequence modeling. The attention mechanism can automatically predict the importance of each time step and select the most important information from sequences. Secondly, we present a sequence-to-sequence based spelling correction model for end-to-end ASR. The proposed correction model can effectively correct errors made by the ASR systems.

ISBN: 9781085676014Subjects--Topical Terms:

194058
Artificial intelligence.

Neural Network Based Representation Learning and Modeling for Speech and Speaker Recognition.
LDR:03953nmm a2200313 4500 001 570787
005 20200514111959.5
008 200901s2019 ||||||||||||||||| ||eng d
020 $a 9781085676014
035 $a (MiAaPQ)AAI13900095
035 $a AAI13900095
040 $a MiAaPQ $c MiAaPQ
100 1 $a Guo, Jinxi. $3 857492
245 1 0 $a Neural Network Based Representation Learning and Modeling for Speech and Speaker Recognition.
260 1 $a Ann Arbor : $b ProQuest Dissertations & Theses, $c 2019
300 $a 128 p.
500 $a Source: Dissertations Abstracts International, Volume: 81-04, Section: B.
500 $a Advisor: Alwan, Abeer A. H.
502 $a Thesis (Ph.D.)--University of California, Los Angeles, 2019.
506 $a This item must not be sold to any third party vendors.
520 $a Deep learning and neural network research has grown significantly in the fields of automatic speech recognition (ASR) and speaker recognition. Compared to traditional methods, deep learning-based approaches are more powerful in learning representation from data and building complex models. In this dissertation, we focus on representation learning and modeling using neural network-based approaches for speech and speaker recognition. In the first part of the dissertation, we present two novel neural network-based methods to learn speaker-specific and phoneme-invariant features for short-utterance speaker verification. We first propose to learn a spectral feature mapping from each speech signal to the corresponding subglottal acoustic signal which has less phoneme variation, using deep neural networks (DNNs). The estimated subglottal features show better speaker-separation ability and provide complementary information when combined with traditional speech features on speaker verification tasks. Additional, we propose another DNN-based mapping model, which maps the speaker representation extracted from short utterances to the speaker representation extracted from long utterances of the same speaker. Two non-linear regression models using an autoencoder are proposed to learn this mapping, and they both improve speaker verification performance significantly.In the second part of the dissertation, we design several new neural network models which take raw speech features (either complex Discrete Fourier Transform (DFT) features or raw waveforms) as input, and perform the feature extraction and phone classification jointly. We first propose a unified deep Highway (HW) network with a time-delayed bottleneck layer (TDB), in the middle, for feature extraction. The TDB-HW networks with complex DFT features as input provide significantly lower error rates compared with hand-designed spectrum features on large-scale keyword spotting tasks. Next, we present a 1-D Convolutional Neural Network (CNN) model, which takes raw waveforms as input and uses convolutional layers to do hierarchical feature extraction. The proposed 1-D CNN model outperforms standard systems with hand-designed features. In order to further reduce the redundancy of the 1-D CNN model, we propose a filter sampling and combination (FSC) technique, which can reduce the model size by 70% and still improve the performance on ASR tasks.In the third part of dissertation, we propose two novel neural-network models for sequence modeling. We first propose an attention mechanism for acoustic sequence modeling. The attention mechanism can automatically predict the importance of each time step and select the most important information from sequences. Secondly, we present a sequence-to-sequence based spelling correction model for end-to-end ASR. The proposed correction model can effectively correct errors made by the ASR systems.
590 $a School code: 0031.
650 4 $a Artificial intelligence. $3 194058
650 4 $a Computer science. $3 199325
650 4 $a Electrical engineering. $3 454503
690 $a 0800
690 $a 0984
690 $a 0544
710 2 $a University of California, Los Angeles. $b Electrical and Computer Engineering 0333. $3 857493
773 0 $t Dissertations Abstracts International $g 81-04B.
790 $a 0031
791 $a Ph.D.
792 $a 2019
793 $a English
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=13900095