語系:
繁體中文
English
說明(常見問題)
圖資館首頁
登入
回首頁
切換:
標籤
|
MARC模式
|
ISBD
Internationalization of Task-Oriented Dialogue Systems.
紀錄類型:
書目-電子資源 : Monograph/item
正題名/作者:
Internationalization of Task-Oriented Dialogue Systems.
作者:
Moradshahi, Mehrad.
出版者:
Ann Arbor : ProQuest Dissertations & Theses, 2023
面頁冊數:
115 p.
附註:
Source: Dissertations Abstracts International, Volume: 85-06, Section: B.
附註:
Advisor: Lam, Monica;Boneh, Dan;Sadigh, Dorsa.
Contained By:
Dissertations Abstracts International85-06B.
標題:
Multilingualism.
電子資源:
http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=30726801
ISBN:
9798381019070
Internationalization of Task-Oriented Dialogue Systems.
Moradshahi, Mehrad.
Internationalization of Task-Oriented Dialogue Systems.
- Ann Arbor : ProQuest Dissertations & Theses, 2023 - 115 p.
Source: Dissertations Abstracts International, Volume: 85-06, Section: B.
Thesis (Ph.D.)--Stanford University, 2023.
This item must not be sold to any third party vendors.
Virtual assistants and Task-oriented Dialogue (ToD) agents are increasingly prevalent due to their utility in daily tasks. Despite the linguistic diversity worldwide, only a few dominant languages are supported by these digital assistants. This restriction is due to the high cost and manual effort required to produce large, hand-annotated datasets to train these agents. Existing low-cost approaches that rely on cross-lingual embeddings or naive machine translation sacrifice a lot of accuracy for data efficiency, and largely fail in creating a usable dialogue agent.This thesis introduces a novel solution to automatically create ToD agents in new languages by leveraging dialogue data in the source language and neural machine translation. The approach is based on automatic entity-aware training data translation, a concise dialogue data representation enabling effective zero-shot training, and a scalable and robust approach for creating end-to-end high-quality fewshot, validation, and test data, minimizing the manual effort needed.To address data scarcity, we use neural machine translation to translate the training dataset from the source to the target language. We show that naive application of this approach would not yield good performance as entities in the input can be mistranslated, transliterated, or omitted and no longer match with that in the annotation. We propose a series of techniques to improve the quality of the dataset by (1) leveraging word alignments from the neural translation model’s cross-attention weights to preserve entities and (2) applying automatic data filtering based on textual semantic similarity to exclude poor translations. Using this approach, we create multilingual versions of Schema2QA, a single-turn question-answering dataset, in 10 different languages. Agents trained on our automatically translated data improve upon previous state-of-the-art by 30-40% and comes within 5-8% of the original English agent.Translation is inherently noisy and poses a special challenge in the end-to-end dialogue setting where the amount of natural language encoded grows with each turn. The accumulation of errors can prevent a correct parse for the rest of the dialogue. To address this, we introduce a new distilled dialogue data representation which significantly reduces the amount of natural language encoded and decoded by the model. On the BiToD dataset, using our representation, we found a 14% improvement in Dialogue Success Rate (DSR) in the fewshot setting.The lack of a high-quality realistic testbed for multilingual ToD evaluation has impeded accurate measurement of research progress on the topic. Prior work deployed human translators to either translate or post-edit an automatically translated dataset. However, this was done only for one or two subtasks of a dialogue agent, and training an intractable end-to-end agent was not possible. To address this, we initiated a global effort to extend a large-scale multi-domain dataset, RiSAWOZ (initially in Chinese), to several new languages: English, Korean, French, Hindi, and code-mixed English-Hindi. To ensure the best quality and fluency, we used human post-editing only for the fewshot, validation, and test data. The challenges encountered in creating this dataset at scale led us to create a toolset that makes post-editing for a new language much faster and cheaper. Experiments show that few-shot training achieves 63-88% performance of the original full-shot. The remaining gap motivates further research on multilingual ToD.
ISBN: 9798381019070Subjects--Topical Terms:
214330
Multilingualism.
Internationalization of Task-Oriented Dialogue Systems.
LDR
:04686nmm a2200373 4500
001
655864
005
20240414211951.5
006
m o d
007
cr#unu||||||||
008
240620s2023 ||||||||||||||||| ||eng d
020
$a
9798381019070
035
$a
(MiAaPQ)AAI30726801
035
$a
(MiAaPQ)STANFORDkg582jk7231
035
$a
AAI30726801
040
$a
MiAaPQ
$c
MiAaPQ
100
1
$a
Moradshahi, Mehrad.
$3
967044
245
1 0
$a
Internationalization of Task-Oriented Dialogue Systems.
260
1
$a
Ann Arbor :
$b
ProQuest Dissertations & Theses,
$c
2023
300
$a
115 p.
500
$a
Source: Dissertations Abstracts International, Volume: 85-06, Section: B.
500
$a
Advisor: Lam, Monica;Boneh, Dan;Sadigh, Dorsa.
502
$a
Thesis (Ph.D.)--Stanford University, 2023.
506
$a
This item must not be sold to any third party vendors.
520
$a
Virtual assistants and Task-oriented Dialogue (ToD) agents are increasingly prevalent due to their utility in daily tasks. Despite the linguistic diversity worldwide, only a few dominant languages are supported by these digital assistants. This restriction is due to the high cost and manual effort required to produce large, hand-annotated datasets to train these agents. Existing low-cost approaches that rely on cross-lingual embeddings or naive machine translation sacrifice a lot of accuracy for data efficiency, and largely fail in creating a usable dialogue agent.This thesis introduces a novel solution to automatically create ToD agents in new languages by leveraging dialogue data in the source language and neural machine translation. The approach is based on automatic entity-aware training data translation, a concise dialogue data representation enabling effective zero-shot training, and a scalable and robust approach for creating end-to-end high-quality fewshot, validation, and test data, minimizing the manual effort needed.To address data scarcity, we use neural machine translation to translate the training dataset from the source to the target language. We show that naive application of this approach would not yield good performance as entities in the input can be mistranslated, transliterated, or omitted and no longer match with that in the annotation. We propose a series of techniques to improve the quality of the dataset by (1) leveraging word alignments from the neural translation model’s cross-attention weights to preserve entities and (2) applying automatic data filtering based on textual semantic similarity to exclude poor translations. Using this approach, we create multilingual versions of Schema2QA, a single-turn question-answering dataset, in 10 different languages. Agents trained on our automatically translated data improve upon previous state-of-the-art by 30-40% and comes within 5-8% of the original English agent.Translation is inherently noisy and poses a special challenge in the end-to-end dialogue setting where the amount of natural language encoded grows with each turn. The accumulation of errors can prevent a correct parse for the rest of the dialogue. To address this, we introduce a new distilled dialogue data representation which significantly reduces the amount of natural language encoded and decoded by the model. On the BiToD dataset, using our representation, we found a 14% improvement in Dialogue Success Rate (DSR) in the fewshot setting.The lack of a high-quality realistic testbed for multilingual ToD evaluation has impeded accurate measurement of research progress on the topic. Prior work deployed human translators to either translate or post-edit an automatically translated dataset. However, this was done only for one or two subtasks of a dialogue agent, and training an intractable end-to-end agent was not possible. To address this, we initiated a global effort to extend a large-scale multi-domain dataset, RiSAWOZ (initially in Chinese), to several new languages: English, Korean, French, Hindi, and code-mixed English-Hindi. To ensure the best quality and fluency, we used human post-editing only for the fewshot, validation, and test data. The challenges encountered in creating this dataset at scale led us to create a toolset that makes post-editing for a new language much faster and cheaper. Experiments show that few-shot training achieves 63-88% performance of the original full-shot. The remaining gap motivates further research on multilingual ToD.
590
$a
School code: 0212.
650
4
$a
Multilingualism.
$3
214330
650
4
$a
Error analysis.
$3
967045
650
4
$a
Semantics.
$3
177636
650
4
$a
Bilingual education.
$3
766158
650
4
$a
Education.
$3
177995
650
4
$a
Language.
$3
180769
650
4
$a
Logic.
$3
180785
650
4
$a
Mathematics.
$3
184409
690
$a
0282
690
$a
0515
690
$a
0679
690
$a
0395
690
$a
0405
710
2
$a
Stanford University.
$3
212607
773
0
$t
Dissertations Abstracts International
$g
85-06B.
790
$a
0212
791
$a
Ph.D.
792
$a
2023
793
$a
English
856
4 0
$u
http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=30726801
筆 0 讀者評論
全部
電子館藏
館藏
1 筆 • 頁數 1 •
1
條碼號
館藏地
館藏流通類別
資料類型
索書號
使用類型
借閱狀態
預約狀態
備註欄
附件
000000236879
電子館藏
1圖書
學位論文
TH 2023
一般使用(Normal)
在架
0
1 筆 • 頁數 1 •
1
多媒體
多媒體檔案
http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=30726801
評論
新增評論
分享你的心得
Export
取書館別
處理中
...
變更密碼
登入