國立高雄大學圖資館 |

語系: 繁體中文

說明(常見問題)

圖資館首頁

登入

回首頁

切換: 標籤 | MARC模式 | ISBD

Methods of Enriching Domain Knowledg...

Qazanfari, Kazem .

Methods of Enriching Domain Knowledge with Universal Semantics for Higher Text Mining Performance.

紀錄類型:	書目-電子資源 : Monograph/item
正題名/作者:	Methods of Enriching Domain Knowledge with Universal Semantics for Higher Text Mining Performance.
作者:	Qazanfari, Kazem .
出版者:	Ann Arbor : ProQuest Dissertations & Theses, 2020
面頁冊數:	136 p.
附註:	Source: Dissertations Abstracts International, Volume: 81-06, Section: B.
附註:	Advisor: Youssef, Abdou.
Contained By:	Dissertations Abstracts International81-06B.
標題:	Computer science.
電子資源:	http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=27665575
ISBN:	9781392787984

Methods of Enriching Domain Knowledge with Universal Semantics for Higher Text Mining Performance.
Qazanfari, Kazem .

Methods of Enriching Domain Knowledge with Universal Semantics for Higher Text Mining Performance. - Ann Arbor : ProQuest Dissertations & Theses, 2020 - 136 p.

Source: Dissertations Abstracts International, Volume: 81-06, Section: B.

Thesis (D.Sc.)--The George Washington University, 2020.

This item is not available from ProQuest Dissertations & Theses.

Language models are either trained only on a repository data or post-trained using the repository after having been trained on a huge dataset like Wikipedia. Either way, since the distribution of the repository data (usually a domain-specific corpus) and the real-world distribution of concepts (such as classes in a classification application) are rarely equivalent, the accuracy of the language model is reduced. That is usually due to the Inadequacy of Knowledge (IoK) of the domain-specific corpus (referred to as the local knowledge), relative to the real-world, universal knowledge. To address this IoK issue, depending on the language modeling technique, whether traditional or recent, different methods are proposed in this dissertation to combine efficiently the local knowledge with the universal semantics and improve the performance of various text mining tasks. Considering traditional language modeling such as bag of words, two novel techniques are proposed to combine two sources of knowledge: one technique for document classification and one for document clustering. For classification, a novel feature weighting function is proposed which calculates the weight of each feature using the discriminating power of the feature derived from the local and the universal sources of knowledge. For document clustering, where no labels are available, a different technique is introduced to combine the similarities of each pair of documents, where the similarities are derived from the local and the universal knowledge. The performance of the proposed methods for document classification and document clustering is evaluated on some widely-used classification and clustering algorithms and on a number of standards datasets. The evaluation results show that the performance of both document classification and clustering techniques is significantly improved using our proposed methods. Also, we consider recent language modeling (or rather word embedding) techniques such as Word2Vec, GloVe, USE and BERT, which introduce new kinds of feature vectors, and exhibit a new, additional kind of IoK, namely, the Out Of vocabulary (OOV) problem, i.e. when certain words do not appear in the repository data but could appear in test data. In this thesis, the OOV issue in GloVe is addressed by changing the form of the training data into n-grams of characters. We show that this version of GloVe, which we call C-GloVe, addresses the OOV problem quite effectively, and generally outperforms GloVe and FastText, especially for smaller training datasets. Also, the IoK of the local knowledge (i.e., domain specific training corpus) relative to the universal knowledge is addressed where the feature vectors are embedding vectors. Specifically, we propose a method to integrate local and universal sources of knowledge and to combine different word embedding algorithms. Our experimental results on three different text mining tasks show that the proposed methods yield higher performance than a standalone source of knowledge and a standalone word embedding algorithm, especially if one embedder is trained on the local source and a different embedder is trained on the universal source of knowledge. Also, experimental results on the classification task show that the proposed integrated method achieves the same or higher F1-score than the state of the art (i.e., BERT) in nearly all the document classification experiments performed.

ISBN: 9781392787984Subjects--Topical Terms:

199325
Computer science.
Subjects--Index Terms:

Data mining

Methods of Enriching Domain Knowledge with Universal Semantics for Higher Text Mining Performance.
LDR:04724nmm a2200385 4500 001 594524
005 20210521101645.5
008 210917s2020 ||||||||||||||||| ||eng d
020 $a 9781392787984
035 $a (MiAaPQ)AAI27665575
035 $a AAI27665575
040 $a MiAaPQ $c MiAaPQ
100 1 $a Qazanfari, Kazem . $3 886513
245 1 0 $a Methods of Enriching Domain Knowledge with Universal Semantics for Higher Text Mining Performance.
260 1 $a Ann Arbor : $b ProQuest Dissertations & Theses, $c 2020
300 $a 136 p.
500 $a Source: Dissertations Abstracts International, Volume: 81-06, Section: B.
500 $a Advisor: Youssef, Abdou.
502 $a Thesis (D.Sc.)--The George Washington University, 2020.
506 $a This item is not available from ProQuest Dissertations & Theses.
506 $a This item must not be sold to any third party vendors.
520 $a Language models are either trained only on a repository data or post-trained using the repository after having been trained on a huge dataset like Wikipedia. Either way, since the distribution of the repository data (usually a domain-specific corpus) and the real-world distribution of concepts (such as classes in a classification application) are rarely equivalent, the accuracy of the language model is reduced. That is usually due to the Inadequacy of Knowledge (IoK) of the domain-specific corpus (referred to as the local knowledge), relative to the real-world, universal knowledge. To address this IoK issue, depending on the language modeling technique, whether traditional or recent, different methods are proposed in this dissertation to combine efficiently the local knowledge with the universal semantics and improve the performance of various text mining tasks. Considering traditional language modeling such as bag of words, two novel techniques are proposed to combine two sources of knowledge: one technique for document classification and one for document clustering. For classification, a novel feature weighting function is proposed which calculates the weight of each feature using the discriminating power of the feature derived from the local and the universal sources of knowledge. For document clustering, where no labels are available, a different technique is introduced to combine the similarities of each pair of documents, where the similarities are derived from the local and the universal knowledge. The performance of the proposed methods for document classification and document clustering is evaluated on some widely-used classification and clustering algorithms and on a number of standards datasets. The evaluation results show that the performance of both document classification and clustering techniques is significantly improved using our proposed methods. Also, we consider recent language modeling (or rather word embedding) techniques such as Word2Vec, GloVe, USE and BERT, which introduce new kinds of feature vectors, and exhibit a new, additional kind of IoK, namely, the Out Of vocabulary (OOV) problem, i.e. when certain words do not appear in the repository data but could appear in test data. In this thesis, the OOV issue in GloVe is addressed by changing the form of the training data into n-grams of characters. We show that this version of GloVe, which we call C-GloVe, addresses the OOV problem quite effectively, and generally outperforms GloVe and FastText, especially for smaller training datasets. Also, the IoK of the local knowledge (i.e., domain specific training corpus) relative to the universal knowledge is addressed where the feature vectors are embedding vectors. Specifically, we propose a method to integrate local and universal sources of knowledge and to combine different word embedding algorithms. Our experimental results on three different text mining tasks show that the proposed methods yield higher performance than a standalone source of knowledge and a standalone word embedding algorithm, especially if one embedder is trained on the local source and a different embedder is trained on the universal source of knowledge. Also, experimental results on the classification task show that the proposed integrated method achieves the same or higher F1-score than the state of the art (i.e., BERT) in nearly all the document classification experiments performed.
590 $a School code: 0075.
650 4 $a Computer science. $3 199325
650 4 $a Artificial intelligence. $3 194058
653 $a Data mining
653 $a Deep learning
653 $a Language modeling
653 $a Machine learning
653 $a Natural language processing
653 $a Text mining
690 $a 0984
690 $a 0800
710 2 $a The George Washington University. $b Computer Science. $3 492889
773 0 $t Dissertations Abstracts International $g 81-06B.
790 $a 0075
791 $a D.Sc.
792 $a 2020
793 $a English
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=27665575