國立高雄大學圖資館 |

Language: English

Back

Term selection for information retrieval applications.

Record Type:	Electronic resources : Monograph/item
Title/Author:	Term selection for information retrieval applications.
Author:	Schultz, J. Michael.
Description:	136 p.
Notes:	Source: Dissertation Abstracts International, Volume: 64-10, Section: A, page: 3667.
Notes:	Supervisor: Mark Y. Liberman.
Contained By:	Dissertation Abstracts International64-10A.
Subject:	Language, Linguistics.
Online resource:	http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=3109218
ISBN:	0496567616

Term selection for information retrieval applications.
Schultz, J. Michael.

Term selection for information retrieval applications. [electronic resource] - 136 p.

Source: Dissertation Abstracts International, Volume: 64-10, Section: A, page: 3667.

Thesis (Ph.D.)--University of Pennsylvania, 2003.

In this dissertation we investigate methods for selecting terms in the context of a number of specific tasks. As a practical test-case for some of the approaches developed here, we participate in the formal evaluations of Topic Detection and Tracking. In the spirit of residual-idf, a metric which measures deviation from Poisson, we develop a sum-log-ratios metric which improves upon residual-idf in two significant ways---it incorporates document length normalization and it is a function of the entire within-document term count distribution. Also developed here is the idea of a "universal dictionary" as a basis for translingual information retrieval tasks. In the methods section, we describe a suffix array based indexing scheme ideally suited to efficiently calculate within-document term counts for ngrams in very large corpora.

ISBN: 0496567616Subjects--Topical Terms:

212724
Language, Linguistics.

Term selection for information retrieval applications.
LDR:03220nmm _2200277 _450 001 162235
005 20051017073425.5
008 230606s2003 eng d
020 $a 0496567616
035 $a 00148736
035 $a 162235
040 $a UnM $c UnM
100 0 $a Schultz, J. Michael. $3 227361
245 1 0 $a Term selection for information retrieval applications. $h [electronic resource]
300 $a 136 p.
500 $a Source: Dissertation Abstracts International, Volume: 64-10, Section: A, page: 3667.
500 $a Supervisor: Mark Y. Liberman.
502 $a Thesis (Ph.D.)--University of Pennsylvania, 2003.
520 # $a In this dissertation we investigate methods for selecting terms in the context of a number of specific tasks. As a practical test-case for some of the approaches developed here, we participate in the formal evaluations of Topic Detection and Tracking. In the spirit of residual-idf, a metric which measures deviation from Poisson, we develop a sum-log-ratios metric which improves upon residual-idf in two significant ways---it incorporates document length normalization and it is a function of the entire within-document term count distribution. Also developed here is the idea of a "universal dictionary" as a basis for translingual information retrieval tasks. In the methods section, we describe a suffix array based indexing scheme ideally suited to efficiently calculate within-document term counts for ngrams in very large corpora.
520 # $a The selection and identification of terms is an important part of many natural language applications. In the information retrieval domain documents are often abbreviated to their most salient terms in order to reduce storage requirements and processing time and also to make algorithms more efficient. The quality of search results is a direct reflection of the quality of these representative features. In translingual applications translation dictionaries must be built in order to bridge the gap between source and target languages. With limited time and resources the most effective terms for translation must somehow be chosen. Techniques for term selection are also fundamental to a number of other tasks including automatic generation of indices, concordances and abstracts and the extraction of terminology.
520 # $a We test our methods in a number of real-world applications. In the formal evaluations of TDT2 we show that the simple vector space model performs as well as much more complicated models. In the context of building a "universal dictionary", we use our method of term selection to choose a vocabulary of less than 10,000 terms which is essentially as effective for topic tracking as an unlimited vocabulary of over 300,000 terms. We demonstrate that this same method extends well to other applications, employing it as a novel approach to multi-word terminology and collocation extraction.
590 $a School code: 0175.
650 # 0 $a Language, Linguistics. $3 212724
650 # 0 $a Computer Science. $3 212513
650 # 0 $a Information Science. $3 212402
710 0 # $a University of Pennsylvania. $3 212781
773 0 # $g 64-10A. $t Dissertation Abstracts International
790 $a 0175
790 1 0 $a Liberman, Mark Y., $e advisor
791 $a Ph.D.
792 $a 2003
856 4 0 $u http://libsw.nuk.edu.tw/login?url=http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=3109218 $z http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=3109218