A Theory of Indexing by Gerard Salton

By Gerard Salton

Offers a conception of indexing in a position to score index phrases, or topic identifiers in lowering order of significance. This results in the alternative of fine rfile representations, and likewise debts for the function of words and of glossary sessions within the indexing method.

This research is average of theoretical paintings in computerized info association and retrieval, in that options are used from arithmetic, machine technological know-how, and linguistics. an entire thought of details retrieval might emerge from a suitable mix of those 3 disciplines.

For the three sample collections of about 450 documents, the document frequency ranges applicable to the majority of the terms for the three classes of discrimination values are 1-5, 5-30, and 30 160, respectively. If the discrimination value of a term furnishes an accurate picture of its value for indexing purposes, the situation may then be summarized, as shown schematically in Fig. 11. When the terms are arranged in increasing order according to their document frequencies in a collection, the first set of terms with very low document frequency Bk exhibits a discrimination value near zero.

In the present instance, the information value test had to be abandoned for the MED collection because a sufficient number of user queries could not be found. The second problem is the relatively small number of cooccurring terms between documents and user queries, and thus the limited scope of the term value modifications. For the CRAN collection only about 20 terms in all were subjected to positive term modifications and only about 50 were modified negatively. The corresponding figures for Time are even smaller about 10 positive modifications and about 30 negative ones.

Such terms do not provide much matching power between documents and queries—in fact, when they occur in a query, they may help in the retrieval of one document at most. Additional deletions are carried out by removing terms with a large document frequency, standard common words, 32 G. SALTON terms with negative discrimination values, and terms that differ from existing ones only by addition of a terminal 's'. Recall-precision results averaged for 1,033 document abstracts and 35 user queries are shown for the system in Fig.

