Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Text processing and Indexed-Vocabulary Size Free-text indexing requires making a

ID: 3606847 • Letter: T

Question

Text processing and Indexed-Vocabulary Size Free-text indexing requires making a few text-processing decisions that determine what goes in the index. For each of the following text-processing decisions (a)-(e), specify whether it is likely to increase or decrease the size of the indexed vocabulary (i.e., the set of unique terms or inverted lists represented in the index). Provide a short explanation (one sentence is enough (a) Down-casing (making all the text lower-case) [4 points]: (b) Stemming [4 points]: (c) Removing stopwords [4 points]:

Explanation / Answer

a) Down casing

Down-casing would reduce the size of the indexed vocabulary. All word-variations based on capitalization (e.g., SMART vs. Smart vs. smart) would become the same entry into the index (e.g., smart).

b) Stemming

Stemming would reduce the size of the indexed vocabulary. Morphological variants (e.g., computer, computing, and computation) would become the same entry into the index (e.g., comput)

c) Removing stopwords

Stopword-removal would reduce the size of the indexed vocabulary by eliminating those entries that correspond to stopwords. Some answers incorrectly stated that removing stopwords would reduce the vocabulary size by 50%. Removing stopwords would reduce the size of the index by about 50%, but the size of the indexed-vocabulary only by the number of stopwords (typically ~50-250 terms)

d) Distinguishing whether a specific word occurs in the TITLE field or the BODY field of a document

Distinguishing between terms in different fields would increase the size of the indexed vocabulary. Potentially, it could double its size. Any term that occurs in both the TITLE field and the BODY field would need to be a separate entry in the index: term.TITLE and term.BODY (and potentially term.ANY).

e) Distinguishing between different senses of the same word (for example, “jaguar” the car and “jaguar” the animal)

Distinguishing between difference senses of the same word would increase the size of the indexed vocabulary. A term with n senses or meanings would have n entries into the index (one per sense)

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote