Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

1. What is the primary problem in Information Retrieval (IR) and what is the obj

ID: 653243 • Letter: 1

Question

1. What is the primary problem in Information Retrieval (IR) and what is the objective of an IR system? For each of the Ingest steps discuss how decisions on it can affect the primary problem (e.g., can it reduce the problem or have no effect) and the primary objective if it effects that. Ingest steps are: locating and getting an item to ingest (discuss crawling web pages); duplicate detection; normalization; zoning; stemming; entity identification; and categorization.

2. Consider the use of entity identification and categorization.

a. What problems are they trying to solve what are they trying to do.

How do they affect the list of processing tokens in an item?

How do they affect the Document Vector for the item?

3. a. Describe how the statement that language is the largest inhibitor to good communications applies to Information Retrieval Systems.

b. Relate this to the challenges in information retrieval that make it difficult to find the information a user is looking for.

c. What are some of the techniques to overcome this problem.

4. A. Why are inverted files the most typical structure used for search systems? How does n-grams modify the inversion list approach?

B. Discuss the similarities and differences between N-gram and PATRICIA TREE approach to indexing.

5. Understanding of precision and recall:

a. Given the following table of relevant documents found in the top 100 documents returned (e.g., 9 rel docs in the first 10, 17 rel docs in the first 20) calculate the Precision/recall graph for the search. Roughly draw a precision recall graph with precision on the Y axis and recall on the X-axis. Assume that all relevant documents are returned by the query. Compare it to the best possible precision/recall graph (draw optimum graph)

1-10

11-20

21-30

31-40

41-50

51-60

61-70

71-80

81-90

91-100

8

9

7

6

4

2

4

2

0

1

b. What is typically more important to a user precision or recall and why.

6. Stemming:

a. Compare advantages and disadvantages of Porter Stemming algorithm, Dictionary stemming algorithm and Success Variety stemming algorithm.

Create the symbol tree for the following words (canopy, cars, cabony, cabossy, cabort, cabins, cabity, cabiry) Using successor variety and the Peak and Plateau algorithm, determine if there are any stems for the above set of words.

Does this method for locating a stem make any sense from a users perspective looking at the stems discuss your answer?

(HINT: one trick to represent the symbol tree use an excel spread sheet or a MS Word table with each row being another level down). Make sure there are enough empty cells between entries to clearly indicate branches in the tree or draw it and scan in the result).

7. PATRICIA TREE:

a. Create the PATRICIA Tree and Reduced PATRICIA for the following 01010101010110 to 8 levels of sistrings (level 1 is original string).

b. Assuming a search term of 01010101111 calculate the number of comparisons needed for the PAT tree and the reduced PAT tree.

8. Weighted Indexing technique

Given the following Weighted term Document matrix, calculate the new document vectors using Normalized TF (using maximum value per row to normalize), Inverse Document Frequency, and Signal. Show how you came up with the weights for the three different algorithms for each term (T1-T6). You dont have to show how you calculate each entry in the table. Treat the table as the complete database (N=5 documents) and a zero term frequency means that word does not occur in that document.

T1

T2

T3

T4

T5

T6

D1

4

2

2

1

6

0

D2

3

2

1

12

0

0

D3

0

2

0

4

2

0

D4

2

2

0

1

1

0

D5

4

2

0

8

4

2

b. Discuss the advantages of each approach and what is the idea that they are using to improve the term weight.

c. What is the goal of a good index and what is the importance of the weights in an index.

9. Given the following documents determine the weights for Nave Bayesian category for document to be about Trees. Given the new document listed determine if it should be given the category or not. (14 points)

Doc1 Oak, Plum, rose, Oak, Oak, Plum, ash, ash, ash is member of trees

Doc2 Plum strawberry, OAK, Ash, Ash, Ash, Oak, Ash is member of trees

Doc3 Ash, Plum, Apple, Apple, Apple, Oak, Ash, Plum is member of trees

Doc4 Ash, Ash, rose, rose, plum, plum, plum, Oak is member of trees

Doc5 rose, rose, rose, plum, tulip, tulip is NOT member of trees

Doc6 rose, tulip, tulip, rose, plum, tulip, strawberry is NOT member of trees

10. Given the following three items where w2, w4, etc. all represent different words. Compare how well the shingle process works in determining which items are near duplicates by looking at the shingles composed of 3 words versus shingles composed of 6 words. Use the rolling definition of shingles where for example the first 3 words are shingle 1, then word 2-4 are shingle 2, 3-5 are shingle 3 until the last 3 words are the last shingle when creating the three word process. To determine the numeric value for each shingle just take the word number to make a number. Thus for shingle w1w1w4 the numeric value would be 114. For shingle w1w1w4w2w2w1 the number would be 114221. Use Borders formula to calculate the resemblance between each item and the other items for the 3 word shingle and the 6 word shingles. Discuss the results and the impact of going to 6 word shingles.

Item 1: w1 w1 w4 w2 w2 w1 w4 w2 w3 w1 w1 w4 w2 w3 w4 w3

Item 2: w1 w4 w2 w4 w1 w1 w4 w2 w2 w1 w2 w3 w3 w2 w2 w4

Item 3: w1 w1 w4 w2 w2 w1 w4 w2 w3 w1 w1 w4 w2 w5 w4 w3

1-10

11-20

21-30

31-40

41-50

51-60

61-70

71-80

81-90

91-100

8

9

7

6

4

2

4

2

0

1

Explanation / Answer

An Information retrieval (IR) system is a combination of hardware and software that eases the process of finding the information needed by him. The information can be in various forms such as text, audio, video, or an image. The IR system design involves a set of users and a set of documents for the purpose of information collection and retrieval. The main challenges involved in designing an effective IR system are:

The queries received from the users as well as their structure are diverse.

The complexity of the computation required to retrieve the relevant information might be high.

The clear understanding and analysis of the problem at hand is vital.

There is also a need to analyse and determine the portion of the solution that can be performed by the system and what portion needs to be implemented by the user.

Deciding the understanding and knowledge level of the user also is a major challenge.

Determine relevant information that is necessary to transform the user from her present state of understanding to the required knowledge state.

Main objectives of an IR system are:

Minimise the time and effort spent by the user to retrieve relevant information. This includes the time for query generation, query execution and all the other steps involved before information finally reaches the user.

Aid the process of data retrieval and provide the most relevant data to the user in the least amount of time.

Impact of

Crawling web pages: Increases efficiency

Duplicate detection: Increases efficiency by around 30%

Normalization: Increases efficiency

Zoning: Increases efficiency

Stemming: Increases efficiency