Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

6. Stemming: a. Compare advantages and disadvantages of Porter Stemming algorith

ID: 653245 • Letter: 6

Question

6. Stemming:

a. Compare advantages and disadvantages of Porter Stemming algorithm, Dictionary stemming algorithm and Success Variety stemming algorithm.

b. Create the symbol tree for the following words (canopy, cars, cabony, cabossy, cabort, cabins, cabity, cabiry) Using successor variety and the Peak and Plateau algorithm, determine if there are any stems for the above set of words.

c. Does this method for locating a stem make any sense from a users perspective looking at the stems discuss your answer?

(HINT: one trick to represent the symbol tree use an excel spread sheet or a MS Word table with each row being another level down). Make sure there are enough empty cells between entries to clearly indicate branches in the tree or draw it and scan in the result).

7. PATRICIA TREE:

a. Create the PATRICIA Tree and Reduced PATRICIA for the following 01010101010110 to 8 levels of sistrings (level 1 is original string).

b. Assuming a search term of 01010101111 calculate the number of comparisons needed for the PAT tree and the reduced PAT tree.

8. Weighted Indexing technique

a. Given the following Weighted term Document matrix, calculate the new document vectors using Normalized TF (using maximum value per row to normalize), Inverse Document Frequency, and Signal. Show how you came up with the weights for the three different algorithms for each term (T1-T6). You dont have to show how you calculate each entry in the table. Treat the table as the complete database (N=5 documents) and a zero term frequency means that word does not occur in that document.

T1 T2 T3 T4 T5 T6

D1 4 2 2 1 6 0

D2 3 2 1 12 0 0

D3 0 2 0 4 2 0

D4 2 2 0 1 1 0

D5 4 2 0 8 4 2

b. Discuss the advantages of each approach and what is the idea that they are using to improve the term weight.

c. What is the goal of a good index and what is the importance of the weights in an index.

9. Given the following documents determine the weights for Nave Bayesian category for document to be about Trees. Given the new document listed determine if it should be given the category or not. (14 points)

Doc1 Oak, Plum, rose, Oak, Oak, Plum, ash, ash, ash is member of trees

Doc2 Plum strawberry, OAK, Ash, Ash, Ash, Oak, Ash is member of trees

Doc3 Ash, Plum, Apple, Apple, Apple, Oak, Ash, Plum is member of trees

Doc4 Ash, Ash, rose, rose, plum, plum, plum, Oak is member of trees

Doc5 rose, rose, rose, plum, tulip, tulip is NOT member of trees

Doc6 rose, tulip, tulip, rose, plum, tulip, strawberry is NOT member of trees

10. Given the following three items where w2, w4, etc. all represent different words. Compare how well the shingle process works in determining which items are near duplicates by looking at the shingles composed of 3 words versus shingles composed of 6 words. Use the rolling definition of shingles where for example the first 3 words are shingle 1, then word 2-4 are shingle 2, 3-5 are shingle 3 until the last 3 words are the last shingle when creating the three word process. To determine the numeric value for each shingle just take the word number to make a number. Thus for shingle w1w1w4 the numeric value would be 114. For shingle w1w1w4w2w2w1 the number would be 114221. Use Borders formula to calculate the resemblance between each item and the other items for the 3 word shingle and the 6 word shingles. Discuss the results and the impact of going to 6 word shingles.

Item 1: w1 w1 w4 w2 w2 w1 w4 w2 w3 w1 w1 w4 w2 w3 w4 w3

Item 2: w1 w4 w2 w4 w1 w1 w4 w2 w2 w1 w2 w3 w3 w2 w2 w4

Item 3: w1 w1 w4 w2 w2 w1 w4 w2 w3 w1 w1 w4 w2 w5 w4 w3

Explanation / Answer

Advantage of Porter Stemming algorithim:-

The advantages of Porter Stemming algorithim approach is that it is simple, fast, and can easily handle exceptions.

Disadvantages of Porter Stemming algorithim:-

The disadvantages include that all the inflected forms must be explicitly present in the table, new or unfamiliar words cannot be not handled even if they are absolutely regular (like iPads and iPad) and the table can be large.

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote