Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

This homework will refer to concepts that we discussed in class regarding how to

ID: 3711537 • Letter: T

Question

This homework will refer to concepts that we discussed in class regarding how to use pre-tagged corpora.

1. Load the tagged words version of the treebank corpus (nltk.corpus.treebank). Write code to find the most common tags, first using the pre-defined treebank tagset, and then again with the universal tagset.

2. Again using the treebank tagged words corpus, use bigrams to determine the most common part of speech that occurs after the part of speech ‘DET.’

3. Using a method similar to what was presented in class to find V-“to”-V sequences, search the treebank tagged corpus to find “to”-ADVERB-VERB sequences. You can do this using just the treebank tagged words corpus or (more similar to what we did in class) the treebank tagged sentence corpus.

4. This question does not involve tagging, but it does involve corpus searching with bigrams. Load the text of Moby D from the gutenberg corpus (nltk.corpus.gutenberg.words(‘melville-moby_d.txt’). Using bigrams, find all words that precede ‘whale’ in the text.

Explanation / Answer

from nltk.corpus import treebank

import nltk

#find most common tags

tree_tagged = treebank.tagged_words()

tag_fd = nltk.FreqDist(tag for (word, tag) in tree_tagged)

print(tag_fd.most_common())

#using universal tagset

tree_tagged = treebank.tagged_words(tagset='universal')

tag_fd = nltk.FreqDist(tag for (word, tag) in tree_tagged)

print(tag_fd.most_common())

#use bigrams to determine the most common part of speech that occurs after the part of speech ‘DET.’

word_tag_pairs = nltk.bigrams(tree_tagged)

det_suc = [b[1] for (a, b) in word_tag_pairs if a[1] == 'DET']

fdist = nltk.FreqDist(det_suc)

print([tag for (tag, _) in fdist.most_common()])

#"TO"-adverb-verb sequences using trigrams

def process(sentence):

for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(sentence):

if (t1 == 'TO' and t2.startswith('ADV') and t3.startswith('V')):

print(w1, w2, w3)

for tagged_sent in treebank.tagged_sents():

process(tagged_sent)

#nltk.corpus.gutenberg.words(‘melville-moby_d.txt’). Using bigrams, find all words that precede ‘whale’ in the text.

fi = nltk.corpus.gutenberg.words('melville-moby_dick.txt')

word_tag_pairs = nltk.bigrams(fi)

whale_preceders = [a for (a, b) in word_tag_pairs if b == 'whale']

print(whale_preceders)

#fdist = nltk.FreqDist(whale_preceders)

#print([tag for (tag, _) in fdist.most_common()])

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote