Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Python I am trying to build a bigram model and to calculate the probability of w

ID: 3854510 • Letter: P

Question

Python

I am trying to build a bigram model and to calculate the probability of word occurrence

. I should: Select an appropriate data structure to store bigrams. Increment counts for a combination of word and previous word. This means I need to keep track of what the previous word was. Compute the probability of the current word based on the previous word count.

Prob of curr word = count(prev word, curr word) / count(previous word)

Consider we observed the following word sequences:

finger remarked

finger on

finger on

finger in

finger .

Notice that "finger on " was observed twice. Also, notice that the period is treated as a separate word. Given the information in this data structure, we can compute the probability (on|finger) as 2/5 = 0.4.

Here is what I got so far:

filename = 'blah-blah.txt'

bigrams ={}

unigrams = {}

prev_word = "START"

# opening the filename in read mode

for line in fp:

words = line.split()
  
for word in words:
word = word.lower()
bigram = prev_word + ' ' + word
  
  
#print(bigram)
if word in unigrams:
unigrams[word] +=1
else:
unigrams[word] =1
  
#print(unigrams[word])

  
if bigram in bigrams:
bigrams[bigram] += 1
else:
bigrams[bigram] = 1
prev_word = word
  


output_file = 'bigram_probs.txt'
with open(output_file, "w") as fs:
for key, value in sorted(bigrams.items()):
prob = value / unigrams[word]
fs.write(key + ": " + str(prob) + " ")

My program works, but I am not sure if it does what it should do. I appreciate any help!

Explanation / Answer

Code:

filename = 'blah-blah.txt'
bigrams ={}
unigrams = {}
prev_word = "START"
fp = open(filename,"r")

for line in fp:
words = line.split()
for word in words:
word = word.lower()
bigram = prev_word + ' ' + word
  
#print(bigram)
if word in unigrams:
unigrams[word] +=1
else:
unigrams[word] =1
  
#print(unigrams[word])
if bigram in bigrams:
bigrams[bigram] += 1
else:
bigrams[bigram] = 1
prev_word = word
  
output_file = 'bigram_probs.txt'
with open(output_file, "w") as fs:
for key, value in sorted(bigrams.items()):
prob = float(value) / unigrams[word]
fs.write(key + ": " + str(prob) + " ")

Input File:

finger remarked
finger on
finger on
finger in
finger

Output File:

START finger: 0.2
finger finger: 0.2
finger in: 0.2
finger on: 0.4
finger remarked: 0.2
in finger: 0.2
on finger: 0.2
remarked finger: 0.2