Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Extensive Vocabulary For this assignment, we would like to study the representat

ID: 658454 • Letter: E

Question

Extensive Vocabulary

For this assignment, we would like to study the representative works of two authors and ask ourselves the question "Whose vocabulary is more extensive?"

The authors that we will be looking at are Charles Dickens and Thomas Hardy. And the representative novels that we will analyze are A Tale of Two Cities and The Return of the Native respectively. Here are the following steps in our analysis.

Step I: Go to the Project Gutenberg and download the plain text versions of both books. Call the first book Tale.txt and the second book Return.txt. Open the books in a text editor and delete the preamble in the beginning of the books and the license agreement and the closing blurbs at the end of the books. The first book in Tale.txt should begin and end with these lines:

The second book in Return.txt should begin and end with these lines:

Step II: The program that you will be writing will be called Books.py. The following is the suggested structure. You do not have to adhere to it. However, we will be looking at good documentation, design, and adherence to the coding convention discussed in class. Use meaningful variable names in your program.

Step III: Declare a global dictionary variable word_dict. In the function create_word_dict() open the file words.txt and populate the dictionary. Each word in the file will be a key in the dictionary and the value will be 1. Hard code the name of the file, words.txt, in your program. When we test your program we will be using a file with the same name.

Step IV: You will have to get the frequency of words in each text to start with. Then here is some additional processing that you will have to do:

Open the file for reading

Create an empty dictionary

Read line by line through the file

For each line strip the end-of-line character and replace punctuation marks with spaces except apostrophes ('). You should replace all hyphens with spaces. There are hyphens at the end of lines but they do not denote word continuation, so there is no need to check for dangling words on the next line. Remove apostrophes (') at the end of a line or ('s) at the end of a line. If there is an apostrophe followed by a character that is not an 's' keep the apostrophe. In the function parseString() create a blank new string. Go through the input string character-by-character. Accept only letters (using isalpha()) and spaces (using isspace()) and apostrophes only for the special cases mentioned above. Return the new string.

After the dictionary is created close the input file.

Remove all words that start with a capital letter.

Create an empty list for words starting with a capital letter.

Go through the words in the frequency dictionary. Check if the word starts with a capital letter.

If it does, add it to the list of capital words.

Once that list is complete, go through the list of capitalized words and check if the lower case version of that word exists in the frequency dictionary. If it exists, then add the upper case word's frequency to the lower case word's frequency.

If the lower case version does not exist in the dictionary, check if it exists in a comprehensive word list that is used for crossword and Scrabble players. If it does, create an entry in the frequency dictionary with the lower case version of the word and the word frequency computed.

After you have checked for all capitalized words, remove all those words in the list of capitalized words and their frequencies from the word frequency dictionary.

You should now have a dictionary of words in lower case and their frequencies. You will have removed all proper names of people and places. You will have added the names of people and places that are common words like Green and Bath. But such words should be few compared to the total number of words that we are dealing with in those novels. You can always write the list of words beginning with a capital letter in a file and examine that file to check if the deletions were properly made.

Step V: In this step you will be working on the function wordComparison(). First you will get some statistics of the two novels separately and then you will compare the two together. For each novel compute and print the following pieces of information:

Print the number of distinct words used, i.e. number of words used if you remove the duplicates. Realize, that is just the length of the list of keys for the word frequency dictionary.

Compute and print the total number of words used by adding all the frequencies together.

Calculate and print the percentage of distinct words to the total number of words used.

You will create two sets with the list of keys from the two word frequency dictionaries. Let us call these sets D and H for the two authors respectively. The set difference D - H represents all the words that Dickens used that Hardy did not. The set difference H - D represents all the words that Hardy used that Dickens did not. For each of these set differences print the following pieces of information:

The number of words in that set difference.

Compute the total frequencies of these words in the set difference (D-H or H-D) and express that as the percentage of total words number in the novel that you found earlier. [In the example below, we computed the sum of the frequencies of the 50 words that Dickens used that Hardy did not (should be 76) and expressed that as a percentage of the 119 words that Dickens used in all.]

Here are two sample files - dickens.txt and hardy.txt taken from the two novels. Your output for these two sample files should be of the following form:

Here is the output of the frequencies of the words in the two excerpts - dickens.out.txt and hardy.out.txt. These outputs have now been checked against the comprehensive word list.

Explanation / Answer

import string

#Create word dictionary from the comprehensive word list
word_dict = {}
def create_word_dict():
    book = open('words.txt','r')
    for line in book:
        line = line.strip()
        line= line.lower()
        word_list = line.split()
        for word in word_list:
            if word in word_dict:
                word_dict[word] = word_dict[word]+1
            else:
                word_dict[word]=1
    book.close()
#Removes punctuation marks from a string but apostrophes
def parseString (st):
symbolString = "~@#$%^&*(()_+=~<>?/,.;:!{}[]|""
for char in symbolString:
    if char in st:
      st = st.replace(char, "")
if "'s" in st:
    st = st.replace("'s", "")
if "-" in st:
    st.replace("-", " ")
if len(st) > 1 and st[0] == "'":
    st = st[1:]
if len(st) > 1 and st[-1] == "'":
    st = st[:-2]
return st

# get the input file and a word dictionary for pair values
# Returns a dictionary of words and their frequencies
def getWordFreq (file):
#open book
bookFile = open(file, "r")
firstBook = {}
capitalWords = []
book1List = bookFile.read().split()
for elt in book1List:
    st = parseString(elt)
    if st.isdigit():
      continue
    if elt in firstBook:
      firstBook[elt] += 1
    else:
      firstBook[elt] = 1
for key in firstBook:
    if key[0].isupper():
      capitalWords.append(key)
for key in capitalWords:
    if key.lower() in firstBook:
      firstBook[key.lower()] += firstBook[key]
    else:
      if key.lower in word_dict:
        firstBook[key.lower()] += word_dict[key]
for key in capitalWords:
    del firstBook[key]
print(firstBook)
print(len(firstBook))
return firstBook

#Compares the distinct words in two dictionaries
def wordComparison (author1, freq1, author2, freq2):

'''function to calculate statistics of authors'''

#initialize counting variables and create set differences
totalCount1 = 0
wordCount1 = 0
for word in freq1:
    totalCount1 += freq1[word]
    wordCount1 += 1
totalCount2 = 0
wordCount2 = 0
for word in freq2:
    totalCount2 += freq2[word]
    wordCount2 += 1
set1 = set(freq1)
set2 = set(freq2)
diff1 = set1 - set2
diff2 = set2 - set1
count1 = 0
for x in diff1:
    count1 += freq1[x]
count2 = 0
for x in diff2:
    count2 += freq2[x]

#Print results
print()
print(author1)
print('Total distinct words =', wordCount1)
print('Total words (including duplicates) =', totalCount1)
print('Ratio (% of total distinct words to total words) =',
        wordCount1/totalCount1*100, end = ' ')
print(author2)
print('Total distinct words =', wordCount2)
print('Total words (including duplicates) =', totalCount2)
print('Ratio (% of total distinct words to total words =',
        wordCount2/totalCount2*100, end = ' ')
print('%s used %d words that %s did not use.' %(author1, len(diff1), author2))
print('Relative frequency of words used by %s not in common with %s ='
        %(author1, author2), count1/totalCount1*100, end = ' ')
print('%s used %d words that %s did not use.' %(author2, len(diff2), author1))
print('Relative frequency of words used by %s not in common with %s ='
        %(author2, author1), count2/totalCount2*100, end = ' ')

def main( ):

#Enter names of the two books in electronic form
book1 = input("Enter name of first book: ")
book2 = input("Enter name of second book: ")
print()

#Enter names of the two authors
author1 = input("Enter last name of first author: ")
author2 = input("Enter last name of second author: ")
print()

#Get the frequency of words used by the two authors
wordFreq1 = getWordFreq(book1)
wordFreq2 = getWordFreq(book2)

#Compare the relative frequency of uncommon words used
#by the two authors
wordComparison(author1, wordFreq1, author2, wordFreq2)

main()

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Chat Now And Get Quote