Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

C Chegg Study I Guided rk 5 S fi CS1210 NEW a. Write a program, analyze SMS inpu

ID: 3675002 • Letter: C

Question

C Chegg Study I Guided rk 5 S fi CS1210 NEW a. Write a program, analyze SMS input Filename), that analyzes word frequencies in text messages There is additional information about the contents of the file in the associated "readme" file The data you are to analyze is in text fil SMSSpamCollection. txt readme SMSSpamCollection, txt and at this website Each line of the file is represents one SMS/text message. The first item on every line is a label indicating whether that line s SMS is considered spam or not. The rest of the line contains the text of the SMS/message. For example call 09061 209465 now At the end, your program must print sumutiary information and information about the most frequent words in spam messages and the most frequent words in non-spam (ham messages. I will not specify exactly what your output should be (but I will demonstrate sample output during the next lecture or two. I will also provide organizational hints and help for each of the parts To accomplish this you analyze SMSes should read all the data from the input file extract individual words from the messages. This should include an effort to get ride of extras such as periods question and exclamation marks and COITOTIaS other characters that aren't part of a word. You should probably also ignore capitalization. Thus in the sample spam message above you probably want to treat Congrats in your frequency analysis congrats aS one for frequencies of words appearing in spam Imessages, one build two dictionaries using dictionaries is required for full credit on this assignment note for frequencies of words from ham messages print summary information and some word frequency information about the data it is up to you to decide exactly what to print. Summary information might include the number of spam/non-spam messages Again the total number of different words in spam and non-spam messages the total number of words in each, and anything else that might be interesting (does spam or non-spam have longer average word length? Frequency information might be in the form of the top ten most used words in spam and in non-spann along with a measure of their frequency is a good measure absolute count of occurrences? Or might it be better as fraction/percentage of all occurrences in that type of message Possibly also consider printing information about most frequent words with more than one or two or three letter the results might be more enlightening You could also though it is not required, compare the Say results with the list of 5000 most common English words of the file words50 t xt most common word first from HN4 b. Write a couple of sentences/ short paragraph saying something about the result Can you conclude something about spam vs non-spam? Did you learn something? Put this answer as a comment at the top of your py file. Thus your file should look like 1b def lyze SMS input Filename

Explanation / Answer

{
import string

def analyzeSMSes(filename):
hist = dict()
f = open(filename,'rb')
for line in f:
process_line(line,hist)
return hist

def process_line(line,hist):

line = line.replace('-','.')

for word in line.split():
word = word.strip(string.punctuation + string.whitespace)
word.lower()

hist[word] = hist.get(word,0)+1

hist = process_file(filename)
print hist
}