For this assignment, you need to use python to compute the term-frequency matrix
ID: 3578684 • Letter: F
Question
For this assignment, you need to use python to compute the term-frequency matrix for a set of documents.
A term frequency matrix is a table, where rows represent documents and columns represent the terms/words. The value in cell (i,j) is the number of times that word j occurs in document i.
To do this, your python program first needs to go through the files in the input folder, where each file is a separate document (thus, the number of documents in the number of files), and build a set of all unique terms across all the documents.
Let's call this list of terms T, which contains n terms.
Then you'll need to go through each file/document, and compute the number of times that each of the n words occurs in that document. Doing this, you will produce the term-document matrix.
The program should save this matrix in a file, where each row of the matrix appears on a separate line, and all terms occurrence frequencies are separated by commas.
The folder with the documents, representing movie reviews, is included in the assignment.
Here are stop words
'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn'
Explanation / Answer
import os
def wordCount(fileName):
file=open(fileName,"r+")
wordcount={}
for word in file.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
return wordcount
m = []
fileNum = 1
#give folder path here
path = folderpath
files = os.listdir(path)
for file in files:
d = {}
if (os.path.isfile(file)):
m[fileNum][0] = file
d = wordCount(file)
i=0
for key, value in d.items():
allKeys = [i[0] for i in m]
if key in allKeys:
for i in range(len(allKeys)):
if allKeys[i] == key:
m[fileNum][i] = value
else:
allKeys.append(key)
for i in range(len(allKeys)):
if allKeys[i] == key:
m[fileNum][i] = value
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.