Hello, I need to finish a program that compares multiple documents and returns t

ID: 3544731 • Letter: H

Question

Hello, I need to finish a program that compares multiple documents and returns the two that are most similar. Below are the instructions / .java files.

***********************************Directions*******************************************************************************************

When indexing a large library of documents to permit later searching, it's pretty obvious how to support searching by title or author name. That information can be copied and pasted directly out of the document.

Indexing by subject area is considerably harder, because it depends on an appreciation of the overall content of the document. Traditionally, such info was provided by the authors (disadvantage: they may not be familiar with the various subject headings used by the library) or by human library staff who would read part of the document to asses its subject (disadvantage: inaccurate, particularly if the librarian is not well-versed in the subject area of a document).

Some systems attempt to automatically classify the subject of a document by comparing its vocabulary "signature" to that of documents already known to belong to various subject areas.

The "signature" can be defined in different ways. A common is to make a list of all words in the document, counting how often each word occurs. Throw out all words that are so common in the English language that they don't tell us anything useful about the subject area (e.g., "a", "the", "and") - such words can be found in various published "stop lists". Then throw out all words that are only used a few times (the exact number used for this threshold varies). The remaining set of words is the "signature".

The similarity of two documents can be measured as the size of the intersection of the two documents' signatures divided by the size of the union of their two signatures. This yields a similarity score between 0.0 and 1.0. (This is a somewhat oversimplified measure. In more realistic systems, words are weighted according to how commonly they are used in the language. For example, two documents whose signatures share the word "entanglement" would get a higher boost to their similarity score than two documents that share the word "green".)

With this scheme for measuring the similarity of two documents, subjects can be assessed in a couple of ways. First, when a new document comes in, it's similarity can be compared to the documents already in the library. The new document can be assigned the subject terms previously associated with the closest matching (highest similarity) documents already in the system. Second, one can look for "clusters" of similar documents that span existing subject areas as a suggestion that a new subject area designation is needed. A human librarian can examine a sample of documents in that cluster to determine if a new subject term is really needed, and, if so, what that term might be.

You will be working on a program that reads several plain-text documents. The first such document is presumed to be a "new" document. It's similarity will be computed against each of the remaining documents, and the best matching document from among the remainder selected.

Input to this program will be a collection of two or more text files. The file names will be supplied on the command line.

The program should list, on a single line, the name of the document most similar to the first one given, and the similarity score for those two documents.

Given the input files supplied with the assignment

would produce the output

Explanation / Answer

//Document.java

import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Scanner;

/**
*
*/

/**
* This document class tracks the words in a document, counting how often they occur,
* but does not preserve or record the order in which the words occur. Extremely
* common words (in the stoplist) are ignored and not counted.
*
* @author zeil
*
*/
public class Document implements Iterable<String> {

private HashMap<String, Integer> wordCounts;
private String docName;

private static String[] stopListWords = {
"i",
"a",
"an",
"and",
"are",
"as",
"at",
"be",
"by",
"for",
"from",
"he",
"her",
"his",
"if",
"in",
"is",
"it",
"of",
"on",
"or",
"she",
"that",
"the",
"this",
"to",
"was",
"what",
"when",
"where",
"who",
"will",
"with"
};

private static HashSet<String> stoplist = new HashSet<String>(Arrays.asList(stopListWords));

public Document(String documentName, String documentText) {
docName = documentName;
wordCounts = new HashMap<String, Integer>();
Scanner scanner = new Scanner(documentText);
while (scanner.hasNext()) {
String word = scanner.next();
word = trimWord(word);
if (word.length() > 0)
addSignificantWord (word);
}
}

/**
* If this word is not extremely common, add it to the document.
*
* @param word
*/
private void addSignificantWord(String word) {
if (stoplist.contains(word)) return;
int count = 1;
if (wordCounts.containsKey(word))
count += wordCounts.get(word);
wordCounts.put(word, count);
}

//
/**
* Removes any leading and trailing non-alphabetics and converts the
* remainder to lower case.
*
* @param word
* @return
*/
private String trimWord(String word) {
String trimmed1 = word.replaceAll("^[^A-Za-z]*", "");
String trimmed2 = trimmed1.replaceAll("[^A-Za-z]*$", "");
return trimmed2.toLowerCase();
}

/**
* Returns the document name.
*
* @return the document name
*/
public String getDocumentName() {
return docName;
}

/**
* Returns the number of times this word occurs in
* the document (0 if the word does not occur in the
* document at all.)
*
* @param word
* @return count
*/
public int getWordCount(String word)
{
if (wordCounts.containsKey(word))
return wordCounts.get(word);
return 0;
}

@Override
/**
* Provides access to all of the words found in the document.
*
* @return
*/
public Iterator<String> iterator() {
return wordCounts.keySet().iterator();
}
}

//Similarity.java

import java.util.HashSet;
import java.util.Set;

public class Similarity {

private Document doc1;
private Document doc2;
private Set<String> indexTerms;

public Similarity(Document doc1, Document doc2) {
this.doc1 = doc1;
this.doc2 = doc2;
indexTerms = null;
}

/**
* The index terms for a similarity measure is the set of all words
* that occur t or more times in one or both documents, where t is
* the threshold supplied in an earlier getSimilarity call.
*
* If no such call has occurred, a value of t=1 is used.
*
* @return collection of sufficiently common words in the two documents
*/
public Set<String> getIndexTerms()
{
if (indexTerms == null)
computeIndexTerms(1);
return indexTerms;
}

/**
* Compute the set of index terms: the collection of words that occur
* threshold or more times in one or both documents.
*
* @param threshold
*/
private void computeIndexTerms(int threshold) {
indexTerms = new HashSet<String>();
for (String word : doc1)
if (doc1.getWordCount(word) >= threshold)
indexTerms.add(word);
for (String word : doc2)
if (doc2.getWordCount(word) >= threshold)
indexTerms.add(word);
}

/**
* Computes the similarity between two documents. The similarity is computed by
* 1) Ignoring all words that occur less than threshold times in both documents
* 2) Counting the number of words occurring at least threshold times in both documents
* 3) Dividing that count by the number of the number of words occurring at least threshold
* times in either document
* For example, if document 1 is ("A penny saved is a penny earned.") and document 2 is
* "What is that? It is a penny.", the similarity at threshold 2 would be 0.0 (the index terms are
* "a", "penny", and "is". At threshold 1, the similiarity is 3/8 = 0.375.
*
* @param threshold
* @return similarity measure, a number between 0.0 and 1.0, or 0.0 if neither document
* contains any words occurring more than threshold times.
*/
public double getSimilarity(int threshold) {
computeIndexTerms(threshold);
int totalCount = indexTerms.size();
if (totalCount > 0) {
int similarWordCounts = 0;
for (String word : indexTerms)
if (doc1.getWordCount(word) >= threshold && doc2.getWordCount(word) >= threshold)
++similarWordCounts;
return (double) similarWordCounts / totalCount;
}
else
return 0.0;
}
}

//leave DocumentComparisons.java untouched.

checked:

java DocumentComparisons pg10520.txt pg14735.txt pg26209.txt pg5669.txt
Best match is pg26209.txt with a similarity of 0.30

Navigate

Hello, I need to do a research and I choose my research question: Do the commerc

Hello, I need to follow instructions made by a teacher. I am a novice programmer

Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.

Hello, I need to finish a program that compares multiple documents and returns t

Question

Explanation / Answer

Related Questions

Navigate