Create a program in C++ that compares text files and calculates their similarity
ID: 3845906 • Letter: C
Question
Create a program in C++ that compares text files and calculates their similarity by word. The user will be prompted to list as many text files as they like (3, 18, 2345, etc), all separated by spaces. These files will then be compared against each other in the words used, forgetting the multiplicity of each word.
Every file should be compared against every other file exactly once, with an arbitrary number of files.
When two files are compared, all the words they had in common (intersection) should be documented. All words that were present in either file (union) should be documented. In documenting the union and intersection, the word frequency does not matter. A word appearing once or a hundred times between two files, for example, should only be documented once.
A similarity rating is defined by the percentage of overlap between the set of all words common to both files (intersection) and the set of all words, words appearing in either file (union).
For each pairwise comparison, a file should be created and saved with name [FileName]_[OtherFileName]_Sim_Score_[PERCENT].txt where [PERCENT] is the similarity score, rounded to the nearest integer, [FileName] is the name of one 1 file without any .’s, and [OtherFileName] is the name of the other file without any .’s.
Each saved file should be formatted as follows:
Similarity Score: [SIMILARITY SCORE]
Union: [ALL WORDS BETWEEN TWO FILES LISTED ALPHABETICALLY WITH NUMBERS COMING FIRST, SEPARATED BY SPACES]
Intersection: [ALL WORDS THAT ARE COMMON BETWEEN THE TWO FILES LISTED ALPHABETICALLY, SEPARATED BY SPACES]
The requirements:
1. You must overload at least 1 operator, but you may overload more. Hint: if you do overload operator+, the return type need not be a reference.
2. You may use std::vector and std::string, but you must also use at least one other data structure. There is one that is very natural for this setting.
3. You must assume the user’s files will be in the same folder as the .cpp or .exe file, and the files must be saved to the same folder.
4. You may assume the files will not use digits 0-9 nor will they use any punctuation marks.
5. The words listed in the files you produce must be listed alphabetically.
Explanation / Answer
#include #include #include #include double percentage_similar(const std::string &filename_one, const std::string &filename_two) { std::ifstream file_one(filename_one.c_str(), std::ios_base::binary); std::ifstream file_two(filename_two.c_str(), std::ios_base::binary); std::istreambuf_iterator file_one_iter(file_one); std::istreambuf_iterator file_two_iter(file_two); std::istreambuf_iterator file_end; unsigned int are_similar = 0; unsigned int are_different = 0; while (file_one_iter != file_end || file_two_iter != file_end) { // deal with special case that first file is smaller if (file_one_iter == file_end) { ++are_different; ++file_two_iter; } else // deal with special case that second file is smaller if (file_two_iter == file_end) { ++are_different; ++file_one_iter; } else if (*file_one_iter++ == *file_two_iter++) ++are_similar; else ++are_different; } return static_cast (are_similar) * 100 / (are_similar + are_different); }Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.