Database Mangement Question : One criticism of the vector space model and its us
ID: 665091 • Letter: D
Question
Database Mangement Question :
One criticism of the vector space model and its use in sirnilarity checking is that it treats termss as occurring independently of each other. In practice, many words tend to occur together (e.g., ambulance and emergency). Write a program that scans an ASCII file and lists all pairs of words that occur within 5 words of each other. For each pair of words, you now have a frequency, and should be able to create a plot like Figure 27.3 with pairs of words on the X-axis. Run this program on some sample document collections. What do the results suggest about co-occurrences of words?
Q = (0.4, 0.8) Dl (0.8, 0.3) D2- (0.2, 0.7) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 02 D1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 TERMB STOP WORDS Figure 27.4 Document Similarity Figure 27.3 Zipfian Distribution of Ternm FrequenciesExplanation / Answer
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
// structure for each word of duplicate array
struct Word
{
char* str; // to store word itself
int index; // index of the word in the original array
};
// structure to represent duplicate array.
struct DupArray
{
struct Word* array; // Array of words
int size; // Size of array
};
// Create a DupArray object that contains an array of Words
struct DupArray* createDupArray(char* str[], int size)
{
// Allocate memory for dupArray and all members of it
struct DupArray* dupArray =
(struct DupArray*) malloc( sizeof(struct DupArray) );
dupArray->size = size;
dupArray->array =
(struct Word*) malloc( dupArray->size * sizeof(struct Word) );
// One by one copy words from the given wordArray to dupArray
int i;
for (i = 0; i < size; ++i)
{
dupArray->array[i].index = i;
dupArray->array[i].str = (char*) malloc( strlen(str[i]) + 1 );
strcpy(dupArray->array[i].str, str[i] );
}
return dupArray;
}
// Compare two characters. Used in qsort() for sorting an array of characters (Word)
int compChar(const void* a, const void* b)
{
return *(char*)a - *(char*)b;
}
// Compare two words. Used in qsort() for sorting an array of words
int compStr(const void* a, const void* b)
{
struct Word* a1 = (struct Word *)a;
struct Word* b1 = (struct Word *)b;
return strcmp(a1->str, b1->str);
}
// Given a list of words in wordArr[],
void printAnagramsTogether(char* wordArr[], int size)
{
// Step 1: Create a copy of all words present in given wordArr.
// The copy will also have orignal indexes of words
struct DupArray* dupArray = createDupArray(wordArr, size);
// Step 2: Iterate through all words in dupArray and sort individual words.
int i;
for (i = 0; i < size; ++i)
qsort(dupArray->array[i].str,
strlen(dupArray->array[i].str), sizeof(char), compChar);
// Step 3: Now sort the array of words in dupArray
qsort(dupArray->array, size, sizeof(dupArray->array[0]), compStr);
// Step 4: Now all words in dupArray are together, but these words are
// changed. Use the index member of word struct to get the corresponding
// original word
for (i = 0; i < size; ++i)
printf("%s ", wordArr[dupArray->array[i].index]);
}
// Driver program to test above functions
int main()
{
char* wordArr[] = {"cat", "dog", "tac", "god", "act"};
int size = sizeof(wordArr) / sizeof(wordArr[0]);
printAnagramsTogether(wordArr, size);
return 0;
}
Related Questions
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.