Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

N-grams An N-gram is a sequence of N consecutive characters from a given word. E

ID: 3872757 • Letter: N

Question

N-grams An N-gram is a sequence of N consecutive characters from a given word. Example: the word "height" contains four 3-grams: "hei", "eig", "igh", and "ght". Write a C function with the following specs that will read from a file and determine the most frequent N-grams: void most freq ngram (const char* filename, int ngram_len, char* ngram[, int* ngram_num) where: filename is the filename of the text file containing the words. ngram len-in the length of the N-gram, valid values are 2, 3, 4, 5 ngram an array of character strings of the most frequent N-grams. Ngram_num the number of ngrams returned. For a given text and an N-gram length, most freq_ngram shall find the most frequent N-gram among all the words in the text. A word of the text is any consecutive sequence of alphanumeric characters, i.e., containing only English letters a-z, A-Z or numbers 0-9. If there are more N-grams with the same number of occurrences, the function shall return all of them. All comparisons between portions of text shall be case-insensitive and the returned N-gram shall be lowercase. Use a data structure, like a linked list or a binary tree for storing the N-grams. The data structure code should be implemented as a separate module with a header file. Extra points will be given for implementations that successfully use a sorted data structure.

Explanation / Answer

ANSWER::

void CPatternAlaysis::ConstructPatterns(BYTE *pSrc, int nSrcLen,
LPCSTR lpcsDelimiters /*= NULL*/,
int nMinPatternWords /*= 2*/,
bool bFixedNGram /*= false*/)
{
m_alpDic.RemoveAll();
CBinaryTreeNode<CPattern, int>* pNode = m_alpDic.Root;
int nPrevLength;
CPattern node(m_pDes, GetPatternLength(
m_pDes, nPrevLength, nMinPatternWords));
while(node.m_pBuffer < m_pDes+nDesLen)
{
pNode = m_alpDic.Insert(&node, -1, pNode);
pNode->Key.m_nFrequency = pNode->Count;
if(bFixedNGram == false && pNode->Count > 1)
node.m_nLength += AddWordToPattern(node.m_pBuffer+node.m_nLength);
else
{
node.m_pBuffer += nPrevLength;
node.m_nLength = GetPatternLength(node.m_pBuffer,
nPrevLength, nMinPatternWords);
  
pNode = m_alpDic.Root;
}
}
}