Task 2 – Building a Markov model of a text sample (MarkovModel) For this task, y
ID: 3844750 • Letter: T
Question
Task 2 – Building a Markov model of a text sample (MarkovModel)
For this task, you will need to complete the MarkovModel class, which generates a Markov model from an input string, and also write a JUnit test for your model. Markov models are probabilistic models (i.e., they model the chances of particular events occurring) and are used for a broad range of natural language processing tasks (including computer speech recognition). They are widely used to model all sorts of dynamical processes in engineering, mathematics, finance and many other areas.
They can be used to estimate the probability that a symbol will appear in a source of text, given the symbols that have preceded it.
A zero-th order Markov model of a text-source is one that estimates the probability that the next character in a sequence is, say, an “a”, based simply on how frequently it occurs in a sample. Higher-order Markov models generalize on this idea. Based on a sample text, they estimate the likelihood that a particular symbol will occur in a sequence of symbols drawn from a source of text, where the probability of each symbol occurring can depend upon preceding symbols. In a first order Markov model, the probability of a symbol occurring depends only on the previous symbol. Thus, for English text, the probability of encountering a “u” can depend on whether the previous letter was a “q”. If it was indeed a “q”, then the probability of encountering a “u” should be quite high. For a second order Markov model, the probability of encountering a particular symbol can depend on the previous two symbols; and generally, the probabilities used by a k-th order Markov model can depend on the preceding kk symbols.
A Markov model can be used to estimate the probability of a symbol appearing, given its k predecessors, in a simple way, as follows:
For each context of characters of length kk, we estimate the probability of that context being followed by each letter cc in our alphabet as the number of times the context appears followed by cc, divided by the number of times the context appears in total. As with our NgramAnalyser class, we consider our input string to “wrap round” when analysing contexts near its end. Call this way of estimating probabilities “simple estimation”.
For instance, consider the string “aabcabaacaac”. The 2- and 3-grams in it (assuming “wrap-around”) are as follows:
Given the context "aa", we can simply estimate the probability that the next character is "b" as:
P(next character is a “b" if last 2 were “aa")=(number of occurrences of “aab")(number of occurrences of “aa")=13P(next character is a “b" if last 2 were “aa")=(number of occurrences of “aab")(number of occurrences of “aa")=13
You will need to write code in the MarkovModel class that builds a k-th order Markov model from an input string. Complete the MarkovModel class as follows. You may wish to complete sub-task (e) before sub-task (d).
Fill in the @author and @version fields in the header comments with all authors’ names, student numbers and the date.
Complete the code for the constructor, which takes an integer k, and an input string s, and initializes any data structures needed for a kk-th order Markov model.
Complete the double simpleEstimate(String sequence) method, which estimates the likelihood of the last character in a string appearing, given its predecessors, using the formula for simple estimation given above. Your code should catch and handle ArithmeticExceptions appropriately. The method should return 0 if the input sequence is not in the n-gram model.
The simpleEstimate will not provide useful estimates if, for instance, the number of occurrences of an n-gram we are looking for is zero. For the double laplaceEstimate(String sequence) method, we will handle such cases by applying Laplace smoothing, which ensures that frequencies of zero will not cause a problem.
Let np,cnp,c represent the number of times a context pp is followed by a character cc.
Let npnp represent the number of times a context pp appears in total.
The Laplace smoothed estimate of the probability of cc following pp is
np,c+1np+Snp,c+1np+S ,
where SS is the size of the alphabet being used.
Implement the double laplaceEstimate(String sequence) method using Laplace smoothing.
Given the value k = 2, and the input string “aabcabaacaac”, the following code fragment should construct a Markov model and then print estimates of particular length-3 sequences appearing:
The expected output in this case is the text
In the testSimpleExample and testLaplaceExample methods of the ProjectTest class, implement tests which confirm that when a Markov model is constructed with k = 2 and an input string of “aabcabaacaac”, the results of calling laplaceEstimate("aac"), laplaceEstimate("aaa") and laplaceEstimate("aab") on the model are within 0.0001 of the numbers given in the sample output in sub-task (d) and that the values returned by simpleEstimate are also correct.
Complete the String toString() method of the MarkovModel class. The returned string should consist of:
A line containing the parameter k used for the model
A line containing the alphabet size used for the model
The string created by concatenating the results of the toString() methods of the ngram and n1gram fields of the MarkovModel class.
Ensure you complete Javadoc comments for each method and class including @param, @returns and @throws fields as required.
MarkovModel Class
import java.util.Set;
/**
* Construct a Markov model of order /k/ based on an input string.
*
* @author (your name)
* @version (a version number or a date)
*/
public class MarkovModel
{
/** Markov model order parameter */
int k;
/** ngram model of order k */
NgramAnalyser ngram;
/** ngram model of order k+1 */
NgramAnalyser n1gram;
/**
* Construct an order-k Markov model from string s
* @param k int order of the Markov model
* @param s String input to be modelled
*/
public MarkovModel(int k, String s)
{
//TODO replace this line with your code
}
/**
* @return order of this Markov model
*/
public int getK()
{
return k;
}
/** Estimate the probability of a sequence appearing in the text
* using simple estimate of freq seq / frequency front(seq).
* @param sequence String of length k+1
* @return double probability of the last letter occuring in the
* context of the first ones or 0 if front(seq) does not occur.
*/
public double simpleEstimate(String sequence) {
//TODO replace this line with your code
return -1.0;
}
/**
* Calculate the Laplacian probability of string obs given this Markov model
* @input sequence String of length k+1
*/
public double laplaceEstimate(String sequence)
{
//TODO replace this line with your code
return -1.0;
}
/**
* @return String representing this Markov model
*/
public String toString()
{
//TODO replace this line with your code
return null;
}
}
Test Cases:
@Test(timeout=1000)
public void testLaplaceExample() {
assertEquals(0,1); //TODO replace with test code
}
@Test(timeout=1000)
public void testSimpleExample() {
assertEquals(0,1); //TODO replace with test code
}
Link to Task 1: https://www.chegg.com/homework-help/questions-and-answers/task-1-analysing-n-grams-sample-text-ngramanalyser-task-need-complete-ngramanalyser-class--q22251885
2-gram frequencies 2-gram frequency "aa" 3 "ab" 2 "ac" 2 "ba" 1 "bc" 1 "ca" 3 3-gram frequencies 3-gram frequency "aab" 1 "aac" 2 "aba" 1 "abc" 1 "aca" 2 "baa" 1 "bca" 1 "caa" 2 "cab" 1Explanation / Answer
import java.util.Set;
public class MarkovModel
{
/** Markov model order parameter */
int k;
/** ngram model of order k */
NgramAnalyser ngram;
/** ngram model of order k+1 */
NgramAnalyser n1gram;
/**
* Construct an order-k Markov model from string s
* @param k int order of the Markov model
* @param s String input to be modelled
*/
public MarkovModel(int k, String s)
{
ngram = new NgramAnalyser(k, s);
n1gram = new NgramAnalyser((k+1), s);
}
/**
* @return order of this Markov model
*/
public int getK()
{
return k;
}
/** Estimate the probability of a sequence appearing in the text
* using simple estimate of freq seq / frequency front(seq).
* @param sequence String of length k+1
* @return double probability of the last letter occurring in the
* context of the first ones or 0 if front(seq) does not occur.
*/
public double simpleEstimate(String sequence) {
double prob;
String seqNotLast = sequence.substring(0, sequence.length()-1);
if (ngram.getDistinctNgrams().contains(seqNotLast))
{
double n1g = n1gram.getNgramFrequency(sequence);
double ng = ngram.getNgramFrequency(seqNotLast);
try{
prob = (n1g/ng);
}
catch(ArithmeticException e){
return 0.0;
}
return prob;
}
else
{
return 0.0;
}
}
/**
* Calculate the Laplacian probability of string obs given this Markov model
* @input sequence String of length k+1
* @return Laplacian Probability
*/
public double laplaceEstimate(String sequence)
{
//TODO replace this line with your code
String context = sequence.substring(0, sequence.length()-1);
double npc = n1gram.getNgramFrequency(sequence);
double np = ngram.getNgramFrequency(context);
double laplace;
laplace = (npc + 1)/(np + ngram.getAlphabetSize());
return laplace;
}
/**
* @return String representing this Markov model
*/
public String toString()
{
//TODO replace this line with your code
String toRet = "";
String k = Integer.toString(getK());
toRet += (k + " ");
toRet += (Integer.toString(ngram.getAlphabetSize()) + " ");
toRet += ngram.toString() + n1gram.toString();
return toRet;
}
}
ProjectTest.java
import static org.junit.Assert.*;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class ProjectTest
{
/**
* Default constructor for test class ProjectTest
*/
public ProjectTest()
{
}
/**
* Sets up the test fixture.
*
* Called before every test case method.
*/
@Before
public void setUp()
{
}
/**
* Tears down the test fixture.
*
* Called after every test case method.
*/
@After
public void tearDown()
{
}
//TODO add new test cases from here include brief documentation
@Test(timeout=1000)
public void testSensibleToStringSize() {
NgramAnalyser analyser = new NgramAnalyser("aabcabfaacfaaac");
int minLines = analyser.getAlphabetSize() + 1;
//This next bit is ridiculously inefficient memory wise but I'm lazy so there.
String[] lines = analyser.toString().split(" | | ");
int linesLength = lines.length;
//System.out.println(linesLength);
//System.out.println(minLines);
assertTrue(linesLength >= minLines);
}
@Test(timeout=1000)
public void testGetDistinctNgrams() {
assertEquals(0,1); //TODO replace with test code
}
@Test(timeout=1000)
public void testLaplaceExample() {
MarkovModel mkMdl = new MarkovModel(2, "aabcabaacaac");
double c = mkMdl.laplaceEstimate("aac");
double b = mkMdl.laplaceEstimate("aab");
double a = mkMdl.laplaceEstimate("aaa");
System.out.println(c);
assertTrue(c >= 0.4999 && c <= 0.5001);
assertTrue(b >= 0.3332 && b <= 0.3334);
assertTrue(a >= 0.1666 && a <= 0.1668);
}
@Test(timeout=1000)
public void testSimpleExample() {
MarkovModel mkMdl = new MarkovModel(2, "aabcabaacaac");
double b = mkMdl.simpleEstimate("aab");
System.out.println(b);
assertTrue(b == (1.0/3.0));
}
@Test
public void testTask3example()
{
MarkovModel model = new MarkovModel(2,"aabcabaacaac");
ModelMatcher match = new ModelMatcher(model,"aabbcaac");
assertEquals(0,1); //TODO replace with test code
}
}
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.