Worksheet: Document classification using Bernoulli Naive Bayes Define a diction
ID: 3823472 • Letter: W
Question
Worksheet: Document classification using Bernoulli Naive Bayes Define a diction comprising of eight words wi goal, wh tutor, w3 variance, wa- speed. drink. we defense, wr perf wg field. Consider a set of documents, each of ormance, which i related either (s)or to Informatics of 8 words implies each document can be represented as a sequence of 8 binary elements f, f (called a nary Each binary element findicates if the corresponding word wh is present in document or not. For example, a binary vector (0 0 1 0 1 1 0 0) indicates that the document the words wy, ws and W6 because the binary elements f f and fe are 1 The data through which we estimate probabilities or make inferences is called "training" presented below as a matrix for each category (or "class", in which each row represents one document. For example, matrix Bsport has 6 rows which implies there are 6 documents in the class S. How many documents are there in the class ICnformatics)? 1 0 0 0 1 1 1 1 0 0 1 0 1 1 0 0 0 1 1 0 0 0 1 0 0 1 0 1 0 1 1 0 BSport 1 1 0 1 0 0 1 1 1 0 0 1 0 1 0 1 0 1 1 0 0 1 0 0 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 roblem Classify the following documents represented by binary vectors into Sports or Informatics: 1. b13 (1 0 0 1 11 01) 2. b2 (01 1 0 1 0 1 0) Algorithm Given a training set of documents (each labeled with a class- S or we can estimate a Bernoulli document classification model as follows: Count the following in the training set (represented by matrices Bsport and B above): N, the total number of documents N., the number of documents labeled with class C k, for k- S, I(hat is, what is the number of documents in class S I? nk(wt), the number of documents of class C for k S, Ithat contain the word w, at is, frequency of occurrence of word w, in sports/informatics documents) How many wt's do we have? 8 (there are 8 words) o ns(ws) of documents in class S that have the word w1 3 Estimate p(wtIC nk(wi)/ N (hat is, estimate p(wtIS and p (wt l D) Estimate p(C N N (hat is, estimate p(s) and p00) To classify a new unlabeled document D represented by binary vector bnew Ui, fa fa, the probability p (oID--) for each class C k (ka S, using the equation ft (1- p (wt IC)) C)p(C) and determine which out of p (c SID.-) and p(c D.) is higher.Explanation / Answer
Total Number of Documents : 6 (Sports) + 5 (Informatics)
Nsports = 6, Ninformatics = 5
Now, nk(wt) and p(wt | C = k) = nk(wt)/Nk:
For k = sports and t (words in the dictionary 1 to 8):
t nsports(wt) p (wt | C = sports) = nsports(wt)/Nsports
1 3 3/6
2 1 1/6
3 2 2/6
4 3 3/6
5 3 3/6
6 4 4/6
7 3 3/6
8 4 4/6
For k = informatics and t (words in the dictionary 1 to 8):
t ninformatics(wt) p (wt | C = informatics) = ninformatics(wt)/Ninformatics
1 1 1/5
2 3 3/5
3 3 3/5
4 1 1/5
5 1 1/5
6 1 1/5
7 3 3/5
8 1 1/5
Also, p(C = Sports) = 6/11 and p(C = Informatics) = 5/11
For Problem:
1. Document1 = (1 0 0 1 1 1 0 1)
P(Sports| Document1) = 6/11*3/6*5/6*4/6*3/6*3/6*4/6*3/6*4/6 = 0.0084
P(Informatics| Document1) = 5/11*1/5*2/5*2/5*1/5*1/5*1/5*2/5*1/5 = 0.000009
As P(Sports| Document1) > P(Informatics| Document1), this document belongs to Sports.
1. Document2 = (0 1 1 0 1 0 1 0)
P(Sports| Document2) = 6/11*3/6*1/6*2/6*3/6*3/6*2/6*3/6*2/6 = 0.0002
P(Informatics| Document2) = 5/11*4/5*3/5*3/5*4/5*1/5*4/5*3/5*4/5 = 0.000009
As P(Informatics| Document2) > P(Sports| Document2) , this document belongs to Informatics.
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.