Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

question of Machine Learning This question concerns spam ltering. Suppose each e

ID: 669250 • Letter: Q

Question

question of Machine Learning

This question concerns spam ltering. Suppose each email is represented by a vector x =
(x1; : : : ; xD) where xi 2 f0; 1g. Each entry of the vector indicates if a particular symbol or word appears
in the email. The symbols/words are :
money, cash, !!!, viagra,..., etc.
so that for example x2 = 1 if the word `cash' appears in the email. The training dataset consists of a
set of vectors along with the class label y, where y = 1 indicates the email is spam, and y = 0 not spam.
Hence, the training set consists of a set of N pairs (xn; yn); n = 1; : : : ;N. Now answer the following:
(a) If `viagra' never appears in the spam training data, discuss what e ect this will have on the
classi cation for a new email that contains the word `viagra'.
(b) From our discussion in class regarding MLE and MAP, explain how you might counter this e ect.
(c) Explain how a spammer might try to fool a naive Bayes spam lter. In particular, suppose appear-
ance of the word `viagra' is strongly associated with the fact that the corresponding email may be
a spam. Explain how a spammer can take advantage of this fact to fool naive Bayes classi er.

Explanation / Answer

Answer a) The effect which will have on the classification for a new email that contains the word 'viagra' is likely seen in your own email, people figured out this spam filter rule and got around it by modifying the spelling. Maybe something about the length of the subject gives it away as spam, or perhaps excessive use of exclamation points or other punctuation. But some words like “Yahoo!” are authentic, so you don’t want to make your rule too simplistic.

Answer b) To counter the effect of class regarding the MLE and MAP can be done by the posterior distribution and the likelihood are proportional is that the posterior distribution is itself proportional to the product of the likelihood and the prior. When the prior takes the same value everywhere, as in the uniform distribution, then the posterior distribution is simply proportional to the likelihood.

Answer c) A spammer might try to fool a native Bayes spam Iter by looking at two large collections of email messages. One collection contains spam messages received by a site, and the other collection contains non-spam messages received by the same site. In essence, the filter picks each message apart into individual words. Based on a comparison of how often a given word appears in spam messages as opposed to non-spam messages, the filter calculates the probability that a message containing that given word is spam.

A spammer can take advantage of this fact to fool naive Bayes classier by each email can be represented by a binary vector, whose jth entry is 1 or 0 depending on whether the jth word appears. Note this is a huge-ass vector, considering how many words we have, and we’d probably want to represent it with the indices of the words that actually show up. The model’s output is the probability that we’d see a given word vector given that we know it’s spam (or that it’s ham). This algorithm works pretty well, and it’s “cheap” to train if we have a prelabeled dataset to train on. Given a ton of emails, we’re more or less just counting the words in spam and nonspam emails. If we get more training data, we can easily increment our counts to improve our filter. In practice, there’s a global model, which we personalize to individuals. Moreover, there are lots of hardcoded, cheap rules before an email gets put into a fancy and slow model.