Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

1- Suppose in our Training set we are using marital status as a predictor, and i

ID: 2923587 • Letter: 1

Question

1-   Suppose in our Training set we are using marital status as a predictor, and it has the values   Married

Single

Divorced

and in our Test set, marital status has the values

Married

Single

Divorced

Widowed

What will happen?

A)Depending on the model, either of the other answers may happen. This is why it's best not to have nominal attribute values with a very small number of instances

B)the model will fail at the test step because there is no way to predict Widowed  

C)the model will treat the data as missing

2-   Crossvalidation is

   A)finding a faster way to model

   B)developed to deal with anger management issues among data analysts

   C)somewhat like picking 10 different Training-Test splits, and averaging the error rates

3-If we do the standard 10 fold crossvalidation, we get 10 different models.

How do we get the final model

   A)the results for each of the 10 models are averaged

   B)the 10 folds are used to determine the accuracy of the classifier. The final model is run on the entire dataset

   C)the results for the highest and lowest models are thrown out, and the remaining 8 models are averraged

   D)the results for the model with the lowest error are used

4- The standard cost matrix is

0 1

1 0

If we change this to

0 1

3 0

this means we want

   A)fewer instances classified in the cell with the 3

   B)more cases classified in the cell with the 3

5- In Weka, if we run a classifier such as J48, the ROC curve can be found under  

a)visualize classifier errors      

b)cust-benefit analysis      

c)visualize margin curve      

d)visualize threshold curve

6-Which of these would be the best AUC for a model?  

a)2      

b)1.5      

c).9  

d).1  

e).5  

Explanation / Answer

We are allowed to do 1 question at a time. Post again for second question.

It will treat it as a missing value:

Missing values affect decision tree construction in three different ways: – How impurity measures are computed – How to distribute instance with missing value to child nodes – How a test instance with missing value is classified

Step 1) Training: Each type of algorithm has its own parameter options (the number of layers in a Neural Network, the number of trees in a Random Forest, etc). For each of your algorithms, you must pick one option. That’s why you have a training set.

Step 2) Validating: You now have a collection of algorithms. You must pick one algorithm. That’s why you have a test set. Most people pick the algorithm that performs best on the validation set (and that's ok). But, if you do not measure your top-performing algorithm’s error rate on the test set, and just go with its error rate on the validation set, then you have blindly mistaken the “best possible scenario” for the “most likely scenario.” That's a recipe for disaster.

Step 3) Testing: I suppose that if your algorithms did not have any parameters then you would not need a third step. In that case, your validation step would be your test step. Perhaps Matlab does not ask you for parameters or you have chosen not to use them and that is the source of your confusion.