Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

000 AT&T; 11:06 PM 27%) instructure-uploads.s3.amazonaws.com 4. Briefly explain

ID: 3721979 • Letter: 0

Question

000 AT&T; 11:06 PM 27%) instructure-uploads.s3.amazonaws.com 4. Briefly explain why it is usually not objective to evaluate a classifier s performance based on the training data set? Why poor out-of-sample predictive accuracy sometimes signals the problem of overfitting? [16 points/ 5. Briefly explain two ways to limit overfitting in constructing a decision tree. Briefly explain the advantages and the weaknesses of decision trees. [16 points] 6. List four common properties of distance measures such as the Euclidean distance. Why it is often necessary to standardize your data when calculating the Euclidean distance? Briefly explain the four distance measures we often use to calculate the distance between two clusters. [18 points]

Explanation / Answer

4)Answer:

We may not be able to get an accurate/ objective probability estimate from the classifier ifthe training sample is not sufficiently representative. Classifiers should have highcertainty before taking positive action and we cannot do that if the sample is not representative or if other issues exist with the sample. Large models over fitted ontraining data sets are usually poor predictors because unneeded predictor variablesincrease the prediction error variance. This is another reason it is usually not objective toevaluate a classifer’s performance based on the training data. Conversely, if used out-of-sample and the predictive accuracy is low, this same issue could signal the problem ofover fitting.

5)Answer:

There are many ways to limit overfitting in constructing a decision tree. One way is tostop generating new split nodes when subsequent splits result in only slight improvement of thepredication.

Another method is pruning back the tree- this is where we select a simpler tree thanthe tree that resulted from the tree building and hope that it does better in predicting or classifying new observations

Advantages:

Decision Tree Function

The tree structure provides a framework for analyzing all possible alternatives for a decision. The visual representation also includes the likelihood and potential reward for each choice. To create a tree, start with the main decision and draw a square. Extend lines out from the edge of the square for each possible solution. If the solution leads to another decision, draw a square and extend new lines to the next possible series of choices. If the outcome of a particular choice is uncertain, draw a circle instead of a square. Assign probabilities to each branch and a dollar amount for the possible payoff. Make sure to subtract any costs involved with executing the decision. Multiply the probability and the net profit for each outcome to get an adjusted expectation value for each branch of the tree.

Brainstorming Outcomes

Decision trees help you think of all possible outcomes for an upcoming choice. The consequences of each outcome must be fully explored, so no details are missed. Taking the time to brainstorm prevents overreactions to any one variable. The graphical depiction of various alternatives makes them easier to compare with each other. The decision tree also adds transparency to the process. An independent party can see exactly how a particular decision was made.

Presentation of Information

Complex data can be presented in a decision tree, but the user only needs to be concerned with the path that matches the specific situation at hand. The circle and square format shows you where incomplete information was used so you can assess the strength of your assumptions. Because you must have accurate cost and risk information to create the tree, decisions can be made without personal bias or emotional reactions. No normalization of data is required because the outcomes are based on each decision, not comparisons between multiple variables.

Automatic Prioritization

The important variables for a decision are automatically emphasized through the process of developing the tree. The top nodes of the tree are the most important because they determine the subsequent decisions to be made. The tree also shows the order decisions must be made and eliminates ambiguity related to how each item affects the others.

Decision Tree Versatility

Decision trees can be customized for a variety for situations. The logical form is good for programmers and engineers. Technicians can also use decision trees to diagnose mechanical failures in equipment or troubleshoot auto repairs. Decision trees are also helpful for evaluating business or investment alternatives. Managers can recreate the math used in a particular decision tree to analyzing the company's decision-making process.

Disadvantages:

Presentation of Information

Complex data can be presented in a decision tree, but the user only needs to be concerned with the path that matches the specific situation at hand. The circle and square format shows you where incomplete information was used so you can assess the strength of your assumptions. Because you must have accurate cost and risk information to create the tree, decisions can be made without personal bias or emotional reactions. No normalization of data is required because the outcomes are based on each decision, not comparisons between multiple variables.

Automatic Prioritization

The important variables for a decision are automatically emphasized through the process of developing the tree. The top nodes of the tree are the most important because they determine the subsequent decisions to be made. The tree also shows the order decisions must be made and eliminates ambiguity related to how each item affects the others.

Decision Tree Versatility

Decision trees can be customized for a variety for situations. The logical form is good for programmers and engineers. Technicians can also use decision trees to diagnose mechanical failures in equipment or troubleshoot auto repairs. Decision trees are also helpful for evaluating business or investment alternatives. Managers can recreate the math used in a particular decision tree to analyzing the company's decision-making process.

Disadvantages of decision trees include: They may be subject topicking up “random noise” (or overfitting to that noise) and then the predictions for the testsamples will be poor. Independent tree tests can be costly or time consuming if you need separatedata sets (which only really applies if you don’t mitigate against this by using another strategysuch as v-fold cross-validation)

6)Answer:

Euclidian distance is numeric, therefore categorical attributes must first be encoded numerically.Also, scales of the data must be taken into consideration, as the distance metric could be affectedif the scales for variables in terms of years and dollars differed dramatically. The data in caseslike this must be standardized numerically before it can be used to calculate Euclidian distance.

The four distance measures we often use to calculate distance between two clusters include:

a.Measurements of distance from basic geometry (calculating the length of thehypotenuse)-ie. Euclidian distance measure or L2 norm

b.Nearest neighbor reasoning and/ or Use similarity/ nearest neighbor for predictivemodeling (includes classification, probability estimation, regression, etc.)

c.Manhattan distance or L1 norm (the sum of the unsquared pairwise distances)

d.Jacard distance (treats two objects as sets of characteristics)

Four common properties of distance measures such as the Euclidian distance (used for non-categorical variables):

1.Non-negative

2.Symmetric

3.Satisfies the triangle inequality

4.Has feature vectors