Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Implement both the Tree Classifier and SVM Classifier using the Breast Cancer Wi

ID: 3683064 • Letter: I

Question

Implement both the Tree Classifier and SVM Classifier using the Breast Cancer Wisconsin data set from the University of California Irvine Machine Learning Data Repository at archive.ics.uci.edu/ml . In data, attributes 2 through 10 have been used to represent instances. Each instance has one of 2 possible classes: benign or malignant. More precisely, sample data have 10 input attributes (Sample code number id number, Clump Thickness 1 - 10, Uniformity of Cell Size 1 - 10, Uniformity of Cell Shape 1 - 10, Marginal Adhesion 1 - 10, Single Epithelial Cell Size 1 - 10, Bare Nuclei 1 - 10, Bland Chromatin 1 - 10, Normal Nucleoli 1 - 10, Mitoses 1-10) plus the single output class attribute (2 for benign, 4 for malignant).
Partition data into training (learning model) and test sets. For tree classifier use the top-down greedy algorithms with either GINI or Information Gain/Entropy measures for node splitting. It would be more elegant (but not required) to avoid model overfitting using pessimistic error formula whether to prune leaves nodes or not to avoid model overfitting.
For SVM you can use either linear SVM (risking that both classification (training and generalization) error will be large), or preferably nonlinear SVM using e.g., polynomial, Gaussian radial, or sigmoid kernel. Of course, your output class attribute should be modified: instead of 2 for benign class use +1, and instead of 4 for malignant class use -1.
For implementation use Python or R. You can be inspired, but you are not allowed to use an existing code, in other words you write your own programs, but you can use standard or other language libraries, including libraries for linear algebra, matrices, and Lagrangian nonlinear optimization with constraints (excluding libraries/ software packages for data mining or machine learning with implemented complete algorithms). Please include both sources and sample outcome running of your programs.
Compare performance of both classifiers, i.e., it is sufficient to provide both training accuracy and test/generalization accuracy for both your programs (of course, using the same training and test data). Based on that, reply which classifier seems be performing better for your programs and data.
Comment: a more elegant would be to test, e.g., the confidence interval for the true accuracy (based on test accuracy) at (1 - a) confidence level, or the hypothesis that the performance difference for stochastic variable d = e1 - e2 (where e1 is misclassification error for the tree classifier, and e2 is misclassification error for the SVM classifier) is statistically significant at (1 - a) confidence level.

Must be in python 2.7

Explanation / Answer

import matplotlib.pyplot as plt
import pandas as pd

from sklearn import datasets
from sklearn import svm
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_score

'''

   # Attribute                     Domain
   -- -----------------------------------------
   1. Sample code number            id number
   2. Clump Thickness               1 - 10
   3. Uniformity of Cell Size       1 - 10
   4. Uniformity of Cell Shape      1 - 10
   5. Marginal Adhesion             1 - 10
   6. Single Epithelial Cell Size   1 - 10
   7. Bare Nuclei                   1 - 10
   8. Bland Chromatin               1 - 10
   9. Normal Nucleoli               1 - 10
10. Mitoses                       1 - 10
11. Class:                        (2 for benign, 4 for malignant)

'''

#read raw data
names=['id', 'Clump Thickness', 'Uniformity of Size', 'Uniformity of Shape', 'Marginal Adhesion', 'Single Epithelial Size', 'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'Class']
raw_data = pd.read_table('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', delimiter=',', names=names, na_values=['?'])
data = raw_data.dropna()

#these will all be set based on the most accurate test/train set
best_prec = 0
best_acc = 0
best_rec = 0
best_coef = 0

#run 100 different train/test sets
for i in range(0, 100):
    train, test = train_test_split(data)
    trainX = train[names[1:len(names)-1]]
    trainY = train[names[len(names)-1]]

    testX = test[names[1:len(names)-1]]
    testY = test[names[len(names)-1]]
    testY = testY/2-1

    clf = svm.SVC(kernel='linear', gamma = .00001)

    clf.fit(trainX, trainY)
    predictions = clf.predict(testX)
    predictions = predictions/2-1
    prec = precision_score(predictions, testY)
    acc = accuracy_score(predictions, testY)
    rec = recall_score(predictions, testY)
    coef = str(zip(names[1:len(names)-1], clf.coef_[0]))
  
    if acc > best_acc:
        best_prec = prec
        best_acc = acc
        best_rec = rec
        best_coef = coef
  


print "Precision: ", best_prec
print "Accuracy: ", best_acc
print "Recall: ", best_rec
print best_coef


breast-cancer-wisconsin.data

1000025,5,1,1,1,2,1,3,1,1,2
1002945,5,4,4,5,7,10,3,2,1,2
1015425,3,1,1,1,2,2,3,1,1,2
1016277,6,8,8,1,3,4,3,7,1,2
1017023,4,1,1,3,2,1,3,1,1,2
1017122,8,10,10,8,7,10,9,7,1,4
1018099,1,1,1,1,2,10,3,1,1,2
1018561,2,1,2,1,2,1,3,1,1,2
1033078,2,1,1,1,2,1,1,1,5,2
1033078,4,2,1,1,2,1,2,1,1,2
1035283,1,1,1,1,1,1,3,1,1,2
1036172,2,1,1,1,2,1,2,1,1,2
1041801,5,3,3,3,2,3,4,4,1,4
1043999,1,1,1,1,2,3,3,1,1,2
1044572,8,7,5,10,7,9,5,5,4,4
1047630,7,4,6,4,6,1,4,3,1,4
1048672,4,1,1,1,2,1,2,1,1,2
1049815,4,1,1,1,2,1,3,1,1,2
1050670,10,7,7,6,4,10,4,1,2,4
1050718,6,1,1,1,2,1,3,1,1,2
1054590,7,3,2,10,5,10,5,4,4,4
1054593,10,5,5,3,6,7,7,10,1,4
1056784,3,1,1,1,2,1,2,1,1,2
1057013,8,4,5,1,2,?,7,3,1,4
1059552,1,1,1,1,2,1,3,1,1,2
1065726,5,2,3,4,2,7,3,6,1,4
1066373,3,2,1,1,1,1,2,1,1,2
1066979,5,1,1,1,2,1,2,1,1,2
1067444,2,1,1,1,2,1,2,1,1,2
1070935,1,1,3,1,2,1,1,1,1,2
1070935,3,1,1,1,1,1,2,1,1,2
1071760,2,1,1,1,2,1,3,1,1,2
1072179,10,7,7,3,8,5,7,4,3

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote