Implement both the Tree Classifier and SVM Classifier using the Breast Cancer Wi
ID: 3683064 • Letter: I
Question
Implement both the Tree Classifier and SVM Classifier using the Breast Cancer Wisconsin data set from the University of California Irvine Machine Learning Data Repository at archive.ics.uci.edu/ml . In data, attributes 2 through 10 have been used to represent instances. Each instance has one of 2 possible classes: benign or malignant. More precisely, sample data have 10 input attributes (Sample code number id number, Clump Thickness 1 - 10, Uniformity of Cell Size 1 - 10, Uniformity of Cell Shape 1 - 10, Marginal Adhesion 1 - 10, Single Epithelial Cell Size 1 - 10, Bare Nuclei 1 - 10, Bland Chromatin 1 - 10, Normal Nucleoli 1 - 10, Mitoses 1-10) plus the single output class attribute (2 for benign, 4 for malignant).
Partition data into training (learning model) and test sets. For tree classifier use the top-down greedy algorithms with either GINI or Information Gain/Entropy measures for node splitting. It would be more elegant (but not required) to avoid model overfitting using pessimistic error formula whether to prune leaves nodes or not to avoid model overfitting.
For SVM you can use either linear SVM (risking that both classification (training and generalization) error will be large), or preferably nonlinear SVM using e.g., polynomial, Gaussian radial, or sigmoid kernel. Of course, your output class attribute should be modified: instead of 2 for benign class use +1, and instead of 4 for malignant class use -1.
For implementation use Python or R. You can be inspired, but you are not allowed to use an existing code, in other words you write your own programs, but you can use standard or other language libraries, including libraries for linear algebra, matrices, and Lagrangian nonlinear optimization with constraints (excluding libraries/ software packages for data mining or machine learning with implemented complete algorithms). Please include both sources and sample outcome running of your programs.
Compare performance of both classifiers, i.e., it is sufficient to provide both training accuracy and test/generalization accuracy for both your programs (of course, using the same training and test data). Based on that, reply which classifier seems be performing better for your programs and data.
Comment: a more elegant would be to test, e.g., the confidence interval for the true accuracy (based on test accuracy) at (1 - a) confidence level, or the hypothesis that the performance difference for stochastic variable d = e1 - e2 (where e1 is misclassification error for the tree classifier, and e2 is misclassification error for the SVM classifier) is statistically significant at (1 - a) confidence level.
Must be in python 2.7
Explanation / Answer
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
from sklearn import svm
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_score
'''
# Attribute Domain
-- -----------------------------------------
1. Sample code number id number
2. Clump Thickness 1 - 10
3. Uniformity of Cell Size 1 - 10
4. Uniformity of Cell Shape 1 - 10
5. Marginal Adhesion 1 - 10
6. Single Epithelial Cell Size 1 - 10
7. Bare Nuclei 1 - 10
8. Bland Chromatin 1 - 10
9. Normal Nucleoli 1 - 10
10. Mitoses 1 - 10
11. Class: (2 for benign, 4 for malignant)
'''
#read raw data
names=['id', 'Clump Thickness', 'Uniformity of Size', 'Uniformity of Shape', 'Marginal Adhesion', 'Single Epithelial Size', 'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'Class']
raw_data = pd.read_table('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', delimiter=',', names=names, na_values=['?'])
data = raw_data.dropna()
#these will all be set based on the most accurate test/train set
best_prec = 0
best_acc = 0
best_rec = 0
best_coef = 0
#run 100 different train/test sets
for i in range(0, 100):
train, test = train_test_split(data)
trainX = train[names[1:len(names)-1]]
trainY = train[names[len(names)-1]]
testX = test[names[1:len(names)-1]]
testY = test[names[len(names)-1]]
testY = testY/2-1
clf = svm.SVC(kernel='linear', gamma = .00001)
clf.fit(trainX, trainY)
predictions = clf.predict(testX)
predictions = predictions/2-1
prec = precision_score(predictions, testY)
acc = accuracy_score(predictions, testY)
rec = recall_score(predictions, testY)
coef = str(zip(names[1:len(names)-1], clf.coef_[0]))
if acc > best_acc:
best_prec = prec
best_acc = acc
best_rec = rec
best_coef = coef
print "Precision: ", best_prec
print "Accuracy: ", best_acc
print "Recall: ", best_rec
print best_coef
breast-cancer-wisconsin.data
1000025,5,1,1,1,2,1,3,1,1,2
1002945,5,4,4,5,7,10,3,2,1,2
1015425,3,1,1,1,2,2,3,1,1,2
1016277,6,8,8,1,3,4,3,7,1,2
1017023,4,1,1,3,2,1,3,1,1,2
1017122,8,10,10,8,7,10,9,7,1,4
1018099,1,1,1,1,2,10,3,1,1,2
1018561,2,1,2,1,2,1,3,1,1,2
1033078,2,1,1,1,2,1,1,1,5,2
1033078,4,2,1,1,2,1,2,1,1,2
1035283,1,1,1,1,1,1,3,1,1,2
1036172,2,1,1,1,2,1,2,1,1,2
1041801,5,3,3,3,2,3,4,4,1,4
1043999,1,1,1,1,2,3,3,1,1,2
1044572,8,7,5,10,7,9,5,5,4,4
1047630,7,4,6,4,6,1,4,3,1,4
1048672,4,1,1,1,2,1,2,1,1,2
1049815,4,1,1,1,2,1,3,1,1,2
1050670,10,7,7,6,4,10,4,1,2,4
1050718,6,1,1,1,2,1,3,1,1,2
1054590,7,3,2,10,5,10,5,4,4,4
1054593,10,5,5,3,6,7,7,10,1,4
1056784,3,1,1,1,2,1,2,1,1,2
1057013,8,4,5,1,2,?,7,3,1,4
1059552,1,1,1,1,2,1,3,1,1,2
1065726,5,2,3,4,2,7,3,6,1,4
1066373,3,2,1,1,1,1,2,1,1,2
1066979,5,1,1,1,2,1,2,1,1,2
1067444,2,1,1,1,2,1,2,1,1,2
1070935,1,1,3,1,2,1,1,1,1,2
1070935,3,1,1,1,1,1,2,1,1,2
1071760,2,1,1,1,2,1,3,1,1,2
1072179,10,7,7,3,8,5,7,4,3
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.