Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

1. Modify the code have a second model using k neighbors regression with n neigh

ID: 3743780 • Letter: 1

Question

1. Modify the code have a second model using k neighbors regression with n neighbors value of 3.

2. Compare the results between the linear regression and k neighbors regression. Explain why they are different.

3. Measure the time used in the following stages: 1. Loading the data, 2. Training the model, and 3. Making the prediction. How does these measurements differ for the linear regression vs. k neighbors regression?

---------Code is Below---------------

-----Part1------

def prepare_country_stats(oecd_bli, gdp_per_capita):
    oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]
    oecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value")
    gdp_per_capita.rename(columns={"2015": "GDP per capita"}, inplace=True)
    gdp_per_capita.set_index("Country", inplace=True)
    full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita,
                                  left_index=True, right_index=True)
    full_country_stats.sort_values(by="GDP per capita", inplace=True)
    remove_indices = [0, 1, 6, 8, 33, 34, 35]
    keep_indices = list(set(range(36)) - set(remove_indices))
    return full_country_stats[["GDP per capita", 'Life satisfaction']].iloc[keep_indices]

-----Part2--------

import os

datapath = os.path.join("datasets", "lifesat", "")

-------Part3--------

# Code example
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model

# Load the data
oecd_bli = pd.read_csv(datapath + "oecd_bli_2015.csv", thousands=',')
gdp_per_capita = pd.read_csv(datapath + "gdp_per_capita.csv",thousands=',',delimiter=' ',
                             encoding='latin1', na_values="n/a")

# Prepare the data
country_stats = prepare_country_stats(oecd_bli, gdp_per_capita)
X = np.c_[country_stats["GDP per capita"]]
y = np.c_[country_stats["Life satisfaction"]]

# Visualize the data
country_stats.plot(kind='scatter', x="GDP per capita", y='Life satisfaction')
plt.show()

# Select a linear model
model = sklearn.linear_model.LinearRegression()

# Train the model
model.fit(X, y)

# Make a prediction for Cyprus
X_new = [[22587]] # Cyprus' GDP per capita
print(model.predict(X_new)) # outputs [[ 5.96242338]]

Explanation / Answer

##################((1)) ###############

The code with k nearest neighbours ( k = 3) is:

Modificaiton in the code is needed when training the model using different algorithm. The new algorithm is k nearest neighbors classification compared to previous linear regression.

Modified code:

# -----Part1------

def prepare_country_stats(oecd_bli, gdp_per_capita):
    oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]
    oecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value")
    gdp_per_capita.rename(columns={"2015": "GDP per capita"}, inplace=True)
    gdp_per_capita.set_index("Country", inplace=True)
    full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita,
                                  left_index=True, right_index=True)
    full_country_stats.sort_values(by="GDP per capita", inplace=True)
    remove_indices = [0, 1, 6, 8, 33, 34, 35]
    keep_indices = list(set(range(36)) - set(remove_indices))
    return full_country_stats[["GDP per capita", 'Life satisfaction']].iloc[keep_indices]

# -----Part2--------

import os

datapath = os.path.join("datasets", "lifesat", "")

# ------Part3--------

# Code example
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier

# Load the data
oecd_bli = pd.read_csv(datapath + "oecd_bli_2015.csv", thousands=',')
gdp_per_capita = pd.read_csv(datapath + "gdp_per_capita.csv",thousands=',',delimiter=' ',
                             encoding='latin1', na_values="n/a")

# Prepare the data
country_stats = prepare_country_stats(oecd_bli, gdp_per_capita)
X = np.c_[country_stats["GDP per capita"]]
y = np.c_[country_stats["Life satisfaction"]]

# Visualize the data
country_stats.plot(kind='scatter', x="GDP per capita", y='Life satisfaction')
plt.show()

# Select the k neighbor classifier with k = 3
model = KNeighborsClassifier(n_neighbors=3)

# Train the model
import numpy as np
train_y = y.ravel()
y = np.array(train_y).astype(int)
model.fit(X, y)

# Make a prediction for Cyprus
X_new = [[22587]] # Cyprus' GDP per capita
print(model.predict(X_new))# output comes to be [5]

##################((2))####################
The steps to compare the two models will be:
1. Test the model of Linear regression on some test dataset.
2 Test the model of K neighbor classification on same test dataset.
3. Compare the accuracy of these two models. The one with higher accuracy has performed better
in this situation.


#################((3))###################
To measure the time we can import time library in python

The snippets will be
********************

import time
tic = time.time()
# Load the data
oecd_bli = pd.read_csv(datapath + "oecd_bli_2015.csv", thousands=',')
gdp_per_capita = pd.read_csv(datapath + "gdp_per_capita.csv",thousands=',',delimiter=' ',
                             encoding='latin1', na_values="n/a")
toc = time.time()
time_load = (toc - tic)*1000 //time in ms(milli seconds)

********************

********************

tic = time.time()
# Train the model
model.fit(X, y)
toc = time.time()
time_train = (toc - tic)*1000 //time in ms(milli seconds)

********************

********************

tic = time.time()
# Make a prediction for Cyprus
X_new = [[22587]] # Cyprus' GDP per capita
print(model.predict(X_new))
toc = time.time()
time_predict = (toc - tic)*1000 //time in ms(milli seconds)

*********************

These snippets are general, and will work for any classifier
This time library helps to calculate the time when the process started and when it ended.
The subtraction is the difference: gives the time taken. The time is in seconds.. so
we multiply with 1000 to get time in ms(milliseconds).