We will visualize Genome-Wide Association Studies (GWAS) with the Manhattan plot
ID: 3915920 • Letter: W
Question
We will visualize Genome-Wide Association Studies (GWAS) with the Manhattan plot for psychiatric disorders. The two data sets of SNPs and phenotypes that contain genome data of two groups: psychiatric disorders (y=1) and control (y = 0). In "SNP.csv", there are 37,853 SNPs of 130 samples. A value in the file indicates the numbers of minor allele on each SNP (i.e., ?? ? {0, 1, 2}). In "Phenotype.csv", zero indicates that the sample is a control, while one shows a psychiatric disorder (one of bipolar disorder, schizophrenia, and major depression). Compute p-values by using t-test (you can use any libraries for t-test). Perform t-test pairwise between a SNP and phenotype. I.e., you need to perform 37,853 t-tests and compute p-values. Then, make a Manhattan plot with bonferroni multiple testing correction (i.e., consider the pvalue cutoff: 0.05/37,853).
Write code using R Language or Python, that reads in the .csv files and generates a Manhattan Plot (using any library).
I cannot attach the .csv files mentioned so any files that you can get to work with your code will be fine. I struggled because there was a lot of data so please be mindful that the code should be able to take in a bunch of data from the files.
10000 20000 30000 SNPs Figure 1. Manhattan PlotExplanation / Answer
we can define a function called loadDataset that loads a CSV with the provided filename and splits it randomly into train and test datasets using the provided split ratio.
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
import csv
import random
def loadDataset(filename, split, trainingSet=[] , testSet=[]):
with open(filename, 'rb') as csvfile:
lines = csv.reader(csvfile)
dataset = list(lines)
for x in range(len(dataset)-1):
for y in range(4):
dataset[x][y] = float(dataset[x][y])
if random.random() < split:
trainingSet.append(dataset[x])
else:
testSet.append(dataset[x])
Download the iris flowers dataset CSV file to the local directory
Python
1
2
3
4
5
trainingSet=[]
testSet=[]
loadDataset('iris.data', 0.66, trainingSet, testSet)
print 'Train: ' + repr(len(trainingSet))
print 'Test: ' + repr(len(testSet))
Below is the getNeighbors function that returns k most similar neighbors from the training set for a given test instance (using the already defined euclideanDistance function)
Python
1
2
3
4
5
6
7
8
9
10
11
12
import operator
def getNeighbors(trainingSet, testInstance, k):
distances = []
length = len(testInstance)-1
for x in range(len(trainingSet)):
dist = euclideanDistance(testInstance, trainingSet[x], length)
distances.append((trainingSet[x], dist))
distances.sort(key=operator.itemgetter(1))
neighbors = []
for x in range(k):
neighbors.append(distances[x][0])
return neighbors
Python
TEST THE PROGRAM
1
2
3
4
5
trainSet = [[2, 2, 2, 'a'], [4, 4, 4, 'b']]
testInstance = [5, 5, 5]
k = 1
neighbors = getNeighbors(trainSet, testInstance, 1)
print(neighbors)
1
2
3
4
5
6
7
8
9
10
11
12
13
import csv
import random
def loadDataset(filename, split, trainingSet=[] , testSet=[]):
with open(filename, 'rb') as csvfile:
lines = csv.reader(csvfile)
dataset = list(lines)
for x in range(len(dataset)-1):
for y in range(4):
dataset[x][y] = float(dataset[x][y])
if random.random() < split:
trainingSet.append(dataset[x])
else:
testSet.append(dataset[x])
Download the iris flowers dataset CSV file to the local directory
Python
1
2
3
4
5
trainingSet=[]
testSet=[]
loadDataset('iris.data', 0.66, trainingSet, testSet)
print 'Train: ' + repr(len(trainingSet))
print 'Test: ' + repr(len(testSet))
Below is the getNeighbors function that returns k most similar neighbors from the training set for a given test instance (using the already defined euclideanDistance function)
Python
1
2
3
4
5
6
7
8
9
10
11
12
import operator
def getNeighbors(trainingSet, testInstance, k):
distances = []
length = len(testInstance)-1
for x in range(len(trainingSet)):
dist = euclideanDistance(testInstance, trainingSet[x], length)
distances.append((trainingSet[x], dist))
distances.sort(key=operator.itemgetter(1))
neighbors = []
for x in range(k):
neighbors.append(distances[x][0])
return neighbors
Python
TEST THE PROGRAM
1
2
3
4
5
trainSet = [[2, 2, 2, 'a'], [4, 4, 4, 'b']]
testInstance = [5, 5, 5]
k = 1
neighbors = getNeighbors(trainSet, testInstance, 1)
print(neighbors)
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.