Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

We will visualize Genome-Wide Association Studies (GWAS) with the Manhattan plot

ID: 3915920 • Letter: W

Question

We will visualize Genome-Wide Association Studies (GWAS) with the Manhattan plot for psychiatric disorders. The two data sets of SNPs and phenotypes that contain genome data of two groups: psychiatric disorders (y=1) and control (y = 0). In "SNP.csv", there are 37,853 SNPs of 130 samples. A value in the file indicates the numbers of minor allele on each SNP (i.e., ?? ? {0, 1, 2}). In "Phenotype.csv", zero indicates that the sample is a control, while one shows a psychiatric disorder (one of bipolar disorder, schizophrenia, and major depression). Compute p-values by using t-test (you can use any libraries for t-test). Perform t-test pairwise between a SNP and phenotype. I.e., you need to perform 37,853 t-tests and compute p-values. Then, make a Manhattan plot with bonferroni multiple testing correction (i.e., consider the pvalue cutoff: 0.05/37,853).

Write code using R Language or Python, that reads in the .csv files and generates a Manhattan Plot (using any library).

I cannot attach the .csv files mentioned so any files that you can get to work with your code will be fine. I struggled because there was a lot of data so please be mindful that the code should be able to take in a bunch of data from the files.

10000 20000 30000 SNPs Figure 1. Manhattan Plot

Explanation / Answer

we can define a function called loadDataset that loads a CSV with the provided filename and splits it randomly into train and test datasets using the provided split ratio.

Python

1

2

3

4

5

6

7

8

9

10

11

12

13

import csv

import random

def loadDataset(filename, split, trainingSet=[] , testSet=[]):

with open(filename, 'rb') as csvfile:

    lines = csv.reader(csvfile)

    dataset = list(lines)

    for x in range(len(dataset)-1):

        for y in range(4):

            dataset[x][y] = float(dataset[x][y])

        if random.random() < split:

            trainingSet.append(dataset[x])

        else:

            testSet.append(dataset[x])

Download the iris flowers dataset CSV file to the local directory

Python

1

2

3

4

5

trainingSet=[]

testSet=[]

loadDataset('iris.data', 0.66, trainingSet, testSet)

print 'Train: ' + repr(len(trainingSet))

print 'Test: ' + repr(len(testSet))

Below is the getNeighbors function that returns k most similar neighbors from the training set for a given test instance (using the already defined euclideanDistance function)

Python

1

2

3

4

5

6

7

8

9

10

11

12

import operator

def getNeighbors(trainingSet, testInstance, k):

distances = []

length = len(testInstance)-1

for x in range(len(trainingSet)):

dist = euclideanDistance(testInstance, trainingSet[x], length)

distances.append((trainingSet[x], dist))

distances.sort(key=operator.itemgetter(1))

neighbors = []

for x in range(k):

neighbors.append(distances[x][0])

return neighbors

Python

TEST THE PROGRAM

1

2

3

4

5

trainSet = [[2, 2, 2, 'a'], [4, 4, 4, 'b']]

testInstance = [5, 5, 5]

k = 1

neighbors = getNeighbors(trainSet, testInstance, 1)

print(neighbors)

1

2

3

4

5

6

7

8

9

10

11

12

13

import csv

import random

def loadDataset(filename, split, trainingSet=[] , testSet=[]):

with open(filename, 'rb') as csvfile:

    lines = csv.reader(csvfile)

    dataset = list(lines)

    for x in range(len(dataset)-1):

        for y in range(4):

            dataset[x][y] = float(dataset[x][y])

        if random.random() < split:

            trainingSet.append(dataset[x])

        else:

            testSet.append(dataset[x])

Download the iris flowers dataset CSV file to the local directory

Python

1

2

3

4

5

trainingSet=[]

testSet=[]

loadDataset('iris.data', 0.66, trainingSet, testSet)

print 'Train: ' + repr(len(trainingSet))

print 'Test: ' + repr(len(testSet))

Below is the getNeighbors function that returns k most similar neighbors from the training set for a given test instance (using the already defined euclideanDistance function)

Python

1

2

3

4

5

6

7

8

9

10

11

12

import operator

def getNeighbors(trainingSet, testInstance, k):

distances = []

length = len(testInstance)-1

for x in range(len(trainingSet)):

dist = euclideanDistance(testInstance, trainingSet[x], length)

distances.append((trainingSet[x], dist))

distances.sort(key=operator.itemgetter(1))

neighbors = []

for x in range(k):

neighbors.append(distances[x][0])

return neighbors

Python

TEST THE PROGRAM

1

2

3

4

5

trainSet = [[2, 2, 2, 'a'], [4, 4, 4, 'b']]

testInstance = [5, 5, 5]

k = 1

neighbors = getNeighbors(trainSet, testInstance, 1)

print(neighbors)

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote