Hi, just a quick Java programming question. http://courses.cs.washington.edu/cou
ID: 3673885 • Letter: H
Question
Hi, just a quick Java programming question. http://courses.cs.washington.edu/courses/cse142/16wi/homework/7/spec.pdf please read the spec. I don't know how to convert the input file that we're given (for example, dna.txt) to a random output file (using a PrintStream?). Can someone just take me through the steps of coding the "Region Name:", "Nucleotides:", and the "Codons List:" part? I just don't get how to get a string from the input file and write it into a new output file. The rest I understand. No overcomplicated programming things please, I only have a basic understanding of computer science (using JGRASP).
Explanation / Answer
import java.io.*;
import java.util.*;
import java.util.Scanner;
public class ConvertDna {
// % of characters to randomly make lowercase
public static final int PERCENT_LOWERCASE = 10;
// % of lines to have as proteins (the rest are random DNA)
public static final int PERCENT_PROTEINS = 66;
public static void main(String[] args) throws IOException {
Scanner console = new Scanner(System.in);
System.out.print("Genome (.fna) file? ");
Scanner fnaInput = new Scanner(new File(console.nextLine()));
System.out.println("Reading .fna file...");
StringBuilder sb = new StringBuilder(4000000);
while (fnaInput.hasNextLine()) {
sb.append(fnaInput.nextLine());
}
System.out.print("Protein (.ptt) file? ");
Scanner pttInput = new Scanner(new File(console.nextLine()));
System.out.print("Output file? ");
PrintStream out = new PrintStream(new File(console.nextLine()));
System.out.print("How many proteins (-1 for all)? ");
int proteins = console.nextInt();
System.out.println("Producing protein output...");
readProtein(pttInput, proteins, sb, out);
}
public static void readProtein(Scanner pttInput, int proteins,
StringBuilder sb, PrintStream out) {
pttInput.nextLine(); // skip header lines
pttInput.nextLine();
pttInput.nextLine();
Random rand = new Random(42);
while (proteins != 0 && pttInput.hasNextLine()) {
String line = pttInput.nextLine();
Scanner lineScan = new Scanner(line);
lineScan.useDelimiter("[ :.]+");
int start = lineScan.nextInt();
int end = lineScan.nextInt();
String strand = lineScan.next(); // "+" or "-"
if (strand.equals("+")) {
lineScan.next(); // skip length token
lineScan.next(); // skip PID token
lineScan.next(); // skip gene token
lineScan.next(); // skip synonym token
lineScan.next(); // skip code token
lineScan.next(); // skip COG token
String name = lineScan.next();
while (lineScan.hasNext()) {
name += " " + lineScan.next();
}
if (rand.nextInt(100) > PERCENT_PROTEINS) {
// grab some random dna
start = rand.nextInt(sb.length() - 3) + 1;
end = Math.min(start + 3 + 3 * rand.nextInt(100), sb.length()) - 1;
name = "Non-protein region";
}
int length = end - start + 1;
out.println(name);
StringBuilder range = new StringBuilder(sb.substring(start - 1, end));
// pseudo-randomly change casing of 10% of nucleotides
for (int i = 0; i < length * PERCENT_LOWERCASE / 100; i++) {
int index = rand.nextInt(length);
range.setCharAt(index, Character.toLowerCase(range.charAt(index)));
}
out.println(range);
proteins--;
}
}
}
}
dna.txt
cure for cancer protein
ATGCCACTATGGTAG
captain picard hair growth protein
ATgCCAACATGgATGCCcGATAtGGATTgA
bogus protein
CCATt-AATgATCa-CAGTt
ecoli.txt
thr operon leader peptide
ATGAAACGCATTAGCaCCAcCATtACCACCaCCATCaCcATTACCACAGGTAACGGTGCGGGCTGA
aspartokinase I/homoserine dehydrogenase I
ATGCGAGtGTTGAAGTTcgGCGGTaCATCAgTGGCAAATGCAGAACGTtTTCTGCGGgTTGCCGATAttCTGGAAAGcAATGCCAGGCAGGGGCAGgTGGcCACCGTCCTCtCTGcCCCCGCCAAAATCACCAACCATCtGGTaGCGATGATtGaaAAaACCATtAGCGGTCAGGAtGCtTTaCcCaATATCAGCGATGCCGAACGTATTTTTGCCGAACTtCTGACgGGACTCGCCGCcGCCCAGcCGGGATTTCCGCTGGCACAAtTgAAAAcTTTCGTCGACCAgGAATTTGCCCAAATAAAACATGTcCtGCATGGCatCAGTTTGTTGGGGCAGTGCCCGGaTAGCATcAACGCTGCGCTGATTTGcCGTGgCGAGAAAaTGTcGaTcgCCattaTGGCCGGCGTGTTAGAAGCGCGTGGTCACAACGTTACCGTTATCGATCCGgTCGAAaAAcTGCTgGCAGTGGGTCATTAcCtCgAaTCTACCGTTGATaTtGCTGAATCCACCCGCCGTATTGCGGCAAGCCGCATTCCgGCTGACCACATgGtGCTGATGGCTGGTTTCACTGcCggTAATGAAAAAGgCGaGCTGGtGGTtCTGGGAcGCAACGGTTCCGACTaCTCCGCTGCGGTgCTGGCGGCcTGTTTaCGCGCCGATTGTTGcGAgaTCTGGACGGATGTTGAcGGTGTTTATACCTGCGATCCGCGTCAGGTGCCCGATGCGAGGTTGTTGAAGTCGATGTCCTATCAGgAaGCGATGGAGCTTTCTTACTTCGGCGCTAAAgTTCTTCaCCCcCGCACCATTACCCCCATcGCCCAGtTCCAGATcCCTtgCCtGATTAAAAATAcCGgAAAtCCCCAAGCACCAGgTACGCtCATTGGTGCCAGCCGTGATGAAGACGAATTACCGGTCAAGGGCATTTCCAATcTGAATaACATGGCAATgTTCAGcGTTTCCGgCCCGGGGAtGAAAGGgATggTTgGCATGGCGGCGCGcgTCTTTGCAGcGaTGTCACGCGCCCGTaTTtCCGTGGTgCtGATTACGCAATCATCTTCCGAATACAGTATCAGTTTCTGCGTTCCGCaAAGCGACTGTGTGCGAGCTgAaCGGGCAaTGcAGGAAGAGtTCTACCTGGAaCTGaAAGAAGGCTTACTGGAGCcGTTGGCgGtGACGGAACGGCTGGCCATTATCTcGGTGgTAGGTGATGGTATGCGcACCTtaCGTGGGAtCTCGgCGAAATtCTtTGCCGCGCTgGCcCGCGCCAATATCAACATTGTCgCCATTGCtCaGGGaTCTTcTGAaCGCTCAAtCTCTGTcGTGGTcAaTAACGATgATGCGACCACTGGCGTGCGCGTTACTCATCAGATGCTGTTCAATACCGATCAGGTTATCGAAGTGTTTGTGATTGgCGTCGGTGGCGTTGgcGGTGCGCTGCTGgAGCAACTGAAGCGTCAgCAAAGCTGGTTGAAGAATAAaCATATCGaCTTACGTGTCTGCGGTGTTGCTAACTCGAAGgCACtgCTCACCAATGTACATGGCCTTAATCTGGAAAACTGGCAGgAAGAACTGGCGCAAGCcAAAGAGCCGTTTAATCTCGgGCGcTtAATTCGCCTCGTGAAAGAATATCATCTGCtGAaCCCGGTCATTgTTGACTgTACTTCCAgCCAGGCTGTgGCAGaTCAATATgCCGACTtCCTgCGCGAAGGTTTCCAcGTTGTtACGCCGAaCAAAaAGGCCaACACCTCGTcgATGGaTTACTaCCATCAGTtGCGTTATGCGGCGGAAAAATCGCGGCGTAaATTCCTCtATGACACcaACGTtGGGGCTGGATTACCGGTTATTgAGAACCTGCAAAATCTGCTCAATGCtGGTGATGAATTGATGAAGTTCTCCGGCATTCTTTCAGGTTCGCTTTCTTAtATCTTCGGCAAGTTAGACGAAGGCaTGAGTtTCTCCGAGgCGACCaCACTGGCGCGGGAAATGgGTTATACCGAACCGGAcCcGCGAGATGATCTTtCtGGTATGgAtGTGGCGCgTAagCTAtTGATtCTCGCTCGTGAAACGGGACGTGAACTGGAGCtGGCGGATATTGAAATTGAACCTgTGCTGCCCGCaGaGTTTAACGCCGAGGGTGATGTCGCcGCTTTTATGGCGAATCTGTCACAGCTCGACGaTCtCTTTGCCGCGCGTGTgGCGAAGGCCCGTGATGAAGGAAAAGTTTTGCGCTATGTTGGCAATAttGATGAAGATgGCgTCTGCCGCGTGAAGaTTGCCGAAGTGGATGgTAATGaTCCGCTGTTCAAAGTGAaAaATGGCGaAAACGCCCTGGCCTTCTATAGCCACTATtATCAGCCGCTGCCGTTGGTACTGCGCGGATATGGTGCGGGCaATgACGTTaCAGCTGCCGGTgTCTTTGCTGATCTGCTACGtACCCTcTCAtGGaAGTTAGGAGTCTGA
homoserine kinase
ATgGTTAAAgTTTAtGCCCCGGCtTCCAGTGCCaATATGaGcGTCGgGTTTGATGTGCTCGGGgCGGCGGTGACACCTGTTGATGGTGCATTGCTCGgAGaTGTagTcaCGGTTGAGGCGGCAGAGACaTTCAgTCTCAACAACCTCGGACGCTTTGCCGAtAAGCTGCCGTCAGAGCCACGgGaaAATAtCGTTtATcAGTGcTGGGAGCGTtTTTGcCaGGAGCTTGGCAAGCAAATTCCAGTGGCGATGaCTCTGGAAAAGAATatGCCGAtCgGTTCGGGcTTAGGCTcCAGCGCCtGTTCAGTGGTCGCGGCgCTgAtGGCGATgAATGAAcACTGCGGCaAGCCGCTTAATGACACTCGTTTGCTGGCTTtGATGGgCGAgTTGGAAGGGcGTATCTCCGGCAGCAtTCATTACGACAACGtGGCACCGTGtTtTCtTGGTGGTAtGCAGTtgATGATCGAAGAaAACGACATCATCAGCCAGCAaGTGCCAGGGTTTGATGAGtGGCTGTGGGTGCTGGCGTATcCGGgGAtTAAAGTCtCGaCGGcAGAAGCCAGGGCTaTTTTACCGGCGCAGTATCGCCGCCAGGATTGCATTGCGCAcGGGCgACATCTgGCAGGCTTCATTCACGCCTGCTATTCCCGTCAGCTTGAGCTTGCCGCGAAGCTGATgAAAGaTGTTATCGCTGAACCCTACcGTGaACgGTTaCTGCCAGGCTTCCGGCAGGCGCGGcAGgCGGTTGCGGAAATCGGCGCGGTAgCGAGCGGTATCTCCGGCTCCGGCCCGAcTtTGTTCGCTCTGTGtGAcAAGCCGGATACCGCCCAGCGCGTTGCCGACTGgTTGGGTAAGAACtAcCTGCAAAATCAGgAAGGTTTTGTTcATATTTGCCGGCTGGATACGGCGGGcGCACGAgTACTGGAAAACTAA
threonine synthase
ATGAAACTCtacaATCTGAAAGATCACAATGAGCAGgTCaGCTTTGCGCAAGCCGTAACCCAGgGgTTAGGCAAAAATCAGGGgCtGTtTTTTCcgCACgaCCTGCCGGaaTTCAGCcTgACTGAAaTTGATGAGATgCTGAAGCtGGATTTTGTCACcCGCAGTGCGAAGATCCTcTCgGCGTTTATTGGTGATGAAATCCCGCAGGAAaTCCTGGAAGAGCGCGTACGTGCGGCGTTTGCCTTCCCGGCTCCGGTCGCCAATGTTGAAaGCGATGTCGGTtGTCTGGAaTTGTTCcACGGGCcAACGCTGGCaTTTAAAGATTTCGGcGGTcGCTTTATGGCACAAATGCTgACCcATATTGCGGGCGATAAGCCAGTGAcCATTCTGACCGCGACATCCGGTgATACTGGaGCGGCAGTGGcTCATGcTTTCtACGGTtTACCGAATGTGAAAGTGGTTATCCTCTATCCACGAGGCAAAATCAGTCCACTGCAAGAAAAACTgTTCTGTACATTGgGCggCAATATCGaAACTGTTGCCATCGAcggCGaTTTCGATGCCTGTCAGGCGCTGGTgAAGCAGGCgTTTGATGATGAAGAACTGAAAGTGgCgCtGGGGCtGAATTCTGCTAAcTCCATCAACaTCAGTCGCTTGCTGGCGcAGATTTGTTaTTAcTTTGaGGCTGTCGCACAGTtGCCGCAAGAAGCACGTAACCAGTTGgTTGTCTCGGTaCCGAGTGgAAACtTcGGCGATtTGACGGcGGGTCTGCTGGCGAaGTcACTCGGTCtGCCGGTAAAACGTtTTATTGCtgCGACCAACGTGAACGAtACCGTACCACGTTTCCTGCaCGaCGGTCAGTGGTCAcCCAAaGCGACTCAGgCGAcgTtaTCCAATGCGATGGATGTTAGCCAGCcAAaCAACTGGCCGCGTGTGGAAGAGTTGtTCcGCCGCAAAATCTGGCAACTGAAAGAGCTGGgTTATGCAGCCGTGgATGATGAAACCACGCAACAGACAATGcGTGAGtTAAaAGAACTGGGCTATACCTCGgAGCCGCACgCTGCCGTAGCTTATCGTGCGCTGCGTGACCAgTTGAAtCCAGGCGAATATGGCTTGTtCCTCGGcACcGCGCATCcGGcGAAatTtAAAgAGAGCGTGGAAGCGATTCTCGGTGAAAcGTTGGatCTGCCAAAAGAGCTGGCAGAACGTGCTgATTTACCCTTGCTTTCGCATAACCTGCCCGCCGATTTTGCTGCGTTGCGTAAatTgaTGATGAaTCATCAGTAA
hypothetical protein
AtGCAGCCcGGCTtTTTTTATGAAGAAAATaTGGAGaAaAACGACagGGAAAAAGGAGAAATTCtCAATAAATGCGGtAACTTAGAgATTaGGATTGCGGAGAATaACAACTGCcGTTCTCaTCGCGTAATCTCCGGATATCGACCCaTAACGGgCAATGATAAAAGgAGTAACCTGTGA
Non-protein region
aAAAACTgCTGGAAACAATGAAAGAcGTACCGGACGACCAAcGTCAGgCGC
transaldolase B
ATGACGGACAAATTGaCCTCcCTTCGTCAGTACACCACCGTAgTGGCCGACACTGGGGACATCGCGGCAATGAAGcTGTaTCAACcGCAGGATGCCACAACCAAcCCTtCTCTCATTCTTAACGCAGCGCAGATTCcGGAATACCGTAAgTTgATTGaTGATGCTGTCGCCTGGGcGAaACaGCAGAGCAAcGATcGCgCgCAGCAgATCGtGGACGCGACCGAcAAACTGGCAGTAaATATTgGTCTgGAAaTCCTGAAACTGgTTCCGgGCCgTATCTCAActGAAGTtGATGCGCGTCTTTCCTATGACaCCGAAGCGTCAATTGCGAAAGCAAAACGCCTGATCAAACTCTACAACGATGcAGGTaTTAGCAACGATCgTaTTCTGATCAAACTGGCTTCTACCTGGCAGGGTATCCGTGCTGcAGAACAGCTGGAAAAAGAaGGTATTAACTGTAAcCTGACCCTGCTgtTCTCctTCGCtCAGGcTCGTGCTTGTGCGGaAGCGGgCGTgTTCCTGaTCTCGcCGTTTgTTGGCcGTATTCTTGACTGGTAcAAaGCGAATACCGaTAAGAAAGAGtACGCTCcGGCAGAAGATcCGGGCGTGGTTTCTGTatCtGAAATCtACCAGtACTACaAAGAGCATGGTTaTgAAACCGTGGTTATGGGCGCAAGCTTCCGTAACATCGGCGAAATTCTGGAAcTGGCAGGCTGCGACCGTCTGACCatCGCACCGgcACTGCTGAAAGAGCTGgCGGAGAGCGAAGGGGCTATCgAACGTAAACTgTCTTACAcTGgTGAAGTgAAAGCgCGTCCGGcGCGTATCACtGAGtCCGAGTTCCTgTGgCAgCACAACCAGGATCCAATGGCAGTaGATAAACTgGcGGaAGgTATCCGTAAGTTTGCTGTTGACCAGGAAAAACTGGAAAAAATGATCGGCGATCTGCtGTAA
molybdopterin biosynthesis mog protein
ATGAATACTTTACGTATTGGCTTaGTtTcCaTCTCTGATCGCGCATCCAGCGGCGTTTAtCAGgaTAAAgGCATCCCTGCGCTGGAagAATGGCTGACAtcGGCGCTAACCACGcCGTTTGAaCTGGAAAcCCgCTTaATCCCCGATGAGCAGGCGATCATCGAGCAaACgTTgTGTGAGCTGGTGGATGAAaTGAGtTGCCaTCTGGTGCTCACCACGGGCGGAAcTGGCCCTGCGCGTCGTGAcgTAACGCcCGATGcGACGCTGGCAGTAGCGGACCGCGAGATgCcAGGCTTTGGTGAACAGATGCGCCAGATCAGCCTGCATTTTGTACcaaCTGCGATCCTTTCGCGTCAGGTggGGGTgATTCGCAAACAGGCGCTGATCCTTAACTTaCcCGGTCAACCGAAGtCTATTAAAGAGACGCtGgAAGGTGtGAAGGACGCTGAGgGTAAcGTTGTGGTGCACGgTATTTTTGCCaGCGTaCcGTaCTGCATTCAGTTGCTGGAAGGGCCATACGTTGAaACGGCaCCgGaAGTGGTTGCAGCATTCAGaCCGAAGAGTGCAaGACGCGAAGtTAGCGAATAA
output_dna.txt
Region Name: cure for cancer protein
Nucleotides: ATGCCACTATGGTAG
Nuc. Counts: [4, 3, 4, 4]
Total Mass%: [27.3, 16.8, 30.6, 25.3] of 1978.8
Codons List: [ATG, CCA, CTA, TGG, TAG]
Is Protein?: YES
Region Name: captain picard hair growth protein
Nucleotides: ATGCCAACATGGATGCCCGATATGGATTGA
Nuc. Counts: [9, 6, 8, 7]
Total Mass%: [30.7, 16.8, 30.5, 22.1] of 3967.5
Codons List: [ATG, CCA, ACA, TGG, ATG, CCC, GAT, ATG, GAT, TGA]
Is Protein?: YES
Region Name: bogus protein
Nucleotides: CCATT-AATGATCA-CAGTT
Nuc. Counts: [6, 4, 2, 6]
Total Mass%: [32.3, 17.7, 12.1, 29.9] of 2508.1
Codons List: [CCA, TTA, ATG, ATC, ACA, GTT]
Is Protein?: NO
Region Name: michael jordan mad hops protein
Nucleotides: ATGAG-ATC-CGTGATGTGGG-AT-CCTA-CT-CATTAA
Nuc. Counts: [9, 6, 8, 10]
Total Mass%: [24.6, 13.5, 24.5, 25.3] of 4942.9
Codons List: [ATG, AGA, TCC, GTG, ATG, TGG, GAT, CCT, ACT, CAT, TAA]
Is Protein?: YES
Region Name: paris hilton phony protein
Nucleotides: ATGC-CAACATGGATGCCCTAAG-ATATGGATTAGTGA
Nuc. Counts: [12, 6, 9, 9]
Total Mass%: [32.6, 13.4, 27.3, 22.6] of 4974.3
Codons List: [ATG, CCA, ACA, TGG, ATG, CCC, TAA, GAT, ATG, GAT, TAG, TGA]
Is Protein?: YES
Region Name: jimi hendrix guitar talent protein
Nucleotides: ATGATAATTAGTTTTAATATCAGA-CTGTAA
Nuc. Counts: [12, 2, 4, 12]
Total Mass%: [40.0, 5.5, 14.9, 37.1] of 4049.5
Codons List: [ATG, ATA, ATT, AGT, TTT, AAT, ATC, AGA, CTG, TAA]
Is Protein?: NO
Region Name: admiral grace murray hopper protein
Nucleotides: ATGC-AATT--GC-----TCGA--------TTAG
Nuc. Counts: [5, 3, 4, 6]
Total Mass%: [17.0, 8.4, 15.2, 18.9] of 3964.1
Codons List: [ATG, CAA, TTG, CTC, GAT, TAG]
Is Protein?: NO
Region Name: tyler durden's brain protein
Nucleotides: ATGATACCTATGAGTAATGTGGACCATATCCAAACTATAGGCATTGTCGGACCAACGATCGATTGGTTATACTGA
Nuc. Counts: [24, 14, 16, 21]
Total Mass%: [32.9, 15.8, 24.6, 26.7] of 9843.8
Codons List: [ATG, ATA, CCT, ATG, AGT, AAT, GTG, GAC, CAT, ATC, CAA, ACT, ATA, GGC, ATT, GTC, GGA, CCA, ACG, ATC, GAT, TGG, TTA, TAC, TGA]
Is Protein?: YES
Region Name: mini me growth hormone
Nucleotides: ATGGGACGCTGA
Nuc. Counts: [3, 2, 5, 2]
Total Mass%: [24.8, 13.6, 46.3, 15.3] of 1633.4
Codons List: [ATG, GGA, CGC, TGA]
Is Protein?: NO
Region Name: Nyan Cat protein
Nucleotides: CAT-CAT-CAT-CAT-CAT-CAT-CAT-CAT-CAT-CAT
Nuc. Counts: [10, 10, 0, 10]
Total Mass%: [29.3, 24.1, 0.0, 27.1] of 4613.4
Codons List: [CAT, CAT, CAT, CAT, CAT, CAT, CAT, CAT, CAT, CAT]
Is Protein?: NO
output_ecoli.txt
Region Name: thr operon leader peptide
Nucleotides: ATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGA
Nuc. Counts: [21, 22, 12, 11]
Total Mass%: [33.5, 28.9, 21.4, 16.2] of 8471.7
Codons List: [ATG, AAA, CGC, ATT, AGC, ACC, ACC, ATT, ACC, ACC, ACC, ATC, ACC, ATT, ACC, ACA, GGT, AAC, GGT, GCG, GGC, TGA]
Is Protein?: YES
outputut
Genome (.fna) file? dna.txt
Reading .fna file...
Protein (.ptt) file? ecoli.txt
Output file? output_dna.txt
How many proteins (-1 for all)? -1
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.