JDF509 Bioinformatics

From CSBLwiki

(Difference between revisions)

Jump to: navigation, search

Revision as of 10:42, 31 May 2011

2011 Spring

Textbook Introduction to Computational Genomics Textbook website - Software & Data

숙제 단원이 끝나면 단원 맨뒤에 있는 연습문제(Exercise)풀어서 제출

Schedule: 3/15/11 Introduction to Python programming by 노한성; 노한성 발표자료 3-15-2011

Midterm rest (4/19/11) - no exam

No class - 5/10/11 (Budda's birthday)

Chapter	Name	Pages	Presentation	숙제 업로드
1		21	03/22/11	03/18/11
2		16	03/29/11	03/25/11
3		23	04/05/11	03/28/11
4		17	04/12/11	04/08/11
5		18	04/26/11	04/22/11			
6		14	05/03/11	04/29/11
7		18	05/10/11	05/06/11
8		12	05/24/11	05/20/11
9               18      05/31/11	05/27/11
10		21	06/07/11	06/03/11
Final			Project

Software

Python

about Python

Introduction to Programming using Python

파이썬 설치하기 파이썬엑스와이 Python(x,y)-2.6.5.6 (Mar 2011) - Free scientific and engineering development software download & install - including almost every very useful scientific modules (Numpy, Scipy...) 바이오파이썬 Biopython 1.56 (Mar 2011) download & install Biopython Tutorial & Cookbook(참고문서)

Chapter 1

Chapter 1 The first look at a genome: Chaos Game Representation

Exercise#1

Download a genome sequence & do basic statistical analysis
GC-content?: Solution = GC content of NC_01415 is '??? %'; Code

>>> from Bio import Entrez, SeqIO
>>> Entrez.mail = 'your@email.address'
>>> handle = Entrez.efetch(db="nucleotide",id="NC_001416",rettype="fasta")
>>> record = SeqIO.read(handle,"fasta")
>>> print record
ID: gi|9626243|ref|NC_001416.1|
Name: gi|9626243|ref|NC_001416.1|
Description: gi|9626243|ref|NC_001416.1| Enterobacteria phage lambda, complete genome
Number of features: 0
Seq('GGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAAGGCG...ACG', SingleLetterAlphabet())
>>> print len(record)
48502
>>> record
SeqRecord(seq=Seq('GGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAAGGCG...ACG', SingleLetterAlphabet()), id='gi|9626243|ref|NC_001416.1|', name='gi|9626243|ref|NC_001416.1|', description='gi|9626243|ref|NC_001416.1| Enterobacteria phage lambda, complete genome', dbxrefs=[])
>>> record.seq
Seq('GGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAAGGCG...ACG', SingleLetterAlphabet())
>>> from Bio.SeqUtils import GC
>>> GC(record.seq)
49.857737825244321

GC-content scanning with window size 500 bps?
- ANS:

GC-values of windowsize 500

Code

>>> x = record.seq
>>> windowsize = 500
>>> gc_values = [ GC(x[i:(i+499)] for i in range(1,len(x)-windowsize+1) ]
>>> import pylab
>>> pylab.plot(gc_values)
>>> pylab.title("GC% 500 bp window size")
>>> pylab.xlabel("Nucleotide positions")
>>> pylab.ylabel("GC%")
>>> pylab.show()

Exercise#2

Basic Statistical Analysis

Comparing human and chimp complete mitochondiral DNA (NC_001807 and NC_001643)
- GC% Human: 44.5, Chimp: 43.7

>>> from Bio import Entrez, SeqIO
>>> handle = Entrez.efetch(db="nucleotide",id="NC_001807",rettype="fasta")
>>> record1 = SeqIO.read(handle,"fasta")
>>> handle = Entrez.efetch(db="nucleotide",id="NC_001643",rettype="fasta")
>>> record2 = SeqIO.read(handle,"fasta")
>>> from Bio.SeqUtils import GC
>>> GC(record1.seq)
44.487357431657713
>>> GC(record2.seq)
43.687326325963511
>>> len(record2.seq)
16554
>>> len(record1.seq)
16571

Exercise#3

Most frequent word

Count frequent dinucleotides in rat Mitochondiral DNA
- NC_001665

>>> from Bio import Entrez, SeqIO
>>> handle = Entrez.efetch(db="nucleotide",id="NC_001665",rettype="fasta")
>>> ratMT = SeqIO.read(handle,"fasta")
>>> base = [ ratMT.seq[i] for i in range(0,len(ratMT.seq))]
>>> a = base.count('A')
>>> g = base.count('G')
>>> c = base.count('C')
>>> t = base.count('T')
>>> di = [ str(ratMT.seq[i:(i+2)]) for i in range(0,len(ratMT.seq)-1) ]
>>> aa = di.count('AA')
>>> aa
1892
>>> a
5544

Chapter 2

related topics GeneSweep result Type I & Type II errors Statitical power

Exercise

Find all ORFs in Human, Chimp and Mouse mtDNA Repeat the ORF search on randomized mtDNA. The longest ORF in the randomized sequence? Find ORFs in H. influenzae

Exercise#1 Finding ORFs

>>> han1 = Entrez.efetch(db="nucleotide",id="NC_001807",rettype="fasta")
>>> hum = SeqIO.read(han1,"fasta")
>>> from Bio.Seq import Seq
>>> orf = hum.seq.translate(table="Vertebrate Mitochondrial")
>>> orf.count("*")
326

From Eric Talevich's presentation

# define function 1 - translate a given sequences in all 6 frames
def translate_six_frames(seq, table=2):
    rev = seq.reverse_complement()
    for i in range(3):
        yield seq[i:].translate(table)
        yield rev[i:].translate(table)


# define function 2 - translate given sequences in 6 reading frames
# & return ORFs, min_prot_len = 'k'
def translate_orfs(sequences, min_prot_len=60):
    for frame in translate_six_frames(seq):
        for prot in frame.split('*'):
            if len(prot) >= min_prot_len:
                yield prot


# actual procedure
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq

seq = hum.seq
proteins = translate_orfs(seq) 
seqrecords = (SeqRecord(seq,id='orf'+str(i))
           for i, seq in enumerate(proteins))

for rec in seqrecords:
	print ">%s lenght=%s\n%s"%(rec.id,len(rec.seq),rec.seq)

Exercise#2 Using randomized sequences

producing randomized genome


from Bio import Entrez, SeqIO
Entrez.email = ''
han1 = Entrez.efetch(db="nucleotide",id="NC_001807",rettype="fasta")
hum = SeqIO.read(han1,"fasta")

import random
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
nuc_list = list(hum.seq)

#shuffle sequence
random.shuffle(nuc_list)

seq = Seq(''.join(nuc_list))

def translate_six_frames(seq, table=2):
    rev = seq.reverse_complement()
    for i in range(3):
        yield seq[i:].translate(table)
        yield rev[i:].translate(table)

def translate_orfs(sequences, min_prot_len=60):
    for frame in translate_six_frames(sequences):
        for prot in frame.split('*'):
            if len(prot) >= min_prot_len:
                yield prot
from Bio.SeqRecord import SeqRecord

proteins = translate_orfs(seq)
seqrecords = (SeqRecord(seq,id='orf'+str(i+1)) 
        for i, seq in enumerate(proteins))

for rec in seqrecords:
	print ">%s lenght=%s\n%s"%(rec.id,len(rec.seq),rec.seq)

Exercise#3

see, Excercis #1


import sys
#min = int(sys.argv[1]) #threshold
min = 60

from Bio.SeqRecord import SeqRecord
proteins = translate_orfs(hinf.seq,min)
seqrecords = (SeqRecord(seq,id='orf'+str(i+1)) 
        for i, seq in enumerate(proteins))

Chapter 3

Scoring matrix: PAM matrices

What is dynamic programming? by Sean Eddy

Dynamic Programming Exercises from UCSC BME 205 class homework

Exersize#1

Local & Global alignment of X79493 and AY707088

>sp|O18381|PAX6_DROME Paired box protein Pax-6 OS=Drosophila melanogaster GN=ey PE=2 SV=3
MRNLPCLGTAGGSGLGGIAGKPSPTMEAVEASTASHRHSTSSYFATTYYHLTDDECHSGV
NQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRP
RAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQENVCTNDNIPSVSSINRVLR
NLAAQKEQQSTGSGSSSTSAGNSISAKVSVSIGGNVSNVASGSRGTLSSSTDLMQTATPL
NSSESGGASNSGEGSEQEAIYEKLRLLNTQHAAGPGPLEPARAAPLVGQSPNHLGTRSSH
PQLVHGNHQALQQHQQQSWPPRHYSGSWYPTSLSEIPISSAPNIASVTAYASGPSLAHSL
SPPNDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGENSNGGASNIGNTEDD
QARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYPDVFARERLAGKIGLPEARIQVWFSN
RRAKWRREEKLRNQRRTPNSTGASATSSSTSATASLTDSPNSLSACSSLLSGSAGGPSVS
TINGLSSPSTLSTNVNAPTLGAGIDSSESPTPIPHIRPSCTSDNDNGRQSEDCRRVCSPC
PLGVGGHQNTHHIQSNGHAQGHALVPAISPRLNFNSGSFGAMYSNMHHTALSMSDSYGAV
TPIPSFNHSAVGPLAPPSPIPQQGDLTPSSLYPCHMTLRPPPMAPAHHHIVPGDGGRPAG
VGLGSGQSANLGASCSGSGYEVLSAYALPPPPMASSSAADSSFSAASSASANVTPHHTIA
QESCPSPCSSASHFGVAHSSGFSSDPISPAVSSYAHMSYNYASSANTMTPSSASGTSAHV
APGKQQFFASCFYSPWV

>gi|51872083|gb|AAU12168.1| paired box gene 6 isoform a [Homo sapiens]
MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRA
IGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSINRVLRNLASEKQQMGAD
GMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPTQDGCQQQEGGGENTNSISSNGEDSDEAQMRLQLKRKL
QRNRTSFTQEQIEALEKEFERTHYPDVFARERLAAKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASN
TPSHIPISSSFSTSVYQPIPQPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQPPVPSQ
TSSYSCMLPTSPSVNGRSYDTYTPPHMQTHMNSQPMGTSGTTSTGLISPGVSVPVQVPGSEPDMSQYWPR
LQ

Tools - EMBOSS package - packing various sequence analysis programs

Emboss windows version - download site

Needlman-Wunsch (Global alignment) - download result File:Pax drome.needle.txt

Smith-Waterman (Local alignment) - download result File:Pax drome.water.txt

Excercise #2 Use Blast

Run Blast with X79493: Protein Blast (Blastp program)

Tips

The Statistics of Sequence Similarity Scores

P-values

Excercise #3 Alignment comparison

use DNA or protein sequence of Exercise #1

Excercise #4 Multiple sequence alignment

use following sequence set (File:SampleMSA.txt) to make a multiple sequence alignment

ClustalW links: Clustalw@genome.jp; Clustalw@EBI

Chapter 3

Scoring matrix: PAM matrices

What is dynamic programming? by Sean Eddy

Dynamic Programming Exercises from UCSC BME 205 class homework

Exersize#1

Local & Global alignment of X79493 and AY707088

>sp|O18381|PAX6_DROME Paired box protein Pax-6 OS=Drosophila melanogaster GN=ey PE=2 SV=3
MRNLPCLGTAGGSGLGGIAGKPSPTMEAVEASTASHRHSTSSYFATTYYHLTDDECHSGV
NQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRP
RAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQENVCTNDNIPSVSSINRVLR
NLAAQKEQQSTGSGSSSTSAGNSISAKVSVSIGGNVSNVASGSRGTLSSSTDLMQTATPL
NSSESGGASNSGEGSEQEAIYEKLRLLNTQHAAGPGPLEPARAAPLVGQSPNHLGTRSSH
PQLVHGNHQALQQHQQQSWPPRHYSGSWYPTSLSEIPISSAPNIASVTAYASGPSLAHSL
SPPNDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGENSNGGASNIGNTEDD
QARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYPDVFARERLAGKIGLPEARIQVWFSN
RRAKWRREEKLRNQRRTPNSTGASATSSSTSATASLTDSPNSLSACSSLLSGSAGGPSVS
TINGLSSPSTLSTNVNAPTLGAGIDSSESPTPIPHIRPSCTSDNDNGRQSEDCRRVCSPC
PLGVGGHQNTHHIQSNGHAQGHALVPAISPRLNFNSGSFGAMYSNMHHTALSMSDSYGAV
TPIPSFNHSAVGPLAPPSPIPQQGDLTPSSLYPCHMTLRPPPMAPAHHHIVPGDGGRPAG
VGLGSGQSANLGASCSGSGYEVLSAYALPPPPMASSSAADSSFSAASSASANVTPHHTIA
QESCPSPCSSASHFGVAHSSGFSSDPISPAVSSYAHMSYNYASSANTMTPSSASGTSAHV
APGKQQFFASCFYSPWV

>gi|51872083|gb|AAU12168.1| paired box gene 6 isoform a [Homo sapiens]
MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRA
IGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSINRVLRNLASEKQQMGAD
GMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPTQDGCQQQEGGGENTNSISSNGEDSDEAQMRLQLKRKL
QRNRTSFTQEQIEALEKEFERTHYPDVFARERLAAKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASN
TPSHIPISSSFSTSVYQPIPQPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQPPVPSQ
TSSYSCMLPTSPSVNGRSYDTYTPPHMQTHMNSQPMGTSGTTSTGLISPGVSVPVQVPGSEPDMSQYWPR
LQ

Tools - EMBOSS package - packing various sequence analysis programs

Emboss windows version - download site

Needlman-Wunsch (Global alignment) - download alignment File:Pax drome.needle.txt

Smith-Waterman (Local alignment) - download alignment File:Pax drome.water.txt

Excercise #2 Use Blast

The Statistics of Sequence Similarity Scores

P-values

Excercise #3 Find homologs

Excercise #4 Multiple sequence alignment

Clustalw@genome.jp

Clustalw@EBI

Chapter 4

What is Hidden Markov Model? by Sean Eddy
What is Bayesian statistics? by Sean Eddy

Exercise#1

Segmenting with 4-state model
- 2-state='AT' and 'GC', 4-state = A,G,C,T

Exercise#2

Draw the topology of a 2-state HMM emitting symbols (1~10) with even & odd states

Excercise#3

Sketch the general architecture of an ORF finding HMM

Chapter 5

Where did the BLOSUM62 alignment score matrix come from? by Sean Eddy

Exercise#1

Which of the modern elephants seems to be more closely related to mammoths? Hint: make a global alignment and calculate the genetic distance between them
12S rRNA sequence of Saber-Tooth Tiger can tell which of modern felines is most closest one.
Genetic distance between blue-whale, hippo and cow

Download sequences & write into a file

8 : from Bio import Entrez, SeqIO
9 : h1 = Entrez.efetch(db="nucleotide",id="NC_001601",rettype="fasta")
10: h2 = Entrez.efetch(db="nucleotide",id="NC_000889",rettype="fasta")
11: h3 = Entrez.efetch(db="nucleotide",id="NC_006853",rettype="fasta")
12: blue = SeqIO.read(h1,"fasta")
13: hipp = SeqIO.read(h2,"fasta")
14: cow = SeqIO.read(h3,"fasta")
19: seqs = [blue, hipp, cow]
20: h4 = open("seq.fasta","w")
21: SeqIO.write(seqs,h4,"fasta")
22: h4.close()

Do alignment with multiple sequence alignment program (e.g. Clustalw)
Calculate genetic distance based on the model (e.g. Juke-Cantor model) by Phylip Package (or any GUI program for phylogenetic analysis - MEGA)
Ans: a pairwise distance matrix

     Whale  Hippo Cow 
Whale 0     
Hippo 0.222 
Cow   0.226 0.226

Download all files

Chapter 6

Exercise 1 Measure Ka/Ks ratio for various mtDNA genes

target gene - human mitochondiral cytochrome C protein (protein sequence)

find homologs of chimp & mouse (or other animals) using BLAST (protein sequence)

find each nucleotide sequence

use [1] Online KaKs Calculator!

Exercise 2 Viral genomes

Exercise 3 Practice with free online software tools for measuring Ka/Ks: [2] Online KaKs Calculator!

2010 Spring

SNP Fact Sheet - FAQs such as What are SNPs? etc.
Nature Milestones (2007) - DNA technologies
Suggested Reading for a term paper (due date: 6/17(Thu))

게놈의 기적 (과학도서관 링크)
신의 언어 : 유전자 지도에서 발견한 신의 존재 (과학도서관 링크)
지루한 사람과 어울리지 마라 : 과학에서 배우는 삶의 교훈 (과학도서관 링크)

2009 Spring

질병유전체학

Human Genome

dbSNP@NCBI(Summary)
- rs#: Reference, ss#: Submitted
HapMap project

Diagnosis test

Diagnostic performance test - Basic statistics

@@ Line 408: / Line 408: @@
 </pre>
 *[[media:sequences.zip|Download all files]]
+===Chapter 6===
+;Exercise 1 Measure Ka/Ks ratio for various mtDNA genes
+:[http://www.ncbi.nlm.nih.gov/protein/CAG47004.1 target gene] - human mitochondiral cytochrome C protein (protein sequence)
+::find homologs of chimp & mouse (or other animals) using BLAST (protein sequence)
+::find each nucleotide sequence
+::use [http://www.cs.gettysburg.edu/~chibfu01/] Online KaKs Calculator!
+;Exercise 2 Viral genomes
+:
+;Exercise 3 Practice with free online software tools for measuring Ka/Ks
+:[http://www.cs.gettysburg.edu/~chibfu01/] Online KaKs Calculator!
 ==2010 Spring==

JDF509 Bioinformatics

From CSBLwiki

Revision as of 10:42, 31 May 2011

Contents

2011 Spring

Software

Chapter 1

Exercise#1

Exercise#2

Exercise#3

Chapter 2

Exercise

Exercise#1 Finding ORFs

Exercise#2 Using randomized sequences

Exercise#3

Chapter 3

Exersize#1

Excercise #2 Use Blast

Excercise #3 Alignment comparison

Excercise #4 Multiple sequence alignment

Chapter 3

Exersize#1

Excercise #2 Use Blast

Excercise #3 Find homologs

Excercise #4 Multiple sequence alignment

Chapter 4

Exercise#1

Exercise#2

Excercise#3

Chapter 5

Exercise#1

Chapter 6

2010 Spring

2009 Spring

Human Genome

Diagnosis test

Personal tools

Namespaces

Variants

Views

Actions

Search

Site

Choi lab

Resources

Toolbox