JDF509 Bioinformatics

From CSBLwiki

Jump to: navigation, search

Contents

2011 Spring

Textbook
숙제
  • 단원이 끝나면 단원 맨뒤에 있는 연습문제(Exercise)풀어서 제출
Schedule
3/15/11 Introduction to Python programming by 노한성
노한성 발표자료 3-15-2011
Midterm rest (4/19/11) - no exam
No class - 5/10/11 (Budda's birthday)
Chapter	Name	Pages	Presentation	숙제 업로드
1		21	03/22/11	03/18/11
2		16	03/29/11	03/25/11
3		23	04/05/11	03/28/11
4		17	04/12/11	04/08/11
5		18	04/26/11	04/22/11			
6		14	05/03/11	04/29/11
7		18	05/10/11	05/06/11
8		12	05/24/11	05/20/11
9               18      05/31/11	05/27/11
10		21	06/07/11	06/03/11
Final			Project

Software

Python
about Python
Introduction to Programming using Python
파이썬 설치하기 (for windows system)
  1. 파이썬엑스와이 Python(x,y)-2.6.5.6 (Mar 2011) - Free scientific and engineering development software download & install - including almost every very useful scientific modules (Numpy, Scipy...)
  2. 바이오파이썬 Biopython 1.56 (Mar 2011) download & install
Biopython Tutorial & Cookbook(참고문서)

Chapter 1

Chapter 1 The first look at a genome
Chaos Game Representation

Exercise#1

Download a genome sequence & do basic statistical analysis
GC-content?
Solution = GC content of NC_01415 is '??? %'
Code
>>> from Bio import Entrez, SeqIO
>>> Entrez.mail = 'your@email.address'
>>> handle = Entrez.efetch(db="nucleotide",id="NC_001416",rettype="fasta")
>>> record = SeqIO.read(handle,"fasta")
>>> print record
ID: gi|9626243|ref|NC_001416.1|
Name: gi|9626243|ref|NC_001416.1|
Description: gi|9626243|ref|NC_001416.1| Enterobacteria phage lambda, complete genome
Number of features: 0
Seq('GGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAAGGCG...ACG', SingleLetterAlphabet())
>>> print len(record)
48502
>>> record
SeqRecord(seq=Seq('GGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAAGGCG...ACG', SingleLetterAlphabet()), id='gi|9626243|ref|NC_001416.1|', name='gi|9626243|ref|NC_001416.1|', description='gi|9626243|ref|NC_001416.1| Enterobacteria phage lambda, complete genome', dbxrefs=[])
>>> record.seq
Seq('GGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAAGGCG...ACG', SingleLetterAlphabet())
>>> from Bio.SeqUtils import GC
>>> GC(record.seq)
49.857737825244321
GC-values of windowsize 500


>>> x = record.seq
>>> windowsize = 500
>>> gc_values = [ GC(x[i:(i+499)] for i in range(1,len(x)-windowsize+1) ]
>>> import pylab
>>> pylab.plot(gc_values)
>>> pylab.title("GC% 500 bp window size")
>>> pylab.xlabel("Nucleotide positions")
>>> pylab.ylabel("GC%")
>>> pylab.show()

Exercise#2

Basic Statistical Analysis
>>> from Bio import Entrez, SeqIO
>>> handle = Entrez.efetch(db="nucleotide",id="NC_001807",rettype="fasta")
>>> record1 = SeqIO.read(handle,"fasta")
>>> handle = Entrez.efetch(db="nucleotide",id="NC_001643",rettype="fasta")
>>> record2 = SeqIO.read(handle,"fasta")
>>> from Bio.SeqUtils import GC
>>> GC(record1.seq)
44.487357431657713
>>> GC(record2.seq)
43.687326325963511
>>> len(record2.seq)
16554
>>> len(record1.seq)
16571

Exercise#3

Most frequent word
>>> from Bio import Entrez, SeqIO
>>> handle = Entrez.efetch(db="nucleotide",id="NC_001665",rettype="fasta")
>>> ratMT = SeqIO.read(handle,"fasta")
>>> base = [ ratMT.seq[i] for i in range(0,len(ratMT.seq))]
>>> a = base.count('A')
>>> g = base.count('G')
>>> c = base.count('C')
>>> t = base.count('T')
>>> di = [ str(ratMT.seq[i:(i+2)]) for i in range(0,len(ratMT.seq)-1) ]
>>> aa = di.count('AA')
>>> aa
1892
>>> a
5544

Chapter 2

related topics
  1. GeneSweep result
  2. Type I & Type II errors
  3. Statitical power

Exercise

  1. Find all ORFs in Human, Chimp and Mouse mtDNA
  2. Repeat the ORF search on randomized mtDNA. The longest ORF in the randomized sequence?
  3. Find ORFs in H. influenzae
Exercise#1 Finding ORFs
>>> han1 = Entrez.efetch(db="nucleotide",id="NC_001807",rettype="fasta")
>>> hum = SeqIO.read(han1,"fasta")
>>> from Bio.Seq import Seq
>>> orf = hum.seq.translate(table="Vertebrate Mitochondrial")
>>> orf.count("*")
326
From Eric Talevich's presentation
# define function 1 - translate a given sequences in all 6 frames
def translate_six_frames(seq, table=2):
    rev = seq.reverse_complement()
    for i in range(3):
        yield seq[i:].translate(table)
        yield rev[i:].translate(table)


# define function 2 - translate given sequences in 6 reading frames
# & return ORFs, min_prot_len = 'k'
def translate_orfs(sequences, min_prot_len=60):
    for frame in translate_six_frames(seq):
        for prot in frame.split('*'):
            if len(prot) >= min_prot_len:
                yield prot


# actual procedure
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq

seq = hum.seq
proteins = translate_orfs(seq) 
seqrecords = (SeqRecord(seq,id='orf'+str(i))
           for i, seq in enumerate(proteins))

for rec in seqrecords:
	print ">%s lenght=%s\n%s"%(rec.id,len(rec.seq),rec.seq)


Exercise#2 Using randomized sequences

from Bio import Entrez, SeqIO
Entrez.email = ''
han1 = Entrez.efetch(db="nucleotide",id="NC_001807",rettype="fasta")
hum = SeqIO.read(han1,"fasta")

import random
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
nuc_list = list(hum.seq)

#shuffle sequence
random.shuffle(nuc_list)

seq = Seq(''.join(nuc_list))

def translate_six_frames(seq, table=2):
    rev = seq.reverse_complement()
    for i in range(3):
        yield seq[i:].translate(table)
        yield rev[i:].translate(table)

def translate_orfs(sequences, min_prot_len=60):
    for frame in translate_six_frames(sequences):
        for prot in frame.split('*'):
            if len(prot) >= min_prot_len:
                yield prot
from Bio.SeqRecord import SeqRecord

proteins = translate_orfs(seq)
seqrecords = (SeqRecord(seq,id='orf'+str(i+1)) 
        for i, seq in enumerate(proteins))

for rec in seqrecords:
	print ">%s lenght=%s\n%s"%(rec.id,len(rec.seq),rec.seq)


Exercise#3
see, Excercis #1

import sys
#min = int(sys.argv[1]) #threshold
min = 60

from Bio.SeqRecord import SeqRecord
proteins = translate_orfs(hinf.seq,min)
seqrecords = (SeqRecord(seq,id='orf'+str(i+1)) 
        for i, seq in enumerate(proteins))

Chapter 3

Scoring matrix: PAM matrices
What is dynamic programming? by Sean Eddy
Dynamic Programming Exercises from UCSC BME 205 class homework


Exersize#1

Local & Global alignment of X79493 and AY707088
>sp|O18381|PAX6_DROME Paired box protein Pax-6 OS=Drosophila melanogaster GN=ey PE=2 SV=3
MRNLPCLGTAGGSGLGGIAGKPSPTMEAVEASTASHRHSTSSYFATTYYHLTDDECHSGV
NQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRP
RAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQENVCTNDNIPSVSSINRVLR
NLAAQKEQQSTGSGSSSTSAGNSISAKVSVSIGGNVSNVASGSRGTLSSSTDLMQTATPL
NSSESGGASNSGEGSEQEAIYEKLRLLNTQHAAGPGPLEPARAAPLVGQSPNHLGTRSSH
PQLVHGNHQALQQHQQQSWPPRHYSGSWYPTSLSEIPISSAPNIASVTAYASGPSLAHSL
SPPNDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGENSNGGASNIGNTEDD
QARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYPDVFARERLAGKIGLPEARIQVWFSN
RRAKWRREEKLRNQRRTPNSTGASATSSSTSATASLTDSPNSLSACSSLLSGSAGGPSVS
TINGLSSPSTLSTNVNAPTLGAGIDSSESPTPIPHIRPSCTSDNDNGRQSEDCRRVCSPC
PLGVGGHQNTHHIQSNGHAQGHALVPAISPRLNFNSGSFGAMYSNMHHTALSMSDSYGAV
TPIPSFNHSAVGPLAPPSPIPQQGDLTPSSLYPCHMTLRPPPMAPAHHHIVPGDGGRPAG
VGLGSGQSANLGASCSGSGYEVLSAYALPPPPMASSSAADSSFSAASSASANVTPHHTIA
QESCPSPCSSASHFGVAHSSGFSSDPISPAVSSYAHMSYNYASSANTMTPSSASGTSAHV
APGKQQFFASCFYSPWV
>gi|51872083|gb|AAU12168.1| paired box gene 6 isoform a [Homo sapiens]
MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRA
IGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSINRVLRNLASEKQQMGAD
GMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPTQDGCQQQEGGGENTNSISSNGEDSDEAQMRLQLKRKL
QRNRTSFTQEQIEALEKEFERTHYPDVFARERLAAKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASN
TPSHIPISSSFSTSVYQPIPQPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQPPVPSQ
TSSYSCMLPTSPSVNGRSYDTYTPPHMQTHMNSQPMGTSGTTSTGLISPGVSVPVQVPGSEPDMSQYWPR
LQ
Tools - EMBOSS package - packing various sequence analysis programs
Emboss windows version - download site
Needlman-Wunsch (Global alignment) - download result File:Pax drome.needle.txt
Smith-Waterman (Local alignment) - download result File:Pax drome.water.txt

Excercise #2 Use Blast

Run Blast with X79493
Protein Blast (Blastp program)
Tips
The Statistics of Sequence Similarity Scores
P-values

Excercise #3 Alignment comparison

use DNA or protein sequence of Exercise #1

Excercise #4 Multiple sequence alignment

use following sequence set (File:SampleMSA.txt) to make a multiple sequence alignment
ClustalW links
Clustalw@genome.jp
Clustalw@EBI

Chapter 3

Scoring matrix: PAM matrices
What is dynamic programming? by Sean Eddy
Dynamic Programming Exercises from UCSC BME 205 class homework

Exersize#1

Local & Global alignment of X79493 and AY707088
>sp|O18381|PAX6_DROME Paired box protein Pax-6 OS=Drosophila melanogaster GN=ey PE=2 SV=3
MRNLPCLGTAGGSGLGGIAGKPSPTMEAVEASTASHRHSTSSYFATTYYHLTDDECHSGV
NQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRP
RAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQENVCTNDNIPSVSSINRVLR
NLAAQKEQQSTGSGSSSTSAGNSISAKVSVSIGGNVSNVASGSRGTLSSSTDLMQTATPL
NSSESGGASNSGEGSEQEAIYEKLRLLNTQHAAGPGPLEPARAAPLVGQSPNHLGTRSSH
PQLVHGNHQALQQHQQQSWPPRHYSGSWYPTSLSEIPISSAPNIASVTAYASGPSLAHSL
SPPNDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGENSNGGASNIGNTEDD
QARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYPDVFARERLAGKIGLPEARIQVWFSN
RRAKWRREEKLRNQRRTPNSTGASATSSSTSATASLTDSPNSLSACSSLLSGSAGGPSVS
TINGLSSPSTLSTNVNAPTLGAGIDSSESPTPIPHIRPSCTSDNDNGRQSEDCRRVCSPC
PLGVGGHQNTHHIQSNGHAQGHALVPAISPRLNFNSGSFGAMYSNMHHTALSMSDSYGAV
TPIPSFNHSAVGPLAPPSPIPQQGDLTPSSLYPCHMTLRPPPMAPAHHHIVPGDGGRPAG
VGLGSGQSANLGASCSGSGYEVLSAYALPPPPMASSSAADSSFSAASSASANVTPHHTIA
QESCPSPCSSASHFGVAHSSGFSSDPISPAVSSYAHMSYNYASSANTMTPSSASGTSAHV
APGKQQFFASCFYSPWV
>gi|51872083|gb|AAU12168.1| paired box gene 6 isoform a [Homo sapiens]
MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRA
IGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSINRVLRNLASEKQQMGAD
GMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPTQDGCQQQEGGGENTNSISSNGEDSDEAQMRLQLKRKL
QRNRTSFTQEQIEALEKEFERTHYPDVFARERLAAKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASN
TPSHIPISSSFSTSVYQPIPQPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQPPVPSQ
TSSYSCMLPTSPSVNGRSYDTYTPPHMQTHMNSQPMGTSGTTSTGLISPGVSVPVQVPGSEPDMSQYWPR
LQ
Tools - EMBOSS package - packing various sequence analysis programs
Emboss windows version - download site
Needlman-Wunsch (Global alignment) - download alignment File:Pax drome.needle.txt
Smith-Waterman (Local alignment) - download alignment File:Pax drome.water.txt

Excercise #2 Use Blast

The Statistics of Sequence Similarity Scores
P-values

Excercise #3 Find homologs

Excercise #4 Multiple sequence alignment

Clustalw@genome.jp
Clustalw@EBI

Chapter 4

Exercise#1

Exercise#2

Excercise#3

Chapter 5

Exercise#1

  1. Which of the modern elephants seems to be more closely related to mammoths? Hint: make a global alignment and calculate the genetic distance between them
  2. 12S rRNA sequence of Saber-Tooth Tiger can tell which of modern felines is most closest one.
  3. Genetic distance between blue-whale, hippo and cow
8 : from Bio import Entrez, SeqIO
9 : h1 = Entrez.efetch(db="nucleotide",id="NC_001601",rettype="fasta")
10: h2 = Entrez.efetch(db="nucleotide",id="NC_000889",rettype="fasta")
11: h3 = Entrez.efetch(db="nucleotide",id="NC_006853",rettype="fasta")
12: blue = SeqIO.read(h1,"fasta")
13: hipp = SeqIO.read(h2,"fasta")
14: cow = SeqIO.read(h3,"fasta")
19: seqs = [blue, hipp, cow]
20: h4 = open("seq.fasta","w")
21: SeqIO.write(seqs,h4,"fasta")
22: h4.close()
     Whale  Hippo Cow 
Whale 0     
Hippo 0.222 
Cow   0.226 0.226

Chapter 6

Exercise 1 Measure Ka/Ks ratio for various mtDNA genes
target gene - human mitochondiral cytochrome C protein (protein sequence)
>gi|49457409|emb|CR542208.1| Homo sapiens full open reading frame cDNA clone RZPDo834D0125D for gene COX7A2L, cytochrome c oxidase subunit VIIa polypeptide 2 like; complete cds, incl. stopcodon
ATGTACTACAAGTTTAGTGGCTTCACGCAGAAGTTGGCAGGAGCATGGGCTTCGGAGGCCTATAGCCCGC
AGGGATTAAAGCCTGTGGTTTCCACAGAAGCACCACCTATCATATTTGCCACACCAACTAAACTGACCTC
CGATTCCACAGTGTATGATTATGCTGGGAAAAACAAAGTTCCAGAGCTACAAAAGTTTTTCCAGAAAGCT
GATGGTGTGCCCGTCTACCTGAAACGAGGCCTGCCTGACCAAATGCTTTACCGGACCACCATGGCGCTGA
CTGTGGGAGGGACCATCTACTGCCTGATCGCCCTCTACATGGCTTCGCAGCCCAAAAACAAATGA
find homologs of chimp & mouse (or other animals) using BLAST (protein sequence)
find each nucleotide sequence
  • Bos taurus - cow
>gb|BC102552.1|:48-392 Bos taurus cytochrome c oxidase subunit VIIa polypeptide 2 like, mRNA (cDNA clone MGC:127892 IMAGE:30957826), complete cds
ATGTATTATAAGTTCAGTGGCTTCACGCAGAAGTTGGCCGGAGCATGGGCTTCCGACGCCTATAGCCCTC
AGGGATTAAGACCTGTTGTTTCCACAGAAGCACCGCCTATCATATTTGCCACACCAACTAAACTGAGCTC
CGGTCCCACTGCATATGACTATGCTGGGAAAAACACAGTTCCAGAGCTGCAGAAGTTTTTCCAGAAATCT
GACGGTGTGCCCATCCACCTGAAACGAGGCCTGCCTGACCAAATGCTTTACCGGACCACCATGGCGCTAA
CAGTGGGAGGGACCATCTACTGCCTGATCGCCCTCTACATGGCATCACAGCCCAGAAACAAATGA
  • Mus musculus - mouse
>gi|226958510:59-394 Mus musculus cytochrome c oxidase subunit VIIa polypeptide 2-like (Cox7a2l), nuclear gene encoding mitochondrial protein, transcript variant 2, mRNA
ATGTACTACAAGTTTAGCAGTTTCACGCAGAAGTTGGCTGGAGCTTGGGCTTCGGAAGCCTACACCCCGC
AGGGGTTAAAGCCAGTTTCCACAGAAGCACCACCTATCATATTTGCCACACCAACCAAACTGACCTCCAG
TGTGACAGCATATGATTATTCTGGGAAGAACAAAGTTCCAGAGCTGCAGAAGTTTTTCCAGAAGGCTGAT
GGTTTCCACCTGAAACGAGGCCTTCCAGACCAAATGCTTTACCGGACCACCATGGCTCTGACACTGGGAG
GGACCATCTACTGCCTGATCGCCCTCTACATGGCCTCGCAGCCCAGAAACAAATGA
Do alignment - ClustalW Online by EBI
>human-gi|49457409|emb|CR542208/1-345
ATGTACTACAAGTTTAGTGGCTTCACGCAGAAGTTGGCAGGAGCATGGGCTTCGGAGGCCTATAGCCCGCAG
GGATTAAAGCCTGTGGTTTCCACAGAAGCACCACCTATCATATTTGCCACACCAACTAAACTGACCTCCGAT
TCCACAGTGTATGATTATGCTGGGAAAAACAAAGTTCCAGAGCTACAAAAGTTTTTCCAGAAAGCTGATGGT
GTGCCCGTCTACCTGAAACGAGGCCTGCCTGACCAAATGCTTTACCGGACCACCATGGCGCTGACTGTGGGA
GGGACCATCTACTGCCTGATCGCCCTCTACATGGCTTCGCAGCCCAAAAACAAATGA
>cow-gb|BC102552.1|_48-392/1-345
ATGTATTATAAGTTCAGTGGCTTCACGCAGAAGTTGGCCGGAGCATGGGCTTCCGACGCCTATAGCCCTCAG
GGATTAAGACCTGTTGTTTCCACAGAAGCACCGCCTATCATATTTGCCACACCAACTAAACTGAGCTCCGGT
CCCACTGCATATGACTATGCTGGGAAAAACACAGTTCCAGAGCTGCAGAAGTTTTTCCAGAAATCTGACGGT
GTGCCCATCCACCTGAAACGAGGCCTGCCTGACCAAATGCTTTACCGGACCACCATGGCGCTAACAGTGGGA
GGGACCATCTACTGCCTGATCGCCCTCTACATGGCATCACAGCCCAGAAACAAATGA
>mouse-gi|226958510_59-394/1-336
ATGTACTACAAGTTTAGCAGTTTCACGCAGAAGTTGGCTGGAGCTTGGGCTTCGGAAGCCTACACCCCGCAG
GGGTTAAAGCC---AGTTTCCACAGAAGCACCACCTATCATATTTGCCACACCAACCAAACTGACCTCCAGT
GTGACAGCATATGATTATTCTGGGAAGAACAAAGTTCCAGAGCTGCAGAAGTTTTTCCAGAAGGCTGATGGT
TT------CCACCTGAAACGAGGCCTTCCAGACCAAATGCTTTACCGGACCACCATGGCTCTGACACTGGGA
GGGACCATCTACTGCCTGATCGCCCTCTACATGGCCTCGCAGCCCAGAAACAAATGA
use [1] Online KaKs Calculator!


Exercise 2 Viral genomes
Exercise 3 Practice with free online software tools for measuring Ka/Ks
[2] Online KaKs Calculator!

2010 Spring

  1. 게놈의 기적 (과학도서관 링크)
  2. 신의 언어 : 유전자 지도에서 발견한 신의 존재 (과학도서관 링크)
  3. 지루한 사람과 어울리지 마라 : 과학에서 배우는 삶의 교훈 (과학도서관 링크)

2009 Spring

Human Genome

Diagnosis test

Personal tools
Namespaces
Variants
Actions
Site
Choi lab
Resources
Toolbox