From CSBLwiki

(Difference between revisions)

Jump to: navigation, search

Latest revision as of 09:17, 29 July 2011

Results

Circular view

Coverage Graph

Methods & Procedures

Assembly

Newbler, CABOG, minimus2 (AMOS package),

GC skew

theory
- Made python code(gc_skew.py)

Primer

EP Hamilton, Use of HAPPY mapping for the higher order assembly of the Tetrahymena genome, elsevier, 2006 :
To confirm directly HAPPY links by PCR amplification, primers were designed in unique regions of scaffold sequence nearest to the linked ends,
using the Primer3 program
- Samuel Assefa, ABACAS: algorithm-based automatic contiguation of assembled sequences, Bioinformatics 2009 25(15):1968-1969; doi:10.1093/bioinformatics/btp347 :
ABACAS automatically extracts gaps on the pseudomolecule and, based on flanking sequences above a base quality threshold, designs primers for gap closure using Primer3

Finishing

Sequence Finishing and Gene Mapping for Candida albicans Chromosome 7 and Syntenic Analysis Against the Saccharomyces cerevisiae Genome
- DNA amplification for gap closing:PCR with each primer pair (shown in supplementary data at http://www.genetics.org/supplemental/) was carried out with Ready-To-Go PCR beads (Amersham Biosciences) using genomic DNA of C. albicans SC5314 as a template DNA. PCR was carried out using a hotstart of 3 min at 94° followed by 35 cycles of 94° for 10 sec, 50° for 10 sec, and 68° for 1 min, concluding with 68° for 10 min. Long PCR was carried out with LA PCR kit ver.2.1 (Takara, Tokyo). Conditions used were a hotstart of 3 min at 94° followed by 35 cycles of 98° for 10 sec and 68° for 20 min, concluding with a final extension of 72° for 10 min. Genomic DNA from C. albicans strain SC5314 (Fonzi and Irwin 1993) was used for all sequence analysis in this work.

Complete Genome Sequence of Staphylococcus lugdunensis Strain HKU09-01
- Briefly, gap closures were performed by genomic PCR followed by DNA sequencing of amplification products on an ABI 3130xl sequencer (Applied Biosystems, CA). The finished sequence was validated by genome macrorestriction analysis using multiple rare-cutting enzymes and visualization by pulsed-field gel electrophoresis.

CBCB Finishing Toolbox

Finishing procedures with Dupfinisher

Here is the LANL finishing procedure involving Dupfinisher: 
1) run Dupfinisher on the assembly ace file;
2) put the artificial reads generated by Dupfinisher into the main project;
3) assemble with parallel Phrap;
4) repeat steps 1-3 with new ace file;
5) run Consed autoFinish on the main project and do only primer walks from the main project and those from subprojects of unfinished repeats;
6) repeat step 4;
7) run autoFinish using primer walks for the main project and those from subprojects of unfinished repeats and use PCR to close gaps between scaffolds in main project;
8) repeat step 4;
9) perform manual finishing including closing gaps, resolving low quality and single clone coverage regions and checking repeat resolutions from Dupfinisher.

Cliff S. Han1, Patrick Chain2, Finishing Repetitive Regions Automatically with Dupfinisher

[1]
Illumina reads -> EULER-SR :4233 contigs
+
454 reads
-> newbler : 270 hybrid contigs
+
paired 454 reads
-> newbler's scaffolder : 3 contigs (A:3.18 Mb, B:5.7 kb and C:524 kb) |  (unscaffolded contigs -> utilized later in the final Finishing phase)

[2]
+
(Hybrid EULER-SR/VELVET contigs, Unscaffoled contigs -> nucmer) & (Illumina reads -> mosaik aligner)
->scaffolder의 N들(degenerate nucleotides)을 채워넣어 finishing  (We developed a Scaffold Bridging and Finishing phase for the purpose of linking the de novo scaffolds and for resolving the intra-scaffold degenerate nucleotide positions that were introduced by the scaffolder)
potential repeats/duplications by examining the read coverage and also the multiplicity of the vertices in the repeat graph that is part of EULER-SR's output
->scaffold B was indeed duplicated and a BLAST [11] search identified it as an rRNA gene
=> A, B, B, C

[3]
-> ordering with PCR

[4] 
-> correct indel error : mosaik aligner with Illumina reads 

figure
Harish Nagarajan et al, De Novo Assembly of the Complete Genome of an Enhanced Electricity-Producing Variant of Geobacter sulfurreducens Using Only Short Reads

Zhou Yu, Tao Li, Jindong Zhao and Jingchu Luo, PGAAS: a prokaryotic genome assembly assistant system
- ABBA와 같은 원리

rRNA

The positions of rRNA operons in the genome assembly were confirmed by long-range PCR amplification using primers that annealed to genes flanking the rRNA genes. These PCR fragments were sequenced to high redundancy and the consensus sequences were manually inserted into the assembly. Among the seven rRNA operons, the nucleotide sequences of 16S and 23S genes are at least 99% identical, differing by only one to three nucleotides in pairwise comparisons.Complete genome sequence of the industrial bacterium Bacillus licheniformis and comparisons with closely related Bacillus species

SEQanswer

Phage

Tandem repeat

Reads Library

454 SE

643326

454 PE

173864 (291735)

Solexa illumina

Andrew D Smith et al, Using quality scores and longer reads improves accuracy of Solexa read mapping, BMC Bioinformatics 에 의하면 미스매치 클수록(4), 퀼리티 cutoff(8) 높을 수록, read 길이가 길 수록 mapping이 잘 된다고 함. 맵핑 소프트웨어도 제공함.

set1: original
set2: "." -> "N"
set3: divided into 4 files
set4: divided into 4 files, "." -> "N"
set5: original -> fastx toolkit의 fastq_masker를 사용하여 quality 10 기준으로 'N' 으로 바꿈

mapping tool : gsMapper(newbler), mosaik aligner, bwa

http://en.wikipedia.org/wiki/FASTQ_format

Relationship between Q and p using the Sanger (red) and Solexa (black) equations (described above). The vertical dotted line indicates p = 0.05, or equivalently, Q ≈ 13. (http://en.wikipedia.org/wiki/FASTQ_format)

p = 0.01, Q = 20

p = 0.001, Q = 30

Softwares


Software	Version	Input	Output	Location(machine/folder)
Newbler	2.3(091027_1459)			panflam,panpyro
Phrap	0.990329(Phrap0.990329_patch)			panflam
Phrap	1.090518			panflam
Consed	090206			panflam
CABOG(celera)	6.1	sanger, 454(.sff), illumina(fastq), fastq	CABOG_output	panflam,panpyro
maq	0.7.1	ref:fasta, read:illumina, long read(not good)		panflam,panpyro
abyss [[1]]	1.2.0	454, illumina		panflam
SOAPdenovo	1.04	illumina		panflam
SOAPaligner		illumina		panflam
Corrector(soap package)	1.00	fasta,fastq		panflam
GapCloser(soap package)	1.10	fasta,fastq		panflam
MIRA		sanger,454,illumina
gapResolution		newbler results	fasta,qual
Dupfinisher		ace file
AutoEditor	1.20	.contig(TIGR)
rnammer	1.2	fasta	gff2	panflam
hmmer	2.3.2(for rnammer), 3			panflam.panpyro
tRNAscan-SE	1.23			panflam,panpyro
BlastViewer				panflam
M-GCAT				panflam,panpyro
bowtie		illumina
Velvet		illumina
MAQ		illumina
Polisher		illumina

http://en.wikipedia.org/wiki/List_of_sequence_alignment_software#Short-Read_Sequence_Alignment
(NGS toos)
galaxy web-page(NGS tools)
fastx-toolkit
List of alignment visualization software
Conversion among formats fa_all2std.pl
Inverted repeat finder IRF help download

Manuals

Introduction to Newbler (ppt) : 게시판

consed manual

about fake reads

phrap_input

phrap_input_v1.090518

phrap diff

phrap_v1.090518_shortread

newbler : flow space assembler
abyss : nucleotide space

create mate file from illumina for bambus

a blog very good at newbler

phrap사용법

454 sff 다루기

cabog 유용 옵션

Taxonomy

NCBI

   cellular organisms; Bacteria; Firmicutes; Clostridia; Clostridiales; Eubacteriaceae; Eubacterium

Database

Genome_database

References

Pawel Mackiewicz, Where does bacterial replication start? Rules for predicting the oriC region, Nucleic Acids Research 2004 32(13):3781-3791 [2]

@@ Line 1: / Line 1: @@
-==Softwares==
+{|align="right" cellpadding="15"
-{| class="wikitable" style="text-align:center" border="1"
+| __TOC__
-|+
-|-
-|Software || Version || Input || Output || Location(machine/folder)
-|-
-|Newbler || 2.3(091027_1459) || || || panflam,panpyro
-|-
-|Phrap || 0.990329([[Phrap0.990329_patch]]) || ||  || panflam
-|-
-|Phrap || 1.090518 || ||  || panflam
-|-
-|Consed || 090206 || ||  || panflam
-|-
-|CABOG(celera) || 6.1 || sanger, 454(.sff), illumina(fastq), fastq ||  || panflam,panpyro
-|-
-|maq || 0.7.1 || ref:fasta, read:illumina, long read(not good) ||   || panflam,panpyro
-|-
-|abyss || 1.2.0 || illumina ||  || panflam
-|-
-|SOAPdenovo || 1.04 || illumina ||   ||  panflam
-|-
-|Corrector(soap package) || 1.00 || fasta,fastq || || panflam
-|-
-|GapCloser(soap package) || 1.10 || fasta,fastq || || panflam
-|-
-|MIRA || || sanger,454,illumina ||  ||
-|-
-|gapResolution || || || ||
 |}
+=Results=
+==Circular view==
+[[File:NC 014624.png|450px]]
-*Dupfinisher
+==Coverage Graph==
-**Downloaded
-*Polisher
-**Can't find...
-==Logbook==
-'''phrap 사용 solexa 조립'''
-read의 이름을 어떻게 변환? manual을 보면 "create a script which translates your read names into St. Louis", 다른 사람들이 만들어 놓은 script는 없나?
-'''다시 addSolexaReads.perl'''
+=Methods & Procedures=
+==Assembly==
+Newbler, CABOG, minimus2 (AMOS package),
-gnusnah@panflam:~/works/assembly_2010_7_8/SE_PE/consed/edit_dir$ addSolexaReads.perl 454Contigs.ace.1 solexa_files.fof ref.fa
+==GC skew==
-*약 2시간 걸림, 또 실패
+*[http://www.nature.com/nrmicro/journal/v2/n11/box/nrmicro1024_BX1.html theory]
-*couldn't execute /home/gnusnah/tools/UW/consed/bin/consed -ace 454Contigs.ace.1 -addReads alignmentFiles100711_154311.fof -chem solexa at /home/gnusnah/tools/UW/consed/bin/addSolexaReads.perl line 170.
+**Made python code(gc_skew.py)
-*[[error_at_reading_step]]
-'''100711 Solexa read 변환'''
+==Primer==
+*[http://www.google.co.kr/url?sa=t&source=web&cd=5&ved=0CE0QFjAE&url=http%3A%2F%2Fhomepage.mac.com%2Fjonathan_eisen%2FPDFs%2F88.Hamilton.HAPPY.pdf&ei=1PJTTO6jMoGyvgOtsdAY&usg=AFQjCNFZuzn4b_3pKJX9nt4ne5FCXZKi1Q&sig2=02Xqn5lEoEe98rQMdBiXlg EP Hamilton, Use of HAPPY mapping for the higher order assembly of the Tetrahymena genome, elsevier, 2006] :
+*To confirm directly HAPPY links by PCR amplification, primers were designed in unique regions of scaffold sequence nearest to the linked ends,
+*using the Primer3 program
+**[http://bioinformatics.oxfordjournals.org/cgi/content/full/25/15/1968 Samuel Assefa, ABACAS: algorithm-based automatic contiguation of assembled sequences, Bioinformatics 2009 25(15):1968-1969; doi:10.1093/bioinformatics/btp347] :
+*ABACAS automatically extracts gaps on the pseudomolecule and, based on flanking sequences above a base quality threshold, designs primers for gap closure using Primer3
-"." 을 N 으로 변환: cat s_3.1.fastq | perl -pi -e 's/\./N/g' > N_s_3.1.fastq
+==Finishing==
+*[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1449773/ Sequence Finishing and Gene Mapping for Candida albicans Chromosome 7 and Syntenic Analysis Against the Saccharomyces cerevisiae Genome]
+**DNA amplification for gap closing:PCR with each primer pair (shown in supplementary data at http://www.genetics.org/supplemental/) was carried out with Ready-To-Go PCR beads (Amersham Biosciences) using genomic DNA of C. albicans SC5314 as a template DNA. PCR was carried out using a hotstart of 3 min at 94° followed by 35 cycles of 94° for 10 sec, 50° for 10 sec, and 68° for 1 min, concluding with 68° for 10 min. Long PCR was carried out with LA PCR kit ver.2.1 (Takara, Tokyo). Conditions used were a hotstart of 3 min at 94° followed by 35 cycles of 98° for 10 sec and 68° for 20 min, concluding with a final extension of 72° for 10 min. Genomic DNA from C. albicans strain SC5314 (Fonzi and Irwin 1993) was used for all sequence analysis in this work.
+*[http://jb.asm.org/cgi/content/full/192/5/1471 Complete Genome Sequence of Staphylococcus lugdunensis Strain HKU09-01]
+**Briefly, gap closures were performed by genomic PCR followed by DNA sequencing of amplification products  on an ABI 3130xl sequencer (Applied Biosystems, CA). The finished sequence was validated by genome macrorestriction  analysis using multiple rare-cutting enzymes and visualization by pulsed-field gel electrophoresis.
-'''Add solexa reads to Newbler result'''
+*[http://cbcb.umd.edu/finishing/ CBCB Finishing Toolbox]
-gnusnah@panflam:~/works/assembly_2010_7_8/SE_PE/consed/edit_dir$ addSolexaReads.perl 454Contigs.ace.1 solexa_files.fof ref.fa
+*Finishing procedures with Dupfinisher
-*총 33분 걸림
+ Here is the LANL finishing procedure involving Dupfinisher:
-*error - 454Contigs.ace.2 file: 0 -> 하드가 100% 됐었음, 정리 후 다시 실행
+) run Dupfinisher on the assembly ace file;
-*다시 error - read에 포함된 "." 가 문제 - 어떻게 해결? "." 가 있는 read 삭제? 삭제할 때는 pair인 read도 함께 삭제? -> "."을 n으로 바꾸면 될지도.
+) put the artificial reads generated by Dupfinisher into the main project;
+) assemble with parallel Phrap;
+) repeat steps 1-3 with new ace file;
+) run Consed autoFinish on the main project and do only primer walks from the main project and those from subprojects of unfinished repeats;
+) repeat step 4;
+) run autoFinish using primer walks for the main project and those from subprojects of unfinished repeats and use PCR to close gaps between scaffolds in main project;
+) repeat step 4;
+) perform manual finishing including closing gaps, resolving low quality and single clone coverage regions and checking repeat resolutions from Dupfinisher.
+ Cliff S. Han1, Patrick Chain2, Finishing Repetitive Regions Automatically with Dupfinisher
-'''run Newbler PE'''
+ [1]
+ Illumina reads -> EULER-SR :4233 contigs
+ +
+reads
+ -> newbler : 270 hybrid contigs
+ +
+ paired 454 reads
+ -> newbler's scaffolder : 3 contigs (A:3.18 Mb, B:5.7 kb and C:524 kb) |  (unscaffolded contigs -> utilized later in the final Finishing phase)
+ [2]
+ +
+ (Hybrid EULER-SR/VELVET contigs, Unscaffoled contigs -> nucmer) & (Illumina reads -> mosaik aligner)
+ ->scaffolder의 N들(degenerate nucleotides)을 채워넣어 finishing  (We developed a Scaffold Bridging and Finishing phase for the purpose of linking the de novo scaffolds and for resolving the intra-scaffold degenerate nucleotide positions that were introduced by the scaffolder)
+ '''potential repeats/duplications''' by examining the read coverage and also the multiplicity of the vertices in the repeat graph that is part of EULER-SR's output
+ ->scaffold '''B''' was indeed '''duplicated''' and a BLAST [11] search identified it as an '''rRNA gene'''
+ => A, B, B, C
+ [3]
+ -> ordering with PCR
+ [4]
+ -> correct indel error : mosaik aligner with Illumina reads
+ [http://www.plosone.org/article/showImageLarge.action?uri=info%3Adoi%2F10.1371%2Fjournal.pone.0010922.g001 figure]
+ [http://www.plosone.org/article/info:doi/10.1371/journal.pone.0010922#pone-0010922-g001 Harish Nagarajan et al, De Novo Assembly of the Complete Genome of an Enhanced Electricity-Producing Variant of Geobacter sulfurreducens Using Only Short Reads]
-runAssembly -o PE -a 50 -l 350 -g -m -ml 20 -cpu 0 -consed ~/db/genome/Eubacteria/APR_2010_PE/GE6FA8204.sff
+*'''Zhou Yu, Tao Li, Jindong Zhao and Jingchu Luo, PGAAS: a prokaryotic genome assembly assistant system'''
-(/home/gnusnah/works/assembly_2010_7_8/)
+** ABBA와 같은 원리
-'''run Newbler SE'''
+==rRNA==
+The positions of rRNA operons in the genome assembly were confirmed by long-range PCR amplification using primers that annealed to genes flanking the rRNA genes. These PCR fragments were sequenced to high redundancy and the consensus sequences were manually inserted into the assembly. Among the seven rRNA operons, the nucleotide sequences of 16S and 23S genes are at least 99% identical, differing by only one to three nucleotides in pairwise comparisons.[http://genomebiology.com/2004/5/10/r77 Complete genome sequence of the industrial bacterium Bacillus licheniformis and comparisons with closely related Bacillus species]
-runAssembly -o SE -a 50 -l 350 -g -m -ml 20 -cpu 0 -consed ~/db/genome/Eubacteria/NOV_2009_SE/GIST.SE.sff
+[http://seqanswers.com/forums/showthread.php?t=5730&highlight=rrna SEQanswer]
-(/home/gnusnah/works/assembly_2010_7_8/)
-'''add solexa read, doing...'''
+[http://seqanswers.com/forums/showthread.php?t=2543&highlight=rrna SEQanswer]
-under /home/gnusnah/works/assembly_2010_7_8/consed/
+==Phage==
-*make dir : solexa_dir
+Tandem repeat
-**link to fastq (2 paired end file)
-*make file : edit_dir/solexa_files.fof
+=Reads Library=
+==454 SE==
+*643326
+==454 PE==
+*173864 (291735)
+==Solexa illumina==
+[[image:Quality_stats.png|center|1024px]]
+[http://www.biomedcentral.com/1471-2105/9/128 Andrew D Smith et al, Using quality scores and longer reads improves accuracy of Solexa read mapping, BMC Bioinformatics] 에 의하면 미스매치 클수록(4), 퀼리티 cutoff(8) 높을 수록, read 길이가 길 수록 mapping이 잘 된다고 함. 맵핑 소프트웨어도 제공함.
+ set1: original
+ set2: "." -> "N"
+ set3: divided into 4 files
+ set4: divided into 4 files, "." -> "N"
+ set5: original -> fastx toolkit의 fastq_masker를 사용하여 quality 10 기준으로 'N' 으로 바꿈
-'''Consed Customization'''
+mapping tool : gsMapper(newbler), mosaik aligner, bwa
-*file : /home/gnusnah/.consedrc
-*add environment : /home/gnusnah/.bashrc
-'''Consed Install'''
+{{:sequencing library}}
-*[[Consed_Install]]
-While customizing phredPhrap, the location of polyphred should be confirmed. Polyphred is not installed. Sent request e-mail.
-'''run Newbler SE + PE'''
+{{:Assembly software}}
-runAssembly -o SE_PE -a 50 -l 350 -g -m -ml 20 -cpu 0 -consed ~/db/genome/Eubacteria/NOV_2009_SE/GIST.SE.sff ~/db/genome/Eubacteria/APR_2010_PE/GE6FA8204.sff
+=Manuals=
-(/home/gnusnah/works/assembly_2010_7_8/)
+{{:Assembly manual}}
+=Taxonomy=
+[http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=1736 NCBI]
+    cellular organisms; Bacteria; Firmicutes; Clostridia; Clostridiales; Eubacteriaceae; Eubacterium
+=Database=
+[[Genome_database]]
-'''Try Consed'''
-gnusnah@panflam:~/works/assembly_2010_7_8/SE_PE/consed/edit_dir$ ~/tools/UW/consed/consed_linux64bit
+=References=
+Pawel Mackiewicz, Where does bacterial replication start? Rules for predicting the oriC region, Nucleic Acids Research 2004 32(13):3781-3791 [http://nar.oxfordjournals.org/cgi/content/full/32/13/3781]
-'''phred'''
-add environment : /home/gnusnah/.bashrc
-PHRED_PARAMETER_FILE=/home/gnusnah/tools/UW/phred/phredpar.dat
-export PHRED_PARAMETER_FILE
-==manuals==
-Introduction to Newbler (ppt) : 게시판
-[[consed manual]]
-[[about fake reads]]
-[[phrap_input]]
-[[phrap_input_v1.090518]]
-[[phrap diff]]
-[[phrap_v1.090518_shortread]]
-*newbler : flow space assembler
-*abyss : nucleotide space

Genome assembly

From CSBLwiki

Latest revision as of 09:17, 29 July 2011

Contents

Results

Circular view

Coverage Graph

Methods & Procedures

Assembly

GC skew

Primer

Finishing

rRNA

Phage

Reads Library

454 SE

454 PE

Solexa illumina

Softwares

Manuals

Taxonomy

Database

References

Personal tools

Namespaces

Variants

Views

Actions

Search

Site

Choi lab

Resources

Toolbox