Genome assembly

From CSBLwiki

Jump to: navigation, search



Circular view

NC 014624.png

Coverage Graph

Methods & Procedures


Newbler, CABOG, minimus2 (AMOS package),

GC skew



Here is the LANL finishing procedure involving Dupfinisher: 
1) run Dupfinisher on the assembly ace file;
2) put the artificial reads generated by Dupfinisher into the main project;
3) assemble with parallel Phrap;
4) repeat steps 1-3 with new ace file;
5) run Consed autoFinish on the main project and do only primer walks from the main project and those from subprojects of unfinished repeats;
6) repeat step 4;
7) run autoFinish using primer walks for the main project and those from subprojects of unfinished repeats and use PCR to close gaps between scaffolds in main project;
8) repeat step 4;
9) perform manual finishing including closing gaps, resolving low quality and single clone coverage regions and checking repeat resolutions from Dupfinisher.

Cliff S. Han1, Patrick Chain2, Finishing Repetitive Regions Automatically with Dupfinisher
Illumina reads -> EULER-SR :4233 contigs
454 reads
-> newbler : 270 hybrid contigs
paired 454 reads
-> newbler's scaffolder : 3 contigs (A:3.18 Mb, B:5.7 kb and C:524 kb) |  (unscaffolded contigs -> utilized later in the final Finishing phase)

(Hybrid EULER-SR/VELVET contigs, Unscaffoled contigs -> nucmer) & (Illumina reads -> mosaik aligner)
->scaffolder의 N들(degenerate nucleotides)을 채워넣어 finishing  (We developed a Scaffold Bridging and Finishing phase for the purpose of linking the de novo scaffolds and for resolving the intra-scaffold degenerate nucleotide positions that were introduced by the scaffolder)
potential repeats/duplications by examining the read coverage and also the multiplicity of the vertices in the repeat graph that is part of EULER-SR's output
->scaffold B was indeed duplicated and a BLAST [11] search identified it as an rRNA gene
=> A, B, B, C

-> ordering with PCR

-> correct indel error : mosaik aligner with Illumina reads 

Harish Nagarajan et al, De Novo Assembly of the Complete Genome of an Enhanced Electricity-Producing Variant of Geobacter sulfurreducens Using Only Short Reads


The positions of rRNA operons in the genome assembly were confirmed by long-range PCR amplification using primers that annealed to genes flanking the rRNA genes. These PCR fragments were sequenced to high redundancy and the consensus sequences were manually inserted into the assembly. Among the seven rRNA operons, the nucleotide sequences of 16S and 23S genes are at least 99% identical, differing by only one to three nucleotides in pairwise comparisons.Complete genome sequence of the industrial bacterium Bacillus licheniformis and comparisons with closely related Bacillus species




Tandem repeat

Reads Library

454 SE

454 PE

Solexa illumina

Quality stats.png

Andrew D Smith et al, Using quality scores and longer reads improves accuracy of Solexa read mapping, BMC Bioinformatics 에 의하면 미스매치 클수록(4), 퀼리티 cutoff(8) 높을 수록, read 길이가 길 수록 mapping이 잘 된다고 함. 맵핑 소프트웨어도 제공함.

set1: original
set2: "." -> "N"
set3: divided into 4 files
set4: divided into 4 files, "." -> "N"
set5: original -> fastx toolkit의 fastq_masker를 사용하여 quality 10 기준으로 'N' 으로 바꿈

mapping tool : gsMapper(newbler), mosaik aligner, bwa

Relationship between Q and p
Relationship between Q and p using the Sanger (red) and Solexa (black) equations (described above). The vertical dotted line indicates p = 0.05, or equivalently, Q ≈ 13. (

p = 0.01, Q = 20
p = 0.001, Q = 30


Software Version Input Output Location(machine/folder)
Newbler 2.3(091027_1459) panflam,panpyro
Phrap 0.990329(Phrap0.990329_patch) panflam
Phrap 1.090518 panflam
Consed 090206 panflam
CABOG(celera) 6.1 sanger, 454(.sff), illumina(fastq), fastq CABOG_output panflam,panpyro
maq 0.7.1 ref:fasta, read:illumina, long read(not good) panflam,panpyro
abyss [[1]] 1.2.0 454, illumina panflam
SOAPdenovo 1.04 illumina panflam
SOAPaligner illumina panflam
Corrector(soap package) 1.00 fasta,fastq panflam
GapCloser(soap package) 1.10 fasta,fastq panflam
MIRA sanger,454,illumina
gapResolution newbler results fasta,qual
Dupfinisher ace file
AutoEditor 1.20 .contig(TIGR)
rnammer 1.2 fasta gff2 panflam
hmmer 2.3.2(for rnammer), 3 panflam.panpyro
tRNAscan-SE 1.23 panflam,panpyro
BlastViewer panflam
M-GCAT panflam,panpyro
bowtie illumina
Velvet illumina
MAQ illumina
Polisher illumina


Introduction to Newbler (ppt) : 게시판

consed manual

about fake reads



phrap diff


create mate file from illumina for bambus

a blog very good at newbler


454 sff 다루기

cabog 유용 옵션



   cellular organisms; Bacteria; Firmicutes; Clostridia; Clostridiales; Eubacteriaceae; Eubacterium




Pawel Mackiewicz, Where does bacterial replication start? Rules for predicting the oriC region, Nucleic Acids Research 2004 32(13):3781-3791 [2]

Personal tools
Choi lab