Genome assembly

From CSBLwiki

(Difference between revisions)

Jump to: navigation, search

Revision as of 09:05, 23 July 2010

Results

Coverage Graph

Using Solexa reads with Mosaik Aligner.mosaik_aligner_result1

8번의 경우 cov가 다른 scf의 4~5배.

3번은 align 안됨. (paired-end 모드 때문?)

Annotation

annotation_E_limosum

tRNA_E_limosum

scaffolds

rRNA
scaffold00008	RNAmmer-1.2	rRNA	2461	5322	3162.5	+	.	23s_rRNA	
scaffold00008	RNAmmer-1.2	rRNA	5400	5513	58.5	+	.	5s_rRNA	
scaffold00007	RNAmmer-1.2	rRNA	277	392	72.7	-	.	5s_rRNA	
scaffold00008	RNAmmer-1.2	rRNA	109	1620	1936.1	+	.	16s_rRNA

4를 쪼개서 총 9개의 scaffold 이다.
GC contents를 고려해 보았을 때, 5-4_b 과 4_f-6의 연결이 더 자연스러울 것으로 예상된다.


scf	length	GC%	GC skew	Descriptions	scf	length	GC%	GC skew	Descriptions
1	760KB	49.00		DnaB,DnaJ,dnaq,dnaG,dnaK tRNA(S(3),Q(2),P(2),L(2),M(1), F(1),R(1),H(1),K(1))	5	1422KB(1.4M)	46.53		Has termi0nus of replication, tRNA(L(1),T(1),Y(1),K(1),R(1)), dnaK,dnaX_nterm
2	493KB	49.24		tRNA(V(1),R(1),Y(1),T(1), dnaG	6	231KB	47.18		dnaG,dnaq
3	2.2KB	50.46			7	377KB	48.00		Has terminus of replication, 5s rRNA
4_f	314KB	47.65		tRNA(K(1),T(1),Y(1),L(1), G(1),K(1),W(1),S(1)), DnaA,DnaB,dnan,dnaX_nterm	8	5.5KB	44.89		5s,23s,16s rRNA, dnaq
4_b	649KB	45.69		tRNA(T(1),A(1),L(1),M(2),Q(1), V(2),R(1),P(1),P(1),F(1),G(2),C(1)), dnaK

아래 두 결과 모두 가능한 것으로 보인다. 그러므로 4를 둘로 쪼개고 이들 사이의 관계를 PCR이나 유전자 순서로 파악해야할 듯.
5
4_front
4_back
1
2
6
7
8
3

즉 5-4_back, 4_front-6  또는  5, 4_front-4_back, 6 둘 모두 가능성이 있다.

newbler gsMapper로 8개의 scf에 454 PE read를 align 해본 결과 4번 scf가 잘 못 조립되었고, 이 것이 둘로 나뉘어 5번과 6번에 연결되었다.
결과적으로 7개의 scf가 남았다.
5-4_back : 2.07M
1 : 0.76M
2 : 0.49M
4_front-6 : 0.546
7 : 0.037M
8 : 5.5KB (5s, 23s, 16s rDNA: encoding rRNA | depth가 다른 것에 비해 3배 큼)
3 : 2.2KB

newbler scaffolds
5 : 1.4M
4 : 0.96M
1 : 0.76M
2 : 0.49M
6 : 0.23M
7 : 0.037M
8 : 5.5KB (5s, 23s, 16s rDNA: encoding rRNA | depth가 다른 것에 비해 3배 큼)
3 : 2.2KB

cabog:5,newbler:8
둘을 align 한 후 비교해보면 newbler가 gapresolution 후 더 정확한 것으로 생각됨.
cabog는 오류를 포함한 scaffold로 생각됨.

Procedures

Finishing
CBCB Finishing Toolbox

Finishing procedures with Dupfinisher
Here is the LANL finishing procedure involving Dupfinisher: 
1) run Dupfinisher on the assembly ace file;
2) put the artificial reads generated by Dupfinisher into the main project;
3) assemble with parallel Phrap;
4) repeat steps 1-3 with new ace file;
5) run Consed autoFinish on the main project and do only primer walks from the main project and those from subprojects of unfinished repeats;
6) repeat step 4;
7) run autoFinish using primer walks for the main project and those from subprojects of unfinished repeats and use PCR to close gaps between scaffolds in main project;
8) repeat step 4;
9) perform manual finishing including closing gaps, resolving low quality and single clone coverage regions and checking repeat resolutions from Dupfinisher.

Cliff S. Han1, Patrick Chain2, Finishing Repetitive Regions Automatically with Dupfinisher

[1]
Illumina reads -> EULER-SR :4233 contigs
+
454 reads
-> newbler : 270 hybrid contigs
+
paired 454 reads
-> newbler's scaffolder : 3 contigs (A:3.18 Mb, B:5.7 kb and C:524 kb) |  (unscaffolded contigs -> utilized later in the final Finishing phase)

[2]
+
(Hybrid EULER-SR/VELVET contigs, Unscaffoled contigs -> nucmer) & (Illumina reads -> mosaik aligner)
->scaffolder의 N들(degenerate nucleotides)을 채워넣어 finishing  (We developed a Scaffold Bridging and Finishing phase for the purpose of linking the de novo scaffolds and for resolving the intra-scaffold degenerate nucleotide positions that were introduced by the scaffolder)
potential repeats/duplications by examining the read coverage and also the multiplicity of the vertices in the repeat graph that is part of EULER-SR's output
->scaffold B was indeed duplicated and a BLAST [11] search identified it as an rRNA gene
=> A, B, B, C

[3]
-> ordering with PCR

[4] 
-> correct indel error : mosaik aligner with Illumina reads 

figure
Harish Nagarajan et al, De Novo Assembly of the Complete Genome of an Enhanced Electricity-Producing Variant of Geobacter sulfurreducens Using Only Short Reads

Zhou Yu, Tao Li, Jindong Zhao and Jingchu Luo, PGAAS: a prokaryotic genome assembly assistant system
=> ABBA와 같은 원리

Logbook

Mosaik

Align solexa reads to scaffold contain N.

/home/gnusnah/works2/assembly_elimosum/mosaik
~/tools/MARTHLAB/UnifiedRelease/bin/MosaikBuild -q solexa/1/ -q2 solexa/2/ -out reads.bin -st illumina
~/tools/MARTHLAB/UnifiedRelease/bin/MosaikBuild -fr ref/manual_align3.fasta -oa scfs.fa.bin
~/tools/MARTHLAB/UnifiedRelease/bin/MosaikJump -ia scfs.fa.bin -out scfs.MosaikJumpDb -hs 15   (hs: hash size -> large vs short = speed vs sensitivity)
~/tools/MARTHLAB/UnifiedRelease/bin/MosaikAligner -in reads.bin -ia scfs.fa.bin -out reads.bin.aligned -hs 15 -mmp 0.1 -act 20 -mhp 100 -m all -a all -p 8 -j scfs.MosaikJumpDb -km -pm
~/tools/MARTHLAB/UnifiedRelease/bin/MosaikSort -in reads.bin.aligned -out reads.bin.aligned.sorted -inu -uo
ace2Fasta.perl reads.bin.aligned.sorted.assembled_scaffold00001.ace
~/tools/MARTHLAB/UnifiedRelease/bin/MosaikCoverage -in reads.bin.aligned -ia scfs.fa.bin -u -od graphs -cg

blast

16s rRNA로 찾은 가까운 종에 대해 tblastx
blastall -p tblastx -d Alkaliphilus_metalliredigens_QYMF/NC_009633.fna -i manual_align2.fasta -e 0.01 -m 7 > Alkaliphilus_metalliredigens_QYMF.blastout (4929566)
blastall -p tblastx -d Alkaliphilus_oremlandii_OhILAs/NC_009922.fna -i manual_align2.fasta -e 0.01 -m 7 > Alkaliphilus_oremlandii_OhILAs.blastout (3123558)
blastall -p tblastx -d Bacillus_halodurans/NC_002570.fna -i manual_align2.fasta -e 0.01 -m 7 > Bacillus_halodurans.blastout (4202352)
blastall -p tblastx -d Clostridium_novyi_NT/NC_008593.fna -i manual_align2.fasta -e 0.01 -m 7 > Clostridium_novyi_NT.blastout (2547720)
blastall -p tblastx -d Clostridium_tetani_E88/NC_004557.fna -i manual_align2.fasta -e 0.01 -m 7 > Clostridium_tetani_E88.blastout (2799251)
blastall -p tblastx -d Clostridium_thermocellum_ATCC_27405/NC_009012.fna -i manual_align2.fasta -e 0.01 -m 7 > Clostridium_thermocellum_ATCC_27405.blastout (3843301)
blastall -p tblastx -d Desulfotomaculum_reducens_MI-1/NC_009253.fna -i manual_align2.fasta -e 0.01 -m 7 > Desulfotomaculum_reducens_MI-1.blastout (3608104)
blastall -p tblastx -d Geobacillus_kaustophilus_HTA426/NC_006510.fna -i manual_align2.fasta -e 0.01 -m 7 > Geobacillus_kaustophilus_HTA426.blastout (3544776)
blastall -p tblastx -d Oceanobacillus_iheyensis/NC_004193.fna -i manual_align2.fasta -e 0.01 -m 7 > Oceanobacillus_iheyensis.blastout (3630528)
blastall -p tblastx -d Pelotomaculum_thermopropionicum_SI/NC_009454.fna -i manual_align2.fasta -e 0.01 -m 7 > Pelotomaculum_thermopropionicum_SI.blastout (3025375) : bad

scaffold을 DB로 해서 아래 두 단백질을 찾기
~/works2/assembly_elimosum/blast$ 
formatdb -t scf -i manual_align2.fasta -p F
blastall -p tblastn -d manual_align2.fasta -i scf3_proteins.fasta -m 8 > blastout.txt
YP_170397.1(앞부분)와 ZP_03057006(뒷부분)의 연속은 scf5_4에서 3번이나 나옴 scf7에서는 한 곳에서 서로 위치가 바뀐 연속이 발견됨. 그 외에 따로 여러 부위에서 발견이 됨.
scf03의 blast결과 2010_07_21

newbler의 3번 scaffold(2.2kb) : 2개

앞부분:
>ref|YP_170397.1| Gene info linked to YP_170397.1 UDP-glucose/GDP-mannose dehydrogenase [Francisella tularensis subsp. tularensis SCHU S4]
Score =  608 bits (1568),  Expect = 9e-172
Length: 436

>gi|56708501|ref|YP_170397.1| UDP-glucose/GDP-mannose dehydrogenase [Francisella tularensis subsp. tularensis SCHU S4]
MSLYEDIVAKREKVSLVGLGYVGLPIAIAFAKKIDVLGFDICETKVQHYKDGFDPTKEVGDEAVRNTTMK
FSCDETSLKECKFHIVAVPTPVKADKTPDLTPIIKASETVGRNLVKGAYVVFESTVYPGVTEDVCVPILE
KESGLRSGEDFKVGYSPERINPGDKVHRLETIIKVVSGMDEESLDTIAKVYELVVDAGVYRASSIKVAEA
AKVIENSQRDVNIAFVNELSIIFNQMGIDTLEVLAAAATKWNFLNFKPGLVGGHCIGVDPYYLTYKAAEL
GYHSQVILSGRRINDSMGKFVVENLVKKLISADIPVKRARVAIFGFTFKEDCPDTRNTRVIDMVKELNEY
GIEPYIIDPVADKEEAKHEYGLEFDDLSKMVNLDAIIIAVSHEQFKDITKQQFDRLYAHNSRKIIFDIKG
SLDKSEFEKDYIYWRL

뒷부분:
>ref|ZP_03057006.1|  NAD dependent epimerase/dehydratase family protein [Francisella tularensis subsp. novicida FTE]
Score =  483 bits (1242),  Expect = 6e-134
Length: 309

>gi|194323222|ref|ZP_03057006.1| NAD dependent epimerase/dehydratase family protein [Francisella tularensis subsp. novicida FTE]
MTGGAGFIGSNLCEVLLSKGYRVRCLDDLSNGHYHNVEPFLTNSNYEFIKGDIRDLDTCMKACEGIDYVL
HQAAWGSVPRSIEMPLVYEDINVKGTLNMLEAARQNNVKKFVYASSSSVYGDEPNLPKKEGREGNILSPY
AFTKKANEEWARLYTKLYGLDTYGLRYFNVFGRRQDPNGAYAAVIPKFIKQLLNDEAPTINGDGKQSRDF
TYIENVIEANLKACLADSKYAGEAFNIAYGGREYLIDLYYNLCDALGKKIEPNFGPDRAGDIKHSNADIS
KARNMLGYNPEYDFELGIKHAVEWYSSEL

tRNAscan-SE

tRNAscan-SE -B -o tRNA.txt manual_align3.fasta (전체 scf에서 검색)

newbler의 3번 scaffold(2.2kb) : 아님

RNAMMER

newbler 결과 중 scaffold 8 번(5.5kb) : 5s 23s 16s rRNA
3번(2.2kb) : rRNA 아님

Dupfinisher

Dupfinisher 수정 NCBI.pm의 148, 222번째 줄에 다음으로 변경 : e-숫자 을 인식 못하는 것을 1e-숫자로 바꿔서 문제 해결
--------------------------------------------
 my $tmp_start_word = "e";
 my $tmp = "";

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

   if (/Expect = ([+e\d\.-]+)/) {
     $tmp_start_word = substr($1,0,1);
     if ($tmp_start_word eq "e") {
        $tmp = $1;
        $tmp =~ s/^e/1e/g;
        $hsp->insert(Expect => $tmp);
        $expect = $tmp < $expect ? $tmp : $expect;
        }
     else {
        $hsp->insert(Expect => $1);
        $expect = $1 < $expect ? $1 : $expect;
        }
   }
--------------------------------------------

하지만 grouping 단계에서 여전히 알 수 없는 error들이 나옴

454 reads - de novo  +  solexa -fake reads
fake reads => afg (toAmos) => frg (amos2frg) 순으로 변환하여 CABOG에 집어넣음
1.cabog로 454 reads 와 fake reads를 함께 조립(잘되고 있는 것으로 보임-저번 시도에서는 fastqToCA를 써서 실패했었음 - 결과 아주 나쁨) -> ace 파일 생성 -> Dupfinisher
2.newbler로 454 reads 와 fake reads를 함께 조립 -> gapResoultion (이미했음)

~/tools/wgs-6.1/Linux-amd64/bin/runCA -d test -p fake_solexa solexa.frg
~/tools/wgs-6.1/Linux-amd64/bin/runCA -d SE_PE -p SE_PE createACE=1 unitigger=bog doToggle=1 closureOverlaps=0 closurePlacement=2 SE.frg PE.frg solexa.frg

MIRA

MIRA 사용하기
조립에 두가지 방법을 제시하고 있음
1. full de-novo 454 reads + solexa reads (총 126.9 GB 필요)
2. 454 read만으로 de-novo (2.9 GB 필요) 한 이후 solexa reads를 mapping (145.6 GB 필요)

solexa reads를 쪼개서 mapping이 가능할까?

Step 1: assemble the 'long' reads (454 or Sanger or both)
우선 454 read만 조립
sff_extract -l linker.fasta-i "insert_size:3000,insert_stdev:900"  GE6FA8204.sff GIST.SE.sff
mira --project=elimosum --job=denovo,genome,accurate,454 COMMON_SETTINGS -GE:not=4 -OUT:ora=yes 454_SETTINGS -ED:ace=yes >&log_assembly

Step 2: filter the results
convert_project -f caf -t caf -x 500 elimosum_out.caf hybrid_backbone_in.caf

Step 3: map the Solexa data
cat s_3.1.fastq s_3.2.fastq > hybrid_in.solexa.fastq
cat s_3.1.fastq s_3.2.fastq 
  | grep "@"
  | sed -e 's/@//' 
  | cut -f 1
  | cut -f 1 -d ' '
  | sed -e 's/$/ hybrid/'
  > hybrid_straindata_in.txt
mira --project=hybrid --job=mapping,genome,accurate,solexa -AS:nop=1 -SB:bft=caf:lsd=yes:bsn=elimosum COMMON_SETTINGS -GE:not=6 -OUT:ora=yes SOLEXA_SETTINGS -CO:msr=no -GE:uti=no:tismin=350:tismax=400 >&log_assembly.txt
(만약 메모리 문제로 실패하면 4등분 된 fastq 이용예정)
swap 메모리 증설
약 10시간 지남->solexa fastq 95% 정도 메모리에 불러들임
새벽 5:50분 이후로 log 파일의 변화가 없음 -> 우선 종료함

Fake Reads

fake reads -> newbler and phrap
*내가 만든 스크립트 사용
454PE-cabog -> fake reads
454SE-cabog -> 사용안함
454SE-newbler -> fake reads
454SE_PE-cabog -> 사용안함

fake reads(454PE-cabog) + fake reads(454SE-newbler) + fake reads(illu-abyss) + fake reads(illu-velvet)
1.phrap  (default) -> phrap 메모리 에러
2.newbler (-ace) -> 결과가 별로 좋지 않음, paried end 정보가 없으니 scaffold 생성도 안됨 -> 454PE reads 추가하여 scaffold 얻음, 11 -> gapRes -> 각종 에러.

*MIRA fragment로 쪼개는 스크립트 + multi contigs 적용 스크립트 만들기
잘 안됨... 
pair 정보를 넣어줘야 할텐데
만약 scaffold 파일을 쪼갤경우 n을 어떻게 처리할 것인가? 그대로 두면 엄청난 참변이...
그렇다고 그냥 contig 파일을 쪼개면 무슨 의미가 있을까?
fake reads(454PE-cabog) + fake reads(454SE-newbler) + fake reads(illu-abyss) + fake reads(illu-velvet)

 다음 step 
*cabog에 들어가는 fastq의 길이 확인 -> contig를 fake read로 만들기 -> 조립
*cabog의 contig를 fake read로 만들고 -> newbler로 조립 -> gapRes
*small assembly를 만들어서(ace 파일등) -> dupfinisher 디버깅
*phrap 으로 fake read를 조립 -> ?
*cabog 를 gapRes이 사용하도록 변경

CABOG

 cabog with ace output and some options 
~/tools/wgs-6.1/Linux-amd64/bin/runCA -d SE -p SE createACE=1 unitigger=bog doToggle=1 closureOverlaps=0 closurePlacement=2 SE.frg & ~/tools/wgs-6.1/Linux-amd64/bin/runCA -d PE -p PE createACE=1 unitigger=bog doToggle=1 closureOverlaps=0 closurePlacement=2 PE.frg & ~/tools/wgs-6.1/Linux-amd64/bin/runCA -d SE_PE -p SE_PE createACE=1 unitigger=bog doToggle=1 closureOverlaps=0 closurePlacement=2 SE.frg PE.frg &

cabog 사용, read:454PE,454SE,illumina 2
만 1일째 0 단계 overlap 중, 언제 끝날지 예측 불가. cpu 사용양을 보니 190%. 몇개를 이용하는지는 알 수 없음. 0-overlaptrim-overlap 단계에서 하드디스크 용량 문제로 실패. 실패한 부분에서 무려 64GB를 차지함.
cabog 사용, read:454PE,454SE,abyss contigs
panpyro
실패 fastq를 읽는 부분은 illumina read에 맞도록 되어 있는 것으로 생각됨. 긴 read는 읽히지 않는 것 같음.

cabog 사용, read:454PE,454SE,abyss fake reads
panpyro /home/users/roh329/works/assembly_2010_7_12
실패 abyss fake reads에 알 수 없는 문제가 있음

fake qual을 만들고 fasta와 섞어서 fastq만듬
/home/gnusnah/p-code/PModule/assembler_modules/make_qual.py
/home/gnusnah/p-code/PModule/assembler_modules/make_fastq.py

cabog 사용, read:454PE,454SE,illumina
panflam
~/tools/wgs-6.1/Linux-amd64/bin/fastqToCA -insertsize 375 25 -libraryname JUN_illu -type illumina -fastq /home/gnusnah/db/genome/Eubacteria/JUN_2010_PE/s_3.1.fastq,/home/gnusnah/db/genome/Eubacteria/JUN_2010_PE/s_3.2.fastq > s_3.frg
~/tools/wgs-6.1/Linux-amd64/bin/sffToCA -libraryname PE -insertsize 3000 200 -linker titanium -output PE GE6FA8204.sff
~/tools/wgs-6.1/Linux-amd64/bin/sffToCA -libraryname SE -output SE GIST.SE.sff
~/tools/wgs-6.1/Linux-amd64/bin/runCA -d SE_PE_ILLU -p run1 unitigger=bog doToggle=1 clossurePlacement=1 PE.frg SE.frg s_3.frg

gapResolution

gapResolution 사용
/home/gnusnah/works/assembly_2010_7_8/gapRes/run1
~/tools/gapResolution-1_2_1/bin/runGapResolution.pl -od run1 -np 8 ../SE_PE_abyss/assembly/consed/edit_dir/454Contigs.ace.1 ../SE_PE_abyss/assembly/454Scaffolds.txt ../SE_PE_abyss/assembly/454NewblerMetrics.txt ../SE_PE_abyss/assembly/454AllContigs.fna ../SE_PE_abyss/assembly/454AllContigs.qual
~/tools/gapResolution-1_2_1/bin/stitchClosedSubProjects.pl ../../SE_PE_abyss/assembly/454Scaffolds.txt ../../SE_PE_abyss/assembly/454AllContigs.fna ../../SE_PE_abyss/assembly/454AllContigs.qual ./fakes/ ./assemInfo/gapdirs.txt my_run1
~/p-code/PModule/assembler_modules/scf2ctg.py my_run1.fasta

seqanswers에서 mira 3의 사용이 hybrid에 상당히 유효하다는 의견들이 있음
메뉴얼이 consed 못지 않게 김.

Phrap/Consed

 St. Louis conversion script 제작 중 
제작 중 454 오리지널 read를 살펴보니, mate pair 정보가 들어있는 read의 경우 linker seq로 쪼갠 후 양 끝 중 어느 한쪽이 짧을 경우 정보를 버린다는 것을 알게됨.
그래서 newbler를 이용해 최소 read 길이 옵션을 조정해서 조립함. 20(default) -> 15(바꿀 수 있는 최소길이)
결과는 오히려 더 안좋아짐. 이 것은 아마도 짧은 서열은 더 많은 혼동을 주기 때문으로 생각됨
script 제작 중 qual 정보를 다루는 것이 어려워 잠시 중단

phrap 사용 solexa 조립
read의 이름을 어떻게 변환? manual을 보면 "create a script which translates your read names into St. Louis", 다른 사람들이 만들어 놓은 script는 없나?

다시 addSolexaReads.perl
gnusnah@panflam:~/works/assembly_2010_7_8/SE_PE/consed/edit_dir$ addSolexaReads.perl 454Contigs.ace.1 solexa_files.fof ref.fa 
약 2시간 걸림, 또 실패
couldn't execute /home/gnusnah/tools/UW/consed/bin/consed -ace 454Contigs.ace.1 -addReads alignmentFiles100711_154311.fof -chem solexa at /home/gnusnah/tools/UW/consed/bin/addSolexaReads.perl line 170.
error_at_reading_step quality value를 읽는 과정 -> 메모리부족 -> solexa read 자체를 읽어 들이는 것은 비효율적인것으로 생각됨 -> 논문에서처럼 contigs 쪼개서 fake reads를

100711 Solexa read 변환
"." 을 N 으로 변환: cat s_3.1.fastq | perl -pi -e 's/\./N/g' > N_s_3.1.fastq

Add solexa reads to Newbler result
gnusnah@panflam:~/works/assembly_2010_7_8/SE_PE/consed/edit_dir$ addSolexaReads.perl 454Contigs.ace.1 solexa_files.fof ref.fa 
총 33분 걸림
error - 454Contigs.ace.2 file: 0 -> 하드가 100% 됐었음, 정리 후 다시 실행
다시 error - read에 포함된 "." 가 문제 - 어떻게 해결? "." 가 있는 read 삭제? 삭제할 때는 pair인 read도 함께 삭제? -> "."을 n으로 바꾸면 될지도.

add solexa read, doing...
under /home/gnusnah/works/assembly_2010_7_8/consed/
make dir : solexa_dir
link to fastq (2 paired end file)
make file : edit_dir/solexa_files.fof

Consed Customization
file : /home/gnusnah/.consedrc
add environment : /home/gnusnah/.bashrc

Consed Install
Consed_Install
While customizing phredPhrap, the location of polyphred should be confirmed. Polyphred is not installed. Sent request e-mail.

Try Consed
gnusnah@panflam:~/works/assembly_2010_7_8/SE_PE/consed/edit_dir$ ~/tools/UW/consed/consed_linux64bit

phred
add environment : /home/gnusnah/.bashrc
PHRED_PARAMETER_FILE=/home/gnusnah/tools/UW/phred/phredpar.dat
export PHRED_PARAMETER_FILE

Newbler

gsMapper
manual_align2.fasta 를 reference로 454PE, 454SE read를 맵핑
scf4를 쪼개서 scf5와 scf6에 합쳤었는데, 다시 맵핑해본 결과 4를 쪼개서 붙이기 전의 결과, 즉 8개 일 때의 것에 해당하는 pair 정보가 발견됨.
다시 말해서 4_front와 4_back, 5, 6 사이의 관계는 모호하다. PCR이나 유전자 순서로 확인이 필요하다 같다.

gsMapper
gapRes로 나온 8 scaffold(reads:454PE,454SE fakes:abyss)에  reads:454PE,454SE fakes:abyss,velvet을 맵핑 -> fakes가 길어서 맵핑 안됨
454PE, 454SE 만 read로 넣었음 :

PyroBayes (MARTHLAB)
454 sff 파일로부터 더 좋은 퀄리티의 fasta를 불러 올 수 있다고 한다.

abyss contigs의 fake reads + 454 data
phrap 사용이 어려워, newbler로 조립해봄, commandline manual을 못찾아 GUI로 조립: -consed -a 50 -l 350 -ml 20
scaffold: 11->8, contigs수: 64->290, contigs총길이: 4247430->4284534

solexa reads로 만든 abyss contigs의 fake read 만들기
길이는 1.5kb, 그 이하의 contigs는 다 버려야 하나? phrap으로 조립하기 위해서는 아마도...
coverage는 얼마나? 10
/home/gnusnah/p-code/PModule/assembler_modules/make_randomread_4_illu_contig.py
45221개, 총길이 67828507의 라이브러리 만듬

run Newbler PE
runAssembly -o PE -a 50 -l 350 -g -m -ml 20 -cpu 0 -consed ~/db/genome/Eubacteria/APR_2010_PE/GE6FA8204.sff
(/home/gnusnah/works/assembly_2010_7_8/)

run Newbler SE
runAssembly -o SE -a 50 -l 350 -g -m -ml 20 -cpu 0 -consed ~/db/genome/Eubacteria/NOV_2009_SE/GIST.SE.sff
(/home/gnusnah/works/assembly_2010_7_8/)

run Newbler SE + PE
runAssembly -o SE_PE -a 50 -l 350 -g -m -ml 20 -cpu 0 -consed ~/db/genome/Eubacteria/NOV_2009_SE/GIST.SE.sff ~/db/genome/Eubacteria/APR_2010_PE/GE6FA8204.sff
(/home/gnusnah/works/assembly_2010_7_8/)

Reads Library

454 SE

454 PE

Solexa illumina

Andrew D Smith et al, Using quality scores and longer reads improves accuracy of Solexa read mapping, BMC Bioinformatics 에 의하면 미스매치 클수록(4), 퀼리티 cutoff(8) 높을 수록, read 길이가 길 수록 mapping이 잘 된다고 함. 맵핑 소프트웨어도 제공함.

fastx toolkit의 fastq_masker를 사용하여 quality 10 기준으로 'N' 으로 바꿈

mapping took : gsMapper(newbler), mosaik aligner

Softwares


Software	Version	Input	Output	Location(machine/folder)
Newbler	2.3(091027_1459)			panflam,panpyro
Phrap	0.990329(Phrap0.990329_patch)			panflam
Phrap	1.090518			panflam
Consed	090206			panflam
CABOG(celera)	6.1	sanger, 454(.sff), illumina(fastq), fastq	CABOG_output	panflam,panpyro
maq	0.7.1	ref:fasta, read:illumina, long read(not good)		panflam,panpyro
abyss [[1]]	1.2.0	454, illumina		panflam
SOAPdenovo	1.04	illumina		panflam
Corrector(soap package)	1.00	fasta,fastq		panflam
GapCloser(soap package)	1.10	fasta,fastq		panflam
MIRA		sanger,454,illumina
gapResolution		newbler results	fasta,qual
Dupfinisher		ace file
AutoEditor	1.20	.contig(TIGR)
rnammer	1.2	fasta	gff2	panflam
hmmer	3		panflam.panpyro

Polisher
- Can't find...

galaxy web-page(NGS tools)

fastx-toolkit

manuals

Introduction to Newbler (ppt) : 게시판

consed manual

about fake reads

phrap_input

phrap_input_v1.090518

phrap diff

phrap_v1.090518_shortread

newbler : flow space assembler
abyss : nucleotide space

create mate file from illumina for bambus

a blog very good at newbler

phrap사용법

454 sff 다루기

cabog 유용 옵션

References

Pawel Mackiewicz, Where does bacterial replication start? Rules for predicting the oriC region, Nucleic Acids Research 2004 32(13):3781-3791 [2]

@@ Line 6: / Line 6: @@
 Using Solexa reads with Mosaik Aligner.[[mosaik_aligner_result1]]
-번의 경우 cov가 다른 scf의 3배.
+번의 경우 cov가 다른 scf의 4~5배.
 번은 align 안됨. (paired-end 모드 때문?)