Short read aligner comparison

From CSBLwiki

(Difference between revisions)
Jump to: navigation, search
 
(3 intermediate revisions not shown)
Line 6: Line 6:
http://samtools.sourceforge.net/swlist.shtml
http://samtools.sourceforge.net/swlist.shtml
 +
 +
http://bamview.sourceforge.net/
 +
 +
[http://sourceforge.net/apps/mediawiki/samtools/index.php?title=SAM_protocol#Support_Protocol_1:_Base_Quality_Recalibration SAM protocol]
 +
 +
This is explained in the [http://samtools.sourceforge.net/samtools.shtml manual page]. Or briefly (when you invoke pileup with the <tt>-c</tt> option):
 +
# reference sequence name
 +
# reference coordinate
 +
# reference base, or `<tt>*</tt>' for an indel line
 +
# genotype where heterozygotes are encoded in the [http://biocorp.ca/IUB.php IUB code]: <tt>M=A/C, R=A/G, W=A/T, S=C/G, Y=C/T</tt> and <tt>K=G/T</tt>; indels are indicated by, for example, <tt>*/+A, -A/*</tt> or <tt>+CC/-C</tt>. There is no difference between <tt>*/+A</tt> or <tt>+A/*</tt>.
 +
# Phred-scaled likelihood that the genotype is wrong, which is also called `consensus quality'.
 +
# Phred-scaled likelihood that the genotype is identical to the reference, which is also called `SNP quality'. Suppose the reference base is <tt>A</tt> and in alignment we see 17 <tt>G</tt> and 3 <tt>A</tt>. We will get a low consensus quality because it is difficult to distinguish an <tt>A/G</tt> heterozygote from a <tt>G/G</tt> homozygote. We will get a high SNP quality, though, because the evidence of a SNP is very strong.
 +
# [http://en.wikipedia.org/wiki/Root_mean_square root mean square] (RMS) mapping quality
 +
# # reads covering the position
 +
# read bases at a SNP line (check the manual page for more information); the 1st indel allele otherwise
 +
# base quality at a SNP line; the 2nd indel allele otherwise
 +
# indel line only: # reads directly supporting the 1st indel allele
 +
# indel line only: # reads directly supporting the 2nd indel allele
 +
# indel line only: # reads supporting a third indel allele
 +
If pileup is invoked without `<tt>-c</tt>', indel lines and columns between 3 and 7 inclusive will not be outputted.
 +
 +
 +
$ fastq_quality_filter -h
 +
usage: fastq_quality_filter [-h] [-v] [-q N] [-p N] [-z] [-i INFILE] [-o OUTFILE]
 +
 +
version 0.0.6
 +
  [-h]        = This helpful help screen.
 +
  [-q N]      = Minimum quality score to keep.
 +
  [-p N]      = Minimum percent of bases that must have [-q] quality.
 +
  [-z]        = Compress output with GZIP.
 +
  [-i INFILE]  = FASTA/Q input file. default is STDIN.
 +
  [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
 +
  [-v]        = Verbose - report number of sequences.
 +
  If [-o] is specified,  report will be printed to STDOUT.
 +
  If [-o] is not specified (and output goes to STDOUT),
 +
  report will be printed to STDERR.

Latest revision as of 08:58, 7 July 2011

BFAST

http://dna.cs.byu.edu/gnumap/

http://iga-rna.sourceforge.net/

http://samtools.sourceforge.net/swlist.shtml

http://bamview.sourceforge.net/

SAM protocol

This is explained in the manual page. Or briefly (when you invoke pileup with the -c option):

  1. reference sequence name
  2. reference coordinate
  3. reference base, or `*' for an indel line
  4. genotype where heterozygotes are encoded in the IUB code: M=A/C, R=A/G, W=A/T, S=C/G, Y=C/T and K=G/T; indels are indicated by, for example, */+A, -A/* or +CC/-C. There is no difference between */+A or +A/*.
  5. Phred-scaled likelihood that the genotype is wrong, which is also called `consensus quality'.
  6. Phred-scaled likelihood that the genotype is identical to the reference, which is also called `SNP quality'. Suppose the reference base is A and in alignment we see 17 G and 3 A. We will get a low consensus quality because it is difficult to distinguish an A/G heterozygote from a G/G homozygote. We will get a high SNP quality, though, because the evidence of a SNP is very strong.
  7. root mean square (RMS) mapping quality
  8. # reads covering the position
  9. read bases at a SNP line (check the manual page for more information); the 1st indel allele otherwise
  10. base quality at a SNP line; the 2nd indel allele otherwise
  11. indel line only: # reads directly supporting the 1st indel allele
  12. indel line only: # reads directly supporting the 2nd indel allele
  13. indel line only: # reads supporting a third indel allele

If pileup is invoked without `-c', indel lines and columns between 3 and 7 inclusive will not be outputted.


$ fastq_quality_filter -h

	usage: fastq_quality_filter [-h] [-v] [-q N] [-p N] [-z] [-i INFILE] [-o OUTFILE]

	version 0.0.6
	   [-h]         = This helpful help screen.
	   [-q N]       = Minimum quality score to keep. 
	   [-p N]       = Minimum percent of bases that must have [-q] quality.
	   [-z]         = Compress output with GZIP.
	   [-i INFILE]  = FASTA/Q input file. default is STDIN.
	   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
	   [-v]         = Verbose - report number of sequences.
			  If [-o] is specified,  report will be printed to STDOUT.
			  If [-o] is not specified (and output goes to STDOUT),
			  report will be printed to STDERR.
Personal tools
Namespaces
Variants
Actions
Site
Choi lab
Resources
Toolbox