Phrap v1.090518 shortread

From CSBLwiki

Jump to: navigation, search

6. Short read analyses (e.g. Solexa/Illumina data).

   Recommended parameter changes include:
   Smaller -minscore (e.g. 20 or 25, if the default score

matrix/penalty values are being used). The default minscore value of 30 requires at least 30 matching bases, which may be too high to reliably detect matches involving reads only slightly longer than 30 bases (on the other hand, you should expect some false positive matches at the lower settings)

   -vector_bound 0 (for phrap; 0 is already the default for cross_match)
   -gap1_only or -bandwidth 2. Note that large indels in short reads

cannot be detected anyway, at least with the default gap penalties; using a larger bandwidth value with -gap_ext 0 may provide some ability to detect larger indels (at the expense of additional 'noise' alignments). -gap1_only is in general significantly faster than -bandwidth 2, and is recommended with short reads. However it cannot detect indels of size larger than 1.

   -max_group_size 0 (for phrap assemblies where coverage depth is

very uneven, e.g. ESTs)

   -globality 1 (for cross_match; requires that -gap1_only,

-spliced_word_gapsize2 and/or -spliced_word_gapsize is also set).

  Note that with cross_match, using -masklevel 101 to report all

matches can dramatically increase running times and memory requirements in cases where some queries match high-copy number repeats; it is generally preferable to use lower values (e.g. the default, or -masklevel 0), together with an appropriate value of -minmargin to control the score range of reported matches. Using -minmargin 1 or higher reduces the number of stored alignments, which can reduce running time and memory requirements. Note also that a histogram for each query, indicating all match scores meeting the -minscore threshold, can be obtained by setting -score_hist.

Limiting reported alignments to those with high-confidence placements

can be achieved by using -minmargin with positive values.

If it is important to detect regions of highly biassed composition

you may want to turn off complexity-adjustment of scores by setting the command line parameters -raw and/or -word_raw. In general I do not recommend using -raw, since it tends to greatly reduce specificity (increase the number of false positive matches) at a given score level; one can usually do a better job of increasing sensitivity to detect biassed composition regions while maintaining reasonable specificity simply by lowering -minscore and setting -word_raw. Note however that using -word_raw can incur a significant speed penalty. (Reducing -maxmatch, without setting -word_raw, has the effect of 'partially' removing word complexity adjustment, which can provide a useful compromise for the speed/sensitivity tradeoff in some cases). Since exons are generally less likely to have highly biassed composition regions, it is usually preferable not to use -raw or -word_raw in RNASeq (cDNA or EST) searches.


Reducing the value of -minmatch will also also increase sensitivity,

as will reducing -gap1_minscore (when gap1_only is not already set). Note that this may substantially increase running time.

In general, appropriate parameter settings for your analyses will depend on the characteristics of your data (e.g. genome complexity, read error rates) as well as your computer resources. I recommend experimenting on a subset of your data before committing to very long runs. For resequencing applications, a useful parameter for this purpose is -output_nonmatching_queries. For example, you can start by using the default parameter setting for -minmatch (but using adjustments for the other parameters as indicated above), and then try turning up the sensitivity using the nonmatching queries to see whether this recovers substantially more matches.

7. "Resequencing" applications.

A useful analysis mode is to run cross_match comparing a large set of

reads (e.g. 2 million or so Solexa reads) in a single query file to a genomic sequence comprising the subject file(s), setting the parameters -discrep_lists and -output_nonmatching_queries along with any others that may be appropriate (e.g. those for short reads above). High quality discrepancies in the discrepancy lists will include substitution and small indel differences between the resequenced & original genomes. Larger scale differences, and contaminants, can often be identified by performing a phrap assembly of the nonmatching query reads, and comparing the contig sequences back to the original genome (using more sensitive parameter settings if desired) and/or to the nucleotide databases using the NCBI Blast server. The bcdsites parameters can be used to produce an output file indicating positions in the reference genome that are either confirmed by or have high-quality discrepancies with the input reads.

When the option -alignments is used to display the aligned sequences,

there is currently a speed advantage to having the subject (genomic) sequences split among multiple subject files (e.g. one file per chromosome) rather than all included in a single file.

Personal tools
Namespaces
Variants
Actions
Site
Choi lab
Resources
Toolbox