Phrap diff

From CSBLwiki

Revision as of 09:17, 9 July 2010 by Csbl (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

--- ../phrap/phrap.txt 2010-07-09 18:06:36.000000000 +0900 +++ phrap.txt 2010-07-09 18:06:04.000000000 +0900 @@ -1,5 +1,5 @@

/*****************************************************************************

-# Copyright (C) 1994-1999 by Phil Green. +# Copyright (C) 1994-2008 by Phil Green.

#   All rights reserved.                           
#                                                                           
#   This software is part of a beta-test version of the swat/cross_match/phrap 

@@ -23,7 +23,7 @@

#*****************************************************************************/

 

-DOCUMENTATION FOR PHRAP AND CROSS_MATCH (VERSION 0.990319) +DOCUMENTATION FOR PHRAP AND CROSS_MATCH (VERSION 1.090518)

phrap ("phragment assembly program", or "phil's revised assembly
program"; a homonym of "frappe" = French for "swat") -- a

@@ -34,7 +34,7 @@

repeats; constructs contig sequence as a mosaic of the highest quality
parts of reads (rather than a consensus); provides extensive information
about assembly (including quality values for contig sequence) to

-assist trouble-shooting; able to handle very large datasets. +assist trouble-shooting; able to handle moderately large datasets.

 N.B. phrap does not provide editing or viewing capabilities; these
are available with consed and phrapview. It is strongly recommended
that phrap be used in conjunction with the base calls and base quality

@@ -44,13 +44,17 @@

cross_match -- a general-purpose utility (based on a "banded" version
of SWAT, an efficient implementation of the Smith-Waterman algorithm)

-for comparing any two sets of (long or short) DNA sequences. For -example, it can be used to compare a set of reads to a set of vector -sequences and produce vector-masked versions of the reads; a set of -cDNA sequences to a set of cosmids; contig sequences found by two -alternative assembly procedures (e.g. phrap and xbap) to each other; -or phrap contigs to the final edited cosmid sequence. It is slower but -more sensitive (because it allows gaps) than BLASTN. +for comparing sets of (long or short) DNA or protein sequences. For +example, it can be used to compare DNA sequencing reads to a set of +vector sequences and produce vector-masked versions of the reads; +reads to contig or genome sequences; cDNA or EST sequences to genome +sequences; contig sequences found by two alternative assembly +procedures (e.g. phrap and gap) to each other; phrap contigs to final +edited clone or genome sequences; 'merged base reads' arising from +mixed signal DNA sequencing traces to genome sequences. Cross_match +has recently been enhanced to allow comparison of large numbers of +reads generated by high throughput methods to a reference genome, for +RNASeq, ChIPSeq, or resequencing applications.


CONTENTS

@@ -86,39 +90,49 @@

VII. CROSS_MATCH OUTPUT

-VIII. EST ASSEMBLIES (TO BE ADDED) +VIII. SPECIAL CONSIDERATIONS/PARAMETER MODIFICATIONS FOR PARTICULAR DATA TYPES + 1. (TO BE ADDED) Shotgun assemblies + 2. (TO BE ADDED) Whole genome assemblies + 3. (TO BE ADDED) Assemblies of polymorphic reads from a single locus + 4. (TO BE ADDED) EST assemblies + 5. Comparisons of ESTs/cDNAs to genome. + 6. Short read analyses + 7. "Resequencing" applications + 8. (TO BE ADDED) Merged base reads

IX. PROBLEMS
 1. Insufficient memory

- 2. Other phrap- or cross_match-generated error messages (TO BE ADDED) - 3. Phrap- or cross_match-generated warning messages (TO BE ADDED) + 2. (TO BE ADDED) Other phrap- or cross_match-generated error messages + 3. (TO BE ADDED) Phrap- or cross_match-generated warning messages

 4. "Crashes" reported by operating system 
 5. Long running time

- 6. Misassemblies, incomplete assemblies, incorrect consensus sequence (TO BE ADDED) - 7. Reporting problems + 6. (TO BE ADDED) Misassemblies, incomplete assemblies, incorrect consensus sequence + 7. How to report problems


-APPENDIX: ALGORITHMS +APPENDIX: ALGORITHMS (TO BE ADDED)


I. INSTALLATION

The source code for the swat/cross_match/phrap package will be sent to

-you in the form an email message containing a uuencoded .tar.Z file; -you will need to have access to a Unix system for the initial -unpacking, but once you've uudecoded it and unpacked the .tar file -(steps i and ii below), you should be able to compile the programs -on computers running other operating systems -- they should be portable -to almost anything with a decent C compiler and adequate memory (64 Mb -RAM or more is desirable). Here are the steps needed to unpack and -install the programs: +you in the form of an email message containing a uuencoded .tar.Z (or +.tar.gz) file; you may need to have access to a Unix system for the +initial unpacking, but once you've uudecoded it and unpacked the .tar +file (steps i and ii below), you should be able to compile the +programs on computers running other operating systems -- they should +be portable to almost anything with a decent C compiler and adequate +memory (512 Mb RAM or more is desirable). + +Here are the steps needed to unpack and install the programs:

i. Save the email message as a file (for example, "temp.mail"). If
possible, do this using the Unix mail command, rather than another
mail program -- some mail programs (e.g. Pine) remove trailing spaces
on each line of incoming messages, which will corrupt a uuencoded

-message. +message. If you cannot use Unix mail, try to avoid opening the message +before you save it.

 Do not attempt to modify the saved mail message in any way. That is
unnecessary and may corrupt the message.

@@ -130,7 +144,9 @@

> zcat distrib.tar.Z | tar xvf -

-If either of these commands results in an error message, it is likely that +[OR > zcat distrib.tar.gz | tar xvf - ] + +If either command results in an error message, it is likely that

the email message was corrupted by your mail program (see step i above).

iii. To produce working versions of the programs, move (if necessary)

@@ -154,21 +170,20 @@

Then remove all files ending in .o produced by the original make, and
recompile.

-iv. If you have datasets with more than 64,000 reads, or that include -sequences longer than 64,000 bp, you will need the .manyreads and/or -.longreads versions of phrap and cross_match. These are created using the -command +Other warning messages (as opposed to error messages!) that may be +produced by the compiler can in general be ignored.

-> make manyreads +(N.B. TO USERS OF PREVIOUS VERSIONS: There are no longer separate +.manyreads OR .longreads variants of the programs -- the default +versions of cross_match and phrap automatically handle large numbers +of reads, and long sequences).

-v. If you are operating a non-commercial (academic or government) +iv. If you are operating a non-commercial (academic or government)

computer facility which provides access to several independent
investigators, you are required by the licensing agreement to set the
permissions on the executables and source code to allow execute but
not read access, so that the programs may not be copied.

- -

II. RUNNING PHRAP & CROSS_MATCH

1.  Phrap. 

@@ -182,7 +197,8 @@

information to consed (e.g. tagging of repeats).  It will accept any
of the phrap command line options described below (section IV).
PhredPhrap and its documentation are included in the consed

-distribution. +distribution (available from David Gordon). +

  Although phredPhrap automates most of the steps in the assembly, you
will still need to do several things: (a) set up the directory
structure that is assumed by phredPhrap and make sure your

@@ -197,30 +213,36 @@

 ii. The following instructions pertain to "standalone" operation of
phrap.   Before the run, you may want to increase the amount of memory
available to phrap: see the instructions for using the Unix

-"limit" command under IX.1, "Insufficient memory" below. +"limit" command below under IX.1, "Insufficient memory".

The command line should be of the form

> phrap seq_file1 [seq_file2 ...] [-option value] [-option value] ...  

where each seq_file is the name of a sequence file in FASTA format as

-described below in III.2. Available options are described in section IV -below. The standard output is extensive and should be redirected to a -file (phrap.out in the following example). +described below in III.2. (The file name may include a directory/path +specification). Available options are described in section IV +below. [The command line format is actually somewhat more flexible +than indicated above, i.e. option/value pairs can be interspersed with +file names]. + +The standard output from this command is extensive and should be +redirected to a file (phrap.out in the following example).

Example: 

> phrap C05D11.reads.screen -minmatch 20 -new_ace > phrap.out 

Note that the quality file is not specified on the command line, but

-instead must be named appropriately (as described below in III.3) to be -recognized by phrap; in this example its name would need to be -C05D11.reads.screen.qual. If more than one seq_file is provided, -then reads in the first file are assembled only against sequences in -the 2d (and later) files; no sequences are compared to other sequences -in the same file. [N.B. THE ABILITY TO USE MULTIPLE SEQUENCE FILES IS -NOT AVAILABLE AT PRESENT -- YOU SHOULD PROVIDE ONLY A SINGLE SEQUENCE -FILE TO PHRAP]. In this case, many features of phrap (e.g. checking -for anomalies, vector, etc.) which rely on comprehensive read-read +instead must be named appropriately (as described below in III.3) to +be recognized by phrap; in this example its name would need to be +C05D11.reads.screen.qual. If more than one seq_file is provided, then +reads in the first file are assembled only against sequences in the 2d +(and later) files; no sequences are compared to other sequences in the +same file. [N.B. THE ABILITY TO USE MULTIPLE SEQUENCE FILES IS NOT +AVAILABLE AT PRESENT WITH PHRAP -- YOU SHOULD PROVIDE ONLY A SINGLE +SEQUENCE FILE. HOWEVER MULTIPLE FILES ARE ALLOWED WITH +CROSS_MATCH]. In this case, many features of phrap (e.g. checking for +anomalies, vector, etc.) which rely on comprehensive read-read

comparisons are turned off, SWAT scores are used in place of
LLR_scores, and implied merges are not performed -- i.e. each read in
the first file is merged with at most one sequence in the 2d

@@ -236,30 +258,40 @@

> cross_match seq_file1 [seq_file2 ...] [-option value] [-option value] ...  

where each seq_file is the name of a sequence file in FASTA format as

-described below in III.2. Available options are described in section -IV below. The standard output should generally be redirected to a -file. If there are at least two files, all sequences in the first -file are compared to those in the second (and any subsequent) files; -if there is only one file, all sequences in this file are compared to -each other (N.B. at present entries are not compared to their own -complements, but are compared to themselves in the same orientation, -and to all other entries in both orientations). Matches meeting -relevant criteria (see section IV) are written to the standard output. +described below in III.2. [The command line format is actually +somewhat more flexible than indicated above, i.e. option/value pairs +can be interspersed with file names. However the relative order of +the sequence files is important inasmuch as the first file (the 'query +file') is treated differently from the others ('subject files')]. (The +file name may include a directory/path specification). Available +options are described in section IV below. If the command line +specifies more than one file, all sequences in the first file are +compared to those in the second (and any subsequent) files. If there +is only one file, all sequences in it are compared to each other +(N.B. in the single-file case, at present entries are not compared to +their own complements, but are compared to themselves in the same +orientation (excluding the trivial identity match), and to all other +entries in both orientations). Matches meeting relevant criteria (see +section IV) are written to the standard output.

+The standard output should generally be redirected to a file.

III. INPUT FILES

Input files to cross_match and phrap are of two types: sequence files
(required), and quality files (optional, but strongly recommended for

-phrap). If you are doing phrap assemblies of reads generated by the +phrap). With cross_match, the query file can optionally be a CALF +formatted file of unaligned reads (see http://www.phrap.org/phredphrap/calf.pdf +for a description of this format), which must have a name ending in +".calf". If you are doing phrap assemblies of reads generated by the

basecalling program phred, then it is recommended that you generate
these input files automatically using the phd2fasta or phredPhrap

-scripts distributed with phred and consed (see II.1.i above). -Even in this case however you should read the following section on the -read naming convention, since you will either need to make sure that -the read names attached to your chromatogram files adhere to the -naming convention, or you will need to modify the phredPhrap script -(or create your own script) so that correct template names and read +scripts distributed with phred and consed (see II.1.i above). Even in +this case however you should read the following section on the read +naming convention, since you will either need to make sure that the +read names attached to your chromatogram files adhere to the naming +convention, or you will need to modify the phredPhrap script (or +create your own script) so that correct template names and read

orientation information is appended to the .phd files after they are
produced by phred.

@@ -268,17 +300,20 @@

 In addition to the sequence and quality data, phrap needs to know
three things for each read: 

+

 (i) the name of the subclone (or other template) from which the read
is derived; this is used, for example, in checking for chimeric
subclones (for which purpose it is necessary to know when two reads
are from the same subclone so that they are not regarded as
independently confirming each other) and for certain other data
anomalies;

+

 (ii) the orientation of the read (forward or reverse) within the
subclone, in cases where data is acquired from each end of a subclone
insert (at present this information is used only for consistency
checking following assembly, and not in the assembly itself, but this
will change in future versions);

+

 (iii) the chemistry used to generate the read (which influences
phrap's decisions as to how to treat discrepancies between potentially
overlapping reads, and as to how to adjust qualities of reads which

@@ -293,6 +328,7 @@

 (i) provide the relevant information in the description field of the
.fasta file instead (see below), or (if you are using the phd2Fasta
script) as tags in the .phd files for each read; 

+

 (ii) see whether appropriate values of the command line options
-subclone_delim and -n_delim (described below) can be used to bring
your own naming convention into conformity with the St. Louis

@@ -301,14 +337,17 @@

subclone name occurs at the first (or n-th for some fixed value of n)
occurrence of a particular character, then you may be able to use
these options without having to change your read names;

+

 (iii) create a script which translates your read names into St. Louis
form (this does not mean that you cannot retain the original names as
well -- for example, for purposes of running phrap you could create a
directory of "links" to your chromatogram files, such that the link
names adhere to the naming convention, and then run the phredPhrap
script on this directory);

+

 (iv) alter the subroutines in the source module "names.c" in order to
parse your read names correctly; or

+

 (v) ignore the issue and hope that it does not cause problems. This
may have unpredictable adverse effects on assembly and is NOT
recommended.

@@ -323,6 +362,7 @@

and/or the option -n_delim may be used to indicate that the subclone
name contains a fixed number of occurrences of this character within
it.   See section IV below for a description of command line options.

+

  The orientation of the read within the subclone, and the chemistry,
are indicated by the first letter following the '.', as follows:

@@ -360,53 +400,87 @@

phrap is used with a single input sequence file containing the reads,
unless one wants to assemble one set of reads "against" another
sequence or set of sequences (see below).  Cross_match is typically

-used with two input sequence files, the first containing the "query" -sequences and the second containing the "subject" sequences, but it -may also be used to compare the sequences in a single input file to -each other. - - Phrap is designed to be able to use all of each read sequence in the -assembly, not just the trimmed (highest quality) part, so the full -sequence of each read should be provided when available. If -phred-generated qualities are not available however then the default -quality parameter -default_qual should be set appropriately (section -IV). +used with two or more input sequence files, the first containing the +"query" sequences and the remaining files containing the "subject" +sequences, but it may also be used to compare the sequences in a +single input file to each other. + + Cross_match may be used with either DNA or protein sequences (mixed +types are not allowed, i.e. all input sequences must be DNA, or all +must be protein). In the case of DNA sequences, the query file (but +not subject files) may include 'merged base' DNA sequences produced by +phred from mixed signal DNA sequencing traces (see below). + + Phrap is used only with ordinary DNA sequences as input. It is +designed to be able to use all of each read sequence in the assembly, +not just the trimmed (highest quality) part, so the full sequence of +each read should be provided when available. If phred-generated +qualities are not available however then the default quality parameter +-default_qual should be set appropriately (section IV).

Sequence files must be in FASTA format: 
  (i) Each sequence entry has a single header line as follows: The
first character is '>'. This is followed immediately (i.e. with no
intervening spaces or other characters) by the sequence name, and then

-(optionally) by a space or spaces, followed by descriptive information -about the sequence. +(optionally) by a space or spaces, followed by (on the same line) +descriptive information about the sequence.

  (ii) The header line is followed by a separate line or lines

-containing the sequence. +containing the sequence. The sequence may all be placed on a single line, +or split among several lines (of arbitrary length). + +Very large amounts of input data are permitted: Individual sequences +must be 2 billion characters or less (the same limit applies to the +name and descriptive information for each sequence). The combined +total size of all query sequences is, by default, limited to about 1 +billion characters; however, this can be increased by changing the +line + + typedef int SEQ_AREA; + +in the header file swat.h to either + + typedef unsigned int SEQ_AREA; + +(which will increase the limit to 2 billion characters), or + + typedef long int SEQ_AREA; + +(which on most computers will increase it to about 10^18 -- at the +cost of increased memory and running time), and recompiling. There is +no combined total length constraint for subject sequences. The 2 +billion residue length constraint for individual sequences (both query +and subject) cannot be increased.

-The sequence names, descriptive information, and sequences can be of -arbitrary length (up to a combined total size for each type of -about 2 billion characters), since space for them is allocated -dynamically. If there are more than 64,000 entries in the first -(query) file, or if any of the sequences are longer than about 64,000 -characters, the .manyreads version of phrap or cross_match must be -used.

Descriptive information on the header line may optionally be used to
indicate template (subclone), read orientation, read chemistry and

-dye. If present, this information overrides whatever would be implied -by the read name. The template name is indicated by including the word -"TEMPLATE:", followed by a space, followed by the name of the -template. Orientation is indicated by the word "DIRECTION:", followed -by the word "fwd" or "rev". Chemistry is indicated by the word -"CHEM:", followed by the word "prim" (for dye primer), "term" for dye -terminator, or "unknown". Dye is indicated by the word "DYE:" followed -by "rhod", "big", "ET", "d-rhod", or "unknown". The chemistry and dye -information is provided automatically by newer versions of phred -(version 0.980904.a or later), which obtain it from the chromatogram -file. (Use the script phd2fasta distributed with phred and consed to -create the .fasta file from the .phd files created by phred). +dye, and repeats. This information is written back out to the .ace +and .singlets files. If present, this information overrides whatever +would be implied by the read name. The template name is indicated by +including the word "TEMPLATE:", followed by a space, followed by the +name of the template. Orientation is indicated by the word +"DIRECTION:", followed by the word "fwd" or "rev". Chemistry is +indicated by the word "CHEM:", followed by the word "prim" (for dye +primer), "term" for dye terminator, or "unknown". Dye is indicated by +the word "DYE:" followed by "rhod", "big", "ET", "d-rhod", or +"unknown". The chemistry and dye information is provided +automatically by newer versions of phred (version 0.980904.a or +later), which obtain it from the chromatogram file. (Use the script +phd2fasta distributed with phred and consed to create the .fasta file +from the .phd files created by phred). + + In the query file, repeated sequence may be indicated by the word +"REPEAT:" followed by a pair of integers indicating the start and end +of the repeat (these should be separated by spaces). There can be +multiple "REPEAT:" tags for a single sequence. If the parameter +-repeat_screen (see section IV.2 below) is set to 1 or 3, the sequence +within any such repeat is ignored for the purpose of finding word +matches between sequences. This can substantially improve the speed +and (for phrap) accuracy of assembly of repeat-rich regions.

Example:

->read#5 DIRECTION: rev CHEM: prim DYE: ET TEMPLATE: a23f1 any_other_information +>read#5 DIRECTION: rev CHEM: prim DYE: ET TEMPLATE: a23f1 REPEAT: 151 237 REPEAT: 305 422 any_other_information

GAAAGATCTCATTGATCACTCTATTCAAGTGGGAGTCTCCGGTCTTT
ATGATCGATTTGTGAATCTTCGTATCAAAGTTGGAGCTGACAAGTATCCA
TTGCTTGCGAAATGGGCTCAAATTTTCACTCAGGGAGTCGTCTTCGATCC

@@ -414,19 +488,57 @@

GGAAGAGAGATGGATGTACTGTGGCATCAACTGCAGTAGCTTCAATGGTT
TATGGAAAGAGTATGTTATTt

+In subject file(s), repeats are instead indicated by using lower-case +letters for the residues.

-All descriptive information in the header line is written back out to -the .ace and .singlets files. - -Non-alphabetic characters (including '*', and digits) in the sequence -lines are automatically stripped out when the file is read in, except -for '-' which is converted to 'N'; '>' must not appear anywhere within -the sequence since it is assumed to start the header of a new sequence -(even if it is not at the beginning of a line). Lower case letters are -converted to upper case on readin, so it is not possible (as it was in -old phrap versions) to use case as a crude indication of quality. - - +Non-alphabetic characters (including '*', and digits) in the +sequence lines are automatically stripped out when the file is read +in, except for '-' and '.' in DNA files, which are both converted to +'N'; '>' must not appear anywhere within the sequence since it is +assumed to start the header of a new sequence (even if it is not at +the beginning of a line). + +Lower case letters are converted to upper case on readin, with +the following important exceptions: + + (i) if -repeat_screen (see section IV.2 below) is set to 2 or 3, +segments of subject file sequences consisting entirely of lower case +letters are assumed to represent repeats, and nucleating perfect matches +falling entirely within them (or spliced matches with either splice +junction falling within them) are ignored; + + (ii) the query file may include 'merged base' DNA sequence reads +produced by phred from mixed signal DNA sequencing traces. Such reads +indicate at each position the two strongest peaks detected in the +signal, and use standard ambiguity codes together with upper and lower +case to indicate these peaks, as follows: + +Merged base read symbols: + +A,C,G,T (single detected peak) + + stronger peak weaker peak +M A C +m C A +K G T +k T G +R A G +r G A +Y C T +y T C +S C G +s G C +W A T +w T A + +N no information + +For effective searches with merged base reads an appropriate score +matrix should be provided using the -matrix option (see IV.1 below); +the matrix "mb_matrix" provided with the distribution has been found +to be useful for this purpose. Note that with this score matrix +the default values of -minscore, -near_minscore, -gap1_dropoff and +-gap1_minscore are too small, and should be changed on the command line.

3. Quality files.  

@@ -438,18 +550,20 @@

tabulating discrepancies and not in the alignment scoring. The name of
this file must consist of the name of the FASTA file, with ".qual"
appended.

+

 The format of the .qual file is similar to that of the corresponding
FASTA file. For each read there should appear a header line identical
(except possibly for the "description" field) to that in the FASTA
file.  This is followed by one or more lines giving the qualities for

-each base. Quality values should be integers between 0 and 99 -(inclusive), and should be separated by spaces. No other (non-digit, -non-white space) characters should appear. The total number of quality -values for each read must match the number of bases for that read in -the FASTA file. Also, the orders of the reads in the FASTA and -corresponding .qual file must be identical. N's are automatically -assigned quality 0, overriding any value present in the .qual file; -however X's retain whatever value is present in the .qual file. +each base. Quality values should be integers >= 0 and <= 99, and +should be separated by spaces. No other (non-digit, non-white space) +characters should appear. The total number of quality values for each +read must match the number of bases for that read in the FASTA file. +Also, the orders of the reads in the FASTA and corresponding .qual +file must be identical. N's are automatically assigned quality 0, +overriding any value present in the .qual file [??]; however X's +retain whatever value is present in the .qual file. +

  If provided, quality information is used by phrap both in the
assembly itself (where discrepancies between high quality bases are
used to discriminate repeats from true overlaps) and in determining

@@ -463,16 +577,17 @@

effects on quality (e.g. compressions) -- in such cases the input
quality information may constitute the only basis for choosing one
strand over the other.

+

  If your sequencing data is from automated sequencing machines it is
strongly recommended that you generate the read sequences using the
basecalling program phred, which will produce an appropriate quality

-file for input to phrap. Phred generally gives more accurate base -calls overall, for a longer distance, than the ABI software does. On -the basis of the trace characteristics, phred computes a probability p -of an error in the base call at each position, and converts this to a -quality value q using the transformation q = -10 log_10(p). Thus a -quality of 30 corresponds to an error probability of 1 / 1000, a -quality of of 40 to an error probability of 1 / 10000, etc. +file for input to phrap. On the basis of the trace characteristics, +phred computes a probability p of an error in the base call at each +position, and converts this to a quality value q using the +transformation q = -10 log_10(p). Thus a quality of 30 corresponds to +an error probability of 1 / 1000, a quality of 40 to an error +probability of 1 / 10000, etc. +

  The quality value 99 is reserved for base calls that have been
visually inspected and verified as "highly accurate" (during editing),
and 98 for bases that have been edited but are not highly accurate

@@ -480,15 +595,16 @@

match but have discrepancies involving bases that have quality 99 in
both reads, phrap does not allow them to be merged during assembly.
At present this is the only way to break false joins made by phrap.

+

  If you wish to generate your own quality values you should be
prepared to experiment. It is particularly important that possible

-insertion errors (which are fairly frequent late in the trace, with ABI -basecalls) receive low quality, because in choosing the "consensus" +insertion errors receive low quality, because in choosing the "consensus"

phrap tends to favor the reads with the highest total quality in a
given region. Also, if at all possible you should use quality values
that are defined in terms of error probabilities in the manner
described above, since to some extent phrap is tuned to expect these
(and its output qualities will then have a similar interpretation).

+

 If all input quality values are relatively small (less than 15),
phrap assumes that they do not correspond to error probabilities and
attempts to rescale them so that the largest quality value is

@@ -545,7 +661,7 @@

IV. COMMAND LINE OPTIONS

- This section describes all command line options available with + This section describes command line options available with

cross_match and phrap, grouped by category. Each option name is
followed immediately by the "default value" assumed by the program,
and a brief description.  A '*' appearing as the default value

@@ -564,13 +680,14 @@

 The following options control the scoring of pairwise sequence
alignments. By default, matching residues receive a reward of +1,
mismatches get a penalty of -2, gap opening residues a penalty of -4

-(= -2 - 2), and gap extension residues a penalty of -3 (= -2 - 1). The -highest scoring word-nucleated local alignment between a pair of -sequences is found, and its score is then complexity-adjusted -(i.e. scores of matches between sequence regions of biassed nucleotide -composition are adjusted downwards). These parameters are chosen such -that regions of two sequences that are about 70% or more identical -will tend to have a positive alignment score. For more stringent +(= -2 - 2), and gap extension residues a penalty of -3 (= -2 - 1). +Word-nucleated high-scoring local alignments between sequences are +found by a modified Smith-Waterman approach (see next section), and +their scores are then complexity-adjusted (so matches between sequence +regions of highly biassed nucleotide composition have their scores +adjusted downwards). The above parameters are chosen such that +regions of two sequences that are about 70% or more identical will +tend to have a positive alignment score. For more stringent

comparisons, -penalty should be made more negative (e.g. a value of -9
will tend to find alignments that are 90% identical), and/or -minscore
should be increased; for more sensitive comparisons, a score matrix

@@ -579,117 +696,358 @@

option name & default value 

 -penalty -2           

-Mismatch (substitution) penalty for SWAT comparisons. +Mismatch (substitution) penalty for scoring alignment

 -gap_init penalty-2   

-Gap initiation penalty for SWAT comparisons. +Gap initiation penalty

 -gap_ext  penalty-1   

-Gap extension penalty for SWAT comparisons. +Gap extension penalty. Ignored when -gap1_only is set, since gaps +in that case may only have size 1.

 -ins_gap_ext gap_ext 

- Insertion gap extension penalty for SWAT comparisons (insertion in + Insertion gap extension penalty (insertion in

subject relative to query).

 -del_gap_ext gap_ext

- Deletion gap extension penalty for SWAT comparisons (deletion in + Deletion gap extension penalty (deletion in

subject relative to query).

 -matrix [None]        

- Score matrix for SWAT comparisons (if present, supersedes -penalty) -Matrix format: (TO BE ADDED) + Score matrix (if present, supersedes -penalty). Note that when a +customized score matrix is used it is usually desirable to change the +default score thresholds for -minscore, -near_minscore, -gap1_dropoff +-gap1_minscore. Matrix format: (Also see examples included +in distribution). + + Leading whitespace on each line is stripped off (& blank lines are ignored). Then + +Lines beginning with '#' are ignored (& so may include comments). + +(optional) A line beginning with 'GAP' is assumed to have the (integer) gap_init +and gap_ext penalties. + +(optional) A line beginning with 'FREQS' is assumed to have (floating +point) residue frequencies (used in complexity adjustment). + +A line beginning 'MERGED BASE DNA' indicates that the matrix is to be +used in analyzing merged base reads (and is required in this case). + +The first line not beginning as above is assumed to contain the +residue column labels for the matrix (as its non-whitespace +characters). + +Any remaining lines are assumed to contain the rows of the score +matrix (optionally preceded by an alphabetic character giving the row +label as the residue name; if absent, it is assumed that the row +labels are the same as the column labels). Score values must be +integers. + + -qual_scores + (cross_match only, & there must be separate query & subject +files). Compute quality-based alignment scores of DNA sequence reads +against a known reference sequence. The score can be interpreted as +(roughly) -10log_10(p), where p is the posterior probability that the +read derives either from an unknown 'contaminant' source (i.e. noise) +or from a different reference sequence location than the indicated +one, given the read's quality values and alignments and assuming a +prior probability of .5 of originating from the contaminant. Apart +from allowing a non-zero contaminant probability, this is essentially +MAQ's mapping quality score (Li, Ruan & Durbin, 2008). Qual_scores are +determined using an (internally created) score matrix that takes into +account query base qualities: briefly, matching bases score 6, and +mismatches score 6 - (q + 5) where q is the base quality. [Here 6 and +5 are approximations to -10log_10(1/4) and -10log_10(1/3) +respectively]. The alignment score computed using this matrix is then +reduced by 10log_10(2G) (where G = reference genome size) to reflect +the prior probability for this specific reference location, and +adjusted further to take into account qualities at matching bases in +the alignment; it may also be reduced if complexity adjustment is on +(i.e. -raw is not set), to take into account any deviation from .25 of +nucleotide frequencies in the matching genomic region. Finally it is +reduced by the score of the highest scoring alternative alignment (if +any) for that read. Alignments with scores > 0 are reported, +consistent with the value of -minmargin. + User-specified matrices are not currently allowed with this +option. The -compact_qual and -score_hist options are turned on, and +the values of -minscore, -gap_init, and -gap_ext, -gap1_minscore, +-gap1_dropoff, and -near_minscore are reset (-minscore is set to +0). -masklevel is set to 0. For statistical validity, ideally +-globality should be set to 3 (global alignments), although this is +currently only possible when -gap1_only is also set. Scores reported +in the score histogram do not reflect adjustment for quality at +matching bases and have not been reduced by the score of an +alternative alignment.

 -raw *                

- Use raw rather than complexity-adjusted Smith-Waterman scores. + Use raw rather than complexity-adjusted Smith-Waterman scores.


-2. Banded search +2. Banded/gap1 search

- The following options control the region of the Smith-Waterman matrix -that is searched. In phrap and cross_match (as opposed to swat, which -performs full Smith-Waterman comparisons), word-nucleated banded -Smith-Waterman comparisons are performed: the set of input sequences -is scanned to find pairs of perfectly matching subsequences ("word -matches") meeting certain criteria, and for each such word match a -band in the Smith-Waterman matrix that is centered on the diagonal -defined by the match is searched to find an optimal-scoring alignment. -If there are multiple matching words for a given pair of sequences, -the union of the corresponding bands is searched. The search is -"recursive" in the sense that if an optimal-scoring alignment is found -whose score exceeds -minscore the remainder of the band is searched -again; as a result multiple alignments may be found within a single -band. - The word matches that are considered are "maximal" in the sense that -they cannot be extended in either direction without introducing a -discrepancy in one of the two sequences. If the length of the word -match is at least -maxmatch, it is accepted; otherwise if the -complexity-adjusted length is less than -minmatch, or if the matching -sequence occurs more than -max_group_size times on the forward strands -of the query file, it is rejected. The latter criterion effectively -downweights frequently occurring words. It can be turned off by -setting -maxmatch equal to -minmatch. - - -minmatch 14 - Minimum length of matching word to nucleate SWAT comparison. if -minmatch = 0, a full (non-banded) comparison is done [N.B. NOT -PERMITTED IN CURRENT VERSION]. Increasing -minmatch can dramatically -decrease the time required for the pairwise sequence comparisons; in -phrap, it also tends to have the effect of increasing assembly -stringency. However it may cause some significant matches to be -missed, and it may increase the risk of incorrect joins in phrap in + The following options control the region of the edit graph (of a +query-subject sequence pair) that is searched. In phrap and +cross_match (as opposed to swat, which performs full Smith-Waterman +comparisons), word-nucleated banded Smith-Waterman comparisons are +performed as follows. First, the set of input sequences is scanned to +find pairs of perfectly matching subsequences ("nucleating perfect matches"), +which are "maximal" in the sense that they cannot be extended in +either direction without introducing a discrepancy in one of the two +sequences, and which meet certain additional filtering criteria +described below. For each such nucleating perfect match, a band in the edit +graph that is centered on the diagonal defined by the match is +searched to find an optimal-scoring alignment. If there are multiple +nucleating perfect matches for a given pair of sequences, the union of the +corresponding bands is searched. The search is "recursive" in the +sense that if an optimal-scoring alignment is found whose score is at +least minscore, the remainder of the band is searched again; +consequently, multiple alignments may be found within a single band. +Note that because an entire band is searched, the nucleating perfect match +itself is not necessarily part of the optimal alignment, although it +usually will be. + +An alternative comparison mode, using "gap1" alignments, is also +available. "Gap1" alignments are alignments that extend the (gapfree) +alignment defined by the nucleating perfect match to the left and to +the right, allowing arbitrarily many residue mismatches (appropriately +penalized), but at most one gap character, in each direction. Such +alignments have a maximum of two gap characters, and are thus more +restrictive than banded Smith-Waterman (bSW) alignments. However they +are significantly faster to find, and for short queries (< 40 bp) will +detect most alignments that are detectable at all by bSW; for longer +queries, the -fuse_gap1 option (see below) will fuse overlapping gap1 +alignments into a longer alignment that may contain more than 2 gaps +(each gap still having a maximum size of one, however). Gap1 +alignments also provide a useful filtering step prior to bSW. In +contrast to banded search, for gap1 alignments the nucleating perfect +match will always be part of the alignment. + + A number of (optional) criteria can be used to filter the set of +nucleating perfect matches that are used. If a maximal nucleating +perfect match falls entirely within an annotated repeat in the query +or repeat sequence it is rejected. Otherwise, if it has length at +least -maxmatch, it is (tentatively) accepted. Otherwise (i.e. if it +is neither accepted nor rejected by those criteria), if the +complexity-adjusted length (or actual length, if -word_raw is +specified) is less than -minmatch, or if the matching sequence occurs +more than -max_group_size times in query file sequences, it is +rejected. (The latter two criteria provide independent mechanisms for +deprecating frequently occurring regions of size less than -maxmatch.) +A final filter is then applied to those nucleating perfect matches not +rejected by the preceding criteria, as follows. The highest scoring +gap1 alignment extending the nucleating perfect match is found, and if +this alignment has (non-complexity-adjusted) score less than +-gap1_minscore, the match is rejected. Nucleating perfect matches not +rejected by any of these criteria are then used to define a banded +Smith-Waterman search as described above (unless the option -gap1_only +is specified). + + -minmatch 14 (for DNA sequences) 4 (for protein sequences) + Minimum length of nucleating perfect match for banded Smith-Waterman or gap1 +comparison. If minmatch = 0, a full (non-banded) comparison is done +[N.B. NOT PERMITTED IN CURRENT VERSION]. Increasing -minmatch can +dramatically decrease the time required for the pairwise sequence +comparisons; in phrap, it also tends to have the effect of increasing +assembly stringency. However it may cause some significant matches to +be missed, and it may increase the risk of incorrect joins in phrap in

certain situations (by causing implied overlaps between reads with
high-quality discrepancies to be missed).

- -maxmatch 30 - Maximum length of matching word. For cross_match, the default value -is equal to minmatch, instead of 30. - - -max_group_size 20 - Group size (query file, forward strand words) + -maxmatch 20 (for DNA sequences) 4 (for protein sequences) + Maximum required length of nucleating perfect match (i.e. non-repeat_screened +nucleating perfect matches at least this long are always used, regardless of +complexity adjustment or max_group_size). Cannot be set larger than +127, or smaller than -minmatch (input value is automatically adjusted +to meet these criteria if necessary). Note that setting maxmatch = +minmatch has the effect of turning off all filtering of nucleating perfect +matches (i.e. both complexity adjustment and max_group_size.) This +will increase sensitivity somewhat, at the expense of significantly +increased running time. + +-word_offset 2 (for ordinary DNA sequences) 1 (for merged base DNA or protein sequences, or if spliced_word_gapsize or spliced_word_gapsize2 is set) + (cross_match only, & there must be separate query & subject +files). Controls which words get indexed. Higher values can greatly +reduce running times, at the cost of (usually slightly) reduced sensitivity. +Phrap/cross_match versions older that 1.080827 implicitly used +-word_offset 1, which provides maximum sensitivity (for a given +-minmatch) but slowest speed. It is not (currently) recommended to +setting -word_offset higher than 1 when looking for spliced alignments +will cause a significant fraction (> 50%) of such alignments to be +missed. + + -max_group_size 10 (for phrap) 0 (for cross_match) + Group size (maximum number of allowed occurrences of the nucleating perfect +region in query file, forward (i.e. uncomplemented) strands). If 0, +group size is ignored -- this is the default for cross_match, and is +the recommended setting for phrap assemblies of very short reads when +coverage depth is expected to be extremely uneven (e.g. Solexa EST +reads).

 -word_raw *           

- Use raw rather than complexity-adjusted word length, in testing -against minmatch (N.B. maxmatch always refer to raw lengths). (The -default is to adjust word length to reflect complexity of matching -sequence). + Use raw rather than complexity-adjusted length for the nucleating perfect +match, in testing against minmatch (N.B. maxmatch always refers to raw +lengths). (Default behavior if -word_raw is not set is to adjust +length to reflect complexity of matching sequence).

 -bandwidth 14        

- 1/2 band width for banded SWAT searches (full width is 2 times -bandwidth + 1). Decreasing bandwidth also decreases running time at -the expense of sensitivity. Phrap assemblies of clones containing long -tandem repeats of a short repeat unit (< 30 bp) may be more accurately -assembled by decreasing -bandwidth; -bandwidth should be set such that -2 bandwidth + 1 is less than the length of a repeat unit. -bandwidth 0 -can be used to find gap-free alignments. + 1/2 band width for banded Smith-Waterman searches (full width is 2 +times bandwidth + 1). Decreasing bandwidth decreases running time, at +the expense of sensitivity to detect large gaps. Phrap assemblies of +clones containing long tandem repeats of a short repeat unit (< 30 bp) +may be more accurately assembled by decreasing -bandwidth; -bandwidth +should be set such that 2 bandwidth + 1 is less than the length of a +repeat unit. Note that -bandwidth 0 can be used to find gap-free +alignments. If read length is short (e.g. with Solexa sequence data) +it is generally pointless to use a large value for -bandwidth because +large indels in these reads cannot be detected anyway (with the usual +gap penalty scoring). -bandwidth is ignored when -gap1_only is specified. + + -repeat_screen 0 + Controls how nucleating perfect matches falling entirely within +repeats are treated. If 0, repeat information is not used in +evaluating nucleating perfect matches; if 1 or 3, nucleating perfect +matches lying entirely within repeat tags in query file are ignored +(i.e. not used to nucleate banded Smith-Waterman or gap1 search); if 2 +or 3, nucleating perfect matches lying entirely within lower case +regions in subject file, or spliced matches for which either splice +junction falls in such a region, are ignored. Note that it is up to +the user to provide the appropriate repeat information in the input +sequence files (see above, section III.2). + + -gap1_minscore 17 + Minimum score (not complexity-adjusted) of 'gap1' extensions of +nucleating perfect match. Can be turned off by setting to 0; however +this is not recommended, because use of -gap1_minscore substantially +lowers running time with very little sacrifice of sensitivity. Should +be adjusted if non-default scoring is used. -gap1_minscore is ignored +when -gap1_only is set. If greater than -minscore, it is reset to +equal -minscore. + +-gap1_only * + (cross_match only). + If set, only gap1 alignments are considered. The parameters -gap_ext, +-bandwidth, -gap1_minscore (which only applies to filtering) and +-near_minscore will then be ignored. If -gap1_only is used with long +queries, it is recommended that -fuse_gap1 also be set. + +-gap1_dropoff -12 + Maximum score dropoff permitted in gap1 alignments, before +terminating search. Should be adjusted if non-default score matrix is used. + +-fuse_gap1 * + Fuse overlapping gap1 alignments. This is only applicable when gap1 +alignments are generated using -gap1_only, -spliced_word_gapsize, and/or +-spliced_word_gapsize2. The fusing is done after -minmargin and +-minscore filtering is applied. This option is generally only useful +with longer sequences. + +-globality 0 + (Cross_match only). Controls extent (with respect to query) of gap1 +alignments. Settings: + 0 local alignments only + 1 semiglobal to left (i.e. alignment forced to extend to left (5') end of query) + 2 semiglobal to right (alignment forced to extend to right (3') end of query) + 3 global (alignment forced to extend to both ends of query) + +This option is currently available only with -gap1_only, +-spliced_word_gapsize, and/or -spliced_word_gapsize2 alignments, and +it does not apply to the gap1 alignments that are used in the +filtering step performed prior to a banded search (which use +-gap1_minscore, and are always local). Only recommended for use with +short reads. -globality 1 increases specificity when the 5' query +bases are expected to be most accurate (as with Solexa data). +-globality 3 (forcing global alignments) is useful for some purposes +(e.g. in analyzing error rates as a function of position or quality in +short reads) but can reduce sensitivity when there is a higher error +rate at the 3' end of the read.

+[DISCUSS NUCLEATING PERFECT MATCHES FOR MERGED BASE READS.]

3. Filtering of matches

- The following options control which matches are displayed (cross_match) + The following options control which matches are reported (cross_match)

or used in assembly (phrap).

 -minscore 30          

- Minimum alignment score. + Minimum alignment score (complexity-adjusted, unless -raw is set (see +IV.1 above)). Should be adjusted if non-default score matrix is used. + + -near_minscore minscore + (cross_match only). Typically used when comparing EST and cDNA query +sequences to a genomic sequence using cross_match, to increase +sensitivity to detect short low-scoring exonic matches that are in the +vicinity of higher-scoring ones. Matches having scores that are at +least -near_minscore but less than -minscore are reported only if they +are within a distance max_intron_length, and in the appropriate order +and orientation, with some match having score >= minscore and +involving the same query and subject sequences. Note that there is no +"-near_minmatch" parameter, so if you want to increase sensitivity at +the nucleating perfect match step you would need to decrease +-minmatch. Should be adjusted if non-default scoring is +used. -near_minscore is turned off when -gap1_only is specified. + +-max_intron_length 10000 + (cross_match only) See description of -near_minscore, above.

- -vector_bound 80 +-vector_bound 80 (for phrap) 0 (for cross_match)

 Number of potential vector bases at beginning of each read.  Matches
that lie entirely within this region are assumed to represent vector

-matches and are ignored. For cross_match, the default value is 0 -instead of 80. - - -masklevel 80 - (cross_match only). A match is reported only if at least (100 - -masklevel)% of the bases in its "domain" (the part of the query that -is aligned) are not contained within the domain of any higher-scoring -match. +matches and are ignored. Note that for assembling very short +(e.g. Solexa) reads, the default value is much too high!! For +cross_match, -vector_bound is only utilized when there are no subject +files, and the default value is 0 instead of 80. + + -masklevel 80 (101 for merged base reads, or when there is no subject file) + (cross_match only). Masklevel controls the grouping of matches +according to their domains (the segment of the query that is aligned), +which is useful when different portions of the query may match +different subject sequence regions (e.g. cDNA queries vs genomic +subject, or chimeric sequencing read queries). Two matches for the +same query are considered to be in the same "query domain group" if at +least masklevel% of the bases in the domain of either one of them is +contained within the domain of the other. Thus for the default value +of 80, matches are assigned to the same group if the domain of either +one is at least 80% contained within the domain of the other.

 Special cases:

- -masklevel 0 report only the single highest scoring match for each query - -masklevel 100 report any match whose domain is not completely contained - within a higher scoring match - -masklevel 101 report all matches - - + -masklevel 0 all matches form a single query domain group + -masklevel 101 all matches form a single query domain group, but +-minmargin is inactivated thus causing every match with score >= +minscore to be reported. + +For merged base reads, -masklevel is always set to 101, because +overlapping matches can correspond to different sets of peaks and thus +be unrelated to each other. + +-minmargin 0.5 + (cross_match only, & there must be separate query & subject +files). The score margin for a match is defined to be the difference s +- best_other_score, where s is the match score, and best_other_score +is the highest score occurring for any other match in the same query +domain group (as defined above under '-masklevel') for that query. +Only matches with score margin at least -minmargin are +reported. Specifically (letting n denote a positive integer): + + -minmargin n report only those highest-scoring matches (for a given query +domain group) for which no other matches have score within n of it + -minmargin 0.5 report a single highest-scoring match for each query +domain group. If there is more than one match with this score, one is +chosen at random; in the case of spliced word matches, this random +choice is made from among those highest-scoring matches having minimal +span in the subject sequence. + -minmargin 0 report all highest-scoring matches in each query domain group + -minmargin -n report any match whose score +is within n of the highest-scoring match in the query domain group + +In particular -minmargin 1 cause the top-scoring match to be reported +only if no other matches have the same score; -minmargin -1 will cause +all matches having a score within 1 of the top scoring match (and at +least minscore) to be reported. + Note that the -minscore filter is also applied. For -minmargin >= +0.5, at most one match per query domain group is reported; for +-minmargin < 0.5, multiple matches per query domain group may be +reported. The value of -minmargin becomes irrelevant when -masklevel is +101.

4. Input data interpretation
  

@@ -724,14 +1082,15 @@

read is not assigned to a group).

 -trim_start 0    

- (phrap only). No. of bases to be removed at beginning of each read. + (phrap only). Number of bases to be removed at beginning of each read.


5. Assembly

- The following options are used to control completeness and stringency -of assembly; note that the options -minmatch, -bandwidth, -penalty, -and -minscore discussed above also affect assembly stringency. + The following phrap-only options are used to control completeness and +stringency of assembly; note that the options -minmatch, -bandwidth, +-penalty, -gap1_minscore, and -minscore discussed above also affect assembly +stringency.


 -forcelevel 0        

@@ -781,11 +1140,11 @@

6. Consensus sequence construction

- The following options affect the weighted directed graph that is used -to find the consensus sequence. Increasing their values will reduce -the size of the graph, which should reduce memory requirements for the -phrap run (substantially in some cases) but may decrease the accuracy -of the consensus sequence that is found. + The following phrap-only options affect the weighted directed graph +that is used to find the consensus sequence. Increasing their values +will reduce the size of the graph, which should reduce memory +requirements for the phrap run (substantially in some cases) but may +decrease the accuracy of the consensus sequence that is found.

 -node_seg 8        
 (phrap only). Minimum segment size (for purposes of traversing

@@ -794,6 +1153,13 @@

 -node_space 4      
 (phrap only). Spacing between nodes (in weighted directed graph) .       

+ -contig_graph_weights 0 + (phrap only). Weighting scheme; currently permitted values are 0 and +1. The value 0 causes the weights to be set equal to the quality +values; 1 causes them to be set to the scaled error probabilities (the +weight attached to a base is then e_0 - e, where e is the error probability +for that base and e_0 is the error probability corresponding to +-trim_qual).

7. Output

@@ -845,24 +1211,148 @@

 (phrap only). Print information about non-local matches between
contigs.

+ -print_word_data * + Print information about nucleating perfect matches. +

 -exp [None]           
 (gcphrap only). Name of a directory in which output experiment files
are to be placed.

 -alignments *       

- (cross_match only). Display the alignment for each match. + (cross_match only). Display the alignment for each reported match. +When the subject sequences are large (e.g. whole chromosomes), there +is currently a speed advantage to having them split among multiple +subject files (e.g. one file per chromosome) rather than all included +in a single file. + +-align_extend 0 + (cross_match only). # residues to display past end of the +SWAT-aligned region, when alignments are displayed (using +-alignments).

 -discrep_lists *    

- (cross_match only). Give list of discrepancies for each match. + (cross_match only). Give list of sequence discrepancies, and +qualities, for each reported match. If -spliced_word_gapsize is set, +putative introns bridged by the spliced word are included in the list, +as large deletion ('D-n' where n is the intron size) discrepancies.

 -discrep_tables *   
 (cross_match only). Give table of discrepancies (by quality) for each
match (default is to display a single table, that combines results for
all matches).

+ -score_hist * + (cross_match only, & there must be separate query & subject +files). Print histogram of match scores for each query domain group +(as defined above under '-masklevel'). Each score histogram appears as +a separate line following all displayed matches for a given query (if +any), with the following format: + query_name domain_start..domain_end all scores (counts): score(count) score(count) score(count) ... +For example, the line + 8-1-237-906_0 2..35 all scores(counts): 18(1) 20(2) 36(1) +indicates that the query sequence named 8-1-237-906_0 had one match +with score 36, two matches with score 20, and one match with score 18, +in the domain (region) starting at base 2 and ending at base 35. Note +that all matches meeting the specified -minscore threshold are +counted, regardless of the settings of -masklevel or -minmargin. +When -spliced_word_gapsize is set, the 1st count after score is no. of +unspliced matches, 2d count is no. of spliced matches. [give example] + +-output_nonmatching_queries * + Create fasta and .qual files (named by appending the suffixes +".nonmatching" and ".nonmatching.qual" to the original query file +name) containing the query sequences which failed to have any matches +in the analysis. + +-output_bcdsites * + (cross_match only, & there must be separate query & subject files). +Create an output file (named by appending the suffix ".bcdsites" to +the original query file name) that indicates subject sequence positions +that are confirmed by and/or have high-quality discrepancies with +respect to the query sequences. + The file format is as follows. There are three header lines, each +beginning with the character '#', which indicate the cross_match +command line, version, and run time for the run which produced the +file. The remainder of the file consists of 'event' lines with the +following format: + +sub pos event,n_f,n_r,max_score,max_marg,max_qual [event,n_f,n_r,max_score,max_marg,max_qual] ... + +where sub = name of the subject sequence (e.g. chromosome name) + pos = position (origin 1) within the subject sequence at which the event starts + event = one of the following: + C-n (n a positive integer) segment of length n in subject confirmed by a query + D-n or d-n segment of length n in subject deleted in a query (putative +introns revealed by EST alignments using spliced_word_gapsize are of this type; lower-case 'd' is reserved +for lower (bottom) strand introns) + S-b (b = A,C,G, or T) site at which a query has high-quality base b instead of + the subject sequence base + I-s (s a nucleotide sequence) site of high-quality inserted sequence s in query sequence with respect + to the subject sequence + L-s site at which a query sequence has a high-quality extension s to the left of position +pos in the subject sequence + R-s (seq a nucleotide sequence) site at which a query sequence has a high-quality extension s +to the right of position pos in the subject sequence + n_f = number of forward (i.e. top-strand) queries supporting the event + n_r = number of reverse (i.e. bottom-strand) queries supporting the event + max_score = maximum score of all query alignments supporting the event + max_margin = maximum score margin (see definition under -minmargin, above) of all query alignments supporting the event + max_qual = maximum quality of query base in alignments +supporting the event + + The events for a given subject sequence are listed in increasing +positional order (pos). For C and D events, the -n is omitted if n = +1. Sequences s for events I, L, and R are always given in top-strand +5' to 3' orientation. max_qual for C (confirmation) events is +currently always 0, because no quality check is performed (this will +be changed in future). Max_score and max_margin together provide a +measure of how confidently a supporting query maps to this +position. Note that the availability of this information in the +bcdsites file means that the initial cross_match run can be performed +with fairly liberal settings for -minscore, -minmargin, and +-bcdsites_qual_threshold; higher stringencies can be used to filter +out lower-confidence events from the bcdsites file later, without +re-running cross_match. + + For example, the line + +CHROMOSOME_X 250526 C,6,1,30,0,0 R-GATGTT,0,1,30,0,40 D-294,2,1,31,0,26 + +in the bcdsites file from a spliced-alignment run (using +-spliced_word_gapsize 2000) of Solexa cDNA reads against the +C. elegans genome indicates that at position 250526 in the X +chromosome, the subject sequence base is confirmed by 6 top strand and +1 bottom strand reads, with a maximum score of 30, while two top +strand and one bottom strand reads instead support the existence of a +294 bp intron beginning at this position, and one bottom strand read +extends to right from this position with additional bases GATGTT (this +is likely an another read supporting the intron, but which failed the +splice alignment because it had too few bases on the other side of the +intron). The last base of the intron would be 250526 + 294 - 1 = +250819. The intron is for a top (upper) strand gene since upper case +'D' is used. Note however that for all of these reads max_margin is +0, implying that the supporting query sequencies all have +equal-scoring matches elsewhere in the genome, so placement of these +queries in the genome is ambiguous and the intron cannot be considered +to be confirmed by these data. + +-bcdsites_qual_threshold 25 + (Only applies with -output_bcdsites). Quality threshold for flagging +discrepancies in .bcdsites file. Quality is ignored for purposes of +identifying confirmed bases, and for 'large' insertions or deletions +(e.g. introns in spliced alignments of ESTs to genome). +

8. Miscellaneous

+ -compact_qual * + (cross_match only, and there must be separate query & subject files; +cannot be used with merged base or protein data.) Store query quality +values in same bytes as nucleotides. Quality values > 60 are set to +60. This option reduces run-time memory requirements for very large +input files, and is set automatically when quality-based scoring of +alignments is used, or when the input query file is in CALF format. +

 -retain_duplicates *  
 (phrap only). Retain exact duplicate reads, rather than eliminating
them.

@@ -882,7 +1372,7 @@

 -trim_qual 13       
 (phrap only). Quality value used in to define the "high-quality" part
of a read, (the part which should overlap; this is used to adjust

-qualities at ends of reads. +qualities at ends of reads).

 -confirm_length 8   
 (phrap only). Minimum size of confirming segment (segment starts at

@@ -899,13 +1389,98 @@

 (phrap only). Minimum alignment score for a read to be allowed to
"confirm" part of another read.

- -indexwordsize 10 - Size of indexing (hashing) words, used in finding word matches -between sequences. The value of this parameter has a generally minor -effect on run time and memory usage. - + -indexwordsize 12 (for DNA sequences) 4 (for protein sequences) + Size of indexing (hashing) words, used in searching for nucleating perfect +matches between sequences. Cannot be set larger than +minmatch. Increasing indexwordsize while holding minmatch constant may +reduce running time somewhat but increase memory requirements, without +affecting sensitivity. + +-indexwordsize2 4 + (Merged base read data only). Size of indexing words used in +searching for nucleating perfect matches involving 2d peaks. Cannot be set +larger than indexwordsize. + +-logfile + Name of .log file. The default name is the name of the query file, with .log appended. + + +The following cross_match-only parameters are used in alignments of +cDNAs or ESTs (as queries) to genomic sequence + +-spliced_word_gapsize 0 + (cross_match only, & there must be separate query & subject +files). Look for nucleating perfect matches that span potential splice +junctions (i.e. where a region in the query matches a 'split' region +in the genomic sequence, flanking a potential intron). Only U2-type +(GY..AG) introns (on either strand), in the size range +-min_intron_length to -spliced_word_gapsize are currently considered; +and the nucleating perfect region must include at least minmatch/2 +perfectly matching bases to the left of the splice junction and +minmatch/2 perfectly matching bases to the right of the +junction). Only gap1 alignments (see above) are considered in +extending the spliced match. Only a single spliced word (a single +intron) can be found per alignment; however, multi-intron alignments +can be obtained by using -fuse_gap1. When spliced_word_gapsize is +set, ordinary (unspliced) nucleating perfect matches are also found, +and used to nucleate banded Smith-Waterman searches (or gap1 searches, +if -gap1_only is set) in the usual way. Spliced alignments are +penalized using an intron-size-dependent scoring function (currently +based on an assumed lognormal size distribution with parameters +derived from analysis of C.elegans introns; this will change in the +future). + Reported matches nucleated by a spliced_word are indicated by +prefixing the word "SPLICED" to the match summary line. + +-spliced_word_gapsize2 0 + (cross_match only, & there must be separate query & subject +files). Like -spliced_word_gapsize, but applies only to potential +splices in which the 'downstream' site (which can be either a 5' or a +3' potential splice site, depending on strand) occurs near the +boundary of a region having a match with (raw) score at least +-minscore. Since there are far fewer such sites, +-spliced_word_gapsize2 can be set much larger than +-spliced_word_gapsize without a significant time penalty. Both +-spliced_word_gapsize and -spliced_word_gapsize2 can be set in the +same run. Note that because the match condition applies only to the +downstream site, use of -spliced_word_gapsize2 can give slightly +different results when run on the complemented genome rather than the +original genome. + +-spliced_match_left 0 + (cross_match only) Only applies when -spliced_word_gapsize2 is +positive. # of positions to left of a left match boundary to consider, +for potential splice site locations. Must be non-negative. + +-spliced_match_right 20 + (cross_match only) Only applies when -spliced_word_gapsize2 is +positive. # of positions to right of a left match boundary to consider, +for potential splice site locations. Must be non-negative. + +-word_intron_margin 2 + (cross_match only) Number of bases at each intron end to display, in +'spliced' alignments found using -spliced_word_gapsize and reported +using -alignments. Must be at least 2. + +-min_intron_length 30 + (cross_match only) + +[THE FOLLOWING ARE CURRENTLY NOT SUPPORTED!!] + +-splice_edge_length 0 + (cross_match only, and there must be separate query & subject files) +Length of region at each end of alignments to scan for potential +splice sites. If 0, no analysis is done. If > 0, then the last +splice_edge_length bases of alignment, and the splice_edge_length +bases beyond end of alignment, are scanned. (Done for both ends). It +is assumed that the subject sequence is genomic, and the query +sequence is EST or cDNA.

+-max_overlap 20 + (cross_match only)

+-min_exon_length 6 + (cross_match only)


V. VIEWING PHRAP ASSEMBLIES WITH OTHER PROGRAMS

@@ -918,10 +1493,12 @@

assembly that is generated by phrap (e.g. phrap's consensus sequence
and quality values, and tags indicating potential assembly problems
and various data anomalies such as chimeras and compressions).

+

 To use consed with phrap, ace files must be generated during the phrap
run using the flag -new_ace or -old_ace on the command line. This is done
automatically if you use the script phredPhrap to carry out all of the
data processing steps.

+

 For additional information, consult consed documentation.


@@ -929,7 +1506,7 @@

  Phrapview is a graphical tool that provides a "global" view of the
phrap assembly, complementary to the "local" view provided by the

-sequence editor CONSED; it will become obsolete when a similar globabl +sequence editor CONSED; it will become obsolete when a similar global

view is added to CONSED.

a. Installation

@@ -996,6 +1573,7 @@

the assembly -- i.e. the "confidently placed" reads. (See phrap.doc
for a description of LLR scores). Misjoins usually occur in places
where the reduced depth is 0 on both strands.

+

 Regions of the contig where the depth is 0 on one or both strands are
indicated in red.

@@ -1011,6 +1589,7 @@

duplicated reads in the same direction on the insert.  Links will only
be indicated properly if the read naming convention assumed by phrap
is used (see section III.1 above).

+

 Each match or link is considered to be in one of three classes,
indicated by color: "problems" (red), which indicate serious
discrepancies or a significant possibility of incorrect, incomplete,

@@ -1041,6 +1620,7 @@

the other region, so it is unlikely that there is a significant
misassembly in such cases, although it is possible that one of the two
matching reads may have been incorrectly placed.

+

 Chimeras generally have one "ok" match (to the site where the read
was assembled) and one or more "problem" matches to the region from
which the other part of the chimera derives (sometimes these matches

@@ -1048,6 +1628,7 @@

LLR cutoff). Some chimeras actually represent subclones with internal
deletions; these tend to be obvious since the two pieces map to nearby
locations in some contig and are in the same orientation there.

+

 Double-headed arrows on matches indicate that the matching regions
are on opposite strands; double-headed arrows on links indicate that
the orientation of the two reads is inconsistent with the read names

@@ -1074,6 +1655,7 @@

remains highlighted until the pointer is moved over another item, or a
contig line.  Textual information about the highlighted item is shown
in a box in the top right region of the display.

+

 Several parameters can be changed by entering new values in
appropriate boxes at the bottom of the display. Four different
magnifications can be changed: horizontal magnification; the "spacing

@@ -1092,6 +1674,7 @@

(or in the wrong orientation) are marked as problems. Min ss and max
ss are the minimum and maximum spacing between same-strand reads (i.e.
size of a walking step).

+

 The display is updated to reflect changes made in the above
parameters only after a relevant button (Show Contig Matches, etc.) is
pressed.

@@ -1103,15 +1686,17 @@

3. Gap

- A version of phrap ("gcphrap", for "gap-compatible phrap") suitable + Versions of phrap and cross_match ("gcphrap", for "gap-compatible phrap", +and "gccross_match" for "gap-compatible cross_match") suitable

for use with the Staden gap4 package may be created as follows: Obtain
relevant source files from the Staden web site
(http://www.mrc-lmb.cam.ac.uk/pubseq/index.html). Put these in the

-same directory as the phrap source files, and enter the command +same directory as the phrap source files, and enter the commands

> make gcphrap

+> make gccross_match

- Consult the Staden web site for instructions on using gcphrap in + Consult the Staden web site for instructions on using gcphrap and gccross_match in

conjunction with gap4, including input and output file formats. In
addition to the usual phrap command line options, gcphrap accepts an
option -exp, which is used to specify the name of a directory in which

@@ -1133,23 +1718,30 @@

but may be redirected to a file).  This contains information
indicating the point in the run that phrap has reached, summary
results for some of the steps, and various warning and error messages.

+

 (ii) The .contigs file.  This is a FASTA file containing the contig
sequences (without pads; bases in this file are upper case if and only
if the quality is >= qual_show).  These include singleton contigs
consisting of single reads with a match to some other contig, but that
couldn't be merged consistently with it.

+

 (iii) The .contigs.qual file. This has the phrap-generated qualities
for the contig bases.

+

 (iv) The .singlets file. A FASTA file containing the singlet reads
(i.e. the reads with no match to any other read).

+

 (v) The .log file. This includes various diagnostic information, and
a summary of aspects of the assembly; it is probably only of interest
to me, for trouble-shooting purposes.

+

 (vi) The .problems file. Again this is probably only of interest to
me.

+

 (vii) The .ace file (produced when the options -new_ace or
-old_ace is used).  This file is required for viewing the assembly
using consed. It's format is described in the consed documentation.

+

 (vii) The .view file (produced when the option -view is
used). Required for viewing the assembly using phrapview.

@@ -1233,6 +1825,7 @@

the apparent size of the deletion, location of the deletion and (in
parentheses) name of the matching (undeleted) read and extent of the
segment that was deleted from in it are given.

+

  The probable deletions are not used in assembly and instead appear
as singleton contigs.

@@ -1401,6 +1994,7 @@

consist entirely of quality 0 are reassigned quality values of -1,
since these usually correspond to the lowest quality data at the
extreme ends of reads.  (Not done in the contig quality histogram.)

+

 An 'N' in either a read or the contig sequence is always counted as a
substitution error.

@@ -1452,6 +2046,7 @@

regions may be flagged as unspanned.  Misassembly errors of omission
should always result in "UNSPANNED" matching regions at the end of one
or both contigs.

+

 Regions of a non-singleton contig matching an anomalous (chimeric or
deleted) read are indicated only in the output for the singleton contig
that contains the anomalous read.

@@ -1487,10 +2082,11 @@

later) input file (the "subject" sequences); or, if a single input
file is provided, the matches between any two sequences in this file.
The matches that are reported are controlled by the command line

-options -minscore and -masklevel, as well as by the options that +options -minscore, -masklevel, and -minmargin, as well as by the options that

control scoring of the alignments and the band search (see section IV
above).  The reported matches are ordered by query, and for each query
by the position of the start of the alignment within the query.

+

  For each reported match, an initial output line gives summary
information:

@@ -1520,18 +2116,21 @@

Following this line there appears (if the flag -discrep_lists is
specified) a listing of the discrepancies between the aligned regions

-of the two sequences, giving for each discrepancy its type, position, -nucleotide, and quality (in the query sequence), and its position in -second sequence. This list is followed (if -discrep_tables is -specified) a table giving, for each quality value, the total number of -bases of that quality in the first sequence; the number which are -included in the SWAT alignment; the cumulative no. of aligned bases -(for this and higher qualities); the number of bases not included in -the alignment (but potentially alignable -- i.e. not lying before the -beginning or after the end of the second sequence); the number of -discrepancies of each type (substitution, deletion, insertion); the -total # of discrepancies (for this quality level) and their percentage -(of the aligned bases of the given quality), and the cumulative # of +of the two sequences, giving for each discrepancy its type (S = +substitution, I = insertion, D = deletion, E = end base (mismatching +base immediately adjacent to aligned region -- these are printed only +if the -output_bcdsites option is specified)), position, nucleotide, +and quality (in the query sequence), and its position in second +sequence. This list is followed (if -discrep_tables is specified) a +table giving, for each quality value, the total number of bases of +that quality in the first sequence; the number which are included in +the SWAT alignment; the cumulative no. of aligned bases (for this and +higher qualities); the number of bases not included in the alignment +(but potentially alignable -- i.e. not lying before the beginning or +after the end of the second sequence); the number of discrepancies of +each type (substitution, deletion, insertion); the total # of +discrepancies (for this quality level) and their percentage (of the +aligned bases of the given quality), and the cumulative # of

discrepancies and their percentage (of the cumulative no. of aligned
bases). This information is useful for computing accuracy as a
function of quality, for automatically generated contig sequences.

@@ -1542,12 +2141,172 @@

 An 'N' in either sequence is always counted as a substitution error.
 If the flag -alignments is specified, a complete alignment for each
reported match is displayed.

+ If the flag -score_hist is specified, a list of the scores of all +matches involving the query is given, along with the number of times +each score occurred.


+VIII. SPECIAL CONSIDERATIONS/PARAMETER MODIFICATIONS FOR PARTICULAR DATA TYPES

-VIII. EST ASSEMBLIES - (TO BE ADDED) + +For all: tag repeats (to reduce incidence of false joins). But don't +need to use most sensitive detection -- only evolutionarily recent +repeats (nearly identical in sequence) tend to cause problems. Repeatmasking. +Need script to process cross_match output, insert matches into fasta file as +tags (or phd files?) + +Also use quality values if possible. + +1. (TO BE ADDED) Shotgun assemblies + Main problem: wide variation in data quality between sequencing labs. + Different depth of coverage. + Increase -forcelevel if contigs not going together. + Reads unaligned to final sequence. + Increase minmatch, decrease maxgap to break bad joins. + +2. (TO BE ADDED) Whole genome assemblies + Can't predict memory needed. Repeat characteristics play crucial role. + Increase minmatch. + If contigs large (=> depth of coverage high), increase -node_seg and -node_space + +3. (TO BE ADDED) Assemblies of polymorphic reads from a single locus + Increase minmatch, set maxmatch = minmatch.

+4. (TO BE ADDED) EST assemblies + Issues: polymorphisms, paralogous genes, alternatively spliced forms, + uneven assembly depth, directional bias, large dataset size, + artifactual clones, data from several sources with & without quality vals. + Identify highly expressed genes, remove reads from assembly. Compare reads + to contig sequences. + depth of coverage high => increase -node_seg and -node_space + Parameters: increase minmatch and set maxmatch = minmatch. + + Deep assemblies (3 and 4): node_seg etc. + +5. Comparisons of ESTs/cDNAs to genome. + + With long reads, use -near_minscore to increase sensitivity to +detect short exonic matches. In this case, each match is reported +separately, and alignments are (currently) not adjusted to reflect +likely splice junctions. + + With long or short reads, use -spliced_word_gapsize and +-spliced_word_gapsize2 to directly detect likely introns (in the case +of long reads, -fuse_gap1 should also be set so that reported +alignments can include more than one intron). Reported alignments and +discrepancy lists (using -alignments and -discrep_lists) will then +indicate precise location & size of the putative intron, and +alignments report the bases at each end of the intron. Note that +there can be a substantial speed penalty for detecting modest size +introns using -spliced_word_gapsize, e.g. -spliced_word_gapsize 300 +(to detect introns of size <= 300) may approximately double running +times. There is also an increased risk of 'noise' alignments due to +the greater total number of potential alignments that are +considered. In contrast, there is relatively little speed penalty with +-spliced_word_gapsize2 because far fewer candidate splices are +considered. + + When using -spliced_word_gapsize2 with short reads you may want to +set -minscore fairly low, and/or spliced_match_left and +spliced_match_right higher than the defaults, to increase the +probability that a match is found in the vicinity of the true splice +site. + +See next section for other parameter settings useful with short reads. + +6. Short read analyses (e.g. Solexa/Illumina data). + + Recommended parameter changes include: + + Smaller -minscore (e.g. 20 or 25, if the default score +matrix/penalty values are being used). The default minscore value of +30 requires at least 30 matching bases, which may be too high to +reliably detect matches involving reads only slightly longer than 30 +bases (on the other hand, you should expect some false positive matches at +the lower settings) + -vector_bound 0 (for phrap; 0 is already the default for cross_match) + -gap1_only or -bandwidth 2. Note that large indels in short reads +cannot be detected anyway, at least with the default gap penalties; +using a larger bandwidth value with -gap_ext 0 may provide some +ability to detect larger indels (at the expense of additional 'noise' +alignments). -gap1_only is in general significantly faster than +-bandwidth 2, and is recommended with short reads. However it cannot +detect indels of size larger than 1. + -max_group_size 0 (for phrap assemblies where coverage depth is +very uneven, e.g. ESTs) + -globality 1 (for cross_match; requires that -gap1_only, +-spliced_word_gapsize2 and/or -spliced_word_gapsize is also set). + + Note that with cross_match, using -masklevel 101 to report all +matches can dramatically increase running times and memory +requirements in cases where some queries match high-copy number +repeats; it is generally preferable to use lower values (e.g. the +default, or -masklevel 0), together with an appropriate value of +-minmargin to control the score range of reported matches. Using +-minmargin 1 or higher reduces the number of stored alignments, which +can reduce running time and memory requirements. Note also that a +histogram for each query, indicating all match scores meeting the +-minscore threshold, can be obtained by setting -score_hist. + + Limiting reported alignments to those with high-confidence placements +can be achieved by using -minmargin with positive values. + + If it is important to detect regions of highly biassed composition +you may want to turn off complexity-adjustment of scores by setting +the command line parameters -raw and/or -word_raw. In general I do +not recommend using -raw, since it tends to greatly reduce specificity +(increase the number of false positive matches) at a given score +level; one can usually do a better job of increasing sensitivity to +detect biassed composition regions while maintaining reasonable +specificity simply by lowering -minscore and setting -word_raw. Note +however that using -word_raw can incur a significant speed penalty. +(Reducing -maxmatch, without setting -word_raw, has the effect of +'partially' removing word complexity adjustment, which can provide a +useful compromise for the speed/sensitivity tradeoff in some +cases). Since exons are generally less likely to have highly biassed +composition regions, it is usually preferable not to use -raw or +-word_raw in RNASeq (cDNA or EST) searches. + + + Reducing the value of -minmatch will also also increase sensitivity, +as will reducing -gap1_minscore (when gap1_only is not already set). +Note that this may substantially increase running time. + +In general, appropriate parameter settings for your analyses will +depend on the characteristics of your data (e.g. genome complexity, +read error rates) as well as your computer resources. I recommend +experimenting on a subset of your data before committing to very long +runs. For resequencing applications, a useful parameter for this +purpose is -output_nonmatching_queries. For example, you can start by +using the default parameter setting for -minmatch (but using +adjustments for the other parameters as indicated above), and then try +turning up the sensitivity using the nonmatching queries to see +whether this recovers substantially more matches. + +7. "Resequencing" applications. + + A useful analysis mode is to run cross_match comparing a large set of +reads (e.g. 2 million or so Solexa reads) in a single query file to a +genomic sequence comprising the subject file(s), setting the +parameters -discrep_lists and -output_nonmatching_queries along with +any others that may be appropriate (e.g. those for short reads above). +High quality discrepancies in the discrepancy lists will include +substitution and small indel differences between the resequenced & +original genomes. Larger scale differences, and contaminants, can +often be identified by performing a phrap assembly of the nonmatching +query reads, and comparing the contig sequences back to the original +genome (using more sensitive parameter settings if desired) and/or to +the nucleotide databases using the NCBI Blast server. The bcdsites +parameters can be used to produce an output file indicating positions +in the reference genome that are either confirmed by or have +high-quality discrepancies with the input reads. + + When the option -alignments is used to display the aligned sequences, +there is currently a speed advantage to having the subject +(genomic) sequences split among multiple subject files (e.g. one file +per chromosome) rather than all included in a single file. + +8. (TO BE ADDED) Merged base reads


IX. PROBLEMS

@@ -1564,7 +2323,8 @@

to have the operating system allocate additional virtual memory to
your process. (However if you are substantially exceeding physical RAM
the run may take an exceptionally long time to complete). On Unix

-systems this can be done using the "limit" command, as follows: +systems you can often get access to additional memory using the +"limit" command, as follows:

i. Run "limit" with no arguments to list the system resources currently
available to you.  For example (this output is for a DEC alpha

@@ -1609,10 +2369,15 @@

which increases datasize to 500 megabytes. Any value smaller than the
value returned in step ii can be used.

-iv. Now rerun phrap. Note that the new datasize value will only apply -to the particular shell (window) in which you execute the limit -command, so you may need to repeat step iii next time you log in or if -you switch to a different window. +iv. Now try your phrap (or cross_match) run again. Note that the new +datasize value will only apply to the particular shell (window) in +which you execute the limit command, so you may need to repeat step +iii next time you log in or if you switch to a different window. + +If you still get a FATAL ERROR: REQUESTED MEMORY UNAVAILABLE message +after doing the above, then the operating system (not phrap!) is still +limiting your access to virtual memory. You will need to consult with +a local system administrator in this case.


2. Other phrap- or cross_match-generated error messages

@@ -1646,7 +2411,7 @@

 (TO BE ADDED)


-7. Reporting problems +7. How to report problems

 If you have a problem that is not addressed by any of the above,
first be sure you have read sections I-IV of the documentation and

@@ -1740,6 +2505,7 @@

somewhat conservative in that it does not take into account
same-strand matches (which clearly should count for something,
although they certainly are not independent).

+

  Contig positions are assigned quality values equal to the highest
adjusted quality of any read at that position, and then adjusted
downward to take into account any discrepancies with other reads. As a

@@ -1748,6 +2514,7 @@

interpretation as (conservative) error probabilities.  Such error
probabilities provide an extremely useful guide to where editing or
additional data collection is needed.

+

 The phrap adjusted qualities are used in computing "LLR scores" for
each pairwise match between two reads. These scores take into account
the qualities of the base calls in the reads, and are (approximate)

@@ -1772,7 +2539,7 @@

single letter is converted to 'N's; such regions are highly likely to
be of poor data quality which if not masked can lead to spurious
matches. Reads are then converted to uppercase (in order to allow

-case-insensitive word matches). All matching words of length at least +case-insensitive nucleating perfect matches). All matching words of length at least

minmatch between any pair of sequences are then found, by (i)
constructing a list of pointers to each position (in each sequence)
that begins a word of at least minmatch letters not containing 'N' or

@@ -1793,6 +2560,7 @@

calculation) detection of multiple matching pieces in different
locations, and will usually find most copies of repeats (since they
generally occur in separate bands).

+

 By default, SWAT scores are complexity-adjusted (so that matches for
which the collection of matching nucleotides has biassed composition
have their scores significantly penalized.)

@@ -1813,7 +2581,7 @@

accurate calls, lower case for less accurate calls) and setting the
penalties appropriately; this would involve using the quality levels
to adjust the symbols used in the reads, which should be done AFTER

-the word matching routine. (At present it does not seem particularly +the nucleating perfect matching routine. (At present it does not seem particularly

useful to use differing positive scores for different nucleotides to
reflect their different frequencies).
Personal tools
Namespaces
Variants
Actions
Site
Choi lab
Resources
Toolbox