Phrap diff

From CSBLwiki

Jump to: navigation, search

<html> <head> </head> <body> --- ../phrap/phrap.txt 2010-07-09 18:06:36.000000000 +0900<br> +++ phrap.txt 2010-07-09 18:06:04.000000000 +0900<br> @@ -1,5 +1,5 @@<br> /*****************************************************************************<br> -# Copyright (C) 1994-1999 by Phil Green. <br> +# Copyright (C) 1994-2008 by Phil Green. <br> # All rights reserved. <br> # <br> # This software is part of a beta-test version of the swat/cross_match/phrap <br> @@ -23,7 +23,7 @@<br> #*****************************************************************************/<br> <br> <br> -DOCUMENTATION FOR PHRAP AND CROSS_MATCH (VERSION 0.990319)<br> +DOCUMENTATION FOR PHRAP AND CROSS_MATCH (VERSION 1.090518)<br> <br> phrap ("phragment assembly program", or "phil's revised assembly<br> program"; a homonym of "frappe" = French for "swat") -- a<br> @@ -34,7 +34,7 @@<br> repeats; constructs contig sequence as a mosaic of the highest quality<br> parts of reads (rather than a consensus); provides extensive information<br> about assembly (including quality values for contig sequence) to<br> -assist trouble-shooting; able to handle very large datasets.<br> +assist trouble-shooting; able to handle moderately large datasets.<br> N.B. phrap does not provide editing or viewing capabilities; these<br> are available with consed and phrapview. It is strongly recommended<br> that phrap be used in conjunction with the base calls and base quality<br> @@ -44,13 +44,17 @@<br> <br> cross_match -- a general-purpose utility (based on a "banded" version<br> of SWAT, an efficient implementation of the Smith-Waterman algorithm)<br> -for comparing any two sets of (long or short) DNA sequences. For<br> -example, it can be used to compare a set of reads to a set of vector<br> -sequences and produce vector-masked versions of the reads; a set of<br> -cDNA sequences to a set of cosmids; contig sequences found by two<br> -alternative assembly procedures (e.g. phrap and xbap) to each other;<br> -or phrap contigs to the final edited cosmid sequence. It is slower but<br> -more sensitive (because it allows gaps) than BLASTN.<br> +for comparing sets of (long or short) DNA or protein sequences. For<br> +example, it can be used to compare DNA sequencing reads to a set of<br> +vector sequences and produce vector-masked versions of the reads;<br> +reads to contig or genome sequences; cDNA or EST sequences to genome<br> +sequences; contig sequences found by two alternative assembly<br> +procedures (e.g. phrap and gap) to each other; phrap contigs to final<br> +edited clone or genome sequences; 'merged base reads' arising from<br> +mixed signal DNA sequencing traces to genome sequences. Cross_match<br> +has recently been enhanced to allow comparison of large numbers of<br> +reads generated by high throughput methods to a reference genome, for<br> +RNASeq, ChIPSeq, or resequencing applications.<br> <br> <br> CONTENTS<br> @@ -86,39 +90,49 @@<br> <br> VII. CROSS_MATCH OUTPUT<br> <br> -VIII. EST ASSEMBLIES (TO BE ADDED)<br> +VIII. SPECIAL CONSIDERATIONS/PARAMETER MODIFICATIONS FOR PARTICULAR DATA TYPES <br> + 1. (TO BE ADDED) Shotgun assemblies<br> + 2. (TO BE ADDED) Whole genome assemblies<br> + 3. (TO BE ADDED) Assemblies of polymorphic reads from a single locus<br> + 4. (TO BE ADDED) EST assemblies<br> + 5. Comparisons of ESTs/cDNAs to genome.<br> + 6. Short read analyses<br> + 7. "Resequencing" applications<br> + 8. (TO BE ADDED) Merged base reads<br> <br> IX. PROBLEMS<br> 1. Insufficient memory<br> - 2. Other phrap- or cross_match-generated error messages (TO BE ADDED)<br> - 3. Phrap- or cross_match-generated warning messages (TO BE ADDED)<br> + 2. (TO BE ADDED) Other phrap- or cross_match-generated error messages <br> + 3. (TO BE ADDED) Phrap- or cross_match-generated warning messages <br> 4. "Crashes" reported by operating system <br> 5. Long running time<br> - 6. Misassemblies, incomplete assemblies, incorrect consensus sequence (TO BE ADDED)<br> - 7. Reporting problems<br> + 6. (TO BE ADDED) Misassemblies, incomplete assemblies, incorrect consensus sequence <br> + 7. How to report problems<br> <br> <br> -APPENDIX: ALGORITHMS <br> +APPENDIX: ALGORITHMS (TO BE ADDED) <br> <br> <br> <br> I. INSTALLATION<br> <br> The source code for the swat/cross_match/phrap package will be sent to<br> -you in the form an email message containing a uuencoded .tar.Z file;<br> -you will need to have access to a Unix system for the initial<br> -unpacking, but once you've uudecoded it and unpacked the .tar file<br> -(steps i and ii below), you should be able to compile the programs<br> -on computers running other operating systems -- they should be portable<br> -to almost anything with a decent C compiler and adequate memory (64 Mb<br> -RAM or more is desirable). Here are the steps needed to unpack and<br> -install the programs:<br> +you in the form of an email message containing a uuencoded .tar.Z (or<br> +.tar.gz) file; you may need to have access to a Unix system for the<br> +initial unpacking, but once you've uudecoded it and unpacked the .tar<br> +file (steps i and ii below), you should be able to compile the<br> +programs on computers running other operating systems -- they should<br> +be portable to almost anything with a decent C compiler and adequate<br> +memory (512 Mb RAM or more is desirable).<br> +<br> +Here are the steps needed to unpack and install the programs:<br> <br> i. Save the email message as a file (for example, "temp.mail"). If<br> possible, do this using the Unix mail command, rather than another<br> mail program -- some mail programs (e.g. Pine) remove trailing spaces<br> on each line of incoming messages, which will corrupt a uuencoded<br> -message.<br> +message. If you cannot use Unix mail, try to avoid opening the message<br> +before you save it.<br> Do not attempt to modify the saved mail message in any way. That is<br> unnecessary and may corrupt the message.<br> <br> @@ -130,7 +144,9 @@<br> <br> > zcat distrib.tar.Z | tar xvf -<br> <br> -If either of these commands results in an error message, it is likely that<br> +[OR > zcat distrib.tar.gz | tar xvf - ]<br> +<br> +If either command results in an error message, it is likely that<br> the email message was corrupted by your mail program (see step i above).<br> <br> iii. To produce working versions of the programs, move (if necessary)<br> @@ -154,21 +170,20 @@<br> Then remove all files ending in .o produced by the original make, and<br> recompile.<br> <br> -iv. If you have datasets with more than 64,000 reads, or that include<br> -sequences longer than 64,000 bp, you will need the .manyreads and/or<br> -.longreads versions of phrap and cross_match. These are created using the<br> -command<br> +Other warning messages (as opposed to error messages!) that may be<br> +produced by the compiler can in general be ignored.<br> <br> -> make manyreads<br> +(N.B. TO USERS OF PREVIOUS VERSIONS: There are no longer separate<br> +.manyreads OR .longreads variants of the programs -- the default<br> +versions of cross_match and phrap automatically handle large numbers<br> +of reads, and long sequences).<br> <br> -v. If you are operating a non-commercial (academic or government)<br> +iv. If you are operating a non-commercial (academic or government)<br> computer facility which provides access to several independent<br> investigators, you are required by the licensing agreement to set the<br> permissions on the executables and source code to allow execute but<br> not read access, so that the programs may not be copied.<br> <br> -<br> -<br> II. RUNNING PHRAP & CROSS_MATCH<br> <br> 1. Phrap. <br> @@ -182,7 +197,8 @@<br> information to consed (e.g. tagging of repeats). It will accept any<br> of the phrap command line options described below (section IV).<br> PhredPhrap and its documentation are included in the consed<br> -distribution.<br> +distribution (available from David Gordon).<br> +<br> Although phredPhrap automates most of the steps in the assembly, you<br> will still need to do several things: (a) set up the directory<br> structure that is assumed by phredPhrap and make sure your<br> @@ -197,30 +213,36 @@<br> ii. The following instructions pertain to "standalone" operation of<br> phrap. Before the run, you may want to increase the amount of memory<br> available to phrap: see the instructions for using the Unix<br> -"limit" command under IX.1, "Insufficient memory" below.<br> +"limit" command below under IX.1, "Insufficient memory".<br> The command line should be of the form<br> <br> > phrap seq_file1 [seq_file2 ...] [-option value] [-option value] ... <br> <br> where each seq_file is the name of a sequence file in FASTA format as<br> -described below in III.2. Available options are described in section IV<br> -below. The standard output is extensive and should be redirected to a<br> -file (phrap.out in the following example).<br> +described below in III.2. (The file name may include a directory/path<br> +specification). Available options are described in section IV<br> +below. [The command line format is actually somewhat more flexible<br> +than indicated above, i.e. option/value pairs can be interspersed with<br> +file names].<br> +<br> +The standard output from this command is extensive and should be<br> +redirected to a file (phrap.out in the following example).<br> <br> Example: <br> <br> > phrap C05D11.reads.screen -minmatch 20 -new_ace > phrap.out <br> <br> Note that the quality file is not specified on the command line, but<br> -instead must be named appropriately (as described below in III.3) to be<br> -recognized by phrap; in this example its name would need to be<br> -C05D11.reads.screen.qual. If more than one seq_file is provided,<br> -then reads in the first file are assembled only against sequences in<br> -the 2d (and later) files; no sequences are compared to other sequences<br> -in the same file. [N.B. THE ABILITY TO USE MULTIPLE SEQUENCE FILES IS<br> -NOT AVAILABLE AT PRESENT -- YOU SHOULD PROVIDE ONLY A SINGLE SEQUENCE<br> -FILE TO PHRAP]. In this case, many features of phrap (e.g. checking<br> -for anomalies, vector, etc.) which rely on comprehensive read-read<br> +instead must be named appropriately (as described below in III.3) to<br> +be recognized by phrap; in this example its name would need to be<br> +C05D11.reads.screen.qual. If more than one seq_file is provided, then<br> +reads in the first file are assembled only against sequences in the 2d<br> +(and later) files; no sequences are compared to other sequences in the<br> +same file. [N.B. THE ABILITY TO USE MULTIPLE SEQUENCE FILES IS NOT<br> +AVAILABLE AT PRESENT WITH PHRAP -- YOU SHOULD PROVIDE ONLY A SINGLE<br> +SEQUENCE FILE. HOWEVER MULTIPLE FILES ARE ALLOWED WITH<br> +CROSS_MATCH]. In this case, many features of phrap (e.g. checking for<br> +anomalies, vector, etc.) which rely on comprehensive read-read<br> comparisons are turned off, SWAT scores are used in place of<br> LLR_scores, and implied merges are not performed -- i.e. each read in<br> the first file is merged with at most one sequence in the 2d<br> @@ -236,30 +258,40 @@<br> > cross_match seq_file1 [seq_file2 ...] [-option value] [-option value] ... <br> <br> where each seq_file is the name of a sequence file in FASTA format as<br> -described below in III.2. Available options are described in section<br> -IV below. The standard output should generally be redirected to a<br> -file. If there are at least two files, all sequences in the first<br> -file are compared to those in the second (and any subsequent) files;<br> -if there is only one file, all sequences in this file are compared to<br> -each other (N.B. at present entries are not compared to their own<br> -complements, but are compared to themselves in the same orientation,<br> -and to all other entries in both orientations). Matches meeting<br> -relevant criteria (see section IV) are written to the standard output.<br> +described below in III.2. [The command line format is actually<br> +somewhat more flexible than indicated above, i.e. option/value pairs<br> +can be interspersed with file names. However the relative order of<br> +the sequence files is important inasmuch as the first file (the 'query<br> +file') is treated differently from the others ('subject files')]. (The<br> +file name may include a directory/path specification). Available<br> +options are described in section IV below. If the command line<br> +specifies more than one file, all sequences in the first file are<br> +compared to those in the second (and any subsequent) files. If there<br> +is only one file, all sequences in it are compared to each other<br> +(N.B. in the single-file case, at present entries are not compared to<br> +their own complements, but are compared to themselves in the same<br> +orientation (excluding the trivial identity match), and to all other<br> +entries in both orientations). Matches meeting relevant criteria (see<br> +section IV) are written to the standard output.<br> <br> +The standard output should generally be redirected to a file.<br> <br> III. INPUT FILES<br> <br> Input files to cross_match and phrap are of two types: sequence files<br> (required), and quality files (optional, but strongly recommended for<br> -phrap). If you are doing phrap assemblies of reads generated by the<br> +phrap). With cross_match, the query file can optionally be a CALF<br> +formatted file of unaligned reads (see http://www.phrap.org/phredphrap/calf.pdf <br> +for a description of this format), which must have a name ending in<br> +".calf". If you are doing phrap assemblies of reads generated by the<br> basecalling program phred, then it is recommended that you generate<br> these input files automatically using the phd2fasta or phredPhrap<br> -scripts distributed with phred and consed (see II.1.i above).<br> -Even in this case however you should read the following section on the<br> -read naming convention, since you will either need to make sure that<br> -the read names attached to your chromatogram files adhere to the<br> -naming convention, or you will need to modify the phredPhrap script<br> -(or create your own script) so that correct template names and read<br> +scripts distributed with phred and consed (see II.1.i above). Even in<br> +this case however you should read the following section on the read<br> +naming convention, since you will either need to make sure that the<br> +read names attached to your chromatogram files adhere to the naming<br> +convention, or you will need to modify the phredPhrap script (or<br> +create your own script) so that correct template names and read<br> orientation information is appended to the .phd files after they are<br> produced by phred.<br> <br> @@ -268,17 +300,20 @@<br> <br> In addition to the sequence and quality data, phrap needs to know<br> three things for each read: <br> +<br> (i) the name of the subclone (or other template) from which the read<br> is derived; this is used, for example, in checking for chimeric<br> subclones (for which purpose it is necessary to know when two reads<br> are from the same subclone so that they are not regarded as<br> independently confirming each other) and for certain other data<br> anomalies;<br> +<br> (ii) the orientation of the read (forward or reverse) within the<br> subclone, in cases where data is acquired from each end of a subclone<br> insert (at present this information is used only for consistency<br> checking following assembly, and not in the assembly itself, but this<br> will change in future versions);<br> +<br> (iii) the chemistry used to generate the read (which influences<br> phrap's decisions as to how to treat discrepancies between potentially<br> overlapping reads, and as to how to adjust qualities of reads which<br> @@ -293,6 +328,7 @@<br> (i) provide the relevant information in the description field of the<br> .fasta file instead (see below), or (if you are using the phd2Fasta<br> script) as tags in the .phd files for each read; <br> +<br> (ii) see whether appropriate values of the command line options<br> -subclone_delim and -n_delim (described below) can be used to bring<br> your own naming convention into conformity with the St. Louis<br> @@ -301,14 +337,17 @@<br> subclone name occurs at the first (or n-th for some fixed value of n)<br> occurrence of a particular character, then you may be able to use<br> these options without having to change your read names;<br> +<br> (iii) create a script which translates your read names into St. Louis<br> form (this does not mean that you cannot retain the original names as<br> well -- for example, for purposes of running phrap you could create a<br> directory of "links" to your chromatogram files, such that the link<br> names adhere to the naming convention, and then run the phredPhrap<br> script on this directory);<br> +<br> (iv) alter the subroutines in the source module "names.c" in order to<br> parse your read names correctly; or<br> +<br> (v) ignore the issue and hope that it does not cause problems. This<br> may have unpredictable adverse effects on assembly and is NOT<br> recommended.<br> @@ -323,6 +362,7 @@<br> and/or the option -n_delim may be used to indicate that the subclone<br> name contains a fixed number of occurrences of this character within<br> it. See section IV below for a description of command line options.<br> +<br> The orientation of the read within the subclone, and the chemistry,<br> are indicated by the first letter following the '.', as follows:<br> <br> @@ -360,53 +400,87 @@<br> phrap is used with a single input sequence file containing the reads,<br> unless one wants to assemble one set of reads "against" another<br> sequence or set of sequences (see below). Cross_match is typically<br> -used with two input sequence files, the first containing the "query"<br> -sequences and the second containing the "subject" sequences, but it<br> -may also be used to compare the sequences in a single input file to<br> -each other.<br> -<br> - Phrap is designed to be able to use all of each read sequence in the<br> -assembly, not just the trimmed (highest quality) part, so the full<br> -sequence of each read should be provided when available. If<br> -phred-generated qualities are not available however then the default<br> -quality parameter -default_qual should be set appropriately (section<br> -IV).<br> +used with two or more input sequence files, the first containing the<br> +"query" sequences and the remaining files containing the "subject"<br> +sequences, but it may also be used to compare the sequences in a<br> +single input file to each other.<br> +<br> + Cross_match may be used with either DNA or protein sequences (mixed<br> +types are not allowed, i.e. all input sequences must be DNA, or all<br> +must be protein). In the case of DNA sequences, the query file (but<br> +not subject files) may include 'merged base' DNA sequences produced by<br> +phred from mixed signal DNA sequencing traces (see below).<br> +<br> + Phrap is used only with ordinary DNA sequences as input. It is<br> +designed to be able to use all of each read sequence in the assembly,<br> +not just the trimmed (highest quality) part, so the full sequence of<br> +each read should be provided when available. If phred-generated<br> +qualities are not available however then the default quality parameter<br> +-default_qual should be set appropriately (section IV).<br> <br> Sequence files must be in FASTA format: <br> (i) Each sequence entry has a single header line as follows: The<br> first character is '>'. This is followed immediately (i.e. with no<br> intervening spaces or other characters) by the sequence name, and then<br> -(optionally) by a space or spaces, followed by descriptive information<br> -about the sequence.<br> +(optionally) by a space or spaces, followed by (on the same line)<br> +descriptive information about the sequence.<br> (ii) The header line is followed by a separate line or lines<br> -containing the sequence.<br> +containing the sequence. The sequence may all be placed on a single line,<br> +or split among several lines (of arbitrary length).<br> +<br> +Very large amounts of input data are permitted: Individual sequences<br> +must be 2 billion characters or less (the same limit applies to the<br> +name and descriptive information for each sequence). The combined<br> +total size of all query sequences is, by default, limited to about 1<br> +billion characters; however, this can be increased by changing the<br> +line<br> +<br> + typedef int SEQ_AREA;<br> +<br> +in the header file swat.h to either<br> +<br> + typedef unsigned int SEQ_AREA;<br> +<br> +(which will increase the limit to 2 billion characters), or <br> +<br> + typedef long int SEQ_AREA; <br> +<br> +(which on most computers will increase it to about 10^18 -- at the<br> +cost of increased memory and running time), and recompiling. There is<br> +no combined total length constraint for subject sequences. The 2<br> +billion residue length constraint for individual sequences (both query<br> +and subject) cannot be increased.<br> <br> -The sequence names, descriptive information, and sequences can be of<br> -arbitrary length (up to a combined total size for each type of<br> -about 2 billion characters), since space for them is allocated<br> -dynamically. If there are more than 64,000 entries in the first<br> -(query) file, or if any of the sequences are longer than about 64,000<br> -characters, the .manyreads version of phrap or cross_match must be<br> -used.<br> <br> Descriptive information on the header line may optionally be used to<br> indicate template (subclone), read orientation, read chemistry and<br> -dye. If present, this information overrides whatever would be implied<br> -by the read name. The template name is indicated by including the word<br> -"TEMPLATE:", followed by a space, followed by the name of the<br> -template. Orientation is indicated by the word "DIRECTION:", followed<br> -by the word "fwd" or "rev". Chemistry is indicated by the word<br> -"CHEM:", followed by the word "prim" (for dye primer), "term" for dye<br> -terminator, or "unknown". Dye is indicated by the word "DYE:" followed<br> -by "rhod", "big", "ET", "d-rhod", or "unknown". The chemistry and dye<br> -information is provided automatically by newer versions of phred<br> -(version 0.980904.a or later), which obtain it from the chromatogram<br> -file. (Use the script phd2fasta distributed with phred and consed to<br> -create the .fasta file from the .phd files created by phred).<br> +dye, and repeats. This information is written back out to the .ace<br> +and .singlets files. If present, this information overrides whatever<br> +would be implied by the read name. The template name is indicated by<br> +including the word "TEMPLATE:", followed by a space, followed by the<br> +name of the template. Orientation is indicated by the word<br> +"DIRECTION:", followed by the word "fwd" or "rev". Chemistry is<br> +indicated by the word "CHEM:", followed by the word "prim" (for dye<br> +primer), "term" for dye terminator, or "unknown". Dye is indicated by<br> +the word "DYE:" followed by "rhod", "big", "ET", "d-rhod", or<br> +"unknown". The chemistry and dye information is provided<br> +automatically by newer versions of phred (version 0.980904.a or<br> +later), which obtain it from the chromatogram file. (Use the script<br> +phd2fasta distributed with phred and consed to create the .fasta file<br> +from the .phd files created by phred).<br> +<br> + In the query file, repeated sequence may be indicated by the word<br> +"REPEAT:" followed by a pair of integers indicating the start and end<br> +of the repeat (these should be separated by spaces). There can be<br> +multiple "REPEAT:" tags for a single sequence. If the parameter<br> +-repeat_screen (see section IV.2 below) is set to 1 or 3, the sequence<br> +within any such repeat is ignored for the purpose of finding word<br> +matches between sequences. This can substantially improve the speed<br> +and (for phrap) accuracy of assembly of repeat-rich regions.<br> <br> Example:<br> <br> ->read#5 DIRECTION: rev CHEM: prim DYE: ET TEMPLATE: a23f1 any_other_information<br> +>read#5 DIRECTION: rev CHEM: prim DYE: ET TEMPLATE: a23f1 REPEAT: 151 237 REPEAT: 305 422 any_other_information<br> GAAAGATCTCATTGATCACTCTATTCAAGTGGGAGTCTCCGGTCTTT<br> ATGATCGATTTGTGAATCTTCGTATCAAAGTTGGAGCTGACAAGTATCCA<br> TTGCTTGCGAAATGGGCTCAAATTTTCACTCAGGGAGTCGTCTTCGATCC<br> @@ -414,19 +488,57 @@<br> GGAAGAGAGATGGATGTACTGTGGCATCAACTGCAGTAGCTTCAATGGTT<br> TATGGAAAGAGTATGTTATTt<br> <br> +In subject file(s), repeats are instead indicated by using lower-case<br> +letters for the residues.<br> <br> -All descriptive information in the header line is written back out to<br> -the .ace and .singlets files.<br> -<br> -Non-alphabetic characters (including '*', and digits) in the sequence<br> -lines are automatically stripped out when the file is read in, except<br> -for '-' which is converted to 'N'; '>' must not appear anywhere within<br> -the sequence since it is assumed to start the header of a new sequence<br> -(even if it is not at the beginning of a line). Lower case letters are<br> -converted to upper case on readin, so it is not possible (as it was in<br> -old phrap versions) to use case as a crude indication of quality.<br> -<br> -<br> +Non-alphabetic characters (including '*', and digits) in the<br> +sequence lines are automatically stripped out when the file is read<br> +in, except for '-' and '.' in DNA files, which are both converted to<br> +'N'; '>' must not appear anywhere within the sequence since it is<br> +assumed to start the header of a new sequence (even if it is not at<br> +the beginning of a line).<br> +<br> +Lower case letters are converted to upper case on readin, with<br> +the following important exceptions:<br> +<br> + (i) if -repeat_screen (see section IV.2 below) is set to 2 or 3,<br> +segments of subject file sequences consisting entirely of lower case<br> +letters are assumed to represent repeats, and nucleating perfect matches<br> +falling entirely within them (or spliced matches with either splice<br> +junction falling within them) are ignored;<br> +<br> + (ii) the query file may include 'merged base' DNA sequence reads<br> +produced by phred from mixed signal DNA sequencing traces. Such reads<br> +indicate at each position the two strongest peaks detected in the<br> +signal, and use standard ambiguity codes together with upper and lower<br> +case to indicate these peaks, as follows:<br> +<br> +Merged base read symbols:<br> +<br> +A,C,G,T (single detected peak)<br> +<br> + stronger peak weaker peak<br> +M A C<br> +m C A <br> +K G T <br> +k T G <br> +R A G <br> +r G A <br> +Y C T <br> +y T C <br> +S C G <br> +s G C <br> +W A T <br> +w T A <br> +<br> +N no information<br> +<br> +For effective searches with merged base reads an appropriate score<br> +matrix should be provided using the -matrix option (see IV.1 below);<br> +the matrix "mb_matrix" provided with the distribution has been found<br> +to be useful for this purpose. Note that with this score matrix <br> +the default values of -minscore, -near_minscore, -gap1_dropoff and<br> +-gap1_minscore are too small, and should be changed on the command line.<br> <br> 3. Quality files. <br> <br> @@ -438,18 +550,20 @@<br> tabulating discrepancies and not in the alignment scoring. The name of<br> this file must consist of the name of the FASTA file, with ".qual"<br> appended.<br> +<br> The format of the .qual file is similar to that of the corresponding<br> FASTA file. For each read there should appear a header line identical<br> (except possibly for the "description" field) to that in the FASTA<br> file. This is followed by one or more lines giving the qualities for<br> -each base. Quality values should be integers between 0 and 99<br> -(inclusive), and should be separated by spaces. No other (non-digit,<br> -non-white space) characters should appear. The total number of quality<br> -values for each read must match the number of bases for that read in<br> -the FASTA file. Also, the orders of the reads in the FASTA and<br> -corresponding .qual file must be identical. N's are automatically<br> -assigned quality 0, overriding any value present in the .qual file;<br> -however X's retain whatever value is present in the .qual file.<br> +each base. Quality values should be integers >= 0 and <= 99, and<br> +should be separated by spaces. No other (non-digit, non-white space)<br> +characters should appear. The total number of quality values for each<br> +read must match the number of bases for that read in the FASTA file.<br> +Also, the orders of the reads in the FASTA and corresponding .qual<br> +file must be identical. N's are automatically assigned quality 0,<br> +overriding any value present in the .qual file [??]; however X's<br> +retain whatever value is present in the .qual file.<br> +<br> If provided, quality information is used by phrap both in the<br> assembly itself (where discrepancies between high quality bases are<br> used to discriminate repeats from true overlaps) and in determining<br> @@ -463,16 +577,17 @@<br> effects on quality (e.g. compressions) -- in such cases the input<br> quality information may constitute the only basis for choosing one<br> strand over the other.<br> +<br> If your sequencing data is from automated sequencing machines it is<br> strongly recommended that you generate the read sequences using the<br> basecalling program phred, which will produce an appropriate quality<br> -file for input to phrap. Phred generally gives more accurate base<br> -calls overall, for a longer distance, than the ABI software does. On<br> -the basis of the trace characteristics, phred computes a probability p<br> -of an error in the base call at each position, and converts this to a<br> -quality value q using the transformation q = -10 log_10(p). Thus a<br> -quality of 30 corresponds to an error probability of 1 / 1000, a<br> -quality of of 40 to an error probability of 1 / 10000, etc.<br> +file for input to phrap. On the basis of the trace characteristics,<br> +phred computes a probability p of an error in the base call at each<br> +position, and converts this to a quality value q using the<br> +transformation q = -10 log_10(p). Thus a quality of 30 corresponds to<br> +an error probability of 1 / 1000, a quality of 40 to an error<br> +probability of 1 / 10000, etc.<br> +<br> The quality value 99 is reserved for base calls that have been<br> visually inspected and verified as "highly accurate" (during editing),<br> and 98 for bases that have been edited but are not highly accurate<br> @@ -480,15 +595,16 @@<br> match but have discrepancies involving bases that have quality 99 in<br> both reads, phrap does not allow them to be merged during assembly.<br> At present this is the only way to break false joins made by phrap.<br> +<br> If you wish to generate your own quality values you should be<br> prepared to experiment. It is particularly important that possible<br> -insertion errors (which are fairly frequent late in the trace, with ABI<br> -basecalls) receive low quality, because in choosing the "consensus"<br> +insertion errors receive low quality, because in choosing the "consensus"<br> phrap tends to favor the reads with the highest total quality in a<br> given region. Also, if at all possible you should use quality values<br> that are defined in terms of error probabilities in the manner<br> described above, since to some extent phrap is tuned to expect these<br> (and its output qualities will then have a similar interpretation).<br> +<br> If all input quality values are relatively small (less than 15),<br> phrap assumes that they do not correspond to error probabilities and<br> attempts to rescale them so that the largest quality value is<br> @@ -545,7 +661,7 @@<br> <br> IV. COMMAND LINE OPTIONS<br> <br> - This section describes all command line options available with<br> + This section describes command line options available with<br> cross_match and phrap, grouped by category. Each option name is<br> followed immediately by the "default value" assumed by the program,<br> and a brief description. A '*' appearing as the default value<br> @@ -564,13 +680,14 @@<br> The following options control the scoring of pairwise sequence<br> alignments. By default, matching residues receive a reward of +1,<br> mismatches get a penalty of -2, gap opening residues a penalty of -4<br> -(= -2 - 2), and gap extension residues a penalty of -3 (= -2 - 1). The<br> -highest scoring word-nucleated local alignment between a pair of<br> -sequences is found, and its score is then complexity-adjusted<br> -(i.e. scores of matches between sequence regions of biassed nucleotide<br> -composition are adjusted downwards). These parameters are chosen such<br> -that regions of two sequences that are about 70% or more identical<br> -will tend to have a positive alignment score. For more stringent<br> +(= -2 - 2), and gap extension residues a penalty of -3 (= -2 - 1).<br> +Word-nucleated high-scoring local alignments between sequences are<br> +found by a modified Smith-Waterman approach (see next section), and<br> +their scores are then complexity-adjusted (so matches between sequence<br> +regions of highly biassed nucleotide composition have their scores<br> +adjusted downwards). The above parameters are chosen such that<br> +regions of two sequences that are about 70% or more identical will<br> +tend to have a positive alignment score. For more stringent<br> comparisons, -penalty should be made more negative (e.g. a value of -9<br> will tend to find alignments that are 90% identical), and/or -minscore<br> should be increased; for more sensitive comparisons, a score matrix<br> @@ -579,117 +696,358 @@<br> option name & default value <br> <br> -penalty -2 <br> -Mismatch (substitution) penalty for SWAT comparisons. <br> +Mismatch (substitution) penalty for scoring alignment<br> <br> -gap_init penalty-2 <br> -Gap initiation penalty for SWAT comparisons. <br> +Gap initiation penalty<br> <br> -gap_ext penalty-1 <br> -Gap extension penalty for SWAT comparisons. <br> +Gap extension penalty. Ignored when -gap1_only is set, since gaps<br> +in that case may only have size 1.<br> <br> -ins_gap_ext gap_ext <br> - Insertion gap extension penalty for SWAT comparisons (insertion in<br> + Insertion gap extension penalty (insertion in<br> subject relative to query).<br> <br> -del_gap_ext gap_ext<br> - Deletion gap extension penalty for SWAT comparisons (deletion in<br> + Deletion gap extension penalty (deletion in<br> subject relative to query).<br> <br> -matrix [None] <br> - Score matrix for SWAT comparisons (if present, supersedes -penalty)<br> -Matrix format: (TO BE ADDED)<br> + Score matrix (if present, supersedes -penalty). Note that when a<br> +customized score matrix is used it is usually desirable to change the<br> +default score thresholds for -minscore, -near_minscore, -gap1_dropoff<br> +-gap1_minscore. Matrix format: (Also see examples included<br> +in distribution). <br> +<br> + Leading whitespace on each line is stripped off (& blank lines are ignored). Then<br> +<br> +Lines beginning with '#' are ignored (& so may include comments).<br> +<br> +(optional) A line beginning with 'GAP' is assumed to have the (integer) gap_init<br> +and gap_ext penalties.<br> +<br> +(optional) A line beginning with 'FREQS' is assumed to have (floating<br> +point) residue frequencies (used in complexity adjustment).<br> +<br> +A line beginning 'MERGED BASE DNA' indicates that the matrix is to be<br> +used in analyzing merged base reads (and is required in this case).<br> +<br> +The first line not beginning as above is assumed to contain the<br> +residue column labels for the matrix (as its non-whitespace<br> +characters).<br> +<br> +Any remaining lines are assumed to contain the rows of the score<br> +matrix (optionally preceded by an alphabetic character giving the row<br> +label as the residue name; if absent, it is assumed that the row<br> +labels are the same as the column labels). Score values must be<br> +integers.<br> +<br> + -qual_scores<br> + (cross_match only, & there must be separate query & subject<br> +files). Compute quality-based alignment scores of DNA sequence reads<br> +against a known reference sequence. The score can be interpreted as<br> +(roughly) -10log_10(p), where p is the posterior probability that the<br> +read derives either from an unknown 'contaminant' source (i.e. noise)<br> +or from a different reference sequence location than the indicated<br> +one, given the read's quality values and alignments and assuming a<br> +prior probability of .5 of originating from the contaminant. Apart<br> +from allowing a non-zero contaminant probability, this is essentially<br> +MAQ's mapping quality score (Li, Ruan & Durbin, 2008). Qual_scores are<br> +determined using an (internally created) score matrix that takes into<br> +account query base qualities: briefly, matching bases score 6, and<br> +mismatches score 6 - (q + 5) where q is the base quality. [Here 6 and<br> +5 are approximations to -10log_10(1/4) and -10log_10(1/3)<br> +respectively]. The alignment score computed using this matrix is then<br> +reduced by 10log_10(2G) (where G = reference genome size) to reflect<br> +the prior probability for this specific reference location, and<br> +adjusted further to take into account qualities at matching bases in<br> +the alignment; it may also be reduced if complexity adjustment is on<br> +(i.e. -raw is not set), to take into account any deviation from .25 of<br> +nucleotide frequencies in the matching genomic region. Finally it is<br> +reduced by the score of the highest scoring alternative alignment (if<br> +any) for that read. Alignments with scores > 0 are reported,<br> +consistent with the value of -minmargin.<br> + User-specified matrices are not currently allowed with this<br> +option. The -compact_qual and -score_hist options are turned on, and<br> +the values of -minscore, -gap_init, and -gap_ext, -gap1_minscore,<br> +-gap1_dropoff, and -near_minscore are reset (-minscore is set to<br> +0). -masklevel is set to 0. For statistical validity, ideally<br> +-globality should be set to 3 (global alignments), although this is<br> +currently only possible when -gap1_only is also set. Scores reported<br> +in the score histogram do not reflect adjustment for quality at<br> +matching bases and have not been reduced by the score of an<br> +alternative alignment.<br> <br> -raw * <br> - Use raw rather than complexity-adjusted Smith-Waterman scores.<br> + Use raw rather than complexity-adjusted Smith-Waterman scores. <br> <br> <br> <br> -2. Banded search<br> +2. Banded/gap1 search<br> <br> - The following options control the region of the Smith-Waterman matrix<br> -that is searched. In phrap and cross_match (as opposed to swat, which<br> -performs full Smith-Waterman comparisons), word-nucleated banded<br> -Smith-Waterman comparisons are performed: the set of input sequences<br> -is scanned to find pairs of perfectly matching subsequences ("word<br> -matches") meeting certain criteria, and for each such word match a<br> -band in the Smith-Waterman matrix that is centered on the diagonal<br> -defined by the match is searched to find an optimal-scoring alignment.<br> -If there are multiple matching words for a given pair of sequences,<br> -the union of the corresponding bands is searched. The search is<br> -"recursive" in the sense that if an optimal-scoring alignment is found<br> -whose score exceeds -minscore the remainder of the band is searched<br> -again; as a result multiple alignments may be found within a single<br> -band.<br> - The word matches that are considered are "maximal" in the sense that<br> -they cannot be extended in either direction without introducing a<br> -discrepancy in one of the two sequences. If the length of the word<br> -match is at least -maxmatch, it is accepted; otherwise if the<br> -complexity-adjusted length is less than -minmatch, or if the matching<br> -sequence occurs more than -max_group_size times on the forward strands<br> -of the query file, it is rejected. The latter criterion effectively<br> -downweights frequently occurring words. It can be turned off by<br> -setting -maxmatch equal to -minmatch.<br> -<br> - -minmatch 14 <br> - Minimum length of matching word to nucleate SWAT comparison. if<br> -minmatch = 0, a full (non-banded) comparison is done [N.B. NOT<br> -PERMITTED IN CURRENT VERSION]. Increasing -minmatch can dramatically<br> -decrease the time required for the pairwise sequence comparisons; in<br> -phrap, it also tends to have the effect of increasing assembly<br> -stringency. However it may cause some significant matches to be<br> -missed, and it may increase the risk of incorrect joins in phrap in<br> + The following options control the region of the edit graph (of a<br> +query-subject sequence pair) that is searched. In phrap and<br> +cross_match (as opposed to swat, which performs full Smith-Waterman<br> +comparisons), word-nucleated banded Smith-Waterman comparisons are<br> +performed as follows. First, the set of input sequences is scanned to<br> +find pairs of perfectly matching subsequences ("nucleating perfect matches"),<br> +which are "maximal" in the sense that they cannot be extended in<br> +either direction without introducing a discrepancy in one of the two<br> +sequences, and which meet certain additional filtering criteria<br> +described below. For each such nucleating perfect match, a band in the edit<br> +graph that is centered on the diagonal defined by the match is<br> +searched to find an optimal-scoring alignment. If there are multiple<br> +nucleating perfect matches for a given pair of sequences, the union of the<br> +corresponding bands is searched. The search is "recursive" in the<br> +sense that if an optimal-scoring alignment is found whose score is at<br> +least minscore, the remainder of the band is searched again;<br> +consequently, multiple alignments may be found within a single band.<br> +Note that because an entire band is searched, the nucleating perfect match<br> +itself is not necessarily part of the optimal alignment, although it<br> +usually will be.<br> +<br> +An alternative comparison mode, using "gap1" alignments, is also<br> +available. "Gap1" alignments are alignments that extend the (gapfree)<br> +alignment defined by the nucleating perfect match to the left and to<br> +the right, allowing arbitrarily many residue mismatches (appropriately<br> +penalized), but at most one gap character, in each direction. Such<br> +alignments have a maximum of two gap characters, and are thus more<br> +restrictive than banded Smith-Waterman (bSW) alignments. However they<br> +are significantly faster to find, and for short queries (< 40 bp) will<br> +detect most alignments that are detectable at all by bSW; for longer<br> +queries, the -fuse_gap1 option (see below) will fuse overlapping gap1<br> +alignments into a longer alignment that may contain more than 2 gaps<br> +(each gap still having a maximum size of one, however). Gap1<br> +alignments also provide a useful filtering step prior to bSW. In<br> +contrast to banded search, for gap1 alignments the nucleating perfect<br> +match will always be part of the alignment.<br> +<br> + A number of (optional) criteria can be used to filter the set of<br> +nucleating perfect matches that are used. If a maximal nucleating<br> +perfect match falls entirely within an annotated repeat in the query<br> +or repeat sequence it is rejected. Otherwise, if it has length at<br> +least -maxmatch, it is (tentatively) accepted. Otherwise (i.e. if it<br> +is neither accepted nor rejected by those criteria), if the<br> +complexity-adjusted length (or actual length, if -word_raw is<br> +specified) is less than -minmatch, or if the matching sequence occurs<br> +more than -max_group_size times in query file sequences, it is<br> +rejected. (The latter two criteria provide independent mechanisms for<br> +deprecating frequently occurring regions of size less than -maxmatch.)<br> +A final filter is then applied to those nucleating perfect matches not<br> +rejected by the preceding criteria, as follows. The highest scoring<br> +gap1 alignment extending the nucleating perfect match is found, and if<br> +this alignment has (non-complexity-adjusted) score less than<br> +-gap1_minscore, the match is rejected. Nucleating perfect matches not<br> +rejected by any of these criteria are then used to define a banded<br> +Smith-Waterman search as described above (unless the option -gap1_only<br> +is specified).<br> +<br> + -minmatch 14 (for DNA sequences) 4 (for protein sequences) <br> + Minimum length of nucleating perfect match for banded Smith-Waterman or gap1<br> +comparison. If minmatch = 0, a full (non-banded) comparison is done<br> +[N.B. NOT PERMITTED IN CURRENT VERSION]. Increasing -minmatch can<br> +dramatically decrease the time required for the pairwise sequence<br> +comparisons; in phrap, it also tends to have the effect of increasing<br> +assembly stringency. However it may cause some significant matches to<br> +be missed, and it may increase the risk of incorrect joins in phrap in<br> certain situations (by causing implied overlaps between reads with<br> high-quality discrepancies to be missed).<br> <br> - -maxmatch 30 <br> - Maximum length of matching word. For cross_match, the default value<br> -is equal to minmatch, instead of 30.<br> -<br> - -max_group_size 20 <br> - Group size (query file, forward strand words) <br> + -maxmatch 20 (for DNA sequences) 4 (for protein sequences) <br> + Maximum required length of nucleating perfect match (i.e. non-repeat_screened<br> +nucleating perfect matches at least this long are always used, regardless of<br> +complexity adjustment or max_group_size). Cannot be set larger than<br> +127, or smaller than -minmatch (input value is automatically adjusted<br> +to meet these criteria if necessary). Note that setting maxmatch =<br> +minmatch has the effect of turning off all filtering of nucleating perfect<br> +matches (i.e. both complexity adjustment and max_group_size.) This<br> +will increase sensitivity somewhat, at the expense of significantly<br> +increased running time.<br> +<br> +-word_offset 2 (for ordinary DNA sequences) 1 (for merged base DNA or protein sequences, or if spliced_word_gapsize or spliced_word_gapsize2 is set)<br> + (cross_match only, & there must be separate query & subject<br> +files). Controls which words get indexed. Higher values can greatly<br> +reduce running times, at the cost of (usually slightly) reduced sensitivity.<br> +Phrap/cross_match versions older that 1.080827 implicitly used<br> +-word_offset 1, which provides maximum sensitivity (for a given<br> +-minmatch) but slowest speed. It is not (currently) recommended to<br> +setting -word_offset higher than 1 when looking for spliced alignments<br> +will cause a significant fraction (> 50%) of such alignments to be<br> +missed.<br> +<br> + -max_group_size 10 (for phrap) 0 (for cross_match) <br> + Group size (maximum number of allowed occurrences of the nucleating perfect<br> +region in query file, forward (i.e. uncomplemented) strands). If 0,<br> +group size is ignored -- this is the default for cross_match, and is<br> +the recommended setting for phrap assemblies of very short reads when<br> +coverage depth is expected to be extremely uneven (e.g. Solexa EST<br> +reads).<br> <br> -word_raw * <br> - Use raw rather than complexity-adjusted word length, in testing<br> -against minmatch (N.B. maxmatch always refer to raw lengths). (The<br> -default is to adjust word length to reflect complexity of matching<br> -sequence).<br> + Use raw rather than complexity-adjusted length for the nucleating perfect<br> +match, in testing against minmatch (N.B. maxmatch always refers to raw<br> +lengths). (Default behavior if -word_raw is not set is to adjust<br> +length to reflect complexity of matching sequence).<br> <br> -bandwidth 14 <br> - 1/2 band width for banded SWAT searches (full width is 2 times<br> -bandwidth + 1). Decreasing bandwidth also decreases running time at<br> -the expense of sensitivity. Phrap assemblies of clones containing long<br> -tandem repeats of a short repeat unit (< 30 bp) may be more accurately<br> -assembled by decreasing -bandwidth; -bandwidth should be set such that<br> -2 bandwidth + 1 is less than the length of a repeat unit. -bandwidth 0<br> -can be used to find gap-free alignments.<br> + 1/2 band width for banded Smith-Waterman searches (full width is 2<br> +times bandwidth + 1). Decreasing bandwidth decreases running time, at<br> +the expense of sensitivity to detect large gaps. Phrap assemblies of<br> +clones containing long tandem repeats of a short repeat unit (< 30 bp)<br> +may be more accurately assembled by decreasing -bandwidth; -bandwidth<br> +should be set such that 2 bandwidth + 1 is less than the length of a<br> +repeat unit. Note that -bandwidth 0 can be used to find gap-free<br> +alignments. If read length is short (e.g. with Solexa sequence data)<br> +it is generally pointless to use a large value for -bandwidth because<br> +large indels in these reads cannot be detected anyway (with the usual<br> +gap penalty scoring). -bandwidth is ignored when -gap1_only is specified.<br> +<br> + -repeat_screen 0 <br> + Controls how nucleating perfect matches falling entirely within<br> +repeats are treated. If 0, repeat information is not used in<br> +evaluating nucleating perfect matches; if 1 or 3, nucleating perfect<br> +matches lying entirely within repeat tags in query file are ignored<br> +(i.e. not used to nucleate banded Smith-Waterman or gap1 search); if 2<br> +or 3, nucleating perfect matches lying entirely within lower case<br> +regions in subject file, or spliced matches for which either splice<br> +junction falls in such a region, are ignored. Note that it is up to<br> +the user to provide the appropriate repeat information in the input<br> +sequence files (see above, section III.2).<br> +<br> + -gap1_minscore 17<br> + Minimum score (not complexity-adjusted) of 'gap1' extensions of<br> +nucleating perfect match. Can be turned off by setting to 0; however<br> +this is not recommended, because use of -gap1_minscore substantially<br> +lowers running time with very little sacrifice of sensitivity. Should<br> +be adjusted if non-default scoring is used. -gap1_minscore is ignored<br> +when -gap1_only is set. If greater than -minscore, it is reset to<br> +equal -minscore.<br> +<br> +-gap1_only *<br> + (cross_match only).<br> + If set, only gap1 alignments are considered. The parameters -gap_ext,<br> +-bandwidth, -gap1_minscore (which only applies to filtering) and<br> +-near_minscore will then be ignored. If -gap1_only is used with long<br> +queries, it is recommended that -fuse_gap1 also be set.<br> +<br> +-gap1_dropoff -12<br> + Maximum score dropoff permitted in gap1 alignments, before<br> +terminating search. Should be adjusted if non-default score matrix is used.<br> +<br> +-fuse_gap1 *<br> + Fuse overlapping gap1 alignments. This is only applicable when gap1<br> +alignments are generated using -gap1_only, -spliced_word_gapsize, and/or<br> +-spliced_word_gapsize2. The fusing is done after -minmargin and<br> +-minscore filtering is applied. This option is generally only useful<br> +with longer sequences.<br> +<br> +-globality 0<br> + (Cross_match only). Controls extent (with respect to query) of gap1<br> +alignments. Settings:<br> + 0 local alignments only<br> + 1 semiglobal to left (i.e. alignment forced to extend to left (5') end of query)<br> + 2 semiglobal to right (alignment forced to extend to right (3') end of query)<br> + 3 global (alignment forced to extend to both ends of query) <br> +<br> +This option is currently available only with -gap1_only,<br> +-spliced_word_gapsize, and/or -spliced_word_gapsize2 alignments, and<br> +it does not apply to the gap1 alignments that are used in the<br> +filtering step performed prior to a banded search (which use<br> +-gap1_minscore, and are always local). Only recommended for use with<br> +short reads. -globality 1 increases specificity when the 5' query<br> +bases are expected to be most accurate (as with Solexa data).<br> +-globality 3 (forcing global alignments) is useful for some purposes<br> +(e.g. in analyzing error rates as a function of position or quality in<br> +short reads) but can reduce sensitivity when there is a higher error<br> +rate at the 3' end of the read.<br> <br> +[DISCUSS NUCLEATING PERFECT MATCHES FOR MERGED BASE READS.]<br> <br> 3. Filtering of matches<br> <br> - The following options control which matches are displayed (cross_match)<br> + The following options control which matches are reported (cross_match)<br> or used in assembly (phrap).<br> <br> -minscore 30 <br> - Minimum alignment score.<br> + Minimum alignment score (complexity-adjusted, unless -raw is set (see<br> +IV.1 above)). Should be adjusted if non-default score matrix is used.<br> +<br> + -near_minscore minscore<br> + (cross_match only). Typically used when comparing EST and cDNA query<br> +sequences to a genomic sequence using cross_match, to increase<br> +sensitivity to detect short low-scoring exonic matches that are in the<br> +vicinity of higher-scoring ones. Matches having scores that are at<br> +least -near_minscore but less than -minscore are reported only if they<br> +are within a distance max_intron_length, and in the appropriate order<br> +and orientation, with some match having score >= minscore and<br> +involving the same query and subject sequences. Note that there is no<br> +"-near_minmatch" parameter, so if you want to increase sensitivity at<br> +the nucleating perfect match step you would need to decrease<br> +-minmatch. Should be adjusted if non-default scoring is<br> +used. -near_minscore is turned off when -gap1_only is specified.<br> +<br> +-max_intron_length 10000<br> + (cross_match only) See description of -near_minscore, above.<br> <br> - -vector_bound 80 <br> +-vector_bound 80 (for phrap) 0 (for cross_match) <br> Number of potential vector bases at beginning of each read. Matches<br> that lie entirely within this region are assumed to represent vector<br> -matches and are ignored. For cross_match, the default value is 0<br> -instead of 80.<br> -<br> - -masklevel 80 <br> - (cross_match only). A match is reported only if at least (100 -<br> -masklevel)% of the bases in its "domain" (the part of the query that<br> -is aligned) are not contained within the domain of any higher-scoring<br> -match.<br> +matches and are ignored. Note that for assembling very short<br> +(e.g. Solexa) reads, the default value is much too high!! For<br> +cross_match, -vector_bound is only utilized when there are no subject<br> +files, and the default value is 0 instead of 80.<br> +<br> + -masklevel 80 (101 for merged base reads, or when there is no subject file)<br> + (cross_match only). Masklevel controls the grouping of matches<br> +according to their domains (the segment of the query that is aligned),<br> +which is useful when different portions of the query may match<br> +different subject sequence regions (e.g. cDNA queries vs genomic<br> +subject, or chimeric sequencing read queries). Two matches for the<br> +same query are considered to be in the same "query domain group" if at<br> +least masklevel% of the bases in the domain of either one of them is<br> +contained within the domain of the other. Thus for the default value<br> +of 80, matches are assigned to the same group if the domain of either<br> +one is at least 80% contained within the domain of the other.<br> Special cases:<br> - -masklevel 0 report only the single highest scoring match for each query <br> - -masklevel 100 report any match whose domain is not completely contained<br> - within a higher scoring match<br> - -masklevel 101 report all matches<br> -<br> -<br> + -masklevel 0 all matches form a single query domain group<br> + -masklevel 101 all matches form a single query domain group, but<br> +-minmargin is inactivated thus causing every match with score >=<br> +minscore to be reported.<br> +<br> +For merged base reads, -masklevel is always set to 101, because<br> +overlapping matches can correspond to different sets of peaks and thus<br> +be unrelated to each other.<br> +<br> +-minmargin 0.5 <br> + (cross_match only, & there must be separate query & subject<br> +files). The score margin for a match is defined to be the difference s<br> +- best_other_score, where s is the match score, and best_other_score<br> +is the highest score occurring for any other match in the same query<br> +domain group (as defined above under '-masklevel') for that query.<br> +Only matches with score margin at least -minmargin are<br> +reported. Specifically (letting n denote a positive integer):<br> +<br> + -minmargin n report only those highest-scoring matches (for a given query<br> +domain group) for which no other matches have score within n of it<br> + -minmargin 0.5 report a single highest-scoring match for each query<br> +domain group. If there is more than one match with this score, one is<br> +chosen at random; in the case of spliced word matches, this random<br> +choice is made from among those highest-scoring matches having minimal<br> +span in the subject sequence.<br> + -minmargin 0 report all highest-scoring matches in each query domain group<br> + -minmargin -n report any match whose score<br> +is within n of the highest-scoring match in the query domain group<br> +<br> +In particular -minmargin 1 cause the top-scoring match to be reported<br> +only if no other matches have the same score; -minmargin -1 will cause<br> +all matches having a score within 1 of the top scoring match (and at<br> +least minscore) to be reported.<br> + Note that the -minscore filter is also applied. For -minmargin >=<br> +0.5, at most one match per query domain group is reported; for<br> +-minmargin < 0.5, multiple matches per query domain group may be<br> +reported. The value of -minmargin becomes irrelevant when -masklevel is<br> +101. <br> <br> 4. Input data interpretation<br> <br> @@ -724,14 +1082,15 @@<br> read is not assigned to a group).<br> <br> -trim_start 0 <br> - (phrap only). No. of bases to be removed at beginning of each read.<br> + (phrap only). Number of bases to be removed at beginning of each read.<br> <br> <br> 5. Assembly<br> <br> - The following options are used to control completeness and stringency<br> -of assembly; note that the options -minmatch, -bandwidth, -penalty,<br> -and -minscore discussed above also affect assembly stringency.<br> + The following phrap-only options are used to control completeness and<br> +stringency of assembly; note that the options -minmatch, -bandwidth,<br> +-penalty, -gap1_minscore, and -minscore discussed above also affect assembly<br> +stringency.<br> <br> <br> -forcelevel 0 <br> @@ -781,11 +1140,11 @@<br> <br> 6. Consensus sequence construction<br> <br> - The following options affect the weighted directed graph that is used<br> -to find the consensus sequence. Increasing their values will reduce<br> -the size of the graph, which should reduce memory requirements for the<br> -phrap run (substantially in some cases) but may decrease the accuracy<br> -of the consensus sequence that is found.<br> + The following phrap-only options affect the weighted directed graph<br> +that is used to find the consensus sequence. Increasing their values<br> +will reduce the size of the graph, which should reduce memory<br> +requirements for the phrap run (substantially in some cases) but may<br> +decrease the accuracy of the consensus sequence that is found.<br> <br> -node_seg 8 <br> (phrap only). Minimum segment size (for purposes of traversing<br> @@ -794,6 +1153,13 @@<br> -node_space 4 <br> (phrap only). Spacing between nodes (in weighted directed graph) . <br> <br> + -contig_graph_weights 0<br> + (phrap only). Weighting scheme; currently permitted values are 0 and<br> +1. The value 0 causes the weights to be set equal to the quality<br> +values; 1 causes them to be set to the scaled error probabilities (the<br> +weight attached to a base is then e_0 - e, where e is the error probability<br> +for that base and e_0 is the error probability corresponding to<br> +-trim_qual).<br> <br> 7. Output<br> <br> @@ -845,24 +1211,148 @@<br> (phrap only). Print information about non-local matches between<br> contigs.<br> <br> + -print_word_data * <br> + Print information about nucleating perfect matches.<br> +<br> -exp [None] <br> (gcphrap only). Name of a directory in which output experiment files<br> are to be placed.<br> <br> -alignments * <br> - (cross_match only). Display the alignment for each match.<br> + (cross_match only). Display the alignment for each reported match.<br> +When the subject sequences are large (e.g. whole chromosomes), there<br> +is currently a speed advantage to having them split among multiple<br> +subject files (e.g. one file per chromosome) rather than all included<br> +in a single file.<br> +<br> +-align_extend 0 <br> + (cross_match only). # residues to display past end of the<br> +SWAT-aligned region, when alignments are displayed (using<br> +-alignments).<br> <br> -discrep_lists * <br> - (cross_match only). Give list of discrepancies for each match.<br> + (cross_match only). Give list of sequence discrepancies, and<br> +qualities, for each reported match. If -spliced_word_gapsize is set,<br> +putative introns bridged by the spliced word are included in the list,<br> +as large deletion ('D-n' where n is the intron size) discrepancies.<br> <br> -discrep_tables * <br> (cross_match only). Give table of discrepancies (by quality) for each<br> match (default is to display a single table, that combines results for<br> all matches).<br> <br> + -score_hist *<br> + (cross_match only, & there must be separate query & subject<br> +files). Print histogram of match scores for each query domain group<br> +(as defined above under '-masklevel'). Each score histogram appears as<br> +a separate line following all displayed matches for a given query (if<br> +any), with the following format:<br> + query_name domain_start..domain_end all scores (counts): score(count) score(count) score(count) ...<br> +For example, the line<br> + 8-1-237-906_0 2..35 all scores(counts): 18(1) 20(2) 36(1) <br> +indicates that the query sequence named 8-1-237-906_0 had one match<br> +with score 36, two matches with score 20, and one match with score 18,<br> +in the domain (region) starting at base 2 and ending at base 35. Note<br> +that all matches meeting the specified -minscore threshold are<br> +counted, regardless of the settings of -masklevel or -minmargin. <br> +When -spliced_word_gapsize is set, the 1st count after score is no. of<br> +unspliced matches, 2d count is no. of spliced matches. [give example]<br> +<br> +-output_nonmatching_queries *<br> + Create fasta and .qual files (named by appending the suffixes<br> +".nonmatching" and ".nonmatching.qual" to the original query file<br> +name) containing the query sequences which failed to have any matches<br> +in the analysis.<br> +<br> +-output_bcdsites *<br> + (cross_match only, & there must be separate query & subject files).<br> +Create an output file (named by appending the suffix ".bcdsites" to<br> +the original query file name) that indicates subject sequence positions<br> +that are confirmed by and/or have high-quality discrepancies with<br> +respect to the query sequences. <br> + The file format is as follows. There are three header lines, each<br> +beginning with the character '#', which indicate the cross_match<br> +command line, version, and run time for the run which produced the<br> +file. The remainder of the file consists of 'event' lines with the<br> +following format:<br> +<br> +sub pos event,n_f,n_r,max_score,max_marg,max_qual [event,n_f,n_r,max_score,max_marg,max_qual] ...<br> +<br> +where sub = name of the subject sequence (e.g. chromosome name)<br> + pos = position (origin 1) within the subject sequence at which the event starts<br> + event = one of the following: <br> + C-n (n a positive integer) segment of length n in subject confirmed by a query<br> + D-n or d-n segment of length n in subject deleted in a query (putative <br> +introns revealed by EST alignments using spliced_word_gapsize are of this type; lower-case 'd' is reserved<br> +for lower (bottom) strand introns)<br> + S-b (b = A,C,G, or T) site at which a query has high-quality base b instead of<br> + the subject sequence base<br> + I-s (s a nucleotide sequence) site of high-quality inserted sequence s in query sequence with respect<br> + to the subject sequence<br> + L-s site at which a query sequence has a high-quality extension s to the left of position <br> +pos in the subject sequence<br> + R-s (seq a nucleotide sequence) site at which a query sequence has a high-quality extension s <br> +to the right of position pos in the subject sequence<br> + n_f = number of forward (i.e. top-strand) queries supporting the event<br> + n_r = number of reverse (i.e. bottom-strand) queries supporting the event<br> + max_score = maximum score of all query alignments supporting the event<br> + max_margin = maximum score margin (see definition under -minmargin, above) of all query alignments supporting the event<br> + max_qual = maximum quality of query base in alignments<br> +supporting the event <br> +<br> + The events for a given subject sequence are listed in increasing<br> +positional order (pos). For C and D events, the -n is omitted if n =<br> +1. Sequences s for events I, L, and R are always given in top-strand<br> +5' to 3' orientation. max_qual for C (confirmation) events is<br> +currently always 0, because no quality check is performed (this will<br> +be changed in future). Max_score and max_margin together provide a<br> +measure of how confidently a supporting query maps to this<br> +position. Note that the availability of this information in the<br> +bcdsites file means that the initial cross_match run can be performed<br> +with fairly liberal settings for -minscore, -minmargin, and<br> +-bcdsites_qual_threshold; higher stringencies can be used to filter<br> +out lower-confidence events from the bcdsites file later, without<br> +re-running cross_match.<br> +<br> + For example, the line<br> +<br> +CHROMOSOME_X 250526 C,6,1,30,0,0 R-GATGTT,0,1,30,0,40 D-294,2,1,31,0,26<br> +<br> +in the bcdsites file from a spliced-alignment run (using<br> +-spliced_word_gapsize 2000) of Solexa cDNA reads against the<br> +C. elegans genome indicates that at position 250526 in the X<br> +chromosome, the subject sequence base is confirmed by 6 top strand and<br> +1 bottom strand reads, with a maximum score of 30, while two top<br> +strand and one bottom strand reads instead support the existence of a<br> +294 bp intron beginning at this position, and one bottom strand read<br> +extends to right from this position with additional bases GATGTT (this<br> +is likely an another read supporting the intron, but which failed the<br> +splice alignment because it had too few bases on the other side of the<br> +intron). The last base of the intron would be 250526 + 294 - 1 =<br> +250819. The intron is for a top (upper) strand gene since upper case<br> +'D' is used. Note however that for all of these reads max_margin is<br> +0, implying that the supporting query sequencies all have<br> +equal-scoring matches elsewhere in the genome, so placement of these<br> +queries in the genome is ambiguous and the intron cannot be considered<br> +to be confirmed by these data.<br> +<br> +-bcdsites_qual_threshold 25<br> + (Only applies with -output_bcdsites). Quality threshold for flagging<br> +discrepancies in .bcdsites file. Quality is ignored for purposes of<br> +identifying confirmed bases, and for 'large' insertions or deletions<br> +(e.g. introns in spliced alignments of ESTs to genome).<br> +<br> <br> 8. Miscellaneous<br> <br> + -compact_qual *<br> + (cross_match only, and there must be separate query & subject files;<br> +cannot be used with merged base or protein data.) Store query quality<br> +values in same bytes as nucleotides. Quality values > 60 are set to<br> +60. This option reduces run-time memory requirements for very large<br> +input files, and is set automatically when quality-based scoring of<br> +alignments is used, or when the input query file is in CALF format.<br> +<br> -retain_duplicates * <br> (phrap only). Retain exact duplicate reads, rather than eliminating<br> them.<br> @@ -882,7 +1372,7 @@<br> -trim_qual 13 <br> (phrap only). Quality value used in to define the "high-quality" part<br> of a read, (the part which should overlap; this is used to adjust<br> -qualities at ends of reads.<br> +qualities at ends of reads).<br> <br> -confirm_length 8 <br> (phrap only). Minimum size of confirming segment (segment starts at<br> @@ -899,13 +1389,98 @@<br> (phrap only). Minimum alignment score for a read to be allowed to<br> "confirm" part of another read.<br> <br> - -indexwordsize 10 <br> - Size of indexing (hashing) words, used in finding word matches<br> -between sequences. The value of this parameter has a generally minor<br> -effect on run time and memory usage.<br> -<br> + -indexwordsize 12 (for DNA sequences) 4 (for protein sequences) <br> + Size of indexing (hashing) words, used in searching for nucleating perfect<br> +matches between sequences. Cannot be set larger than<br> +minmatch. Increasing indexwordsize while holding minmatch constant may<br> +reduce running time somewhat but increase memory requirements, without<br> +affecting sensitivity.<br> + <br> +-indexwordsize2 4<br> + (Merged base read data only). Size of indexing words used in<br> +searching for nucleating perfect matches involving 2d peaks. Cannot be set<br> +larger than indexwordsize.<br> +<br> +-logfile<br> + Name of .log file. The default name is the name of the query file, with .log appended.<br> +<br> +<br> +The following cross_match-only parameters are used in alignments of<br> +cDNAs or ESTs (as queries) to genomic sequence <br> +<br> +-spliced_word_gapsize 0<br> + (cross_match only, & there must be separate query & subject<br> +files). Look for nucleating perfect matches that span potential splice<br> +junctions (i.e. where a region in the query matches a 'split' region<br> +in the genomic sequence, flanking a potential intron). Only U2-type<br> +(GY..AG) introns (on either strand), in the size range<br> +-min_intron_length to -spliced_word_gapsize are currently considered;<br> +and the nucleating perfect region must include at least minmatch/2<br> +perfectly matching bases to the left of the splice junction and<br> +minmatch/2 perfectly matching bases to the right of the<br> +junction). Only gap1 alignments (see above) are considered in<br> +extending the spliced match. Only a single spliced word (a single<br> +intron) can be found per alignment; however, multi-intron alignments<br> +can be obtained by using -fuse_gap1. When spliced_word_gapsize is<br> +set, ordinary (unspliced) nucleating perfect matches are also found,<br> +and used to nucleate banded Smith-Waterman searches (or gap1 searches,<br> +if -gap1_only is set) in the usual way. Spliced alignments are<br> +penalized using an intron-size-dependent scoring function (currently<br> +based on an assumed lognormal size distribution with parameters<br> +derived from analysis of C.elegans introns; this will change in the<br> +future).<br> + Reported matches nucleated by a spliced_word are indicated by<br> +prefixing the word "SPLICED" to the match summary line.<br> +<br> +-spliced_word_gapsize2 0<br> + (cross_match only, & there must be separate query & subject<br> +files). Like -spliced_word_gapsize, but applies only to potential<br> +splices in which the 'downstream' site (which can be either a 5' or a<br> +3' potential splice site, depending on strand) occurs near the<br> +boundary of a region having a match with (raw) score at least<br> +-minscore. Since there are far fewer such sites,<br> +-spliced_word_gapsize2 can be set much larger than<br> +-spliced_word_gapsize without a significant time penalty. Both<br> +-spliced_word_gapsize and -spliced_word_gapsize2 can be set in the<br> +same run. Note that because the match condition applies only to the<br> +downstream site, use of -spliced_word_gapsize2 can give slightly<br> +different results when run on the complemented genome rather than the<br> +original genome.<br> +<br> +-spliced_match_left 0<br> + (cross_match only) Only applies when -spliced_word_gapsize2 is<br> +positive. # of positions to left of a left match boundary to consider,<br> +for potential splice site locations. Must be non-negative.<br> +<br> +-spliced_match_right 20<br> + (cross_match only) Only applies when -spliced_word_gapsize2 is<br> +positive. # of positions to right of a left match boundary to consider,<br> +for potential splice site locations. Must be non-negative.<br> +<br> +-word_intron_margin 2<br> + (cross_match only) Number of bases at each intron end to display, in<br> +'spliced' alignments found using -spliced_word_gapsize and reported<br> +using -alignments. Must be at least 2.<br> +<br> +-min_intron_length 30<br> + (cross_match only) <br> +<br> +[THE FOLLOWING ARE CURRENTLY NOT SUPPORTED!!]<br> +<br> +-splice_edge_length 0<br> + (cross_match only, and there must be separate query & subject files)<br> +Length of region at each end of alignments to scan for potential<br> +splice sites. If 0, no analysis is done. If > 0, then the last<br> +splice_edge_length bases of alignment, and the splice_edge_length<br> +bases beyond end of alignment, are scanned. (Done for both ends). It<br> +is assumed that the subject sequence is genomic, and the query<br> +sequence is EST or cDNA.<br> <br> +-max_overlap 20<br> + (cross_match only) <br> <br> +-min_exon_length 6<br> + (cross_match only) <br> <br> <br> V. VIEWING PHRAP ASSEMBLIES WITH OTHER PROGRAMS<br> @@ -918,10 +1493,12 @@<br> assembly that is generated by phrap (e.g. phrap's consensus sequence<br> and quality values, and tags indicating potential assembly problems<br> and various data anomalies such as chimeras and compressions).<br> +<br> To use consed with phrap, ace files must be generated during the phrap<br> run using the flag -new_ace or -old_ace on the command line. This is done<br> automatically if you use the script phredPhrap to carry out all of the<br> data processing steps.<br> +<br> For additional information, consult consed documentation.<br> <br> <br> @@ -929,7 +1506,7 @@<br> <br> Phrapview is a graphical tool that provides a "global" view of the<br> phrap assembly, complementary to the "local" view provided by the<br> -sequence editor CONSED; it will become obsolete when a similar globabl<br> +sequence editor CONSED; it will become obsolete when a similar global<br> view is added to CONSED.<br> <br> a. Installation<br> @@ -996,6 +1573,7 @@<br> the assembly -- i.e. the "confidently placed" reads. (See phrap.doc<br> for a description of LLR scores). Misjoins usually occur in places<br> where the reduced depth is 0 on both strands.<br> +<br> Regions of the contig where the depth is 0 on one or both strands are<br> indicated in red.<br> <br> @@ -1011,6 +1589,7 @@<br> duplicated reads in the same direction on the insert. Links will only<br> be indicated properly if the read naming convention assumed by phrap<br> is used (see section III.1 above).<br> +<br> Each match or link is considered to be in one of three classes,<br> indicated by color: "problems" (red), which indicate serious<br> discrepancies or a significant possibility of incorrect, incomplete,<br> @@ -1041,6 +1620,7 @@<br> the other region, so it is unlikely that there is a significant<br> misassembly in such cases, although it is possible that one of the two<br> matching reads may have been incorrectly placed.<br> +<br> Chimeras generally have one "ok" match (to the site where the read<br> was assembled) and one or more "problem" matches to the region from<br> which the other part of the chimera derives (sometimes these matches<br> @@ -1048,6 +1628,7 @@<br> LLR cutoff). Some chimeras actually represent subclones with internal<br> deletions; these tend to be obvious since the two pieces map to nearby<br> locations in some contig and are in the same orientation there.<br> +<br> Double-headed arrows on matches indicate that the matching regions<br> are on opposite strands; double-headed arrows on links indicate that<br> the orientation of the two reads is inconsistent with the read names<br> @@ -1074,6 +1655,7 @@<br> remains highlighted until the pointer is moved over another item, or a<br> contig line. Textual information about the highlighted item is shown<br> in a box in the top right region of the display.<br> +<br> Several parameters can be changed by entering new values in<br> appropriate boxes at the bottom of the display. Four different<br> magnifications can be changed: horizontal magnification; the "spacing<br> @@ -1092,6 +1674,7 @@<br> (or in the wrong orientation) are marked as problems. Min ss and max<br> ss are the minimum and maximum spacing between same-strand reads (i.e.<br> size of a walking step).<br> +<br> The display is updated to reflect changes made in the above<br> parameters only after a relevant button (Show Contig Matches, etc.) is<br> pressed.<br> @@ -1103,15 +1686,17 @@<br> <br> 3. Gap<br> <br> - A version of phrap ("gcphrap", for "gap-compatible phrap") suitable<br> + Versions of phrap and cross_match ("gcphrap", for "gap-compatible phrap",<br> +and "gccross_match" for "gap-compatible cross_match") suitable<br> for use with the Staden gap4 package may be created as follows: Obtain<br> relevant source files from the Staden web site<br> (http://www.mrc-lmb.cam.ac.uk/pubseq/index.html). Put these in the<br> -same directory as the phrap source files, and enter the command<br> +same directory as the phrap source files, and enter the commands<br> <br> > make gcphrap<br> +> make gccross_match<br> <br> - Consult the Staden web site for instructions on using gcphrap in<br> + Consult the Staden web site for instructions on using gcphrap and gccross_match in<br> conjunction with gap4, including input and output file formats. In<br> addition to the usual phrap command line options, gcphrap accepts an<br> option -exp, which is used to specify the name of a directory in which<br> @@ -1133,23 +1718,30 @@<br> but may be redirected to a file). This contains information<br> indicating the point in the run that phrap has reached, summary<br> results for some of the steps, and various warning and error messages.<br> +<br> (ii) The .contigs file. This is a FASTA file containing the contig<br> sequences (without pads; bases in this file are upper case if and only<br> if the quality is >= qual_show). These include singleton contigs<br> consisting of single reads with a match to some other contig, but that<br> couldn't be merged consistently with it.<br> +<br> (iii) The .contigs.qual file. This has the phrap-generated qualities<br> for the contig bases.<br> +<br> (iv) The .singlets file. A FASTA file containing the singlet reads<br> (i.e. the reads with no match to any other read).<br> +<br> (v) The .log file. This includes various diagnostic information, and<br> a summary of aspects of the assembly; it is probably only of interest<br> to me, for trouble-shooting purposes.<br> +<br> (vi) The .problems file. Again this is probably only of interest to<br> me.<br> +<br> (vii) The .ace file (produced when the options -new_ace or<br> -old_ace is used). This file is required for viewing the assembly<br> using consed. It's format is described in the consed documentation.<br> +<br> (vii) The .view file (produced when the option -view is<br> used). Required for viewing the assembly using phrapview.<br> <br> @@ -1233,6 +1825,7 @@<br> the apparent size of the deletion, location of the deletion and (in<br> parentheses) name of the matching (undeleted) read and extent of the<br> segment that was deleted from in it are given.<br> +<br> The probable deletions are not used in assembly and instead appear<br> as singleton contigs.<br> <br> @@ -1401,6 +1994,7 @@<br> consist entirely of quality 0 are reassigned quality values of -1,<br> since these usually correspond to the lowest quality data at the<br> extreme ends of reads. (Not done in the contig quality histogram.)<br> +<br> An 'N' in either a read or the contig sequence is always counted as a<br> substitution error.<br> <br> @@ -1452,6 +2046,7 @@<br> regions may be flagged as unspanned. Misassembly errors of omission<br> should always result in "UNSPANNED" matching regions at the end of one<br> or both contigs.<br> +<br> Regions of a non-singleton contig matching an anomalous (chimeric or<br> deleted) read are indicated only in the output for the singleton contig<br> that contains the anomalous read.<br> @@ -1487,10 +2082,11 @@<br> later) input file (the "subject" sequences); or, if a single input<br> file is provided, the matches between any two sequences in this file.<br> The matches that are reported are controlled by the command line<br> -options -minscore and -masklevel, as well as by the options that<br> +options -minscore, -masklevel, and -minmargin, as well as by the options that<br> control scoring of the alignments and the band search (see section IV<br> above). The reported matches are ordered by query, and for each query<br> by the position of the start of the alignment within the query.<br> +<br> For each reported match, an initial output line gives summary<br> information:<br> <br> @@ -1520,18 +2116,21 @@<br> <br> Following this line there appears (if the flag -discrep_lists is<br> specified) a listing of the discrepancies between the aligned regions<br> -of the two sequences, giving for each discrepancy its type, position,<br> -nucleotide, and quality (in the query sequence), and its position in<br> -second sequence. This list is followed (if -discrep_tables is<br> -specified) a table giving, for each quality value, the total number of<br> -bases of that quality in the first sequence; the number which are<br> -included in the SWAT alignment; the cumulative no. of aligned bases<br> -(for this and higher qualities); the number of bases not included in<br> -the alignment (but potentially alignable -- i.e. not lying before the<br> -beginning or after the end of the second sequence); the number of<br> -discrepancies of each type (substitution, deletion, insertion); the<br> -total # of discrepancies (for this quality level) and their percentage<br> -(of the aligned bases of the given quality), and the cumulative # of<br> +of the two sequences, giving for each discrepancy its type (S =<br> +substitution, I = insertion, D = deletion, E = end base (mismatching<br> +base immediately adjacent to aligned region -- these are printed only<br> +if the -output_bcdsites option is specified)), position, nucleotide,<br> +and quality (in the query sequence), and its position in second<br> +sequence. This list is followed (if -discrep_tables is specified) a<br> +table giving, for each quality value, the total number of bases of<br> +that quality in the first sequence; the number which are included in<br> +the SWAT alignment; the cumulative no. of aligned bases (for this and<br> +higher qualities); the number of bases not included in the alignment<br> +(but potentially alignable -- i.e. not lying before the beginning or<br> +after the end of the second sequence); the number of discrepancies of<br> +each type (substitution, deletion, insertion); the total # of<br> +discrepancies (for this quality level) and their percentage (of the<br> +aligned bases of the given quality), and the cumulative # of<br> discrepancies and their percentage (of the cumulative no. of aligned<br> bases). This information is useful for computing accuracy as a<br> function of quality, for automatically generated contig sequences.<br> @@ -1542,12 +2141,172 @@<br> An 'N' in either sequence is always counted as a substitution error.<br> If the flag -alignments is specified, a complete alignment for each<br> reported match is displayed.<br> + If the flag -score_hist is specified, a list of the scores of all<br> +matches involving the query is given, along with the number of times<br> +each score occurred.<br> <br> <br> +VIII. SPECIAL CONSIDERATIONS/PARAMETER MODIFICATIONS FOR PARTICULAR DATA TYPES<br> <br> -VIII. EST ASSEMBLIES<br> - (TO BE ADDED)<br> + <br> +For all: tag repeats (to reduce incidence of false joins). But don't<br> +need to use most sensitive detection -- only evolutionarily recent<br> +repeats (nearly identical in sequence) tend to cause problems. Repeatmasking. <br> +Need script to process cross_match output, insert matches into fasta file as<br> +tags (or phd files?)<br> +<br> +Also use quality values if possible.<br> +<br> +1. (TO BE ADDED) Shotgun assemblies<br> + Main problem: wide variation in data quality between sequencing labs.<br> + Different depth of coverage.<br> + Increase -forcelevel if contigs not going together.<br> + Reads unaligned to final sequence.<br> + Increase minmatch, decrease maxgap to break bad joins.<br> +<br> +2. (TO BE ADDED) Whole genome assemblies<br> + Can't predict memory needed. Repeat characteristics play crucial role.<br> + Increase minmatch.<br> + If contigs large (=> depth of coverage high), increase -node_seg and -node_space<br> + <br> +3. (TO BE ADDED) Assemblies of polymorphic reads from a single locus<br> + Increase minmatch, set maxmatch = minmatch.<br> <br> +4. (TO BE ADDED) EST assemblies<br> + Issues: polymorphisms, paralogous genes, alternatively spliced forms,<br> + uneven assembly depth, directional bias, large dataset size,<br> + artifactual clones, data from several sources with & without quality vals.<br> + Identify highly expressed genes, remove reads from assembly. Compare reads<br> + to contig sequences.<br> + depth of coverage high => increase -node_seg and -node_space<br> + Parameters: increase minmatch and set maxmatch = minmatch.<br> +<br> + Deep assemblies (3 and 4): node_seg etc.<br> +<br> +5. Comparisons of ESTs/cDNAs to genome.<br> +<br> + With long reads, use -near_minscore to increase sensitivity to<br> +detect short exonic matches. In this case, each match is reported<br> +separately, and alignments are (currently) not adjusted to reflect<br> +likely splice junctions.<br> +<br> + With long or short reads, use -spliced_word_gapsize and<br> +-spliced_word_gapsize2 to directly detect likely introns (in the case<br> +of long reads, -fuse_gap1 should also be set so that reported<br> +alignments can include more than one intron). Reported alignments and<br> +discrepancy lists (using -alignments and -discrep_lists) will then<br> +indicate precise location & size of the putative intron, and<br> +alignments report the bases at each end of the intron. Note that<br> +there can be a substantial speed penalty for detecting modest size<br> +introns using -spliced_word_gapsize, e.g. -spliced_word_gapsize 300<br> +(to detect introns of size <= 300) may approximately double running<br> +times. There is also an increased risk of 'noise' alignments due to<br> +the greater total number of potential alignments that are<br> +considered. In contrast, there is relatively little speed penalty with<br> +-spliced_word_gapsize2 because far fewer candidate splices are<br> +considered.<br> +<br> + When using -spliced_word_gapsize2 with short reads you may want to<br> +set -minscore fairly low, and/or spliced_match_left and<br> +spliced_match_right higher than the defaults, to increase the<br> +probability that a match is found in the vicinity of the true splice<br> +site.<br> +<br> +See next section for other parameter settings useful with short reads.<br> +<br> +6. Short read analyses (e.g. Solexa/Illumina data).<br> +<br> + Recommended parameter changes include:<br> +<br> + Smaller -minscore (e.g. 20 or 25, if the default score<br> +matrix/penalty values are being used). The default minscore value of<br> +30 requires at least 30 matching bases, which may be too high to<br> +reliably detect matches involving reads only slightly longer than 30<br> +bases (on the other hand, you should expect some false positive matches at<br> +the lower settings)<br> + -vector_bound 0 (for phrap; 0 is already the default for cross_match)<br> + -gap1_only or -bandwidth 2. Note that large indels in short reads<br> +cannot be detected anyway, at least with the default gap penalties;<br> +using a larger bandwidth value with -gap_ext 0 may provide some<br> +ability to detect larger indels (at the expense of additional 'noise'<br> +alignments). -gap1_only is in general significantly faster than<br> +-bandwidth 2, and is recommended with short reads. However it cannot<br> +detect indels of size larger than 1.<br> + -max_group_size 0 (for phrap assemblies where coverage depth is<br> +very uneven, e.g. ESTs)<br> + -globality 1 (for cross_match; requires that -gap1_only,<br> +-spliced_word_gapsize2 and/or -spliced_word_gapsize is also set).<br> +<br> + Note that with cross_match, using -masklevel 101 to report all<br> +matches can dramatically increase running times and memory<br> +requirements in cases where some queries match high-copy number<br> +repeats; it is generally preferable to use lower values (e.g. the<br> +default, or -masklevel 0), together with an appropriate value of<br> +-minmargin to control the score range of reported matches. Using<br> +-minmargin 1 or higher reduces the number of stored alignments, which<br> +can reduce running time and memory requirements. Note also that a<br> +histogram for each query, indicating all match scores meeting the<br> +-minscore threshold, can be obtained by setting -score_hist.<br> +<br> + Limiting reported alignments to those with high-confidence placements<br> +can be achieved by using -minmargin with positive values.<br> +<br> + If it is important to detect regions of highly biassed composition<br> +you may want to turn off complexity-adjustment of scores by setting<br> +the command line parameters -raw and/or -word_raw. In general I do<br> +not recommend using -raw, since it tends to greatly reduce specificity<br> +(increase the number of false positive matches) at a given score<br> +level; one can usually do a better job of increasing sensitivity to<br> +detect biassed composition regions while maintaining reasonable<br> +specificity simply by lowering -minscore and setting -word_raw. Note<br> +however that using -word_raw can incur a significant speed penalty.<br> +(Reducing -maxmatch, without setting -word_raw, has the effect of<br> +'partially' removing word complexity adjustment, which can provide a<br> +useful compromise for the speed/sensitivity tradeoff in some<br> +cases). Since exons are generally less likely to have highly biassed<br> +composition regions, it is usually preferable not to use -raw or<br> +-word_raw in RNASeq (cDNA or EST) searches.<br> +<br> +<br> + Reducing the value of -minmatch will also also increase sensitivity,<br> +as will reducing -gap1_minscore (when gap1_only is not already set).<br> +Note that this may substantially increase running time. <br> +<br> +In general, appropriate parameter settings for your analyses will<br> +depend on the characteristics of your data (e.g. genome complexity,<br> +read error rates) as well as your computer resources. I recommend<br> +experimenting on a subset of your data before committing to very long<br> +runs. For resequencing applications, a useful parameter for this<br> +purpose is -output_nonmatching_queries. For example, you can start by<br> +using the default parameter setting for -minmatch (but using<br> +adjustments for the other parameters as indicated above), and then try<br> +turning up the sensitivity using the nonmatching queries to see<br> +whether this recovers substantially more matches.<br> +<br> +7. "Resequencing" applications.<br> +<br> + A useful analysis mode is to run cross_match comparing a large set of<br> +reads (e.g. 2 million or so Solexa reads) in a single query file to a<br> +genomic sequence comprising the subject file(s), setting the<br> +parameters -discrep_lists and -output_nonmatching_queries along with<br> +any others that may be appropriate (e.g. those for short reads above).<br> +High quality discrepancies in the discrepancy lists will include<br> +substitution and small indel differences between the resequenced &<br> +original genomes. Larger scale differences, and contaminants, can<br> +often be identified by performing a phrap assembly of the nonmatching<br> +query reads, and comparing the contig sequences back to the original<br> +genome (using more sensitive parameter settings if desired) and/or to<br> +the nucleotide databases using the NCBI Blast server. The bcdsites<br> +parameters can be used to produce an output file indicating positions<br> +in the reference genome that are either confirmed by or have<br> +high-quality discrepancies with the input reads.<br> +<br> + When the option -alignments is used to display the aligned sequences,<br> +there is currently a speed advantage to having the subject<br> +(genomic) sequences split among multiple subject files (e.g. one file<br> +per chromosome) rather than all included in a single file.<br> +<br> +8. (TO BE ADDED) Merged base reads<br> <br> <br> IX. PROBLEMS<br> @@ -1564,7 +2323,8 @@<br> to have the operating system allocate additional virtual memory to<br> your process. (However if you are substantially exceeding physical RAM<br> the run may take an exceptionally long time to complete). On Unix<br> -systems this can be done using the "limit" command, as follows:<br> +systems you can often get access to additional memory using the<br> +"limit" command, as follows:<br> <br> i. Run "limit" with no arguments to list the system resources currently<br> available to you. For example (this output is for a DEC alpha<br> @@ -1609,10 +2369,15 @@<br> which increases datasize to 500 megabytes. Any value smaller than the<br> value returned in step ii can be used.<br> <br> -iv. Now rerun phrap. Note that the new datasize value will only apply<br> -to the particular shell (window) in which you execute the limit<br> -command, so you may need to repeat step iii next time you log in or if<br> -you switch to a different window.<br> +iv. Now try your phrap (or cross_match) run again. Note that the new<br> +datasize value will only apply to the particular shell (window) in<br> +which you execute the limit command, so you may need to repeat step<br> +iii next time you log in or if you switch to a different window.<br> +<br> +If you still get a FATAL ERROR: REQUESTED MEMORY UNAVAILABLE message<br> +after doing the above, then the operating system (not phrap!) is still<br> +limiting your access to virtual memory. You will need to consult with<br> +a local system administrator in this case.<br> <br> <br> 2. Other phrap- or cross_match-generated error messages<br> @@ -1646,7 +2411,7 @@<br> (TO BE ADDED)<br> <br> <br> -7. Reporting problems<br> +7. How to report problems<br> <br> If you have a problem that is not addressed by any of the above,<br> first be sure you have read sections I-IV of the documentation and<br> @@ -1740,6 +2505,7 @@<br> somewhat conservative in that it does not take into account<br> same-strand matches (which clearly should count for something,<br> although they certainly are not independent).<br> +<br> Contig positions are assigned quality values equal to the highest<br> adjusted quality of any read at that position, and then adjusted<br> downward to take into account any discrepancies with other reads. As a<br> @@ -1748,6 +2514,7 @@<br> interpretation as (conservative) error probabilities. Such error<br> probabilities provide an extremely useful guide to where editing or<br> additional data collection is needed.<br> +<br> The phrap adjusted qualities are used in computing "LLR scores" for<br> each pairwise match between two reads. These scores take into account<br> the qualities of the base calls in the reads, and are (approximate)<br> @@ -1772,7 +2539,7 @@<br> single letter is converted to 'N's; such regions are highly likely to<br> be of poor data quality which if not masked can lead to spurious<br> matches. Reads are then converted to uppercase (in order to allow<br> -case-insensitive word matches). All matching words of length at least<br> +case-insensitive nucleating perfect matches). All matching words of length at least<br> minmatch between any pair of sequences are then found, by (i)<br> constructing a list of pointers to each position (in each sequence)<br> that begins a word of at least minmatch letters not containing 'N' or<br> @@ -1793,6 +2560,7 @@<br> calculation) detection of multiple matching pieces in different<br> locations, and will usually find most copies of repeats (since they<br> generally occur in separate bands).<br> +<br> By default, SWAT scores are complexity-adjusted (so that matches for<br> which the collection of matching nucleotides has biassed composition<br> have their scores significantly penalized.)<br> @@ -1813,7 +2581,7 @@<br> accurate calls, lower case for less accurate calls) and setting the<br> penalties appropriately; this would involve using the quality levels<br> to adjust the symbols used in the reads, which should be done AFTER<br> -the word matching routine. (At present it does not seem particularly<br> +the nucleating perfect matching routine. (At present it does not seem particularly<br> useful to use differing positive scores for different nucleotides to<br> reflect their different frequencies).<br> <br> </body> </html>

Personal tools
Namespaces
Variants
Actions
Site
Choi lab
Resources
Toolbox