Phrap input v1.090518

From CSBLwiki

Jump to: navigation, search

III. INPUT FILES

Input files to cross_match and phrap are of two types: sequence files (required), and quality files (optional, but strongly recommended for phrap). With cross_match, the query file can optionally be a CALF formatted file of unaligned reads (see http://www.phrap.org/phredphrap/calf.pdf for a description of this format), which must have a name ending in ".calf". If you are doing phrap assemblies of reads generated by the basecalling program phred, then it is recommended that you generate these input files automatically using the phd2fasta or phredPhrap scripts distributed with phred and consed (see II.1.i above). Even in this case however you should read the following section on the read naming convention, since you will either need to make sure that the read names attached to your chromatogram files adhere to the naming convention, or you will need to modify the phredPhrap script (or create your own script) so that correct template names and read orientation information is appended to the .phd files after they are produced by phred.


1. Read naming convention (phrap only)

In addition to the sequence and quality data, phrap needs to know three things for each read:

(i) the name of the subclone (or other template) from which the read is derived; this is used, for example, in checking for chimeric subclones (for which purpose it is necessary to know when two reads are from the same subclone so that they are not regarded as independently confirming each other) and for certain other data anomalies;

(ii) the orientation of the read (forward or reverse) within the subclone, in cases where data is acquired from each end of a subclone insert (at present this information is used only for consistency checking following assembly, and not in the assembly itself, but this will change in future versions);

(iii) the chemistry used to generate the read (which influences phrap's decisions as to how to treat discrepancies between potentially overlapping reads, and as to how to adjust qualities of reads which confirm each other -- confirmation of a read by another read with different chemistry counts as significantly as confirmation by an opposite-strand read).

The above information is generally conveyed via the read names using the "St. Louis" read naming convention described below. If your laboratory uses a different naming system, you have several options:

(i) provide the relevant information in the description field of the .fasta file instead (see below), or (if you are using the phd2Fasta script) as tags in the .phd files for each read;

(ii) see whether appropriate values of the command line options -subclone_delim and -n_delim (described below) can be used to bring your own naming convention into conformity with the St. Louis convention -- if your naming convention is such that the subclone name always occurs as the first part of the read name, and the end of the subclone name occurs at the first (or n-th for some fixed value of n) occurrence of a particular character, then you may be able to use these options without having to change your read names;

(iii) create a script which translates your read names into St. Louis form (this does not mean that you cannot retain the original names as well -- for example, for purposes of running phrap you could create a directory of "links" to your chromatogram files, such that the link names adhere to the naming convention, and then run the phredPhrap script on this directory);

(iv) alter the subroutines in the source module "names.c" in order to parse your read names correctly; or

(v) ignore the issue and hope that it does not cause problems. This may have unpredictable adverse effects on assembly and is NOT recommended.

The "St. Louis" read naming convention assumed by phrap is as follows:

The portion of the read name up to the first '.', if any, is the name of the subclone from which the read is derived (i.e. the DNA template used in the sequencing reaction). If desired, the command line option -subclone_delim can be used to change the character which delimits the end of the subclone name to something other than '.', and/or the option -n_delim may be used to indicate that the subclone name contains a fixed number of occurrences of this character within it. See section IV below for a description of command line options.

The orientation of the read within the subclone, and the chemistry, are indicated by the first letter following the '.', as follows:

"s" forward direction read on single stranded (SS) template, dye primer chemistry
"f" forward read on double stranded (DS) template, dye primer chemistry
"r" DS reverse read, dye primer chemistry
"x" SS forward read, standard dye terminator chemistry
"z" DS forward read, standard dye terminator chemistry
"y" DS reverse read, standard dye terminator chemistry
"i" SS forward read, big dye terminator chemistry
"b" DS forward read, big dye terminator chemistry
"g" DS reverse read, big dye terminator chemistry
"t" for T7  (cDNAs)
"p" for SP6 (cDNAs)
"e" for T3  (cDNAs)
"d" for special
"c" consensus pieces
"a" assembly pieces

The remainder of the name is ignored. If the read name ends in ".seq" the .seq is removed.

Example: the name BAC112_a11b3.x3_pg indicates a forward read with (old) dye terminator chemistry from the subclone BAC112_a11b3.


2. Sequence files.

With both phrap and cross_match, either a single sequence file or multiple sequence files can be specified (N.B. WITH CURRENT VERSION OF PHRAP ONLY A SINGLE SEQUENCE FILE SHOULD BE SPECIFIED). If a single file is specified, all sequences in it are compared to each other; if more than one file is specified, sequences in the first file are compared to sequences in the second (and subsequent) files. Typically phrap is used with a single input sequence file containing the reads, unless one wants to assemble one set of reads "against" another sequence or set of sequences (see below). Cross_match is typically used with two or more input sequence files, the first containing the "query" sequences and the remaining files containing the "subject" sequences, but it may also be used to compare the sequences in a single input file to each other.

Cross_match may be used with either DNA or protein sequences (mixed types are not allowed, i.e. all input sequences must be DNA, or all must be protein). In the case of DNA sequences, the query file (but not subject files) may include 'merged base' DNA sequences produced by phred from mixed signal DNA sequencing traces (see below).

Phrap is used only with ordinary DNA sequences as input. It is designed to be able to use all of each read sequence in the assembly, not just the trimmed (highest quality) part, so the full sequence of each read should be provided when available. If phred-generated qualities are not available however then the default quality parameter -default_qual should be set appropriately (section IV).

Sequence files must be in FASTA format: (i) Each sequence entry has a single header line as follows: The first character is '>'. This is followed immediately (i.e. with no intervening spaces or other characters) by the sequence name, and then (optionally) by a space or spaces, followed by (on the same line) descriptive information about the sequence. (ii) The header line is followed by a separate line or lines containing the sequence. The sequence may all be placed on a single line, or split among several lines (of arbitrary length).

Very large amounts of input data are permitted: Individual sequences must be 2 billion characters or less (the same limit applies to the name and descriptive information for each sequence). The combined total size of all query sequences is, by default, limited to about 1 billion characters; however, this can be increased by changing the line

typedef int SEQ_AREA;

in the header file swat.h to either

typedef unsigned int SEQ_AREA;

(which will increase the limit to 2 billion characters), or

typedef long int SEQ_AREA;

(which on most computers will increase it to about 10^18 -- at the cost of increased memory and running time), and recompiling. There is no combined total length constraint for subject sequences. The 2 billion residue length constraint for individual sequences (both query and subject) cannot be increased.


Descriptive information on the header line may optionally be used to indicate template (subclone), read orientation, read chemistry and dye, and repeats. This information is written back out to the .ace and .singlets files. If present, this information overrides whatever would be implied by the read name. The template name is indicated by including the word "TEMPLATE:", followed by a space, followed by the name of the template. Orientation is indicated by the word "DIRECTION:", followed by the word "fwd" or "rev". Chemistry is indicated by the word "CHEM:", followed by the word "prim" (for dye primer), "term" for dye terminator, or "unknown". Dye is indicated by the word "DYE:" followed by "rhod", "big", "ET", "d-rhod", or "unknown". The chemistry and dye information is provided automatically by newer versions of phred (version 0.980904.a or later), which obtain it from the chromatogram file. (Use the script phd2fasta distributed with phred and consed to create the .fasta file from the .phd files created by phred).

In the query file, repeated sequence may be indicated by the word "REPEAT:" followed by a pair of integers indicating the start and end of the repeat (these should be separated by spaces). There can be multiple "REPEAT:" tags for a single sequence. If the parameter -repeat_screen (see section IV.2 below) is set to 1 or 3, the sequence within any such repeat is ignored for the purpose of finding word matches between sequences. This can substantially improve the speed and (for phrap) accuracy of assembly of repeat-rich regions.

Example:

>read#5 DIRECTION: rev CHEM: prim DYE: ET TEMPLATE: a23f1 REPEAT: 151 237 REPEAT: 305 422 any_other_information GAAAGATCTCATTGATCACTCTATTCAAGTGGGAGTCTCCGGTCTTT ATGATCGATTTGTGAATCTTCGTATCAAAGTTGGAGCTGACAAGTATCCA TTGCTTGCGAAATGGGCTCAAATTTTCACTCAGGGAGTCGTCTTCGATCC TTCAAGAATTCATCAATGTGCTCAAAAGTTGGCTGGAGAAGCTCGTGATC GGAAGAGAGATGGATGTACTGTGGCATCAACTGCAGTAGCTTCAATGGTT TATGGAAAGAGTATGTTATTt

In subject file(s), repeats are instead indicated by using lower-case letters for the residues.

Non-alphabetic characters (including '*', and digits) in the sequence lines are automatically stripped out when the file is read in, except for '-' and '.' in DNA files, which are both converted to 'N'; '>' must not appear anywhere within the sequence since it is assumed to start the header of a new sequence (even if it is not at the beginning of a line).

Lower case letters are converted to upper case on readin, with the following important exceptions:

(i) if -repeat_screen (see section IV.2 below) is set to 2 or 3, segments of subject file sequences consisting entirely of lower case letters are assumed to represent repeats, and nucleating perfect matches falling entirely within them (or spliced matches with either splice junction falling within them) are ignored;

(ii) the query file may include 'merged base' DNA sequence reads produced by phred from mixed signal DNA sequencing traces. Such reads indicate at each position the two strongest peaks detected in the signal, and use standard ambiguity codes together with upper and lower case to indicate these peaks, as follows:

Merged base read symbols:

A,C,G,T (single detected peak)

   stronger peak    weaker peak

M A C m C A K G T k T G R A G r G A Y C T y T C S C G s G C W A T w T A

N no information

For effective searches with merged base reads an appropriate score matrix should be provided using the -matrix option (see IV.1 below); the matrix "mb_matrix" provided with the distribution has been found to be useful for this purpose. Note that with this score matrix the default values of -minscore, -near_minscore, -gap1_dropoff and -gap1_minscore are too small, and should be changed on the command line.

3. Quality files.

 For each input sequence file in FASTA format, you may optionally

include a corresponding file of data quality information. This is strongly recommended for phrap runs since it greatly improves the accuracy of assembly and of the consensus sequence; with cross_match, the qualities are at present used only for purposes of annotating and tabulating discrepancies and not in the alignment scoring. The name of this file must consist of the name of the FASTA file, with ".qual" appended.

The format of the .qual file is similar to that of the corresponding

FASTA file. For each read there should appear a header line identical (except possibly for the "description" field) to that in the FASTA file. This is followed by one or more lines giving the qualities for each base. Quality values should be integers >= 0 and <= 99, and should be separated by spaces. No other (non-digit, non-white space) characters should appear. The total number of quality values for each read must match the number of bases for that read in the FASTA file. Also, the orders of the reads in the FASTA and corresponding .qual file must be identical. N's are automatically assigned quality 0, overriding any value present in the .qual file [??]; however X's retain whatever value is present in the .qual file.

 If provided, quality information is used by phrap both in the

assembly itself (where discrepancies between high quality bases are used to discriminate repeats from true overlaps) and in determining the contig consensus sequence (which is formed as a mosaic of the highest quality parts of the reads). Because phrap generates its own quality measures (based on read-read confirmation) it performs reasonably well even if no input quality is provided; however when it is available, it is important to provide input data quality information since this can substantially increase the accuracy of the consensus, particularly in regions where there are strand-specific effects on quality (e.g. compressions) -- in such cases the input quality information may constitute the only basis for choosing one strand over the other.

 If your sequencing data is from automated sequencing machines it is

strongly recommended that you generate the read sequences using the basecalling program phred, which will produce an appropriate quality file for input to phrap. On the basis of the trace characteristics, phred computes a probability p of an error in the base call at each position, and converts this to a quality value q using the transformation q = -10 log_10(p). Thus a quality of 30 corresponds to an error probability of 1 / 1000, a quality of 40 to an error probability of 1 / 10000, etc.

 The quality value 99 is reserved for base calls that have been

visually inspected and verified as "highly accurate" (during editing), and 98 for bases that have been edited but are not highly accurate (these are converted to quality 0 in phrap). If two reads appear to match but have discrepancies involving bases that have quality 99 in both reads, phrap does not allow them to be merged during assembly. At present this is the only way to break false joins made by phrap.

 If you wish to generate your own quality values you should be

prepared to experiment. It is particularly important that possible insertion errors receive low quality, because in choosing the "consensus" phrap tends to favor the reads with the highest total quality in a given region. Also, if at all possible you should use quality values that are defined in terms of error probabilities in the manner described above, since to some extent phrap is tuned to expect these (and its output qualities will then have a similar interpretation).

If all input quality values are relatively small (less than 15),

phrap assumes that they do not correspond to error probabilities and attempts to rescale them so that the largest quality value is approximately 30. This allows different input scales to be used.


4. Vector screening

Following creation of a FASTA file with the raw read sequences, it is

important to remove or screen out sequencing (subclone) vector sequence before running phrap. (It is unnecessary to remove the cloning (e.g. cosmid or BAC) vector, unless the inserts from multiple overlapping clones are being assembled simultaneously; however it is generally useful to remove at least the central part of the cloning vector, since this then provides a natural point at which phrap can break the circular sequence into a linear one). Unremoved sequencing vector may cause reads to be identified as "possible chimeras", or otherwise interfere with proper assembly. Some vector removal programs may be unreliable at identifying vector sequences that are rearranged or found in the lower quality part of the read, and I recommend that you screen for sequencing vector using cross_match as described here, whether or not you have already screened them using another program.

To carry out the screening, first create a FASTA file, say "vector", with all of the vector sequences you want to screen for (I use one that contains all vector sequences used in any of the ongoing sequencing projects, in case the vector is misidentified in the current project). If your read file is named "C05D11.reads" (for example), then run the command

> cross_match C05D11.reads vector -minmatch 10 -minscore 20 -screen > screen.out

(This uses somewhat more sensitive parameter settings than the cross_match defaults). The '-screen' option causes a file named "C05D11.reads.screen" to be created, containing "vector masked" versions of the original sequences: i.e. any region that matches any part of a vector sequence is replaced by X's. This ".screen" file is what should be provided as input to phrap. The output from cross_match (which has been redirected to the file screen.out in the above example) lists the matches that were found (see below).

N.B. If you have created a .qual file for your reads (say C05D11.reads.qual), be sure to rename or copy it to C05D11.reads.screen.qual so that it will be recognized by phrap when C05D11.reads.screen is used as the input FASTA file.

The phredPhrap script automatically carries out the above steps (i.e. vector screening using cross_match, and renaming of the .qual file); however it is your responsibility to make sure the vector sequence file includes vector sequences appropriate for your cloning and sequencing vectors.

Personal tools
Namespaces
Variants
Actions
Site
Choi lab
Resources
Toolbox