Evolutionary age
From CSBLwiki
(Difference between revisions)
Line 1: | Line 1: | ||
- | {|align="left" cellpadding=" | + | {|align="left" cellpadding="25" |
| __TOC__ | | __TOC__ | ||
|} | |} | ||
- | + | =Evolutionary age of protein domains= | |
(Based on this reference) | (Based on this reference) | ||
<biblio>Reference pmid=16959887</biblio> | <biblio>Reference pmid=16959887</biblio> | ||
- | ===Pfam | + | ==Data== |
+ | ===Pfam release 24.0 (2009. Oct)=== | ||
*The [http://pfam.sanger.ac.uk Pfam] database | *The [http://pfam.sanger.ac.uk Pfam] database | ||
**ftp [ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/ current_release] | **ftp [ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/ current_release] | ||
Line 49: | Line 50: | ||
*[http://www.arb-silva.de/ Silva]: SSU rRNA database | *[http://www.arb-silva.de/ Silva]: SSU rRNA database | ||
- | + | ==Results== | |
*Species in Pfam (NCBI taxonomy): 148,925 species | *Species in Pfam (NCBI taxonomy): 148,925 species | ||
*Genus (taxonomy): 31,949 | *Genus (taxonomy): 31,949 | ||
Line 70: | Line 71: | ||
</pre> | </pre> | ||
*Now, you have a list of Pfam having taxonomic distribution (ready to map) | *Now, you have a list of Pfam having taxonomic distribution (ready to map) | ||
- | + | ===Building the universal phylogenetic tree=== | |
===Procedure=== | ===Procedure=== |
Revision as of 13:43, 16 August 2010
|
Evolutionary age of protein domains
(Based on this reference)
Error fetching PMID 16959887:
- Error fetching PMID 16959887:
Data
Pfam release 24.0 (2009. Oct)
- The Pfam database
- ftp current_release
- User manual (the pfam format)
- Tip: use Mysql dump <- easy to handle the content
# download total DB (estimated ~2 days) wget -c ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/database.tar &
- Loading into MySQL
- Not all DBs were imported (only a few key DBs for this study)
- pfamseq, genome_seqs(not available), ncbi_taxonomy, genome_species(not availabe), pfamA, gene_ontology
- Release 24 - pfamseq, pfamA, ncbi_taxonomy, taxonomy, pfamA_reg_full_significant
- in_full should "1"
The pfamA_reg_full_significant and pfamA_reg_full_insignificant tables contain, as the names suggest, the significant and insignificant data respectively. Significant hits are those with a bits score above the curated threshold for the family, whilst insignificant matches are those that score below the curated threshold. With respect to the tables that contain significant data (pfamA_reg_full_significant and pfamA_reg_full), there is an extra column called 'in_full'. The matches that are present in the full alignment for a Pfam family have this column set to 1, while those that are not present in the full alignment have the 'in_full' column set to 0. Where there is an overlapping fragment match and a full length match to the same Pfam-A family, only one of the matches will be present in the full alignment for that Pfam-A family.
mysql -u user -p mysql>create DATABASE pfam24; \q mysql -u user -p pfam24 < FULL_PATH/pfamseq.sql mysql -u user -p mysql>use pfam24 mysql>load data local infile 'pfamseq.txt' into table pfamseq FIELDS ENCLOSED BY "\'";
- MySQL로 로딩시, 테이블 작성 순서대로 할것 - 에러발생 (작성하지 않은 테이블의 키인덱스 링크 관련 키워드)
- loading time: a few hours
- Loading data file in background job
mysql -u user -p'passwd' database < loadscript.sql &
- using R to analyze taxonomic distribution
- see which fields should be used to link the databases...
Tree data
- Silva: SSU rRNA database
Results
- Species in Pfam (NCBI taxonomy): 148,925 species
- Genus (taxonomy): 31,949
- taxonomy - browsing order in the web?
- MySQL query for the taxonomic distribution retrieval
- pfamseq, pfamA, pfamA_reg_full_significant
# pfamA list select distinct pfamA_id, auto_pfamA FROM pfamA; # protein sequences of each auto_pfamA (pfamA_id) select auto_pfamseq from pfamA_reg_full_significant WHERE auto_pfamA = ''; # taxonomic distribution SELECT DISTINCT species,taxonomy,ncbi_code FROM pfamseq WHERE auto_pfamseq = 'auto_pfamseq';or
SELECT DISTINCT species, taxonomy, ncbi_code FROM pfamseq seq, pfamA pf, pfamA_reg_full_significant sig \ WHERE pf.pfamA_id='PF...' \ AND sig.auto_pfamA=pf.auto_pfamA \ AND seq.auto_pfamseq=sig.auto_pfamseq;
- Now, you have a list of Pfam having taxonomic distribution (ready to map)
Building the universal phylogenetic tree
Procedure
- Extract the taxonomic origins of each protein
- Get the Taxonomy info. of each origin
- Non-redundant (NR) set of taxonomic origins
- Collect all Small Subunit (SSU) rRNA sequences of the NR set
- Build the universal tree of life using SSU sequences
- Mapping the each protein belonging to a given domain into the universal tree
- Check which node is the most recent common ancestor (MRCA) node
- Calculate the branch length between MRCA and LCA (last common ancestor)