Evolutionary age
From CSBLwiki
(Difference between revisions)
(Created page with '==Evolutionary age of protein domain== *Reference:') |
(→Silva) |
||
(37 intermediate revisions not shown) | |||
Line 1: | Line 1: | ||
- | ==Evolutionary age of protein | + | {|align="left" cellpadding="25" |
- | + | | __TOC__ | |
+ | |} | ||
+ | |||
+ | ==Evolutionary age of protein domains== | ||
+ | (Based on this reference) | ||
+ | <biblio>Reference pmid=16959887</biblio> | ||
+ | |||
+ | ==Data== | ||
+ | ===Pfam release 24.0 (2009. Oct)=== | ||
+ | *The [http://pfam.sanger.ac.uk Pfam] database | ||
+ | **ftp [ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/ current_release] | ||
+ | **[ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/userman.txt User manual] (the pfam format) | ||
+ | **Tip: use Mysql dump <- easy to handle the content | ||
+ | <pre> | ||
+ | # download total DB (estimated ~2 days) | ||
+ | wget -c ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/database.tar &</pre> | ||
+ | *Loading into MySQL | ||
+ | **Not all DBs were imported (only a few key DBs for this study) | ||
+ | **pfamseq, genome_seqs(not available), ncbi_taxonomy, genome_species(not availabe), pfamA, gene_ontology | ||
+ | **Release 24 - pfamseq, pfamA, ncbi_taxonomy, taxonomy, pfamA_reg_full_significant | ||
+ | ***in_full should "1" | ||
+ | <pre> | ||
+ | The pfamA_reg_full_significant and pfamA_reg_full_insignificant tables contain, | ||
+ | as the names suggest, the significant and insignificant data respectively. | ||
+ | Significant hits are those with a bits score above the curated threshold for the family, | ||
+ | whilst insignificant matches are those that score below the curated threshold. | ||
+ | With respect to the tables that contain significant data (pfamA_reg_full_significant and | ||
+ | pfamA_reg_full), there is an extra column called 'in_full'. | ||
+ | The matches that are present in the full alignment for a Pfam family have this column set to 1, | ||
+ | while those that are not present in the full alignment have the 'in_full' column set to 0. | ||
+ | Where there is an overlapping fragment match and a full length match to the same Pfam-A family, | ||
+ | only one of the matches will be present in the full alignment for that Pfam-A family. </pre> | ||
+ | <pre> | ||
+ | mysql -u user -p | ||
+ | mysql>create DATABASE pfam24; \q | ||
+ | mysql -u user -p pfam24 < FULL_PATH/pfamseq.sql | ||
+ | mysql -u user -p | ||
+ | mysql>use pfam24 | ||
+ | mysql>load data local infile 'pfamseq.txt' into table pfamseq FIELDS ENCLOSED BY "\'"; | ||
+ | </pre> | ||
+ | ====MySQL==== | ||
+ | *[[MySQL]]로 로딩시, 테이블 작성 순서대로 할것 - 에러발생 (작성하지 않은 테이블의 키인덱스 링크 [http://dev.mysql.com/doc/refman/5.1/en/innodb-foreign-key-constraints.html 관련 키워드]) | ||
+ | **loading time: a few hours | ||
+ | *Loading data file in background job | ||
+ | *Need to tune the database parameters <- it is very critical to process the calculation | ||
+ | **innodb_buffer_pool_size | ||
+ | **query_cache | ||
+ | **query_buffer | ||
+ | <pre> | ||
+ | mysql -u user -p'passwd' database < loadscript.sql & | ||
+ | </pre> | ||
+ | ====R==== | ||
+ | *using [[R]] to analyze taxonomic distribution | ||
+ | **see which fields should be used to link the databases... | ||
+ | |||
+ | ===Tree data=== | ||
+ | *[http://www.arb-silva.de/ Silva]: SSU rRNA database | ||
+ | |||
+ | ==Results== | ||
+ | *Species in Pfam (NCBI taxonomy): 148,925 species | ||
+ | *Genus (taxonomy): 31,949 | ||
+ | **taxonomy - browsing order in the web? | ||
+ | *MySQL query for the taxonomic distribution retrieval | ||
+ | **pfamseq, pfamA, pfamA_reg_full_significant | ||
+ | <pre> | ||
+ | # pfamA list | ||
+ | select distinct pfamA_id, auto_pfamA FROM pfamA; | ||
+ | # protein sequences of each auto_pfamA (pfamA_id) | ||
+ | select auto_pfamseq from pfamA_reg_full_significant WHERE auto_pfamA = ''; | ||
+ | # taxonomic distribution | ||
+ | SELECT DISTINCT species,taxonomy,ncbi_code FROM pfamseq WHERE auto_pfamseq = 'auto_pfamseq'; | ||
+ | </pre> or | ||
+ | <pre> | ||
+ | SELECT DISTINCT species, taxonomy, ncbi_code FROM pfamseq seq, pfamA pf, pfamA_reg_full_significant sig \ | ||
+ | WHERE pf.pfamA_id='PF...' \ | ||
+ | AND sig.auto_pfamA=pf.auto_pfamA \ | ||
+ | AND seq.auto_pfamseq=sig.auto_pfamseq; | ||
+ | </pre> | ||
+ | *Now, you have a list of Pfam having taxonomic distribution (ready to map) | ||
+ | ===Building the universal phylogenetic tree=== | ||
+ | *ncbitaxa.csv | ||
+ | **total: 148925 (1 for column name) | ||
+ | **non-blank: 148203 | ||
+ | **blank: 722 | ||
+ | **Eukaryota 106462 | ||
+ | **other sequences 191 | ||
+ | **Viruses 24643 | ||
+ | **'unclassified' 1 | ||
+ | **Archaea 764 | ||
+ | **Bacteria 16102 | ||
+ | **unclassified sequences 40 | ||
+ | **max rank (in Archaea, Bacteria, Eukaryota): 22 | ||
+ | **[[ncbi_rank_taxa]] | ||
+ | |||
+ | create TABLE ncbitaxa ( SN INT, ncbi_taxid INT, species text, tax_rank text); | ||
+ | load data local infile 'ncbitaxa.csv' into table ncbitaxa FIELDS terminated BY ',' enclosed by '"' ignore 1 lines; | ||
+ | |||
+ | ===Silva=== | ||
+ | */home/gnusnah/db/silva$SSURef_102_SILVA_NR_99.header - 262,092 lines | ||
+ | *paring SILVA NR set into ncbi taxid (Genbank) - "ACC_NCBI-taxon.txt" | ||
+ | <pre> | ||
+ | ## R | ||
+ | > tmp = scan("ACC_NCBI-taxon.txt",sep="\t",what="character") | ||
+ | Read 517708 items | ||
+ | > length(tmp)/2 | ||
+ | [1] 258854 | ||
+ | > tmp1 = matrix(tmp,length(tmp)/2,2,byrow=T) | ||
+ | > tmp1[1:2,] | ||
+ | [,1] [,2] | ||
+ | [1,] "A16379" "730" | ||
+ | [2,] "A27627" "480" | ||
+ | > length(table(tmp1[,2])) | ||
+ | [1] 63916 | ||
+ | </pre> | ||
+ | *Procedure for building the universal tree | ||
+ | #arb - editing NDS and exporting selected fields into a file (acc + taxid) | ||
+ | #arb - exporting sequences into gb or fasta with gaps (alignments) | ||
+ | #sorting out NR-taxid set corresponding to pfam taxid | ||
+ | #collecting alignments for NR-taxid set (converting into phylip) | ||
+ | #building tree (neighbor, mp, ml in phylip package) | ||
+ | |||
+ | ===Procedure=== | ||
+ | *Extract the taxonomic origins of each protein | ||
+ | *Get the Taxonomy info. of each origin | ||
+ | *Non-redundant (NR) set of taxonomic origins | ||
+ | *Collect all Small Subunit (SSU) rRNA sequences of the NR set | ||
+ | *Build the universal tree of life using SSU sequences | ||
+ | *Mapping the each protein belonging to a given domain into the universal tree | ||
+ | *Check which node is the most recent common ancestor (MRCA) node | ||
+ | *Calculate the branch length between MRCA and LCA (last common ancestor) |
Latest revision as of 10:04, 24 August 2010
|
Evolutionary age of protein domains
(Based on this reference)
Error fetching PMID 16959887:
- Error fetching PMID 16959887:
Data
Pfam release 24.0 (2009. Oct)
- The Pfam database
- ftp current_release
- User manual (the pfam format)
- Tip: use Mysql dump <- easy to handle the content
# download total DB (estimated ~2 days) wget -c ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/database.tar &
- Loading into MySQL
- Not all DBs were imported (only a few key DBs for this study)
- pfamseq, genome_seqs(not available), ncbi_taxonomy, genome_species(not availabe), pfamA, gene_ontology
- Release 24 - pfamseq, pfamA, ncbi_taxonomy, taxonomy, pfamA_reg_full_significant
- in_full should "1"
The pfamA_reg_full_significant and pfamA_reg_full_insignificant tables contain, as the names suggest, the significant and insignificant data respectively. Significant hits are those with a bits score above the curated threshold for the family, whilst insignificant matches are those that score below the curated threshold. With respect to the tables that contain significant data (pfamA_reg_full_significant and pfamA_reg_full), there is an extra column called 'in_full'. The matches that are present in the full alignment for a Pfam family have this column set to 1, while those that are not present in the full alignment have the 'in_full' column set to 0. Where there is an overlapping fragment match and a full length match to the same Pfam-A family, only one of the matches will be present in the full alignment for that Pfam-A family.
mysql -u user -p mysql>create DATABASE pfam24; \q mysql -u user -p pfam24 < FULL_PATH/pfamseq.sql mysql -u user -p mysql>use pfam24 mysql>load data local infile 'pfamseq.txt' into table pfamseq FIELDS ENCLOSED BY "\'";
MySQL
- MySQL로 로딩시, 테이블 작성 순서대로 할것 - 에러발생 (작성하지 않은 테이블의 키인덱스 링크 관련 키워드)
- loading time: a few hours
- Loading data file in background job
- Need to tune the database parameters <- it is very critical to process the calculation
- innodb_buffer_pool_size
- query_cache
- query_buffer
mysql -u user -p'passwd' database < loadscript.sql &
R
- using R to analyze taxonomic distribution
- see which fields should be used to link the databases...
Tree data
- Silva: SSU rRNA database
Results
- Species in Pfam (NCBI taxonomy): 148,925 species
- Genus (taxonomy): 31,949
- taxonomy - browsing order in the web?
- MySQL query for the taxonomic distribution retrieval
- pfamseq, pfamA, pfamA_reg_full_significant
# pfamA list select distinct pfamA_id, auto_pfamA FROM pfamA; # protein sequences of each auto_pfamA (pfamA_id) select auto_pfamseq from pfamA_reg_full_significant WHERE auto_pfamA = ''; # taxonomic distribution SELECT DISTINCT species,taxonomy,ncbi_code FROM pfamseq WHERE auto_pfamseq = 'auto_pfamseq';or
SELECT DISTINCT species, taxonomy, ncbi_code FROM pfamseq seq, pfamA pf, pfamA_reg_full_significant sig \ WHERE pf.pfamA_id='PF...' \ AND sig.auto_pfamA=pf.auto_pfamA \ AND seq.auto_pfamseq=sig.auto_pfamseq;
- Now, you have a list of Pfam having taxonomic distribution (ready to map)
Building the universal phylogenetic tree
- ncbitaxa.csv
- total: 148925 (1 for column name)
- non-blank: 148203
- blank: 722
- Eukaryota 106462
- other sequences 191
- Viruses 24643
- 'unclassified' 1
- Archaea 764
- Bacteria 16102
- unclassified sequences 40
- max rank (in Archaea, Bacteria, Eukaryota): 22
- ncbi_rank_taxa
create TABLE ncbitaxa ( SN INT, ncbi_taxid INT, species text, tax_rank text); load data local infile 'ncbitaxa.csv' into table ncbitaxa FIELDS terminated BY ',' enclosed by '"' ignore 1 lines;
Silva
- /home/gnusnah/db/silva$SSURef_102_SILVA_NR_99.header - 262,092 lines
- paring SILVA NR set into ncbi taxid (Genbank) - "ACC_NCBI-taxon.txt"
## R > tmp = scan("ACC_NCBI-taxon.txt",sep="\t",what="character") Read 517708 items > length(tmp)/2 [1] 258854 > tmp1 = matrix(tmp,length(tmp)/2,2,byrow=T) > tmp1[1:2,] [,1] [,2] [1,] "A16379" "730" [2,] "A27627" "480" > length(table(tmp1[,2])) [1] 63916
- Procedure for building the universal tree
- arb - editing NDS and exporting selected fields into a file (acc + taxid)
- arb - exporting sequences into gb or fasta with gaps (alignments)
- sorting out NR-taxid set corresponding to pfam taxid
- collecting alignments for NR-taxid set (converting into phylip)
- building tree (neighbor, mp, ml in phylip package)
Procedure
- Extract the taxonomic origins of each protein
- Get the Taxonomy info. of each origin
- Non-redundant (NR) set of taxonomic origins
- Collect all Small Subunit (SSU) rRNA sequences of the NR set
- Build the universal tree of life using SSU sequences
- Mapping the each protein belonging to a given domain into the universal tree
- Check which node is the most recent common ancestor (MRCA) node
- Calculate the branch length between MRCA and LCA (last common ancestor)