Evolutionary age

From CSBLwiki

(Difference between revisions)

Jump to: navigation, search

Latest revision as of 10:04, 24 August 2010

Evolutionary age of protein domains

(Based on this reference)

Error fetching PMID 16959887:

Error fetching PMID 16959887: [Reference]

Data

Pfam release 24.0 (2009. Oct)

The Pfam database
- ftp current_release
- User manual (the pfam format)
- Tip: use Mysql dump <- easy to handle the content

# download total DB (estimated ~2 days)
wget -c ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/database.tar &

Loading into MySQL
- Not all DBs were imported (only a few key DBs for this study)
- pfamseq, genome_seqs(not available), ncbi_taxonomy, genome_species(not availabe), pfamA, gene_ontology
- Release 24 - pfamseq, pfamA, ncbi_taxonomy, taxonomy, pfamA_reg_full_significant
  - in_full should "1"

The pfamA_reg_full_significant and pfamA_reg_full_insignificant tables contain,
as the names suggest, the significant and insignificant data respectively.
Significant hits are those with a bits score above the curated threshold for the family,
whilst insignificant matches are those that score below the curated threshold.
With respect to the tables that contain significant data (pfamA_reg_full_significant and
pfamA_reg_full), there is an extra column called 'in_full'.
The matches that are present in the full alignment for a Pfam family have this column set to 1,
while those that are not present in the full alignment have the 'in_full' column set to 0.
Where there is an overlapping fragment match and a full length match to the same Pfam-A family,
only one of the matches will be present in the full alignment for that Pfam-A family.

mysql -u user -p
mysql>create DATABASE pfam24; \q
mysql -u user -p pfam24 < FULL_PATH/pfamseq.sql
mysql -u user -p
mysql>use pfam24
mysql>load data local infile 'pfamseq.txt' into table pfamseq FIELDS ENCLOSED BY "\'";

MySQL

MySQL로 로딩시, 테이블 작성 순서대로 할것 - 에러발생 (작성하지 않은 테이블의 키인덱스 링크 관련 키워드)
- loading time: a few hours
Loading data file in background job
Need to tune the database parameters <- it is very critical to process the calculation
- innodb_buffer_pool_size
- query_cache
- query_buffer

mysql -u user -p'passwd' database < loadscript.sql &

R

using R to analyze taxonomic distribution
- see which fields should be used to link the databases...

Tree data

Silva: SSU rRNA database

Results

Species in Pfam (NCBI taxonomy): 148,925 species
Genus (taxonomy): 31,949
- taxonomy - browsing order in the web?
MySQL query for the taxonomic distribution retrieval
- pfamseq, pfamA, pfamA_reg_full_significant

# pfamA list
select distinct pfamA_id, auto_pfamA FROM pfamA;
# protein sequences of each auto_pfamA (pfamA_id)
select auto_pfamseq from pfamA_reg_full_significant WHERE auto_pfamA = '';
# taxonomic distribution
SELECT DISTINCT species,taxonomy,ncbi_code FROM pfamseq WHERE auto_pfamseq = 'auto_pfamseq';

or

SELECT DISTINCT species, taxonomy, ncbi_code FROM pfamseq seq, pfamA pf, pfamA_reg_full_significant sig \
       WHERE pf.pfamA_id='PF...' \
       AND sig.auto_pfamA=pf.auto_pfamA \
       AND seq.auto_pfamseq=sig.auto_pfamseq;

Now, you have a list of Pfam having taxonomic distribution (ready to map)

Building the universal phylogenetic tree

ncbitaxa.csv
- total: 148925 (1 for column name)
- non-blank: 148203
- blank: 722
- Eukaryota 106462
- other sequences 191
- Viruses 24643
- 'unclassified' 1
- Archaea 764
- Bacteria 16102
- unclassified sequences 40
- max rank (in Archaea, Bacteria, Eukaryota): 22
- ncbi_rank_taxa

create TABLE ncbitaxa ( SN INT, ncbi_taxid INT, species text, tax_rank text);
load data local infile 'ncbitaxa.csv' into table ncbitaxa FIELDS terminated BY ',' enclosed by '"' ignore 1 lines;

Silva

/home/gnusnah/db/silva$SSURef_102_SILVA_NR_99.header - 262,092 lines
paring SILVA NR set into ncbi taxid (Genbank) - "ACC_NCBI-taxon.txt"

## R
> tmp = scan("ACC_NCBI-taxon.txt",sep="\t",what="character")
Read 517708 items
> length(tmp)/2
[1] 258854
> tmp1 = matrix(tmp,length(tmp)/2,2,byrow=T)
> tmp1[1:2,]
     [,1]     [,2] 
[1,] "A16379" "730"
[2,] "A27627" "480"
> length(table(tmp1[,2]))
[1] 63916

Procedure for building the universal tree

arb - editing NDS and exporting selected fields into a file (acc + taxid)
arb - exporting sequences into gb or fasta with gaps (alignments)
sorting out NR-taxid set corresponding to pfam taxid
collecting alignments for NR-taxid set (converting into phylip)
building tree (neighbor, mp, ml in phylip package)

Procedure

Extract the taxonomic origins of each protein
Get the Taxonomy info. of each origin
Non-redundant (NR) set of taxonomic origins
Collect all Small Subunit (SSU) rRNA sequences of the NR set
Build the universal tree of life using SSU sequences
Mapping the each protein belonging to a given domain into the universal tree
Check which node is the most recent common ancestor (MRCA) node
Calculate the branch length between MRCA and LCA (last common ancestor)

@@ Line 1: / Line 1: @@
-==Evolutionary age of protein domain==
+{|align="left" cellpadding="25"
-*Reference:
+| __TOC__
+|}
+==Evolutionary age of protein domains==
+(Based on this reference)
+<biblio>Reference pmid=16959887</biblio>
+==Data==
+===Pfam release 24.0 (2009. Oct)===
+*The [http://pfam.sanger.ac.uk Pfam] database
+**ftp [ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/ current_release]
+**[ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/userman.txt User manual] (the pfam format)
+**Tip: use Mysql dump <- easy to handle the content
+<pre>
+# download total DB (estimated ~2 days)
+wget -c ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/database.tar &</pre>
+*Loading into MySQL
+**Not all DBs were imported (only a few key DBs for this study)
+**pfamseq, genome_seqs(not available), ncbi_taxonomy, genome_species(not availabe), pfamA, gene_ontology
+**Release 24 - pfamseq, pfamA, ncbi_taxonomy, taxonomy, pfamA_reg_full_significant
+***in_full should "1"
+<pre>
+The pfamA_reg_full_significant and pfamA_reg_full_insignificant tables contain,
+as the names suggest, the significant and insignificant data respectively.
+Significant hits are those with a bits score above the curated threshold for the family,
+whilst insignificant matches are those that score below the curated threshold.
+With respect to the tables that contain significant data (pfamA_reg_full_significant and
+pfamA_reg_full), there is an extra column called 'in_full'.
+The matches that are present in the full alignment for a Pfam family have this column set to 1,
+while those that are not present in the full alignment have the 'in_full' column set to 0.
+Where there is an overlapping fragment match and a full length match to the same Pfam-A family,
+only one of the matches will be present in the full alignment for that Pfam-A family. </pre>
+<pre>
+mysql -u user -p
+mysql>create DATABASE pfam24; \q
+mysql -u user -p pfam24 < FULL_PATH/pfamseq.sql
+mysql -u user -p
+mysql>use pfam24
+mysql>load data local infile 'pfamseq.txt' into table pfamseq FIELDS ENCLOSED BY "\'";
+</pre>
+====MySQL====
+*[[MySQL]]로 로딩시, 테이블 작성 순서대로 할것 - 에러발생 (작성하지 않은 테이블의 키인덱스 링크 [http://dev.mysql.com/doc/refman/5.1/en/innodb-foreign-key-constraints.html 관련 키워드])
+**loading time: a few hours
+*Loading data file in background job
+*Need to tune the database parameters <- it is very critical to process the calculation
+**innodb_buffer_pool_size
+**query_cache
+**query_buffer
+<pre>
+mysql -u user -p'passwd' database < loadscript.sql &
+</pre>
+====R====
+*using [[R]] to analyze taxonomic distribution
+**see which fields should be used to link the databases...
+===Tree data===
+*[http://www.arb-silva.de/ Silva]: SSU rRNA database
+==Results==
+*Species in Pfam (NCBI taxonomy): 148,925 species
+*Genus (taxonomy): 31,949
+**taxonomy - browsing order in the web?
+*MySQL query for the taxonomic distribution retrieval
+**pfamseq, pfamA, pfamA_reg_full_significant
+<pre>
+# pfamA list
+select distinct pfamA_id, auto_pfamA FROM pfamA;
+# protein sequences of each auto_pfamA (pfamA_id)
+select auto_pfamseq from pfamA_reg_full_significant WHERE auto_pfamA = '';
+# taxonomic distribution
+SELECT DISTINCT species,taxonomy,ncbi_code FROM pfamseq WHERE auto_pfamseq = 'auto_pfamseq';
+</pre> or
+<pre>
+SELECT DISTINCT species, taxonomy, ncbi_code FROM pfamseq seq, pfamA pf, pfamA_reg_full_significant sig \
+       WHERE pf.pfamA_id='PF...' \
+       AND sig.auto_pfamA=pf.auto_pfamA \
+       AND seq.auto_pfamseq=sig.auto_pfamseq;
+</pre>
+*Now, you have a list of Pfam having taxonomic distribution (ready to map)
+===Building the universal phylogenetic tree===
+*ncbitaxa.csv
+**total: 148925 (1 for column name)
+**non-blank: 148203
+**blank: 722
+**Eukaryota 106462
+**other sequences 191
+**Viruses 24643
+**'unclassified' 1
+**Archaea 764
+**Bacteria 16102
+**unclassified sequences 40
+**max rank (in Archaea, Bacteria, Eukaryota): 22
+**[[ncbi_rank_taxa]]
+ create TABLE ncbitaxa ( SN INT, ncbi_taxid INT, species text, tax_rank text);
+ load data local infile 'ncbitaxa.csv' into table ncbitaxa FIELDS terminated BY ',' enclosed by '"' ignore 1 lines;
+===Silva===
+*/home/gnusnah/db/silva$SSURef_102_SILVA_NR_99.header - 262,092 lines
+*paring SILVA NR set into ncbi taxid (Genbank) - "ACC_NCBI-taxon.txt"
+<pre>
+## R
+> tmp = scan("ACC_NCBI-taxon.txt",sep="\t",what="character")
+Read 517708 items
+> length(tmp)/2
+[1] 258854
+> tmp1 = matrix(tmp,length(tmp)/2,2,byrow=T)
+> tmp1[1:2,]
+     [,1]     [,2]
+[1,] "A16379" "730"
+[2,] "A27627" "480"
+> length(table(tmp1[,2]))
+[1] 63916
+</pre>
+*Procedure for building the universal tree
+#arb - editing NDS and exporting selected fields into a file (acc + taxid)
+#arb - exporting sequences into gb or fasta with gaps (alignments)
+#sorting out NR-taxid set corresponding to pfam taxid
+#collecting alignments for NR-taxid set (converting into phylip)
+#building tree (neighbor, mp, ml in phylip package)
+===Procedure===
+*Extract the taxonomic origins of each protein
+*Get the Taxonomy info. of each origin
+*Non-redundant (NR) set of taxonomic origins
+*Collect all Small Subunit (SSU) rRNA sequences of the NR set
+*Build the universal tree of life using SSU sequences
+*Mapping the each protein belonging to a given domain into the universal tree
+*Check which node is the most recent common ancestor (MRCA) node
+*Calculate the branch length between MRCA and LCA (last common ancestor)

Evolutionary age

From CSBLwiki

Latest revision as of 10:04, 24 August 2010

Contents

Evolutionary age of protein domains

Data

Pfam release 24.0 (2009. Oct)

MySQL

R

Tree data

Results

Building the universal phylogenetic tree

Silva

Procedure

Personal tools

Namespaces

Variants

Views

Actions

Search

Site

Choi lab

Resources

Toolbox