Evolutionary age

From CSBLwiki

(Difference between revisions)
Jump to: navigation, search
(Results)
(Silva)
 
(17 intermediate revisions not shown)
Line 1: Line 1:
 +
{|align="left" cellpadding="25"
 +
| __TOC__
 +
|}
 +
==Evolutionary age of protein domains==
==Evolutionary age of protein domains==
(Based on this reference)
(Based on this reference)
<biblio>Reference pmid=16959887</biblio>
<biblio>Reference pmid=16959887</biblio>
-
===Pfam data (24.0)===
+
 
 +
==Data==
 +
===Pfam release 24.0 (2009. Oct)===
*The [http://pfam.sanger.ac.uk Pfam] database
*The [http://pfam.sanger.ac.uk Pfam] database
**ftp [ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/ current_release]
**ftp [ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/ current_release]
Line 24: Line 30:
The matches that are present in the full alignment for a Pfam family have this column set to 1,
The matches that are present in the full alignment for a Pfam family have this column set to 1,
while those that are not present in the full alignment have the 'in_full' column set to 0.
while those that are not present in the full alignment have the 'in_full' column set to 0.
-
Where there is an overlapping fragment match and a full length match to the same Pfam-A family, only one of the matches will be present in the full alignment for that Pfam-A family. </pre>
+
Where there is an overlapping fragment match and a full length match to the same Pfam-A family,
 +
only one of the matches will be present in the full alignment for that Pfam-A family. </pre>
<pre>
<pre>
mysql -u user -p
mysql -u user -p
Line 33: Line 40:
mysql>load data local infile 'pfamseq.txt' into table pfamseq FIELDS ENCLOSED BY "\'";
mysql>load data local infile 'pfamseq.txt' into table pfamseq FIELDS ENCLOSED BY "\'";
</pre>
</pre>
-
*MySQL로 로딩시, 테이블 작성 순서대로 할것 - 에러발생 (작성하지 않은 테이블의 키인덱스 링크 [http://dev.mysql.com/doc/refman/5.1/en/innodb-foreign-key-constraints.html 관련 키워드])
+
====MySQL====
 +
*[[MySQL]]로 로딩시, 테이블 작성 순서대로 할것 - 에러발생 (작성하지 않은 테이블의 키인덱스 링크 [http://dev.mysql.com/doc/refman/5.1/en/innodb-foreign-key-constraints.html 관련 키워드])
**loading time: a few hours
**loading time: a few hours
*Loading data file in background job
*Loading data file in background job
 +
*Need to tune the database parameters <- it is very critical to process the calculation
 +
**innodb_buffer_pool_size
 +
**query_cache
 +
**query_buffer
<pre>
<pre>
mysql -u user -p'passwd' database < loadscript.sql &
mysql -u user -p'passwd' database < loadscript.sql &
</pre>
</pre>
 +
====R====
*using [[R]] to analyze taxonomic distribution
*using [[R]] to analyze taxonomic distribution
**see which fields should be used to link the databases...
**see which fields should be used to link the databases...
Line 45: Line 58:
*[http://www.arb-silva.de/ Silva]: SSU rRNA database
*[http://www.arb-silva.de/ Silva]: SSU rRNA database
-
===Results===
+
==Results==
*Species in Pfam (NCBI taxonomy): 148,925 species
*Species in Pfam (NCBI taxonomy): 148,925 species
*Genus (taxonomy): 31,949
*Genus (taxonomy): 31,949
Line 66: Line 79:
</pre>
</pre>
*Now, you have a list of Pfam having taxonomic distribution (ready to map)
*Now, you have a list of Pfam having taxonomic distribution (ready to map)
 +
===Building the universal phylogenetic tree===
 +
*ncbitaxa.csv
 +
**total: 148925 (1 for column name)
 +
**non-blank: 148203
 +
**blank: 722
 +
**Eukaryota 106462
 +
**other sequences 191
 +
**Viruses 24643
 +
**'unclassified' 1
 +
**Archaea 764
 +
**Bacteria 16102
 +
**unclassified sequences 40
 +
**max rank (in Archaea, Bacteria, Eukaryota): 22
 +
**[[ncbi_rank_taxa]]
 +
 +
create TABLE ncbitaxa ( SN INT, ncbi_taxid INT, species text, tax_rank text);
 +
load data local infile 'ncbitaxa.csv' into table ncbitaxa FIELDS terminated BY ',' enclosed by '"' ignore 1 lines;
 +
 +
===Silva===
 +
*/home/gnusnah/db/silva$SSURef_102_SILVA_NR_99.header - 262,092 lines
 +
*paring SILVA NR set into ncbi taxid (Genbank) - "ACC_NCBI-taxon.txt"
 +
<pre>
 +
## R
 +
> tmp = scan("ACC_NCBI-taxon.txt",sep="\t",what="character")
 +
Read 517708 items
 +
> length(tmp)/2
 +
[1] 258854
 +
> tmp1 = matrix(tmp,length(tmp)/2,2,byrow=T)
 +
> tmp1[1:2,]
 +
    [,1]    [,2]
 +
[1,] "A16379" "730"
 +
[2,] "A27627" "480"
 +
> length(table(tmp1[,2]))
 +
[1] 63916
 +
</pre>
 +
*Procedure for building the universal tree
 +
#arb - editing NDS and exporting selected fields into a file (acc + taxid)
 +
#arb - exporting sequences into gb or fasta with gaps (alignments)
 +
#sorting out NR-taxid set corresponding to pfam taxid
 +
#collecting alignments for NR-taxid set (converting into phylip)
 +
#building tree (neighbor, mp, ml in phylip package)
===Procedure===
===Procedure===

Latest revision as of 10:04, 24 August 2010

Contents

Evolutionary age of protein domains

(Based on this reference)

Error fetching PMID 16959887:
  1. Error fetching PMID 16959887: [Reference]

Data

Pfam release 24.0 (2009. Oct)

# download total DB (estimated ~2 days)
wget -c ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/database.tar &
The pfamA_reg_full_significant and pfamA_reg_full_insignificant tables contain,
as the names suggest, the significant and insignificant data respectively.
Significant hits are those with a bits score above the curated threshold for the family,
whilst insignificant matches are those that score below the curated threshold.
With respect to the tables that contain significant data (pfamA_reg_full_significant and
pfamA_reg_full), there is an extra column called 'in_full'.
The matches that are present in the full alignment for a Pfam family have this column set to 1,
while those that are not present in the full alignment have the 'in_full' column set to 0.
Where there is an overlapping fragment match and a full length match to the same Pfam-A family,
only one of the matches will be present in the full alignment for that Pfam-A family. 
mysql -u user -p
mysql>create DATABASE pfam24; \q
mysql -u user -p pfam24 < FULL_PATH/pfamseq.sql
mysql -u user -p
mysql>use pfam24
mysql>load data local infile 'pfamseq.txt' into table pfamseq FIELDS ENCLOSED BY "\'";

MySQL

mysql -u user -p'passwd' database < loadscript.sql &

R

Tree data

Results

# pfamA list
select distinct pfamA_id, auto_pfamA FROM pfamA;
# protein sequences of each auto_pfamA (pfamA_id)
select auto_pfamseq from pfamA_reg_full_significant WHERE auto_pfamA = '';
# taxonomic distribution
SELECT DISTINCT species,taxonomy,ncbi_code FROM pfamseq WHERE auto_pfamseq = 'auto_pfamseq';
or
SELECT DISTINCT species, taxonomy, ncbi_code FROM pfamseq seq, pfamA pf, pfamA_reg_full_significant sig \
       WHERE pf.pfamA_id='PF...' \
       AND sig.auto_pfamA=pf.auto_pfamA \
       AND seq.auto_pfamseq=sig.auto_pfamseq;

Building the universal phylogenetic tree

create TABLE ncbitaxa ( SN INT, ncbi_taxid INT, species text, tax_rank text);
load data local infile 'ncbitaxa.csv' into table ncbitaxa FIELDS terminated BY ',' enclosed by '"' ignore 1 lines;

Silva

## R
> tmp = scan("ACC_NCBI-taxon.txt",sep="\t",what="character")
Read 517708 items
> length(tmp)/2
[1] 258854
> tmp1 = matrix(tmp,length(tmp)/2,2,byrow=T)
> tmp1[1:2,]
     [,1]     [,2] 
[1,] "A16379" "730"
[2,] "A27627" "480"
> length(table(tmp1[,2]))
[1] 63916
  1. arb - editing NDS and exporting selected fields into a file (acc + taxid)
  2. arb - exporting sequences into gb or fasta with gaps (alignments)
  3. sorting out NR-taxid set corresponding to pfam taxid
  4. collecting alignments for NR-taxid set (converting into phylip)
  5. building tree (neighbor, mp, ml in phylip package)

Procedure

Personal tools
Namespaces
Variants
Actions
Site
Choi lab
Resources
Toolbox