TY - JOUR
T1 - EnzML
T2 - multi-label prediction of enzyme classes using InterPro signatures
AU - De Ferrari, Luna
AU - Aitken, Stuart
AU - van Hemert, Jano
AU - Goryanin, Igor
N1 - LDF is funded by ONDEX DTG, BBSRC TPS Grant BB/F529038/1 of the Centre
for Systems Biology at Edinburgh and the University of Newcastle. SA is
supported by by a Wellcome Trust Value In People award and, together with
IG, the Centre for Systems Biology at Edinburgh, a centre funded by the
Biotechnology and Biological Sciences Research Council and the Engineering
and Physical Sciences Research Council (BB/D019621/1). JVH was funded by
various BBSRC and EPSRC grant
PY - 2012/4
Y1 - 2012/4
N2 - Background: Manual annotation of enzymatic functions cannot keep up with automatic genome sequencing. In this work we explore the capacity of InterPro sequence signatures to automatically predict enzymatic function. Results: We present EnzML, a multi-label classification method that can efficiently account also for proteins with multiple enzymatic functions: 50,000 in UniProt. EnzML was evaluated using a standard set of 300,747 proteins for which the manually curated Swiss-Prot and KEGG databases have agreeing Enzyme Commission (EC) annotations. EnzML achieved more than 98% subset accuracy (exact match of all correct Enzyme Commission classes of a protein) for the entire dataset and between 87 and 97% subset accuracy in reannotating eight entire proteomes: human, mouse, rat, mouse-ear cress, fruit fly, the S. pombe yeast, the E. coli bacterium and the M. jannaschii archaebacterium. To understand the role played by the dataset size, we compared the cross-evaluation results of smaller datasets, either constructed at random or from specific taxonomic domains such as archaea, bacteria, fungi, invertebrates, plants and vertebrates. The results were confirmed even when the redundancy in the dataset was reduced using UniRef100, UniRef90 or UniRef50 clusters. Conclusions: InterPro signatures are a compact and powerful attribute space for the prediction of enzymatic function. This representation makes multi-label machine learning feasible in reasonable time (30 minutes to train on 300,747 instances with 10,852 attributes and 2,201 class values) using the Mulan Binary Relevance Nearest Neighbours algorithm implementation (BR-kNN).
AB - Background: Manual annotation of enzymatic functions cannot keep up with automatic genome sequencing. In this work we explore the capacity of InterPro sequence signatures to automatically predict enzymatic function. Results: We present EnzML, a multi-label classification method that can efficiently account also for proteins with multiple enzymatic functions: 50,000 in UniProt. EnzML was evaluated using a standard set of 300,747 proteins for which the manually curated Swiss-Prot and KEGG databases have agreeing Enzyme Commission (EC) annotations. EnzML achieved more than 98% subset accuracy (exact match of all correct Enzyme Commission classes of a protein) for the entire dataset and between 87 and 97% subset accuracy in reannotating eight entire proteomes: human, mouse, rat, mouse-ear cress, fruit fly, the S. pombe yeast, the E. coli bacterium and the M. jannaschii archaebacterium. To understand the role played by the dataset size, we compared the cross-evaluation results of smaller datasets, either constructed at random or from specific taxonomic domains such as archaea, bacteria, fungi, invertebrates, plants and vertebrates. The results were confirmed even when the redundancy in the dataset was reduced using UniRef100, UniRef90 or UniRef50 clusters. Conclusions: InterPro signatures are a compact and powerful attribute space for the prediction of enzymatic function. This representation makes multi-label machine learning feasible in reasonable time (30 minutes to train on 300,747 instances with 10,852 attributes and 2,201 class values) using the Mulan Binary Relevance Nearest Neighbours algorithm implementation (BR-kNN).
KW - Enzymatic function
KW - Automatic genome sequencing
KW - InterPro signatures
KW - EnzML
U2 - 10.1186/1471-2105-13-61
DO - 10.1186/1471-2105-13-61
M3 - Article
SN - 1471-2105
VL - 13
JO - BMC Bioinformatics
JF - BMC Bioinformatics
M1 - 61
ER -