TY - CONF
T1 - Multi-label prediction of enzyme classes using InterPro signatures.
AU - De Ferrari, Luna
AU - Aitken, Stuart
AU - van Hemert, Jano
AU - Goryanin, Igor
PY - 2010
Y1 - 2010
N2 - In this work we use InterPro protein signatures to predict enzymatic function.We evaluate the method over more than 300,000 proteins (55% enzymes, 45% non-enzymes) for which Swiss-Prot and KEGG have agreeing Enzyme Commission annotations. We applied multi-label classification to account for proteins with multiple enzymatic functions (about 3% of UniProt) using Mulan, a library of algorithms based on the Weka framework. We achieved > 97% recall, accuracy and precision in predicting enzymatic classes. To understand the role played by the data set size, we compared smaller data sets, either random or specific to taxonomic domains such as archaea, bacteria, fungi, invertebrates, plants and vertebrates.We find that the success of prediction increases with data set size. Limiting the data to a particular taxonomic set, while saving computational time, only covers a reduced set of enzymatic classes and achieves better accuracy than a random set only if the proteins are grouped by high level taxonomic domains (archaea, bacteria and eukaria).
AB - In this work we use InterPro protein signatures to predict enzymatic function.We evaluate the method over more than 300,000 proteins (55% enzymes, 45% non-enzymes) for which Swiss-Prot and KEGG have agreeing Enzyme Commission annotations. We applied multi-label classification to account for proteins with multiple enzymatic functions (about 3% of UniProt) using Mulan, a library of algorithms based on the Weka framework. We achieved > 97% recall, accuracy and precision in predicting enzymatic classes. To understand the role played by the data set size, we compared smaller data sets, either random or specific to taxonomic domains such as archaea, bacteria, fungi, invertebrates, plants and vertebrates.We find that the success of prediction increases with data set size. Limiting the data to a particular taxonomic set, while saving computational time, only covers a reduced set of enzymatic classes and achieves better accuracy than a random set only if the proteins are grouped by high level taxonomic domains (archaea, bacteria and eukaria).
M3 - Paper
ER -