Machine Learning Approaches to Predict Enzyme Function

Project: Standard

Project Details


The key idea in our work is to identify the reaction mechanism, if any, catalysed enzymatically by a protein structure. Here, the reaction mechanisms are the 260 distinct entries in MACiE. The possible predictions are that the enzyme catalyses each of these reactions, or no enzyme reaction in our knowledge base. Our work, including a study of convergently evolved analogous pairs of enzymes, suggests that the full mechanism contains information critical to recognising similarities between enzymes.

Our main machine learning method is Random Forest, simply a forest made out of many different randomly created decision trees. Randomness is introduced in two ways. Firstly, each tree is based on a bootstrap sample of N out of the N known proteins, chosen with replacement such that some proteins will appear more than once and others not at all in the set from which a given tree is built. Secondly, the descriptors used for making the split at each node are chosen from a (new) small random subset of the descriptors. Once grown, the trees then predict unseen data. Random Forest can predict either a categorical or a continuous variable. Here, our interest is in classification; the class assigned to a new protein is that given the most votes amongst the trees in the forest.

Subsequently to predicting the reaction mechanisms, we will apply chemoinformatics, docking and virtual screening to suggest substrates for the enzyme reactions identified. Docking is a computational filter, reducing the number of candidates by more than an order of magnitude. Rescoring will use our novel Random Forest based RF-Score function. We will use fingerprint-based chemoinformatics methods to retain only molecules with the correct chemical functionalities needed to undergo the reaction mechanisms identified, and Ultrafast Shape Recognition as a scaffold-hopping method to identify molecules of suitable shape.

Layman's description

Proteins are amongst the most important of all molecules in biological systems. They are crucial to organisms which use them to carry out a huge variety of essential functions: catalysis, transport, storage, motor functions, signalling, chaperoning folding, regulation, molecular recognition, structural roles, and DNA

As proteins are so ubiquitous in biology, understanding their properties is essential if we want to know about biological processes. This project is focused on one of the most important of all protein functions: enzyme catalysis. Enzymes catalyse, or facilitate, the chemical reactions that occur in living organisms. Understanding how they work is both interesting in itself and useful in areas as diverse as drug design, diagnostics, biofuels, food science and laundry.
This project is about the relationship between the structure of a protein and the enzyme function it carries out. We aim to predict the catalytic functionality from a knowledge of the protein structure.

In order to achieve this, we will use machine learning methods, and in particular a technique called Random Forest. The forest consists of several hundred "decision trees", each of which is basically a flow diagram. We will train them to learn patterns in the known properties of existing enzyme structures and the chemistry of the steps comprising the reactions they catalyse. However, the way in which we will generate the trees involves computer-simulated dice-rolling. This will ensure that they are all different, though based on the same underlying information. The decision trees then each make a prediction of the unknown possible catalytic functions. These predictions are treated as votes as to the function of the protein. This voting process produces a consensus of many decision trees and maximises the use of the information contained in the underlying data, generating results which are much more accurate than those of any one decision tree.

The prediction of enzyme function is immensely important for a number of reasons. Firstly, being able to predict enzyme function more accurately will improve the functional annotation of genomes and reduce the current risk of misannotations being propagated through bioinformatics databases. Rapid developments in structural genomics, high throughput structure determination of diverse proteins from a wide variety of organisms, mean that many structures are available for enzymes whose functions are not yet known. Secondly, this project will allow us to recognise chemical similarities between evolutionarily unrelated enzymes that catalyse similar steps, though not necessarily similar overall reactions. Thirdly, this work will help us to understand the key determinants of the complex relationship between protein structure, function and evolution, particularly in terms of catalysis of reaction steps. Fourthly, the project will facilitate the design of new enzymes with either novel functions or carefully modified versions of existing functions.

This project sits at an interface between disciplines, combining chemistry, biology and computer science. A wide range of skills and expertise is necessary to increase our understanding of catalysis, which has long been an important academic goal. Commercially, this work lays a foundation which is directly useful to the pharmaceutical and biotechnology industries, where enzymes are used both as diagnostics and therapeutics; the agrochemical industry, whose products often target enzymes; in the development of biofuels, which need robust enzymes to improve productivity and reduce costs; in laundry, where enzymes are already used in everyday products; and in the nutrition and food industries. In particular this project will aid in the design of new and repurposed enzymes.
AcronymMachine Learning Approaches to Predict
Effective start/end date1/09/1131/12/14


  • BBSRC: £247,960.46


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.