Abstract
In the present paper we combine the Winnow algorithm and an advanced scheme for feature generation into a tool for multiclass classification. The Winnow algorithm, specifically designed in the late 1980s to work well with high-dimensional data, by design ignores most of the irrelevant features for the scoring of each single training/test case. To augment the pool of available molecular features we use the Winnow algorithm in conjunction with a process that creates additional features from a set of given ones. We adapt a technique formerly employed in text classification termed “orthogonal sparse bigrams” and extend the use of that method to the domain of cheminformatics. Using circular molecular fingerprints as initial features, we create “molecular orthogonal sparse bigrams” (MOSBs) and report their successful application to the task of classification of bioactive molecules. Additionally, we introduce a memory-efficient way of bagging individual classifiers, avoiding the need to hold the complete training data set in memory. To compare the performance of our method with published results, we use the Hert data set of 8293 active molecules in 11 classes. We compare our method to Random Forest and find that our method not only is comparable or better in classification accuracy (up to 50% higher in MCC [Matthews correlation coefficient], 98% higher in fraction of correct predictions) but also is quicker to train (by a factor between 2 and 18, depending on the feature generation), more memory efficient, and able to cope more easily with large data sets when we seeded the actives into a pool of 94290 inactive molecules. It is shown that this method can be used with different fingerprints.
Original language | English |
---|---|
Pages (from-to) | 306-318 |
Number of pages | 13 |
Journal | Journal of Chemical Information and Modeling |
Volume | 48 |
Issue number | 2 |
DOIs | |
Publication status | Published - Feb 2008 |
Keywords
- PHYSICOCHEMICAL DESCRIPTORS
- RANDOM FOREST
- CLASSIFICATION
- QSAR
- PREDICTION
- ENRICHMENT
- ENSEMBLES
- MODELS
- KERNEL
- QSPR