TY - UNPB
T1 - cazy_webscraper
T2 - local compilation and interrogation of comprehensive CAZyme datasets
AU - Hobbs, Emma Elizabeth Mary
AU - Gloster, Tracey
AU - Pritchard, Leighton
N1 - Funding: E.E.M.H. is funded by a BBSRC EASTBIO Doctoral Training Partnership award.
PY - 2022/12/4
Y1 - 2022/12/4
N2 - Carbohydrate Active enZymes (CAZymes) are pivotal in biological processes including energy metabolism, cell structure maintenance, signalling and pathogen recognition. Bioinformatic prediction and mining of CAZymes improves our understanding of these activities, and enables discovery of candidates of interest for industrial biotechnology, particularly the processing of organic waste for biofuel production. CAZy (www.cazy.org) is a high-quality, manually-curated and authoritative database of CAZymes that is often the starting point for these analyses. Automated querying, and integration of CAZy data with other public datasets would constitute a powerful resource for mining and exploring CAZyme diversity. However, CAZy does not itself provide methods to automate queries, or integrate annotation data from other sources (except by following hyperlinks) to support further analysis.To overcome these limitations we developed cazy_webscraper, a command-line tool that retrieves data from CAZy and other online resources to build a local, shareable, and reproducible database that augments and extends the authoritative CAZy database. cazy_webscraper’s integration of curated CAZyme annotations with their corresponding protein sequences, up to date taxonomy assignments, and protein structure data facilitates automated large-scale and targeted bioinformatic CAZyme family analysis and candidate screening. This tool has found widespread uptake in the community, with over 20,000 downloads.We demonstrate the use and application of cazy_webscraper to: (i) augment, update and correct CAZy database accessions; (ii) explore taxonomic distribution of CAZymes recorded in CAZy, identifying underrepresented taxa and unusual CAZy class distributions; and (iii) investigate three CAZymes having potential biotechnological application for degradation of biomass, but lacking a representative structure in the PDB database. We describe in general how cazy_webscraper facilitates functional, structural and evolutionary studies to aid identification of candidate enzymes for further characterisation, and specifically note that CAZy provides supporting evidence for recent expansion of the Auxiliary Activities (AA) CAZy family in eukaryotes, consistent with functions potentially specific to eukaryotic lifestyles.
AB - Carbohydrate Active enZymes (CAZymes) are pivotal in biological processes including energy metabolism, cell structure maintenance, signalling and pathogen recognition. Bioinformatic prediction and mining of CAZymes improves our understanding of these activities, and enables discovery of candidates of interest for industrial biotechnology, particularly the processing of organic waste for biofuel production. CAZy (www.cazy.org) is a high-quality, manually-curated and authoritative database of CAZymes that is often the starting point for these analyses. Automated querying, and integration of CAZy data with other public datasets would constitute a powerful resource for mining and exploring CAZyme diversity. However, CAZy does not itself provide methods to automate queries, or integrate annotation data from other sources (except by following hyperlinks) to support further analysis.To overcome these limitations we developed cazy_webscraper, a command-line tool that retrieves data from CAZy and other online resources to build a local, shareable, and reproducible database that augments and extends the authoritative CAZy database. cazy_webscraper’s integration of curated CAZyme annotations with their corresponding protein sequences, up to date taxonomy assignments, and protein structure data facilitates automated large-scale and targeted bioinformatic CAZyme family analysis and candidate screening. This tool has found widespread uptake in the community, with over 20,000 downloads.We demonstrate the use and application of cazy_webscraper to: (i) augment, update and correct CAZy database accessions; (ii) explore taxonomic distribution of CAZymes recorded in CAZy, identifying underrepresented taxa and unusual CAZy class distributions; and (iii) investigate three CAZymes having potential biotechnological application for degradation of biomass, but lacking a representative structure in the PDB database. We describe in general how cazy_webscraper facilitates functional, structural and evolutionary studies to aid identification of candidate enzymes for further characterisation, and specifically note that CAZy provides supporting evidence for recent expansion of the Auxiliary Activities (AA) CAZy family in eukaryotes, consistent with functions potentially specific to eukaryotic lifestyles.
KW - CAZy
KW - CAZymes
KW - Database
KW - Software
U2 - 10.1101/2022.12.02.518825
DO - 10.1101/2022.12.02.518825
M3 - Preprint
T3 - biorxiv
BT - cazy_webscraper
ER -