Phyl Toukach: Glyco databases

Carbohydrate databases in the recent decade
presentation slides

Download the presentation as high-resolution PDF (6.3 Mb)

Slides
(click the slide name to see it)

Annotation

Nowadays, the orientation in currently acquired volume of glyco-related information is impossible without special features of informatics. Therefore, the progress of glycobiology strongly depends on a presence of an information environment including data on structures, properties and functions of carbohydrates, as well as on taxonomy and properties of their biological sources. The main approach to create such environment is development of carbohydrate databases (DBs). In contrast to genomics and proteomics, informatization of glycomics is still in the making. Existing projects are oriented to limited scope and solving local problems. They are not fully compatible with each other, both in coverage and data format.

Glyco-databases providing wide coverage are most demanded, among them GlytoucanB (meta-repository), GLYCOSCIENCES (imported Carbbank + selected mammalian glycans + NMR data), GlycoSuiteDB (mammalian O- and N-glycans), CFG Glycan Database (mammalian glycans from Carbbank and Glycominds), KEGG (mainly imported Carbbank), GlycoBase-Dublin (N-glycans + MS data), GlycoBase-Lille (amphibean glycans + NMR data), ECODAB (E. ñoli O-antigens), and Carbohydratre Structure Database (bacterial, plant and fungal glycans + NMR data).

Historically first glyco-database was Carbbank, which pretended to have complete coverage of structures published before 1996, when its support was ceased. Collection and digitizing of primary data are the most time-consuming stages of database development, and therefore almost all modern projects somehow use the Carbbank data and ideology.

Analysis of the distinctive features of various projects allows establishing the criteria of database evaluation: types of data deposited, completeness of coverage, data quality, functions provided to users, user interface (including stability and performance), integration with other projects, and, indirectly, database architecture.

The minimal types of data, stored and processed in a glyco database, are a primary molecular structure and taxonomical and bibliographic annotations. Many databases store experimental analytical data, such as NMR or MS spectra. Storage of biochemical, genetic, medical and other related data is often supported, but their coverage remains poor. Even some major carbohydrate databases lack taxonomical and bibliographical annotations in a considerable part of records. Databases, which store NMR spectra provide NMR coverage of 5-35% of the published data.

Higher coverage significantly increases the value of a database, since even a negative answer to a search request presents valuable scientific information. Restricted potential of automatization of a search for suitable publications limits the acquisition of primary data and, therefore, the coverage. Nowadays only Bacterial CSDB pretends to have complete (>80%) structural coverage within a chosen compound class (bacterial glycans). To keep the coverage actual, periodic updates are needed. We consider a two-year time lag between publication and deposition of data acceptable. A universal solution for keeping the data actual is a requirement of obligatory upload of every published structrure to a database prior to publication with subsequent provision of the IDs during article submission. Such approach has been realized in genomics for long but is still missing in glycomics. One of reasons for this gap is insufficient standardization of glycan description languages originating from high chemical diversity of carbohydrates.

The process of data deposition can hardly be automatized not only at the level of publication selection but also at the level of article interpretation. As the result, all chemical and biological databases contain errors. These errors originate from annotators’ failures, are transferred from other databases, copied from original publications, and arise from DB architecture inconsistency and software bugs in importers and auto-annotators (listed in occurrence-decreasing order). According to our investigation, most records in Carbbank contain errors, and more than one third of records contain two or more errors. The most abundant error type is an incorrect taxonomical annotation of structures. Significant gaps in the Carbbank coverage were also discovered. As most of the modern projects use the Carbbank data, these errors are being reproduced. Some of them can be detected and sometimes corrected automatically. Such control is present in a number of databases; however, only a retrospective expert analysis of publications can provide really high data quality.

Database functionality is its capability to process various search requests, combine them using logical operations, and refine the results by queries of other types. E.g., "find all structures published from 2001 to 2005, that contain either an α-Gal(1-->3)KDO fragment or a monosaccharide-bound lysine or alanine, except synthetic structures or those found in gamma-proteobactertia, and display their ¹³C NMR spectra". The functionality can be extended by carbohydrate-related services, such as conformation map simulation, spectra prediction, search for structural motifs etc. In contrast to the search for bibliography, taxonomy, keywords, text fragments and similar data, the search for structures containing a specific fragment (as well as for structures or spectra resembling the specified ones) requires more meticulous programming and computational power, making the inner database architecture critical for the performance of structural queries. In the mid-2000s, developers of GLYCOSCIENCES.de formulated "Ten golden rules of carbohydrate database development", which summarized the experience of the German and Russian groups. The key points of this document include usage of a connection table for inward structure representation, maximal possible indexation, minimum of free text data (which, regretfully, are present in every project), and unambiguously controlled vocabularies for as much data types as possible, for monosaccharide names in the first place. An attempt to separate the monosaccharide vocabulary from glyco-databases was made within MonosaccharideDB.

Possibility of correct processing of structural data is directly related to the format of both internal and user structure descriptions. Incapabilities and inter-incompatibility of glycan description languages have been limiting the progress of glycoinformatics for decades. These are the main criteria of carbohydrate language evaluation:

unambiguity (strict rules that allow recording of every chemically distinct structure in a unique way);
support of all structural features present in published carbohydrates (polymeric, oligomeric, cyclic and combined glycans, glycolipids, glycoproteins), including those containing non-carbohydrate constituents and various special cases (untypical residues, phospho- and sulfo-linkages, cyclic esters, amide and ether linkages, etc. );
support of incompletely determined structures at the level of monomers and their configurations, substitution positions, chain topology and side chains stoichiometry;
computer-readability (with no need for complicated parsing, as in the case of Extended IUPAC) and human-readability (required for tracking of errors that appear during human processing of data dumps),
compatibility with other formats (presence of converters that help language learning and cross-database work), e.g. monomer vocabulary widely recognized by glycobiologists.

The CSDB Linear and GlycoCT languages possess most of these features. However, the former does not support some topologies, and the latter is not human-readable. Glycomics still lacks a standard language except the IUPAC, which is highly imperfect.

Modern quality standard in scientific software engineering implies that both user and administrative interfaces are intuitive, well-documented and freely accessible via Internet. Intuitiveness includes structure input and output formats, which users should not have to study. Standalone services for structure input and editing are of extreme usefulness, as they allow users of any database to stay within the interface which they are used to. Integration between projects implies not only common interface of search requests but also automated data interchange via the API. It concerns interactions with non-carbohydrate databases as well, at least NCBI Taxonomy and NCBI Pubmed. First two projects that reported protocols for automated data exchange were GLYCOSCIENCES.de and Bacterial CSDB, and since then the development of glyco-related web-services has intensified.

Two special projects should be mentioned. EurocarbDB was funded as a design study of a database completely lacking any disadvantages; however, its development was limited to the design of approaches without their realization (which is always a bottleneck due to human factor) and Carbbank import. The opposite end of the ideological hierarchy is occupied by the Glytoucan repository, which does not provide its own data but integrates the other databases and assigns unique IDs to glycan moieties of published glycans and glycoconjugates. It is is a "database of databases" and allows the cross-project operation within a single interface.

Within the Carbohydrate Structure Database (CSDB) project started in 2005, we tried to develop a database free of disadvantages of other glyco-databases both at architectural and content level. Since then CSDB has been regularly updated and upgraded, and it has served as a platform for multiple services of glycoinformatics. CSDB has become one of the most recognized source of data on carbohydrates of microorganisms, and it aims at ideological replacement of Carbbank.

Slides

Home : Science

Science : CSDB

Home : Teaching

Last update: 2017 Oct 29 Home

Carbohydrate databases in the recent decadepresentation slides

Slides(click the slide name to see it)

Annotation

Slides

Carbohydrate databases in the recent decade
presentation slides

Slides
(click the slide name to see it)