Скачать библиобазу для EndNote (ZIP, 1.2 Mb) |
Эта подборка включает ~670 статей, которые могут быть полезны гликоинформатикам. Как в базе Endnote, так и в этом документе ссылки разделены на группы. Некоторые ссылки присутствуют более, чем в одной группе. Для поиска авторов, заголовков, текста из абстракта используйте Ctrl-F. Нажмите на DOI, чтобы перейти к статье.
Includes cutting-edge techniques for glycoinformatics studies; Provides practical detail essential for reproducible results; Contains key notes and implementation advice from the experts.
Protein glycosylation is the most complex and prevalent post-translation modification in terms of the number of proteins modified and the diversity generated. To understand the functional roles of glycoproteins it is important to gain an insight into the repertoire of oligosaccharides present. The comparison and relative quantitation of glycoforms combined with site-specific identification and occupancy are necessary steps in this direction. Computational platforms have continued to mature assisting researchers with the interpretation of such glycomics and glycoproteomics data sets, but frequently support dedicated workflows and users rely on the manual interpretation of data to gain insights into the glycoproteome. The growth of site-specific knowledge has also led to the implementation of machine-learning algorithms to predict glycosylation which is now being integrated into glycoproteomics pipelines. This short review describes commercial and open-access databases and software with an emphasis on those that are actively maintained and designed to support current analytical workflows.
With the introduction of intuitive graphical software, structural biologists who are not experts in crystallography are now able to build complete protein or nucleic acid models rapidly. In contrast, carbohydrates are in a wholly different situation: scant automation exists, with manual building attempts being sometimes toppled by incorrect dictionaries or refinement problems. Sugars are the most stereochemically complex family of biomolecules and, as pyranose rings, have clear conformational preferences. Despite this, all refinement programs may produce high-energy conformations at medium to low resolution, without any support from the electron density. This problem renders the affected structures unusable in glyco-chemical terms. Bringing structural glycobiology up to 'protein standards' will require a total overhaul of the methodology. Time is of the essence, as the community is steadily increasing the production rate of glycoproteins, and electron cryo-microscopy has just started to image them in precisely that resolution range where crystallographic methods falter most.
Resource description framework (RDF) and Property Graph databases are emerging technologies that are used for storing graph-structured data. We compare these technologies through a molecular biology use case: glycan substructure search. Glycans are branched tree-like molecules composed of building blocks linked together by chemical bonds. The molecular structure of a glycan can be encoded into a direct acyclic graph where each node represents a building block and each edge serves as a chemical linkage between two building blocks. In this context, Graph databases are possible software solutions for storing glycan structures and Graph query languages, such as SPARQL and Cypher, can be used to perform a substructure search. Glycan substructure searching is an important feature for querying structure and experimental glycan databases and retrieving biologically meaningful data. This applies for example to identifying a region of the glycan recognised by a glycan binding protein (GBP). In this study, 19,404 glycan structures were selected from GlycomeDB (www.glycome-db.org) and modelled for being stored into a RDF triple store and a Property Graph. We then performed two different sets of searches and compared the query response times and the results from both technologies to assess performance and accuracy. The two implementations produced the same results, but interestingly we noted a difference in the query response times. Qualitative measures such as portability were also used to define further criteria for choosing the technology adapted to solving glycan substructure search and other comparable issues.
One aspect of glycome informatics is the analysis of carbohydrate sugar chains, or glycans, whose basic structure is not a sequence, but a tree structure. Although there has been much work in the development of sequence databases and matching algorithms for sequences (for performing queries and analyzing similarity), the more complicated tree structure of glycans does not allow a direct implementation of such a database for glycans, and further, does not allow for the direct application of sequence alignment algorithms for performing searches or analyzing similarity. Therefore, we have utilized a polynomial-time dynamic programming algorithm for solving the maximum common subtree of two trees to implement an accurate and efficient tool for finding and aligning maximally matching glycan trees. The KEGG Glycan database for glycan structures released recently incorporates our tree-structure alignment algorithm with various parameters to adapt to the needs of a variety of users. Because we use similarity scores as opposed to a distance metric, our methods are more readily used to display trees of higher similarity. We present the two methods developed for this purpose and illustrate its validity.
Because of the many different glycan-related databases now publicly available, a number of different glycan structure representations are used to graphically and textually display glycans structures. This overview provides an easy reference to each representation format as well as links to several major databases and web resources that may be useful for glycobiologists. Many of the major databases, glycan sequence and structure representation formats, and tools that are currently available and used in the glycoscience field will be described. The purpose of this chapter is to provide background knowledge for beginners and to provide any supporting information for the other chapters in this section.
---.
BACKGROUND: The glycomics field has made great advancements in the last decade due to technologies for their synthesis and analysis including carbohydrate microarrays. Accordingly, databases for glycomics research have also emerged and been made publicly available by many major institutions worldwide. OBJECTIVE: This review introduces these and other useful databases on which new methods for drug discovery can be developed. METHODS: The scope of this review covers current documented and accessible databases and resources pertaining to glycomics. These were selected with the expectation that they may be useful for drug discovery research. RESULTS/CONCLUSION: There is a plethora of glycomics databases that have much potential for drug discovery. This may seem daunting at first but this review helps to put some of these resources into perspective. Additionally, some thoughts on how to integrate these resources to allow more efficient research are presented.
Glycans are branched, biosynthetic metabolic products that are commonly encoded by multiple genes. Unique genes may be involved in the biosynthesis of specific glycan classes (glycoprotein, glycolipid, glycosaminoglycans etc.), and at the same time, many glycogenes participate in the biosynthesis of more than one glycan class. These intricacies relating to gene expression, enzyme specificity, endoplasmic reticulum (ER)-Golgi compartment-specific localization of enzymes, the branched nature of glycan structures, and species-specific variation in monosaccharide composition makes the analysis of glycosylation processes complicated. To aid this effort, a variety of analytical methods have been developed to identify and quantify the structure of glycans and their conjugates in biological samples. Glycoinformatics tools and software aim to use computers to integrate these experimental data, using our knowledge of glycan biosynthetic pathways as a backbone. Glycoinformatics databases ideally curate experimental data allowing glycan structures to be rigorously defined, archived, organized, searched, and annotated. When linked to other relational databases, glycoscience data may then be integrated with related genomic, transcriptomic, proteomic, lipidomic, and metabolomic information. This chapter describes the current status of glycoinformatics databases and software development, with focus on efforts to bridge the gap between glycan structure and function.
---.
---.
The heterogeneity, mobility and complexity of glycans in glycoproteins have been, and currently remain, significant challenges in structural biology. These aspects present unique problems to the two most prolific techniques: X-ray crystallography and cryo-electron microscopy. At the same time, advances in mass spectrometry have made it possible to get deeper insights on precisely the information that is most difficult to recover by structure solution methods: the full-length glycan composition, including linkage details for the glycosidic bonds. The developments have given rise to glycomics. Thankfully, several large scale glycomics initiatives have stored results in publicly available databases, some of which can be accessed through API interfaces. In the present work, we will describe how the Privateer carbohydrate structure validation software has been extended to harness results from glycomics projects, and its use to greatly improve the validation of 3D glycoprotein structures.
The number of proteins encoded in the human genome has been estimated at between 20,000 and 25,000, despite estimates that the entire proteome contains more than a million proteins. One reason for this difference is due to many post-translational modifications of protein that contribute to proteome complexity. Among these, glycosylation is of particular relevance because it serves to modify a large number of cellular proteins. Glycogenomics, glycoproteomics, glycomics, and glycoinformatics are helping to accelerate our understanding of the cellular events involved in generating the glycoproteome, the variety of glycan structures possible, and the importance of roles that glycans play in therapeutics and disease. Indeed, interest in glycosylation has expanded rapidly over the past decade, as large amounts of experimental 'omics data relevant to glycosylation processing have accumulated. Furthermore, new and more sophisticated glycoinformatics tools and databases are now available for glycan and glycosylation pathway analysis. Here, we summarize some of the recent advances in both experimental profiling and analytical methods involving N- and O-linked glycosylation processing for biotechnological and medically relevant cells together with the unique opportunities and challenges associated with interrogating and assimilating multiple, disparate high-throughput glycosylation data sets. This emerging era of advanced glycomics will lead to the discovery of key glycan biomarkers linked to diseases and help establish a better understanding of physiology and improved control of glycosylation processing in diverse cells and tissues important to disease and production of recombinant therapeutics. Furthermore, methodologies that facilitate the integration of glycomics measurements together with other 'omics data sets will lead to a deeper understanding and greater insights into the nature of glycosylation as a complex cellular process.
Despite some recent successes in deciphering new cancer molecular makers, there is still a clear and continual need to develop new technologies that help characterizing existing biomarkers or facilitate discovery of new biomarkers. An important systems biology opportunity on this respect is provided by understanding the glycosylation changes associated with cancer. Indeed, interest in cancer glycosylation has expanded over the past decade and large amount of data relevant to cancer glycosylation has been accumulating rapidly. Furthermore, new and improved sophisticated glycoinformatics tools, methods and databases for glycan analysis now offer the opportunity to investigate this data for understanding the role that glycans play in cancer glycosylation. Here we summarize developments of glycoinformatics tools to support analysis of cancer glycosylation and experimental glycoproteomics approaches. In addition, we discuss challenges faced by glycoinformatics for the integration and interrogation of disparate high-throughput glycan data sets in order to assimilate technologies and better address cancer glycosylation. We also provide examples of integrative glycoinformatics approaches that lead to a better understanding of cancer glycosylation as a complex cellular process.
Artificial intelligence (AI) methods have been and are now being increasingly integrated in prediction software implemented in bioinformatics and its glycoscience branch known as glycoinformatics. AI techniques have evolved in the past decades, and their applications in glycoscience are not yet widespread. This limited use is partly explained by the peculiarities of glyco-data that are notoriously hard to produce and analyze. Nonetheless, as time goes, the accumulation of glycomics, glycoproteomics, and glycan-binding data has reached a point where even the most recent deep learning methods can provide predictors with good performance. We discuss the historical development of the application of various AI methods in the broader field of glycoinformatics. A particular focus is placed on shining a light on challenges in glyco-data handling, contextualized by lessons learnt from related disciplines. Ending on the discussion of state-of-the-art deep learning approaches in glycoinformatics, we also envision the future of glycoinformatics, including development that need to occur in order to truly unleash the capabilities of glycoscience in the systems biology era.
UniCarbKB ( http://unicarbkb.org ) is a comprehensive resource for mammalian glycoprotein and annotation data. In particular, the database provides information on the oligosaccharides characterized from a glycoprotein at either the global or site-specific level. This evidence is accumulated from a peer-reviewed and manually curated collection of information on oligosaccharides derived from membrane and secreted glycoproteins purified from biological fluids and/or tissues. This information is further supplemented with experimental method descriptions that summarize important sample preparation and analytical strategies. A new release of UniCarbKB is published every three months, each includes a collection of curated data and improvements to database functionality. In this Chapter, we outline the objectives of UniCarbKB, and describe a selection of step-by-step workflows for navigating the information available. We also provide a short description of web services available and future plans for improving data access. The information presented in this Chapter supplements content available in our knowledgebase including regular updates on interface improvements, new features, and revisions to the database content ( http://confluence.unicarbkb.org ).
The GlyCosmos Glycoscience Portal (https://glycosmos.org) and PubChem (https://pubchem.ncbi.nlm.nih.gov/) are major portals for glycoscience and chemistry, respectively. GlyCosmos is a portal for glycan-related repositories, including GlyTouCan, GlycoPOST, and UniCarb-DR, as well as for glycan-related data resources that have been integrated from a variety of 'omics databases. Glycogenes, glycoproteins, lectins, pathways, and disease information related to glycans are accessible from GlyCosmos. PubChem, on the other hand, is a chemistry-based portal at the National Center for Biotechnology Information. PubChem provides information not only on chemicals, but also genes, proteins, pathways, as well as patents, bioassays, and more, from hundreds of data resources from around the world. In this work, these 2 portals have made substantial efforts to integrate their complementary data to allow users to cross between these 2 domains. In addition to glycan structures, key information, such as glycan-related genes, relevant diseases, glycoproteins, and pathways, was integrated and cross-linked with one another. The interfaces were designed to enable users to easily find, access, download, and reuse data of interest across these resources. Use cases are described illustrating and highlighting the type of content that can be investigated. In total, these integrations provide life science researchers improved awareness and enhanced access to glycan-related information.
The chemical composition of saccharide complexes underlies their biomedical activities as biomarkers for cardiometabolic disease, various types of cancer, and other conditions. However, because these molecules may undergo major structural modifications, distinguishing between compounds of saccharide and non-saccharide origin becomes a challenging computational problem that hinders the aggregation of information about their bioactive moieties. We have developed an algorithm and software package called "Cheminformatics Tool for Probabilistic Identification of Carbohydrates" (CTPIC) that analyzes the covalent structure of a compound to yield a probabilistic measure for distinguishing saccharides and saccharide-derivatives from non-saccharides. CTPIC analysis of the RCSB Ligand Expo (database of small molecules found to bind proteins in the Protein Data Bank) led to a substantial increase in the number of ligands characterized as saccharides. CTPIC analysis of Protein Data Bank identified 7.7% of the proteins as saccharide-binding. CTPIC is freely available as a webservice at (http://ctpic.nmrfam.wisc.edu).
BACKGROUND: Surface polysaccharides (SPs), such as lipopolysaccharide (O antigen) and capsular polysaccharide (K antigen), play a key role in the pathogenicity of Escherichia coli (E. coli). Gene cluster for polysaccharide antigen biosynthesis encodes various glycosyltransferases (GTs), which drive the process of SP synthesis and determine the serotype. RESULTS: In this study, a total of 7,741 E. coli genomic sequences were chosen for systemic data mining. The monosaccharides in both O and K antigens were dominated by D-hexopyranose, and the SPs in 70-80% of the strains consisted of only the five most common hexoses (or some of them). The linkages between the two monosaccharides were mostly alpha-1,3 (23.15%) and beta-1,3 (20.49%) bonds. Uridine diphosphate activated more than 50% of monosaccharides for glycosyltransferase reactions. These results suggest that the most common pathways could be integrated into chassis cells to promote glycan biosynthesis. We constructed a database (EcoSP, http://ecosp.dmicrobe.cn/ ) for browse this information, such as monosaccharide synthesis pathways. It can also be used for serotype analysis and GT annotation of known or novel E. coli sequences, thus facilitating the diagnosis and typing. CONCLUSIONS: Summarizing and analyzing the properties of these polysaccharide antigens and GTs are of great significance for designing glycan-based vaccines and the synthetic glycobiology.
A torsion angle-based Monte Carlo searching routine was developed and applied to several carbohydrate modeling problems. The routine was developed as a Unix shell script that calls several programs, which allows it to be interfaced with multiple potential functions and various utilities for evaluating conformers. In its current form, the program operates with several versions of the MM3 and MM4 molecular mechanics programs and has a module to calculate hydrogen-hydrogen coupling constants. The routine was used to study the low-energy exo-cyclic substituents of beta-D-glucopyranose and the conformers of D-glucaramide, both of which had been previously studied with MM3 by full conformational searches. For these molecules, the program found all previously reported low-energy structures. The routine was also used to find favorable conformers of 2,3,4,5-tetra-O-acetyl-N,N'-dimethyl-D-glucaramide and D-glucitol, the latter of which is believed to have many low-energy forms. Finally, the technique was used to study the inter-ring conformations of beta-gentiobiose, a beta-(1-->6)-linked disaccharide of D-glucopyranose. The program easily found conformers in the 10 previously identified low-energy regions for this disaccharide. In 6 of the 10 local regions, the same previously identified low-energy structures were found. In the remaining four regions, the search identified structures with slightly lower energies than those previously reported. The approach should be useful for extending modeling studies on acyclic monosaccharides and possibly oligosaccharides.
Glycoinformatics is a young scientific branch, which assists the scientific community in carbohydrate research. In spite of fundamental roles of glycans in living organisms, the informatization of the field is significantly hindered by the lack of standards, data indices, protocols and tools widely recognized by researchers and publishers. The connection of the existing glycoinformatics projects with each other and with global life-science projects is far from complete. Moreover, in contrast to genes and proteins, there is no obligatory unique identifier for each natural carbohydrate structure. In the current essay, we assess the basic principles of successful informatization of glycoscience and discuss their implementation in the present glycoinformatics projects.
The field of glycobiology is concerned with the study of the structure, properties, and biological functions of the family of biomolecules called carbohydrates. Bioinformatics for glycobiology is a particularly challenging field, because carbohydrates exhibit a high structural diversity and their chains are often branched. Significant improvements in experimental analytical methods over recent years have led to a tremendous increase in the amount of carbohydrate structure data generated. Consequently, the availability of databases and tools to store, retrieve and analyze these data in an efficient way is of fundamental importance to progress in glycobiology. In this review, the various graphical representations and sequence formats of carbohydrates are introduced, and an overview of newly developed databases, the latest developments in sequence alignment and data mining, and tools to support experimental glycan analysis are presented. Finally, the field of structural glycoinformatics and molecular modeling of carbohydrates, glycoproteins, and protein-carbohydrate interaction are reviewed.
Glycomics researchers have identified the need for integrated database systems for collecting glycomics information in a consistent format. The goal is to create a resource for knowledge discovery and dissemination to wider research communities. This has the potential to extend the research community to include biologists, clinicians, chemists, and computer scientists. This chapter discusses the technology and approach needed to create integrated data resources to empower the broader community to leverage extant glycomics data. The focus is on glycosaminoglycan (GAGs) and proteoglycan research, but the approach can be generalized. The methods described span the development of glycomics standards from CarbBank to Glyco Connection Tables. The existence of integrated data sets provides a foundation for novel methods of analysis such as machine learning for knowledge discovery. The implications of predictive analysis are examined in relation to disease biomarker to expand the target audience of GAG and proteoglycan research.
While abnormalities related to carbohydrates (glycans) are frequent for patients with rare and undiagnosed diseases as well as in many common diseases, these glycan-related phenotypes (glycophenotypes) are not well represented in knowledge bases (KBs). If glycan-related diseases were more robustly represented and curated with glycophenotypes, these could be used for molecular phenotyping to help to realize the goals of precision medicine. Diagnosis of rare diseases by computational cross-species comparison of genotype-phenotype data has been facilitated by leveraging ontological representations of clinical phenotypes, using Human Phenotype Ontology (HPO), and model organism ontologies such as Mammalian Phenotype Ontology (MP) in the context of the Monarch Initiative. In this article, we discuss the importance and complexity of glycobiology and review the structure of glycan-related content from existing KBs and biological ontologies. We show how semantically structuring knowledge about the annotation of glycophenotypes could enhance disease diagnosis, and propose a solution to integrate glycophenotypes and related diseases into the Unified Phenotype Ontology (uPheno), HPO, Monarch and other KBs. We encourage the community to practice good identifier hygiene for glycans in support of semantic analysis, and clinicians to add glycomics to their diagnostic analyses of rare diseases.
Glycoproteins and protein-carbohydrate complexes in the worldwide Protein Data Bank (wwPDB) can be an excellent source of information for glycoscientists. Unfortunately, a rather large number of errors and inconsistencies is found in the glycan moieties of these 3D structures. This review illustrates frequent problems of carbohydrate moieties in wwPDB entries, such as nomenclature issues, incorrect N-glycan core structures, missing or erroneous linkages, or poor glycan geometry, and describes the carbohydrate-specific validation tools that are designed to identify such problems. Recommendations how to avoid these issues or how to rectify incorrect structures are also given.
SUMMARY: Glycoinformatics plays a major role in glycobiology research, and the development of a comprehensive glycoinformatics knowledgebase is critical. This application note describes the GlyGen data model, processing workflow and the data access interfaces featuring programmatic use case example queries based on specific biological questions. The GlyGen project is a data integration, harmonization and dissemination project for carbohydrate and glycoconjugate-related data retrieved from multiple international data sources including UniProtKB, GlyTouCan, UniCarbKB and other key resources. AVAILABILITY AND IMPLEMENTATION: GlyGen web portal is freely available to access at https://glygen.org. The data portal, web services, SPARQL endpoint and GitHub repository are also freely available at https://data.glygen.org, https://api.glygen.org, https://sparql.glygen.org and https://github.com/glygener, respectively. All code is released under license GNU General Public License version 3 (GNU GPLv3) and is available on GitHub https://github.com/glygener. The datasets are made available under Creative Commons Attribution 4.0 International (CC BY 4.0) license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
The GlySpace Alliance was formed in 2018 among the principal investigators of three major glycoscience portals: Glyco@Expasy, GlyCosmos, and GlyGen, representing Europe, Asia, and the United States, respectively. While each of these portals has its unique user interface, the aim is to provide the same basic data set of glycan-related omics data. These portals will be introduced with the aim to enable users to find their target information in the most efficient manner, in particular, in terms of the chemical structures of glycans and their functions.
The glycome constitutes the entire complement of free carbohydrates and glycoconjugates expressed on whole cells or tissues. 'Systems Glycobiology' is an emerging discipline that aims to quantitatively describe and analyse the glycome. Here, instead of developing a detailed understanding of single biochemical processes, a combination of computational and experimental tools are used to seek an integrated or 'systems-level' view. This can explain how multiple biochemical reactions and transport processes interact with each other to control glycome biosynthesis and function. Computational methods in this field commonly build in silico reaction network models to describe experimental data derived from structural studies that measure cell-surface glycan distribution. While considerable progress has been made, several challenges remain due to the complex and heterogeneous nature of this post-translational modification. First, for the in silico models to be standardized and shared among laboratories, it is necessary to integrate glycan structure information and glycosylation-related enzyme definitions into the mathematical models. Second, as glycoinformatics resources grow, it would be attractive to utilize 'Big Data' stored in these repositories for model construction and validation. Third, while the technology for profiling the glycome at the whole-cell level has been standardized, there is a need to integrate mass spectrometry derived site-specific glycosylation data into the models. The current review discusses progress that is being made to resolve the above bottlenecks. The focus is on how computational models can bridge the gap between 'data' generated in wet-laboratory studies with 'knowledge' that can enhance our understanding of the glycome.
Knowledge of the three-dimensional structures of the carbohydrate molecules is indispensable for a full understanding of the molecular processes in which carbohydrates are involved, such as protein glycosylation or protein-carbohydrate interactions. The Protein Data Bank (PDB) is a valuable resource for three-dimensional structural information on glycoproteins and protein-carbohydrate complexes. Unfortunately, many carbohydrate moieties in the PDB contain inconsistencies or errors. This article gives an overview of the information that can be obtained from individual PDB entries and from statistical analyses of sets of three-dimensional structures, of typical problems that arise during the analysis of carbohydrate three-dimensional structures and of the validation tools that are currently available to scientists to evaluate the quality of these structures.
Glycoinformatics is a small but growing branch of bioinformatics and chemoinformatics. Various resources are now available that can be of use to glycobiologists, but also to chemists who work on the synthesis or analysis of carbohydrates. This article gives an overview of existing glyco-specific databases and tools, with a focus on their application to glycochemistry: Databases can provide information on candidate glycan structures for synthesis, or on glyco-enzymes that can be used to synthesize carbohydrates. Statistical analyses of glycan databases help to plan glycan synthesis experiments. 3D-Structural data of protein-carbohydrate complexes are used in targeted drug design, and tools to support glycan structure analysis aid with quality control. Specific problems of glycoinformatics compared to bioinformatics for genomics or proteomics, especially concerning integration and long-term maintenance of the existing glycan databases, are also discussed.
---.
Knowledge of the 3D structure of glycans is a prerequisite for a complete understanding of the biological processes glycoproteins are involved in. However, due to a lack of standardised nomenclature, carbohydrate compounds are difficult to locate within the Protein Data Bank (PDB). Using an algorithm that detects carbohydrate structures only requiring element types and atom coordinates, we were able to detect 1663 entries containing a total of 5647 carbohydrate chains. The majority of chains are found to be N-glycosidically bound. Noncovalently bound ligands are also frequent, while O-glycans form a minority. About 30% of all carbohydrate containing PDB entries comprise one or several errors. The automatic assignment of carbohydrate structures in PDB entries will improve the cross-linking of glycobiology resources with genomic and proteomic data collections, which will be an important issue of the upcoming glycomics projects. By aiding in detection of erroneous annotations and structures, the algorithm might also help to increase database quality.
Protein glycosylation is an important post-translational modification. It is a feature that enhances the functional diversity of proteins and influences their biological activity. A wide range of functions for glycans have been described, from structural roles to participation in molecular trafficking, self-recognition and clearance. Understanding the basis of these functions is challenging because the biosynthetic machinery that constructs glycans executes sequential and competitive steps that result in a mixture of glycosylated variants (glycoforms) for each glycoprotein. Additionally, naturally occurring glycoproteins are often present at low levels, putting pressure on the sensitivity of the analytical technologies. No universal method for the rapid and reliable identification of glycan structure is currently available; hence, research goals must dictate the best method or combination of methods. To this end, we introduce some of the major technologies routinely used for structural N- and O-glycan analysis, describing the complementary information that each provides.
Glycoproteins fulfill many indispensable biological functions, and changes in protein glycosylation have been observed in various diseases. Improved analytical methods are needed to allow a complete characterization of this complex and common post-translational modification. In this study, we present a workflow for the analysis of the microheterogeneity of N-glycoproteins that couples hydrophilic interaction and nanoreverse-phase C18 chromatography to tandem QTOF mass spectrometric analysis. A glycan database search program, GlycoPeptideSearch, was developed to match N-glycopeptide MS/MS spectra with the glycopeptides comprised of a glycan drawn from the GlycomeDB glycan structure database and a peptide from a user-specified set of potentially glycosylated peptides. Application of the workflow to human haptoglobin and hemopexin, two microheterogeneous N-glycoproteins, identified a total of 57 distinct site-specific glycoforms in the case of haptoglobin and 14 site-specific glycoforms of hemopexin. Using glycan oxonium ions and peptide-characteristic glycopeptide fragment ions and by collapsing topologically redundant glycans, the search software was able to make unique N-glycopeptide assignments for 51% of assigned spectra, with the remaining assignments primarily representing isobaric topological rearrangements. The optimized workflow, coupled with GlycoPeptideSearch, is expected to make high-throughput semiautomated glycopeptide identification feasible for a wide range of users.
Carbohydrate libraries printed in glycan micorarray format have had a great impact on the high-throughput analysis of the specificity of a wide range of mammalian, plant, and bacterial lectins. Chemical and chemo-enzymatic synthesis allows the construction of diverse glycan libraries but requires substantial effort and resources. To leverage the synthetic effort, the ideal library would be a minimal subset of all structures that provides optimal diversity. Therefore, a measure of library diversity is needed. To this end, we developed a linear representation of glycans using standard chemoinformatic tools. This representation was applied to measure pairwise similarity and consequently diversity of glycan libraries in a single value. The diversities of four existing sialoside glycan arrays were compared. More diverse arrays are proposed reducing the number of glycans. This algorithm can be applied to diverse aspects of library design from target structure selection to the choice of building blocks for their synthesis.
Glycomics-an integrated approach to study structure-function relationships of complex carbohydrates (or glycans)-is an emerging field in this age of post-genomics. Realizing the importance of glycomics, many large scale research initiatives have been established to generate novel resources and technologies to advance glycomics. These initiatives are generating and cataloging diverse data sets necessitating the development of bioinformatic platforms to acquire, integrate, and disseminate these data sets in a meaningful fashion. With the consortium for functional glycomics (CFG) as the model system, this review discusses databases and the bioinformatics platform developed by this consortium to advance glycomics.
A novel four-bead coarse-grained (CG) model for carbohydrates denoted PITOMBA was devised using a bottom-up approach based on the atomistic GROMOS 53A6GLYC force field and on experimental thermodynamical data. The model was developed to be used in conjunction with the SPC CG water model (J. Chem. Phys. 2011, 134, 084110) and the GROMOS force field functional form. Explicit electrostatic interactions are considered by assigning point charges to each CG bead. Validation of the model is presented to a variety of structural and thermodynamic properties for mono- and oligosaccharides in solution. In addition, the model development philosophy allows for prompt extensions to include hexopyranose chains with diverse glycosidic linkages and branches.
Complex carbohydrates are built for high-density biocoding, which is at par with proteins and nucleic acids and their role and importance is widely being recognized. This can be conceptualized as an extended paradigm of molecular biology in which biological information flows from DNA to RNA and protein. This article describes the role of glycoinformatics in the growth of glycobiology and in the area of structural characterization of glycans, within the area of carbohydrate research.
Functional glycomics, the scientific attempt to identify and assign functions to all glycan molecules synthesized by an organism, is an emerging field of science. In recent years, several databases have been started, all aiming to support deciphering the biological function of carbohydrates. However, diverse encoding and storage schemes are in use amongst these databases, significantly hampering the interchange of data. The mutual online access between the Bacterial Carbohydrate Structure DataBase (BCSDB) and the GLYCOSCIENCES.de portal, as a first reported attempt of a structure-based direct interconnection of two glyco-related databases is described. In this approach, users have to learn only one interface, will always have access to the latest data of both services, and will have the results of both searches presented in a consistent way. The establishment of this connection helped to find shortcomings and inconsistencies in the database design and functionality related to underlying data concepts and structural representations. For the maintenance of the databases, duplication of work can be easily avoided, and will hopefully lead to a better worldwide acceptance of both services within the community of glycoscienists. BCSDB is available at http://www.glyco.ac.ru/bcsdb/ and the GLYCOSCIENCES.de portal at http://www.glycosciences.de/.
---.
High-throughput methods to identify and quantify glycans in a given sample are rare. We have optimised a robotic platform for analysing biopharmaceuticals at each stage of the manufacturing process. In addition, it can be applied to basic research. The plate format makes it convenient for large sample sets; it is relatively cheap, robust and quantitative. However, the large datasets churned out by this platform require significant time to interpret. Consequently, informatics tool are required to help with this annotation. This article briefly describes our robotic platform and concentrates on a set of software tools for the interpretation of quantitative glycoprofiling data. .
---.
The MIRAGE (minimum information required for a glycomics experiment) initiative was founded in Seattle, WA, in November 2011 in order to develop guidelines for reporting the qualitative and quantitative results obtained by diverse types of glycomics analyses, including the conditions and techniques that were applied to prepare the glycans for analysis and generate the primary data along with the tools and parameters that were used to process and annotate this data. These guidelines must address a broad range of issues, as glycomics data are inherently complex and are generated using diverse methods, including mass spectrometry (MS), chromatography, glycan array-binding assays, nuclear magnetic resonance (NMR) and other rapidly developing technologies. The acceptance of these guidelines by scientists conducting research on biological systems in which glycans have a significant role will facilitate the evaluation and reproduction of glycomics experiments and data that is reported in scientific journals and uploaded to glycomics databases. As a first step, MIRAGE guidelines for glycan analysis by MS have been recently published (Kolarich D, Rapp E, Struwe WB, Haslam SM, Zaia J., et al. 2013. The minimum information required for a glycomics experiment (MIRAGE) project - Improving the standards for reporting mass spectrometry-based glycoanalytic data. Mol. Cell Proteomics. 12:991-995), allowing them to be implemented and evaluated in the context of real-world glycobiology research. In this paper, we set out the historical context, organization structure and overarching objectives of the MIRAGE initiative.
The field of Glycomics has emerged as a result of technical advances that allow high-throughput, high-sensitivity analysis of structurally complex molecules. The complexity of glycan biosynthesis makes its analysis difficult, limiting glycomics knowledge of the mechanisms by which glycans carry out their biological functions. This chapter provides an overview of the general problem of integrating glycomics information and structural databases. Development of tools is required to create informatics systems that significantly advance our understanding of glycobiology. The basic infrastructure for semantic integration of glycomics data includes knowledge repositories, data repositories and tools for glycoinformatics that are required for high-throughput data acquisition, integration of the data, and discovery of knowledge that is inferred by the data. This provides an understanding of the structure and biosynthesis of glycans. These powerful new analytical techniques produce large amounts of information-rich raw data, making it necessary to develop equally powerful informatics techniques that process and mine data in order to understand its biological implications in detail. The highest priority is the extension and utilization of curational tools to populate ontologies with trusted knowledge in the glycomics domain. Ontological specification of the fundamental knowledge serves as a springboard for the further development of powerful tools for interpreting glycomics data by providing a basis for the expressive annotation of experimental data, facilitating the realization of its biological relevance.
no abstract.
Оптимизированы и автоматизированы структурно-функциональные исследования в гликохимии и гликобиологии, устранено отставание гликомики от других наук о жизни. Для этого на фундаментальном уровне разработаны модели и алгоритмы, учитывающие углеводную специфику, а на прикладном уровне апробированы и внедрены стандарты, базы данных, веб-сервисы. Охваченные темы включают правила обработки информации в гликохимии; глобальную углеводную базу данных и платформу для сервисов; семантический углеводный язык и его связь с атомарными моделями; ЯМР-моделирование углеводов; предсказание структуры по спектрам ЯМР; моделирование молекулярной геометрии углеводов; кластеризацию биологических таксонов на основе разнообразия их гликомов; углеводную онтологию. Компьютерные инструменты гликомики созданы и объединены в согласованную систему, верифицированы на модельных объектах и использованы для реальных исследований. Задан и обеспечен мировой вектор развития молодой науки – гликоинформатики.
This book provides glycoscientists with a handbook of useful databases that can be applied to glycoscience research. Although many databases are now publicly available, one of the hurdles for their users is the learning curve required to effectively utilize those databases. Therefore, this book not only describes the existing databases, but also provides tips on how to obtain the target data. That is, because many databases provide a variety of data that could be obtained from different perspectives, each chapter provides users with potential biological questions that can be answered by a particular database and step-by-step instructions, with figures, on how to obtain that data. Troubleshooting tips are also provided to aid users encountering problems that can be predicted when using these databases. Moreover, contact information for each database is provided in case unexpected issues arise.
In the bioinformatics field, many computer algorithmic and data mining technologies have been developed for gene prediction, protein-protein interaction analysis, sequence analysis, and protein folding predictions, to name a few. This kind of research has branched off from the genomics field, creating the transcriptomics, proteomics, metabolomics, and glycomics research areas in the postgenomic age. In the glycomics field, given the complexity of glycan structures with their branches of monosaccharides in various conformations, new data mining and algorithmic methods have been developed in an attempt to gain a better understanding of glycans. However, these methods have not all been implemented as tools such that the glycobiology community may utilize them in their research. Thus, we have developed RINGS (Resource for INformatics of Glycomes at Soka) as a freely available Web resource for glycobiologists to analyze their data using the latest data mining and algorithmic techniques. It provides a number of tools including a 2D glycan drawing and querying interface called DrawRINGS, a Glycan Pathway Predictor (GPP) tool for dynamically computing the N-glycan biosynthesis pathway from a given glycan structure, and data mining tools Glycan Miner Tool and Profile PSTMM. These tools and other utilities provided by RINGS will be described. The URL for RINGS is http://rings.t.soka.ac.jp/.
MOTIVATION: In the field of glycomics research, several different techniques are used for structure elucidation. Although multiple techniques are often used to increase confidence in structure assignments, most glycomics databases allow storing of only a single type of experimental data. In addition, the methods used to prepare a sample for analysis is seldom recorded making it harder to reproduce the analytical data and results. RESULTS: We have extended the freely available EUROCarbDB framework to allow the submission of experimental data and the reporting of several orthogonal experimental datasets. The features aim to increase the understandability and reproducibility of the reported data. AVAILABILITY AND IMPLEMENTATION: The installation with the glycan standards is available at http://glycomics.ccrc.uga.edu/eurocarb/. The source code of the project is available at https://code.google.com/p/ucdb/. .
Glycans are known as the third major class of biopolymers, next to DNA and proteins. They cover the surfaces of many cells, serving as the 'face' of cells, whereby other biomolecules and viruses interact. The structure of glycans, however, differs greatly from DNA and proteins in that they are branched, as opposed to linear sequences of amino acids or nucleotides. Therefore, the storage of glycan information in databases, let alone their curation, has been a difficult problem. This has caused many duplicated efforts when integration is attempted between different databases, making an international repository for glycan structures, where unique accession numbers are assigned to every identified glycan structure, necessary. As such, an international team of developers and glycobiologists have collaborated to develop this repository, called GlyTouCan and is available at http://glytoucan.org/, to provide a centralized resource for depositing glycan structures, compositions and topologies, and to retrieve accession numbers for each of these registered entries. This will thus enable researchers to reference glycan structures simply by accession number, as opposed to by chemical structure, which has been a burden to integrate glycomics databases in the past.
Many databases of carbohydrate structures and related information can be found on the World Wide Web. This review covers the major carbohydrate databases that have potential utility for glycoscientists and researchers entering the glycosciences. The first half provides a brief overview of carbohydrate databases and web resources (including a history of carbohydrate databases and carbohydrate notations used in these databases), and the second half provides a guide that can be used as an index to determine which resources provide the data of most interest to the user.
BACKGROUND: The glycomics field has made great advancements in the last decade due to technologies for their synthesis and analysis including carbohydrate microarrays. Accordingly, databases for glycomics research have also emerged and been made publicly available by many major institutions worldwide. OBJECTIVE: This review introduces these and other useful databases on which new methods for drug discovery can be developed. METHODS: The scope of this review covers current documented and accessible databases and resources pertaining to glycomics. These were selected with the expectation that they may be useful for drug discovery research. RESULTS/CONCLUSION: There is a plethora of glycomics databases that have much potential for drug discovery. This may seem daunting at first but this review helps to put some of these resources into perspective. Additionally, some thoughts on how to integrate these resources to allow more efficient research are presented.
This chapter describes the KEGG GLYCAN database of the KEGG resource, including descriptions of links to the other databases in KEGG. In particular, KEGG GLYCAN consists of glycan structures, with links to glycogenes, orthologs, reactions, pathways, drugs, diseases, and others, all within the KEGG resources. A number of analytical tools are also available, including the composite structure map (CSM), KegDraw, KCam, and GECS. These databases and tools will be described along with simple examples of their usage. .
Glycans are crucial to the functioning of multicellular organisms. They may also play a role as mediators between host and parasite or symbiont. As many proteins (>50%) are posttranslationally modified by glycosylation, this mechanism is considered to be the most widespread posttranslational modification in eukaryotes. These surface modifications alter and regulate structure and biological activities/functions of proteins/biomolecules as they are largely involved in the recognition process of the appropriate structure in order to bind to the target cells. Consequently, the recognition of glycans on cellular surfaces plays a crucial role in the promotion or inhibition of various diseases and, therefore, glycosylation itself is considered to be a critical protein quality control attribute for commercial therapeutics, which is one of the fastest growing segments in the pharmaceutical industry. With the development of glycobiology as a separate discipline, a number of databases and tools became available in a similar way to other well-established "omics." Alleviating the recognized shortcomings of the available tools for data storage and retrieval is one of the highest priorities of the international glycoinformatics community. In the last decade, major efforts have been made, by leading scientific groups, towards the integration of a number of major databases and tools into a single portal, which would act as a centralized data repository for glycomics, equipped with a number of comprehensive analytical tools for data systematization, analysis, and comparison. This chapter provides an overview of the most important carbohydrate-related databases and glycoinformatic tools.
SWISS-MODEL Repository (SMR) is a database of annotated 3D protein structure models generated by the automated SWISS-MODEL homology modeling pipeline. It currently holds >400 000 high quality models covering almost 20% of Swiss-Prot/UniProtKB entries. In this manuscript, we provide an update of features and functionalities which have been implemented recently. We address improvements in target coverage, model quality estimates, functional annotations and improved in-page visualization. We also introduce a new update concept which includes regular updates of an expanded set of core organism models and UniProtKB-based targets, complemented by user-driven on-demand update of individual models. With the new release of the modeling pipeline, SMR has implemented a REST-API and adopted an open licencing model for accessing model coordinates, thus enabling bulk download for groups of targets fostering re-use of models in other contexts. SMR can be accessed at https://swissmodel.expasy.org/repository.
The EPS Database (EPS-DB) is a web-based, platform-independent database of bacterial exopolysaccharides (EPSs) providing access to detailed structural, taxonomic, growth conditions, functional properties, genetic, and bibliographic information for EPSs. It is freely available on the Internet as a website at http://www.epsdatabase.com. Several structural data representation schemes are used following the most commonly accepted formats. This guarantees full interoperability with other structural, experimental, and functional databases in the area of glycoscience. The scientific usage of EPS-DB throughout a user-friendly interface is presented with a subsection of the database exemplified by EPSs from lactic acid bacteria.
Glycosciences.DB, the glycan structure database of the Glycosciences.de portal, collects various kinds of data on glycan structures, including carbohydrate moieties from worldwide Protein Data Bank (wwPDB) structures. This way it forms a bridge between glycomics and proteomics resources. A major update of this database combines a redesigned web interface with a series of new functions. These include separate entry pages not only for glycan structures but also for literature references and wwPDB entries, improved substructure search options, a newly available keyword search covering all types of entries in one query, and new types of information that is added to glycan structures. These new features are described in detail in this article, and options how users can provide information to the database are discussed as well. Glycosciences.DB is available at http://www.glycosciences.de/database/ and can be freely accessed.
Compared to proteomics, computational platforms for glycoproteomics is at an early stage and many researchers rely on the manual interpretation of large data sets to gain structural insights into the glycoproteome. Over the last few years there has been a steady increase in the availability of bioinformatics tools for processing and annotating glycoproteomics data sets. This mini-review describes advances in the development of algorithms and software applications and their applications. Furthermore, an update on structural and analytical databases is presented with a focus on those resources still actively maintained by the community, and how these resources are now being integrated into glycoproteomics pipelines to improve data interpretation.
Despite the success of several international initiatives the glycosciences still lack a managed infrastructure that contributes to the advancement of research through the provision of comprehensive structural and experimental glycan data collections. UniCarbKB is an initiative that aims to promote the creation of an online information storage and search platform for glycomics and glycobiology research. The knowledgebase will offer a freely accessible and information-rich resource supported by querying interfaces, annotation technologies and the adoption of common standards to integrate structural, experimental and functional data. The UniCarbKB framework endeavors to support the growth of glycobioinformatics and the dissemination of knowledge through the provision of an open and unified portal to encourage the sharing of data. In order to achieve this, the framework is committed to the development of tools and procedures that support data annotation, and expanding interoperability through cross-referencing of existing databases. Database URL: http://www.unicarbkb.org.
BACKGROUND: UniCarbKB aims to provide a resource for the representation of mammalian glycobiology knowledge by providing a curated database of structural and experimental data, supported by a web application that allows users to easily find and view richly annotated information. The database comprises two levels of annotation (i) global-specific data of oligosaccharides released and characterised from single purified glycoproteins and (ii) information pertaining to site-specific glycan heterogeneity. Additional, contextual information is provided including structural, bibliographic, and taxonomic information for each entry. METHODS: Since the launch of UniCarbKB in 2012, we have continued to improve the organisation of our data model. Recently, we have extended our pipeline to collate structural and abundance changes of oligosaccharides in different human disease states and experimental models to extend our coverage of the human glycome. RESULTS: In this manuscript, we demonstrate the capability of UniCarbKB to store and query relative glycan abundance data using a set of published colorectal and prostate cancer cell lines as examples. Furthermore, we outline our strategy for managing large-scale glycoproteomics data, site-specific and glycan compositional data, and how this information is adding value to UniCarbKB. Finally, we summarise our efforts to improve the efficient representation of disease terms and associated changes in glycan heterogeneity by integrating the Disease Ontology. CONCLUSIONS: Updates and improvements to UniCarbKB have introduced unique features for storing and displaying glycosylation features of mammalian glycoproteins. The integration of site-specific glycosylation data obtained from large-scale glycoproteomics and introduction of cell line studies will improve the analysis of glycoproteins and entire glycomes. GENERAL SIGNIFICANCE: Continuing advancements in analytical technologies and new data types are advancing disease-related glycomics. It is increasingly necessary to ensure all the data are comprehensively annotated. UniCarbKB was established with the mission of providing a resource for human glycobiology by capturing a wide range of data with corresponding annotations.
The UniCarb KnowledgeBase (UniCarbKB; http://unicarbkb.org) offers public access to a growing, curated database of information on the glycan structures of glycoproteins. UniCarbKB is an international effort that aims to further our understanding of structures, pathways and networks involved in glycosylation and glyco-mediated processes by integrating structural, experimental and functional glycoscience information. This initiative builds upon the success of the glycan structure database GlycoSuiteDB, together with the informatic standards introduced by EUROCarbDB, to provide a high-quality and updated resource to support glycomics and glycoproteomics research. UniCarbKB provides comprehensive information concerning glycan structures, and published glycoprotein information including global and site-specific attachment information. For the first release over 890 references, 3740 glycan structure entries and 400 glycoproteins have been curated. Further, 598 protein glycosylation sites have been annotated with experimentally confirmed glycan structures from the literature. Among these are 35 glycoproteins, 502 structures and 60 publications previously not included in GlycoSuiteDB. This article provides an update on the transformation of GlycoSuiteDB (featured in previous NAR Database issues and hosted by ExPASy since 2009) to UniCarbKB and its integration with UniProtKB and GlycoMod. Here, we introduce a refactored database, supported by substantial new curated data collections and intuitive user-interfaces that improve database searching.
BACKGROUND: Recent progress in method development for characterising the branched structures of complex carbohydrates has now enabled higher throughput technology. Automation of structure analysis then calls for software development since adding meaning to large data collections in reasonable time requires corresponding bioinformatics methods and tools. Current glycobioinformatics resources do cover information on the structure and function of glycans, their interaction with proteins or their enzymatic synthesis. However, this information is partial, scattered and often difficult to find to for non-glycobiologists. METHODS: Following our diagnosis of the causes of the slow development of glycobioinformatics, we review the "objective" difficulties encountered in defining adequate formats for representing complex entities and developing efficient analysis software. RESULTS: Various solutions already implemented and strategies defined to bridge glycobiology with different fields and integrate the heterogeneous glyco-related information are presented. CONCLUSIONS: Despite the initial stage of our integrative efforts, this paper highlights the rapid expansion of glycomics, the validity of existing resources and the bright future of glycobioinformatics.
SUMMARY: The development of robust high-performance liquid chromatography (HPLC) technologies continues to improve the detailed analysis and sequencing of glycan structures released from glycoproteins. Here, we present a database (GlycoBase) and analytical tool (autoGU) to assist the interpretation and assignment of HPLC-glycan profiles. GlycoBase is a relational database which contains the HPLC elution positions for over 350 2-AB labelled N-glycan structures together with predicted products of exoglycosidase digestions. AutoGU assigns provisional structures to each integrated HPLC peak and, when used in combination with exoglycosidase digestions, progressively assigns each structure automatically based on the footprint data. These tools are potentially very promising and facilitate basic research as well as the quantitative high-throughput analysis of low concentrations of glycans released from glycoproteins. AVAILABILITY: http://glycobase.ucd.ie.
MOTIVATION: Glycan microarrays are capable of illuminating the interactions of glycan-binding proteins (GBPs) against hundreds of defined glycan structures, and have revolutionized the investigations of protein-carbohydrate interactions underlying numerous critical biological activities. However, it is difficult to interpret microarray data and identify structural determinants promoting glycan binding to glycan-binding proteins due to the ambiguity in microarray fluorescence intensity and complexity in branched glycan structures. To facilitate analysis of glycan microarray data alongside protein structure, we have built the Glycan Microarray Database (GlyMDB), a web-based resource including a searchable database of glycan microarray samples and a toolset for data/structure analysis. RESULTS: The current GlyMDB provides data visualization and glycan-binding motif discovery for 5203 glycan microarray samples collected from the Consortium for Functional Glycomics. The unique feature of GlyMDB is to link microarray data to PDB structures. The GlyMDB provides different options for database query, and allows users to upload their microarray data for analysis. After search or upload is complete, users can choose the criterion for binder versus non-binder classification. They can view the signal intensity graph including the binder/non-binder threshold followed by a list of glycan-binding motifs. One can also compare the fluorescence intensity data from two different microarray samples. A protein sequence-based search is performed using BLAST to match microarray data with all available PDB structures containing glycans. The glycan ligand information is displayed, and links are provided for structural visualization and redirection to other modules in GlycanStructure.ORG for further investigation of glycan-binding sites and glycan structures. AVAILABILITY AND IMPLEMENTATION: http://www.glycanstructure.org/glymdb. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
GlycoSuiteDB is a relational database that curates information from the scientific literature on glyco-protein derived glycan structures, their biological sources, the references in which the glycan was described and the methods used to determine the glycan structure. To date, the database includes most published O:-linked oligosaccharides from the last 50 years and most N:-linked oligosaccharides that were published in the 1990s. For each structure, information is available concerning the glycan type, linkage and anomeric configuration, mass and composition. Detailed information is also provided on native and recombinant sources, including tissue and/or cell type, cell line, strain and disease state. Where known, the proteins to which the glycan structures are attached are reported, and cross-references to the SWISS-PROT/TrEMBL protein sequence databases are given if applicable. The GlycoSuiteDB annotations include literature references which are linked to PubMed, and detailed information on the methods used to determine each glycan structure are noted to help the user assess the quality of the structural assignment. GlycoSuiteDB has a user-friendly web interface which allows the researcher to query the database using mono-isotopic or average mass, monosaccharide composition, glycosylation linkages (e.g. N:- or O:-linked), reducing terminal sugar, attached protein, taxonomy, tissue or cell type and GlycoSuiteDB accession number. Advanced queries using combinations of these parameters are also possible. GlycoSuiteDB can be accessed on the web at http://www.glycosuite.com.
GlycoSuiteDB is an annotated and curated relational database of glycan structures reported in the literature. It contains information on the glycan type, core type, linkages and anomeric configurations, mass, composition and the analytical methods used by the researchers to determine the glycan structure. Native and recombinant sources are detailed, including species, tissue and/or cell type, cell line, strain, life stage, disease, and if known the protein to which the glycan structures are attached. There are links to SWISS-PROT/TrEMBL and PubMed where applicable. Recent developments include the implementation of searching by 2D structure and substructure, disease and reference. The database is updated twice a year, and now contains over 7650 entries. Access to GlycoSuiteDB is available at http://www.glycosuite.com.
no abstract.
The Complex Carbohydrate Structure Database (CCSD) and CarbBank, an IBM PC/AT (or compatible) database management system, were created to provide an information system to meet the needs of people interested in carbohydrate science. The CCSD, which presently contains more than 2000 citations, is expected to double in size in the next two years and to include, soon thereafter, all of the published structures of carbohydrates larger than disaccharides.
The carbohydrate fraction of most mammalian milks contains a variety of oligosaccharides that encompass a range of structures and monosaccharide compositions. Human milk oligosaccharides have received considerable attention due to their biological roles in neonatal gut microbiota, immunomodulation, and brain development. However, a major challenge in understanding the biology of milk oligosaccharides across other mammals is that reports span more than 5 decades of publications with varying data reporting methods. In the present study, publications on milk oligosaccharide profiles were identified and harmonized into a standardized format to create a comprehensive, machine-readable database of milk oligosaccharides across mammalian species. The resulting database, MilkOligoDB, includes 3193 entries for 783 unique oligosaccharide structures from the milk of 77 different species harvested from 113 publications. Cross-species and cross-publication comparisons of milk oligosaccharide profiles reveal common structural motifs within mammalian orders. Of the species studied, only chimpanzees, bonobos, and Asian elephants share the specific combination of fucosylation, sialylation, and core structures that are characteristic of human milk oligosaccharides. However, agriculturally important species do produce diverse oligosaccharides that may be valuable for human supplementation. Overall, MilkOligoDB facilitates cross-species and cross-publication comparisons of milk oligosaccharide profiles and the generation of new data-driven hypotheses for future research.
Carbohydrate Structure Database (CSDB) is a regularly updated database containing structural, taxonomic, bibliographic, NMR spectroscopic, and other information on carbohydrates and their derivatives obtained from prokaryotes, plants, and fungi. CSDB claims for full coverage and high data quality. It serves as a platform for various search strategies, tools for NMR spectrum and structure prediction, and instruments for statistical analysis. CSDB is freely available via the Internet at http://csdb.glycoscience.ru.
The main goals of glycoscience are elucidation of carbohydrate features responsible for cellular processes, pathogenicity of microorganisms, and immunological properties of higher organisms, as well as application of glycans as diagnostic and therapeutic agents and classification of natural glycans and glycoconjugates. These goals are hardly achievable without freely available, regularly updated, and cross-linked databases, which provide data accumulated in glycoscience and allow tracking of their quality. The Carbohydrate Structure Database is a curated data repository developed for provision of structural, bibliographic, taxonomic, NMR spectroscopic, and other related information on published carbohydrates and derivatives. Currently it covers ca. 90 % of published primary structures of bacterial and archaeal origin and ca. 30 % of published primary structures of plant and fungal origin. The data in bacterial part of CSDB are regularly updated. The expansion of plant and fungal coverage is expected in the future. The project aims at coverage close to complete in selected taxonomic domains and at high data quality achieved by manual literature analysis, annotation, verification, and data approval. CSDB is freely available on the Internet as a web service at http:// csdb. glycoscience. ru. This chapter presents a step-by-step guide to use CSDB for solving everyday tasks typical for carbohydrate research.
Systematization and classification of carbohydrates contribute greatly to development of modern biomedical sciences. CCSD (CarbBank) data constitute the significant part of nearly all existing carbohydrate databases. However, these data have not been verified from their original deposit. During the expansion of Bacterial Carbohydrate Structure Database (BCSDB) project, we checked CCSD data quality and found that about 35% of records contained errors. The CCSD data cannot be used without manual verification, while CCSD errors migrate from database to database.
The Bacterial Carbohydrate Structure Database (BCSDB), which has been maintained since 2005, was expanded to cover glycans from plants and fungi. The current coverage on plant and fungal glycans includes several thousands of the CarbBank records, as well as data published before 1996 but not deposited in CarbBank. Prior to deposition, the data were verified against the original publications and supplemented with additional information, such as NMR spectra. Both the Bacterial and Plant and Fungal Carbohydrate Structure Databases are freely available at http://csdb.glycoscience.ru. (C) 2013 Elsevier Ltd. All rights reserved.
Conformational energy maps of the glycosidic linkages are a valuable resource to gain information about preferred conformations and flexibility of carbohydrates. Here we present GlycoMapsDB, a new database containing more than 2500 calculated conformational maps for a variety of di- to pentasaccharide fragments contained in N- and O-glycans. Oligosaccharides representing branchpoints of N-glycans are included in the set of fragments, thus the influence of neighbouring residues is reflected in the conformational maps. During refinement of new crystal structures, maps contained in GlycoMapsDB can serve as a valuable resource to check whether the torsion values of a glycosidic linkage are located in an 'allowed' region similar to the Ramachandran plot analysis for proteins. This might help to improve the structural quality of the glycan data contained in the Protein Data Bank (PDB). A link between GlycoMapsDB and the PDB has been established so that the glycosidic torsions of all glycans contained in the PDB can be retrieved and compared to calculated data. The service is available at www.glycosciences.de/modeling/glycomapsdb/.
Glycans serve important roles in signaling events and cell-cell communication, and they are recognized by lectins, viruses and bacteria, playing a variety of roles in many biological processes. However, there was no system to organize the plethora of glycan-related data in the literature. Thus GlyTouCan (https://glytoucan.org) was developed as the international glycan repository, allowing researchers to assign accession numbers to glycans. This also aided in the integration of glycan data across various databases. GlyTouCan assigns accession numbers to glycans which are defined as sets of monosaccharides, which may or may not be characterized with linkage information. GlyTouCan was developed to be able to recognize any level of ambiguity in glycans and uniquely assign accession numbers to each of them, regardless of the input text format. In this manuscript, we describe the latest update to GlyTouCan in version 3.0, its usage, and plans for future development.
Bioinformatics approaches to carbohydrate research have recently begun using large amounts of protein and carbohydrate data. In this field called glycome informatics, the foremost necessity is a comprehensive resource for genome-scale bioinformatics analysis of glycan data. Although the accumulation of experimental data may be useful as a reference of biological and biochemical information on carbohydrates, this is insufficient for bioinformatics analysis. Thus, we have developed a glycome informatics resource (http://www.genome.jp/kegg/glycan/) in KEGG (Kyoto Encyclopedia of Genes and Genomes), an integrated knowledge base of protein networks, genomic information, and chemical information. This review describes three noteworthy features: (1) GLYCAN, a database of carbohydrate structures; (2) glycan-related pathways; and (3) Composite Structure Map (CSM), a map illustrating all possible variations of carbohydrate structures within organisms. GLYCAN includes two useful tools: an intuitive drawing tool called KegDraw, and an efficient glycan search and alignment tool called KEGG Carbohydrate Matcher (KCaM). KEGG's glycan biosynthesis and metabolism pathways, integrating carbohydrate structures, proteins, and reactions, are also a pivotal resource. CSM is constructed as a bridge between carbohydrate functions and structures. CSM is able to display, for example, expression data of glycosyltransferases in a compact manner. In all the KEGG resources, various objects including KEGG pathways, chemical compounds, as well as carbohydrate structures are commonly represented as graphs, which are widely studied and utilized in the computer science field.
BACKGROUND: Glycans are involved in a wide range of biological process, and they play an essential role in functions such as cell differentiation, cell adhesion, pathogen-host recognition, toxin-receptor interactions, signal transduction, cancer metastasis, and immune responses. Elucidating pathways related to post-translational modifications (PTMs) such as glycosylation are of growing importance in post-genome science and technology. Graphical networks describing the relationships among glycan-related molecules, including genes, proteins, lipids and various biological events are considered extremely valuable and convenient tools for the systematic investigation of PTMs. However, there is no database which dynamically draws functional networks related to glycans. DESCRIPTION: We have created a database called Glyco-Net http://www.glycoconjugate.jp/functions/, with many binary relationships among glycan-related molecules. Using search results, we can dynamically draw figures of the functional relationships among these components with nodes and arrows. A certain molecule or event corresponds to a node in the network figures, and the relationship between the molecule and the event are indicated by arrows. Since all components are treated equally, an arrow is also a node. CONCLUSIONS: In this paper, we describe our new database, Glyco-Net, which is the first database to dynamically show networks of the functional profiles of glycan related molecules. The graphical networks will assist in the understanding of the role of the PTMs. In addition, since various kinds of bio-objects such as genes, proteins, and inhibitors are equally treated in Glyco-Net, we can obtain a large amount of information on the PTMs.
Glycosylation is one of the most important post-translational modifications of proteins, known to be involved in pathogen recognition, innate immune response and protection of epithelial membranes. However, when compared to the tools and databases available for the processing of high-throughput proteomic data, the glycomic domain is severely lacking. While tools to assist the analysis of mass spectrometry (MS) and HPLC are continuously improving, there are few resources available to support liquid chromatography (LC)-MS/MS techniques for glycan structure profiling. Here, we present a platform for presenting oligosaccharide structures and fragment data characterized by LC-MS/MS strategies. The database is annotated with high-quality datasets and is designed to extend and reinforce those standards and ontologies developed by existing glycomics databases.
BACKGROUND: There are considerable differences between bacterial and mammalian glycans. In contrast to most eukaryotic carbohydrates, bacterial glycans are often composed of repeating units with diverse functions ranging from structural reinforcement to adhesion, colonization and camouflage. Since bacterial glycans are typically displayed at the cell surface, they can interact with the environment and, therefore, have significant biomedical importance. RESULTS: The sequence characteristics of glycans (monosaccharide composition, modifications, and linkage patterns) for the higher bacterial taxonomic classes have been examined and compared with the data for mammals, with both similarities and unique features becoming evident. Compared to mammalian glycans, the bacterial glycans deposited in the current databases have a more than ten-fold greater diversity at the monosaccharide level, and the disaccharide pattern space is approximately nine times larger. Specific bacterial subclasses exhibit characteristic glycans which can be distinguished on the basis of distinctive structural features or sequence properties. CONCLUSION: For the first time a systematic database analysis of the bacterial glycome has been performed. This study summarizes the current knowledge of bacterial glycan architecture and diversity and reveals putative targets for the rational design and development of therapeutic intervention strategies by comparing bacterial and mammalian glycans.
Protein glycosylation serves critical roles in the cellular and biological processes of many organisms. Aberrant glycosylation has been associated with many illnesses such as hereditary and chronic diseases like cancer, cardiovascular diseases, neurological disorders, and immunological disorders. Emerging mass spectrometry (MS) technologies that enable the high-throughput identification of glycoproteins and glycans have accelerated the analysis and made possible the creation of dynamic and expanding databases. Although glycosylation-related databases have been established by many laboratories and institutions, they are not yet widely known in the community. Our study reviews 15 different publicly available databases and identifies their key elements so that users can identify the most applicable platform for their analytical needs. These databases include biological information on the experimentally identified glycans and glycopeptides from various cells and organisms such as human, rat, mouse, fly and zebrafish. The features of these databases - 7 for glycoproteomic data, 6 for glycomic data, and 2 for glycan binding proteins are summarized including the enrichment techniques that are used for glycoproteome and glycan identification. Furthermore databases such as Unipep, GlycoFly, GlycoFish recently established by our group are introduced. The unique features of each database, such as the analytical methods used and bioinformatical tools available are summarized. This information will be a valuable resource for the glycobiology community as it presents the analytical methods and glycosylation related databases together in one compendium. It will also represent a step towards the desired long term goal of integrating the different databases of glycosylation in order to characterize and categorize glycoproteins and glycans better for biomedical research.
Carbohydrate chains occupy truly significant positions in various fields of life sciences and biotechnology. A large number of polyclonal or monoclonal antibodies have been used as very important tools for analyzing expression of carbohydrate chains and their functions. "GlycoEpitope" is an integrated database that consists of useful information on carbohydrate antigens and their antibodies. It has been developed with the cooperation of top class researchers in the field of glycobiology and maintained by the Research Center for Glycobiotechnology in Ritsumeikan University. The GlycoEpitope Database provides a fund of information, e.g. lists of 1) glycoproteins that express carbohydrate antigens (epitopes), 2) glycolipids of which the partial structure is a carbohydrate epitope, 3) enzymes that take part in synthesis and degradation of glycoepitopes, 4) time and site of expression of carbohydrate epitopes, 5) diseases to which carbohydrate epitopes relate, and 6) suppliers where carbohydrate recognition antibodies can be obtained. This database is not limited to glycobiologists, but open to a wide range of life science researchers. Its search criteria are made flexible, so the database is very user-friendly. Here, we would like to introduce a general outline of the database and how it works.
Recently, the involvement of carbohydrate chains in life sciences has been extended to diverse functions as cell to cell recognition, communication in neuronal tissues and immune systems, pathogen recognition, sperm-egg recognition, fertilization, regulation of hormonal half-lives in the blood, direction of embryonic development and differentiation, and direction of the distribution of various cells and proteins throughout the body. A large number of polyclonal and monoclonal antibodies have been used as very important tools for analyzing the expression of various carbohydrate chains and their functions. However, a large amount of important information on carbohydrate-recognizing antibodies is spread throughout a wide range of literature. In this database, useful information on such carbohydrate antigens, i.e., glyco-epitopes and antibodies has been assembled as a compact encyclopedia (Kawasaki et al. 2006). It has been developed with the cooperation of foremost researchers in the field of glycobiology and is maintained by the Ritsumeikan University Research Center for Glycobiotechnology. GlycoEpitope provides a wealth of information including lists of glycoproteins that express carbohydrate antigens, glycolipids of which part of the structure is a carbohydrate antigen, enzymes that take part in the synthesis and degradation of epitopes, the times and sites of expression of carbohydrate antigens, diseases to which carbohydrate epitopes are related, and suppliers from which carbohydrate-recognizing antibodies can be obtained. This database is useful for not only glycobiologists but also a wide range of life science researchers. The search criteria are very flexible, so, a user can easily find the information he needs. Here, we would like to give a general outline of the database and how to use it.
SWISS-MODEL Repository (http://swissmodel.expasy.org/repository/) is a database of 3D protein structure models generated by the SWISS-MODEL homology-modelling pipeline. The aim of the SWISS-MODEL Repository is to provide access to an up-to-date collection of annotated 3D protein models generated by automated homology modelling for all sequences in Swiss-Prot and for relevant models organisms. Regular updates ensure that target coverage is complete, that models are built using the most recent sequence and template structure databases, and that improvements in the underlying modelling pipeline are fully utilised. As of September 2008, the database contains 3.4 million entries for 2.7 million different protein sequences from the UniProt database. SWISS-MODEL Repository allows the users to assess the quality of the models in the database, search for alternative template structures, and to build models interactively via SWISS-MODEL Workspace (http://swissmodel.expasy.org/workspace/). Annotation of models with functional information and cross-linking with other databases such as the Protein Model Portal (http://www.proteinmodelportal.org) of the PSI Structural Genomics Knowledge Base facilitates the navigation between protein sequence and structure resources.
SUMMARY: The open access comprehensive GlycoCD database application is for representation and retrieval of carbohydrate-related clusters of differentiation (CDs). The main objective of this database platform is to provide information about interactions of carbohydrate moieties with proteins that are important for identification of specific cell surface molecule with a focus on the integration of data from carbohydrate microarray databases. GlycoCD database comprises two sections: the carbohydrate recognition CD and glycan CD. It allows easy access through a user-friendly web interface to all carbohydrate-defined CDs and those that interact with carbohydrates along with other relevant information. AVAILABILITY: The database is freely available at http://glycosciences.de/glycocd/index.php CONTACT: r.s-albiez@dkfz.de.
A very high rate of multidrug resistance (MDR) seen among Gram-negative bacteria such as Escherichia, Klebsiella, Salmonella, Shigella, etc. is a major threat to public health and safety. One of the major virulent determinants of Gram-negative bacteria is capsular polysaccharide or K antigen located on the bacterial outer membrane surface, which is a potential drug & vaccine target. It plays a key role in host-pathogen interactions as well as host immune evasion and thus, mandates detailed structural information. Nonetheless, acquiring structural information of K antigens is not straightforward due to their innate enormous conformational flexibility. Here, we have developed a manually curated database of K antigens corresponding to various E. coli serotypes, which differ from each other in their monosaccharide composition, linkage between the monosaccharides and their stereoisomeric forms. Subsequently, we have modeled their 3D structures and developed an organized repository, namely EK3D that can be accessed through www.iith.ac.in/EK3D/. Such a database would facilitate the development of antibacterial drugs to combat E. coli infections as it has evolved resistance against 2 major drugs namely, third-generation cephalosporins and fluoroquinolones. EK3D also enables the generation of polymeric K antigens of varying lengths and thus, provides comprehensive information about E. coli K antigens.
Glycosylation plays critical roles in various biological processes and is closely related to diseases. Deciphering the glycocode in diverse cells and tissues offers opportunities to develop new disease biomarkers and more effective recombinant therapeutics. In the past few decades, with the development of glycobiology, glycomics, and glycoproteomics technologies, a large amount of glycoscience data has been generated. Subsequently, a number of glycobiology databases covering glycan structure, the glycosylation sites, the protein scaffolds, and related glycogenes have been developed to store, analyze, and integrate these data. However, these databases and tools are not well known or widely used by the public, including clinicians and other researchers who are not in the field of glycobiology, but are interested in glycoproteins. In this study, the representative databases of glycan structure, glycoprotein, glycan-protein interactions, glycogenes, and the newly developed bioinformatic tools and integrated portal for glycoproteomics are reviewed. We hope this overview could assist readers in searching for information on glycoproteins of interest, and promote further clinical application of glycobiology.
The access to biodatabases for glycomics and glycoproteomics has proven to be essential for current glycobiological research. This chapter presents available databases that are devoted to different aspects of glycobioinformatics. This includes oligosaccharide sequence databases, experimental databases, 3D structure databases (of both glycans and glycorelated proteins) and association of glycans with tissue, disease, and proteins. Specific search protocols are also provided using tools associated with experimental databases for converting primary glycoanalytical data to glycan structural information. In particular, researchers using glycoanalysis methods by U/HPLC (GlycoBase), MS (GlycoWorkbench, UniCarb-DB, GlycoDigest), and NMR (CASPER) will benefit from this chapter. In addition we also include information on how to utilize glycan structural information to query databases that associate glycans with proteins (UniCarbKB) and with interactions with pathogens (SugarBind). .
Complex carbohydrates are known as mediators of complex cellular events. Concerning their structural diversity, their potential of information content is several orders of magnitude higher in a short sequence than any other biological macromolecule. SWEET-DB (http://www.dkfz.de/spec2/sweetdb/) is an attempt to use modern web techniques to annotate and/or cross-reference carbohydrate-related data collections which allow glycoscientists to find important data for compounds of interest in a compact and well-structured representation. Currently, reference data taken from three data sources can be retrieved for a given carbohydrate (sub)structure. The sources are CarbBank structures and literature references (linked to NCBI PubMed service), NMR data taken from SugaBase and 3D co-ordinates generated with SWEET-II. The main purpose of SWEET-DB is to enable an easy access to all data stored for one carbohydrate structure entering a complete sequence or parts thereof. Access to SWEET-DB contents is provided with the help of separate input spreadsheets for (sub)structures, bibliographic data, general structural data like molecular weight, NMR spectra and biological data. A detailed online tutorial is available at http://www.dkfz.de/spec2/sweetdb/nar/.
The development of glycan-related databases and bioinformatics applications is considerably lagging behind compared with the wealth of available data and software tools in genomics and proteomics. Because the encoding of glycan structures is more complex, most of the bioinformatics approaches cannot be applied to glycan structures. No standard procedures exist where glycan structures found in various species, organs, tissues or cells can be routinely deposited. In this article the concepts of the GLYCOSCIENCES.de portal are described. It is demonstrated how an efficient structure-based cross-linking of various glycan-related data originating from different resources can be accomplished using a single user interface. The structure oriented retrieval options-exact structure, substructure, motif, composition and sugar components-are discussed. The types of available data-references, composition, spatial structures, nuclear magnetic resonance (NMR) shifts (experimental and estimated), theoretically calculated fragments and Protein Database (PDB) entries-are exemplified for Man(3.) The free availability and unrestricted use of glycan-related data is an absolute prerequisite to efficiently share distributed resources. Additionally, there is an urgent need to agree to a generally accepted exchange format as well as to a common software interface. An open access repository for glyco-related experimental data will secure that the loss of primary data will be considerably reduced.
no abstract.
The biological significance of glycans has been widely studied and reported in the past. However, most achievements of our predecessors are not readily available in existing databases. JCGGDB is a meta-database involving 15 original databases in AIST and 5 cooperative databases in alliance with JCGG: Japan Consortium for Glycobiology and Glycotechnology. It centers on a glycan structure database and accumulates information such as glycan preferences of lectins, glycosylation sites in proteins, and genes related to glycan syntheses from glycoscience and related fields. This chapter illustrates how to use three major search interfaces (Keyword Search, Structure Search, and GlycoChem Explorer) available in JCGGDB to search across multiple databases. .
Glycobiology has been brought to public attention as a frontier in the post-genomic era. Structural information about glycans has been accumulating in the Protein Data Bank (PDB) for years. It has been recognized, however, that there are many questionable glycan models in the PDB. A tool for verifying the primary structures of glycan 3D structures is evidently required, yet there have been no such publicly available tools. The Glycoconjugate Data Bank:Structures (GDB:Structures, http://www.glycostructures.jp) is an annotated glycan structure database, which also provides an N-glycan primary structure (or glycoform) verification service. All the glycan 3D structures are detected and annotated by an in-house program named 'getCARBO'. When an N-glycan is detected in a query coordinate by getCARBO, the primary structure of the glycan is compared with the most similar entry in the glycan primary structure database (KEGG GLYCAN), and unmatched substructure(s) are indicated if observed. The results of getCARBO are stored and presented in GDB:Structures.
SUMMARY: In recent years, the improvement of mass spectrometry-based glycomics techniques (i.e. highly sensitive, quantitative and high-throughput analytical tools) has enabled us to obtain a large dataset of glycans. Here we present a database named Xeno-glycomics database (XDB) that contains cell- or tissue-specific pig glycomes analyzed with mass spectrometry-based techniques, including a comprehensive pig glycan information on chemical structures, mass values, types and relative quantities. It was designed as a user-friendly web-based interface that allows users to query the database according to pig tissue/cell types or glycan masses. This database will contribute in providing qualitative and quantitative information on glycomes characterized from various pig cells/organs in xenotransplantation and might eventually provide new targets in the alpha1,3-galactosyltransferase gene-knock out pigs era. AVAILABILITY: The database can be accessed on the web at http://bioinformatics.snu.ac.kr/xdb.
The present work describes, in a detailed way, a family of databases covering the three-dimensional features of monosaccharides, disaccharides, oligosaccharides, polysaccharides, glycosyltransferases, lectins, monoclonal antibodies against carbohydrates, and glycosaminoglycan-binding proteins. These databases have been developed with non-proprietary software, and they are open freely to the scientific community. They are accessible through the common portal called "Glyco3D" http://www.glyco3d.cermav.cnrs.fr. The databases are accompanied by a user-friendly graphical user interface (GUI) which offers several search options. All three-dimensional structures are available for visual consultations (with basic measurements possibilities) and can be downloaded in commonly used formats for further uses. .
Glyco3D is a portal for structural glycobiology of several interlinked databases that is covering the three-dimensional features of monosaccharides, disaccharides, oligosaccharides, polysaccharides, glycosyltransferases, lectins, monoclonal antibodies, and glycosaminoglycan-binding proteins. Collection of annotated NMR data of bioactive oligosaccharides is also available. A common nomenclature has been adopted for the structural encoding of the carbohydrates. Each individual database stands by itself as it covers a particular family of either complex carbohydrates or carbohydrate-binding proteins. A unique search engine is available that scans the full content of all the databases for queries related to sequential information of the carbohydrates. The interconnection of these databases provides a unique opportunity to characterize the three-dimensional features that a given oligosaccharide molecule can take in different environments, i.e., vacuum, crystalline state, or interacting with different proteins having different biological function. The databases, which have been manually curated, were developed with nonproprietary software. They are web-based platform and are freely available to the scientific community at http://glyco3d.cermav.cnrs.fr.
Despite ongoing harmonization efforts, the major carbohydrate sequence databases following the first initiative in this field, CarbBank, are still isolated islands, with mechanisms for automatic structure exchange and comparison largely missing. This unfavorable situation has been overcome with a systematic data integration effort, resulting in the GlycomeDB, a meta-database for public carbohydrate sequences. It contains at present 35,056 unique structures in GlycoCT encoding, referencing more than 100,000 external records from 1845 different taxonomic sources. We have created a user-friendly, web-based graphical interface which allows taxonomic and structural data to be entered and searched for. The structural search possibilities include substructure search, similarity search, and maximum common substructure. A novel search refinement mechanism allows the assembly of complex queries. With GlycomeDB (www.glycome-db.org), it is now possible to use a single portal to access all digitally encoded, public structural data in glycomics and to perform complex queries with the help of a web-based user interface.
The availability of databases and tools to store, retrieve, and analyze data in an efficient way is of fundamental importance to progress in glycomics. This chapter describes major, well-established databases and introduces two new database initiatives. Each database uses a particular sequence format to encode carbohydrate structures. Therefore, there is hardly any cross-linking between the established databases. There are nine major database projects dedicated to the storage of carbohydrate structures: seven of these follow an open access policy, while two more are commercial and thus follow a more restricted access model. The open access databases have different capabilities to perform queries. Most of the projects offer structural searches to some extent, while the additional query functions are highly diverse, indicating a specialization of the individual databases. Future database initiatives are undertaken owing to the lack of a comprehensive database for storage and retrieval of carbohydrate structures in glycomics and glycobiology research. EUROCarbDB project aims to support the analytical process of carbohydrate structure determination from the spectrometer to database storage. Its fundamental ethics involve freely accessible data and open source tools. GlycomeDB is another initiative involving the translation of freely available databases to the GlycoCT sequence format. This format is stored in a new database to overcome the isolation of the carbohydrate structure databases and to create a comprehensive index of all available structures with references back to the original databases. Two case studies demonstrating how existing database resources can be used to answer specific scientific questions are analyzed.
GlycomeDB integrates the structural and taxonomic data of all major public carbohydrate databases, as well as carbohydrates contained in the Protein Data Bank, which renders the database currently the most comprehensive and unified resource for carbohydrate structures worldwide. GlycomeDB retains the links to the original databases and is updated at weekly intervals with the newest structures available from the source databases. The complete database can be downloaded freely or accessed through a Web-interface (www.glycome-db.org) that provides flexible and powerful search functionalities.
BACKGROUND: Although carbohydrates are the third major class of biological macromolecules, after proteins and DNA, there is neither a comprehensive database for carbohydrate structures nor an established universal structure encoding scheme for computational purposes. Funding for further development of the Complex Carbohydrate Structure Database (CCSD or CarbBank) ceased in 1997, and since then several initiatives have developed independent databases with partially overlapping foci. For each database, different encoding schemes for residues and sequence topology were designed. Therefore, it is virtually impossible to obtain an overview of all deposited structures or to compare the contents of the various databases. RESULTS: We have implemented procedures which download the structures contained in the seven major databases, e.g. GLYCOSCIENCES.de, the Consortium for Functional Glycomics (CFG), the Kyoto Encyclopedia of Genes and Genomes (KEGG) and the Bacterial Carbohydrate Structure Database (BCSDB). We have created a new database called GlycomeDB, containing all structures, their taxonomic annotations and references (IDs) for the original databases. More than 100000 datasets were imported, resulting in more than 33000 unique sequences now encoded in GlycomeDB using the universal format GlycoCT. Inconsistencies were found in all public databases, which were discussed and corrected in multiple feedback rounds with the responsible curators. CONCLUSION: GlycomeDB is a new, publicly available database for carbohydrate sequences with a unified, all-encompassing structure encoding format and NCBI taxonomic referencing. The database is updated weekly and can be downloaded free of charge. The JAVA application GlycoUpdateDB is also available for establishing and updating a local installation of GlycomeDB. With the advent of GlycomeDB, the distributed islands of knowledge in glycomics are now bridged to form a single resource.
Bioinformatics for glycobiology is still considered to be in its infancy. Nevertheless, there are various applications and databases available for glycoscientists by now. This article summarizes the problems that glycoinformatics is facing and gives an overview of the existing resources, including web portals, databases and tools. Software for structure input and display, for processing of analytical data, for prediction and analysis of glycosylation sites, and applications related to carbohydrate 3D structures are described. Special emphasis is put on GlycomeDB, a project that aims to integrate all freely available carbohydrate structure data already stored in databases, and the taxonomic annotation of these structures, into one resource. By this means it allows researchers to locate data in many databases without having to learn the different query types and carbohydrate notations used in the individual resources. KeywordsBioinformatics-Carbohydrate database-GlycomeDB-Glycan-Glycosylation sites-Automatic annotation-3D structure-Analytical software-Carbohydrate software tools.
Escherichia coli O-antigen database (ECODAB) is a web-based application to support the collection of E. coli O-antigen structures, polymerase and flippase amino acid sequences, NMR chemical shift data of O-antigens as well as information on glycosyltransferases (GTs) involved in the assembly of O-antigen polysaccharides. The database content has been compiled from scientific literature. Furthermore, the system has evolved from being a repository to one that can be used for generating novel data on its own. GT specificity is suggested through sequence comparison with GTs whose function is known. The migration of ECODAB to a relational database has allowed the automation of all processes to update, retrieve and present information, thereby, endowing the system with greater flexibility and improved overall performance. ECODAB is freely available at http://www.casper.organ.su.se/ECODAB/. Currently, data on 169 E. coli unique O-antigen entries and 338 GTs is covered. Moreover, the scope of the database has been extended so that polysaccharide structure and related information from other bacteria subsequently can be added, for example, from Streptococcus pneumoniae.
BACKGROUND: Polysaccharides are ubiquitously present in the living world. Their structural versatility makes them important and interesting components in numerous biological and technological processes ranging from structural stabilization to a variety of immunologically important molecular recognition events. The knowledge of polysaccharide three-dimensional (3D) structure is important in studying carbohydrate-mediated host-pathogen interactions, interactions with other bio-macromolecules, drug design and vaccine development as well as material science applications or production of bio-ethanol. DESCRIPTION: PolySac3DB is an annotated database that contains the 3D structural information of 157 polysaccharide entries that have been collected from an extensive screening of scientific literature. They have been systematically organized using standard names in the field of carbohydrate research into 18 categories representing polysaccharide families. Structure-related information includes the saccharides making up the repeat unit(s) and their glycosidic linkages, the expanded 3D representation of the repeat unit, unit cell dimensions and space group, helix type, diffraction diagram(s) (when applicable), experimental and/or simulation methods used for structure description, link to the abstract of the publication, reference and the atomic coordinate files for visualization and download. The database is accompanied by a user-friendly graphical user interface (GUI). It features interactive displays of polysaccharide structures and customized search options for beginners and experts, respectively. The site also serves as an information portal for polysaccharide structure determination techniques. The web-interface also references external links where other carbohydrate-related resources are available. CONCLUSION: PolySac3DB is established to maintain information on the detailed 3D structures of polysaccharides. All the data and features are available via the web-interface utilizing the search engine and can be accessed at http://polysac3db.cermav.cnrs.fr.
Glycans are known as the third major class of biopolymers next to DNA and proteins and have many biological roles by structural properties. The structure of glycans differs greatly from DNA and proteins in that they are branched structures of monosaccharides, as opposed to linear sequences of amino acids or nucleotides. Therefore, the assignment of glycan structure information has been a difficult problem. In order to solve this problem, an international team of glyco-scientists has collaborated to develop this repository, called GlyTouCan, to provide a centralized resource to deposit glycan structures and obtain unique accession numbers. GlyTouCan can accept glycan structures in any form, including ambiguous structures consisting of compositions and topologies. Users can register new glycan structures and additionally search for glycan structures that have been registered into this repository. All of these tools are freely available at https:// glytoucan. org/ . This will enable glycomics researchers to easily identify glycan structures by accession number. This chapter describes the procedures for the registration and search methods of glycan structures and provides an overview of the entry pages. Furthermore, troubleshooting tips and cautionary notes for using GlyTouCan are also included.
The glycosylation of proteins and lipids is known to be closely related to the mechanisms of various diseases such as influenza, cancer, and muscular dystrophy. Therefore, it has become clear that the analysis of post-translational modifications of proteins, including glycosylation, is important to accurately understand the functions of each protein molecule and the interactions among them. In order to conduct large-scale analyses more efficiently, it is essential to promote the accumulation, sharing, and reuse of experimental and analytical data in accordance with the FAIR (Findability, Accessibility, Interoperability, and Re-usability) data principles. However, a FAIR data repository for storing and sharing glycoconjugate information, including glycopeptides and glycoproteins, in a standardized format did not exist. Therefore, we have developed GlyComb (https://glycomb.glycosmos.org) as a new standardized data repository for glycoconjugate data. Currently, GlyComb can assign a unique identifier to a set of glycosylation information associated with a specific peptide sequence or UniProt ID. By standardizing glycoconjugate data via GlyComb identifiers and coordinating with existing web resources such as GlyTouCan and GlycoPOST, a comprehensive system for data submission and data sharing among researchers can be established. Here we introduce how GlyComb is able to integrate the variety of glycoconjugate data already registered in existing data repositories to obtain a better understanding of the available glycopeptides and glycoproteins, and their glycosylation patterns. We also explain how this system can serve as a foundation for a better understanding of glycan function.
KEGG (Kyoto Encyclopedia of Genes and Genomes) is a bioinformatics resource for understanding the functions and utilities of cells and organisms from both high-level and genomic perspectives. It is a self-sufficient, integrated resource consisting of genomic, chemical, and network information, with cross-references to numerous outside databases. The genomic and chemical information is a complete set of building blocks (genes and molecules) and the network information includes molecular wiring diagrams (interaction/reaction networks) and hierarchical classifications (relation networks) to represent high-level functions. This unit describes protocols for using KEGG, focusing on molecular network information in KEGG PATHWAY, KEGG BRITE, and KEGG MODULE, perturbed molecular networks in KEGG DISEASE and KEGG DRUG, molecular building block information in KEGG GENES and KEGG LIGAND, and a mechanism for linking genomes to molecular networks in KEGG ORTHOLOGY (KO). All of these many protocols enable the user to take advantage of the full breadth of the functionality provided by KEGG.
Rapid and continued growth in the generation of glycomic data has revealed the need for enhanced development of basic infrastructure for presenting and interpreting these datasets in a manner that engages the broader biomedical research community. Early in their growth, the genomic and proteomic fields implemented mechanisms for assigning unique gene and protein identifiers that were essential for organizing data presentation and for enhancing bioinformatic approaches to extracting knowledge. Similar unique identifiers are currently absent from glycomic data. In order to facilitate continued growth and expanded accessibility of glycomic data, the authors strongly encourage the glycomics community to coordinate the submission of their glycan structures to the GlyTouCan Repository and to make use of GlyTouCan identifiers in their communications and publications. The authors also deeply encourage journals to recommend a submission workflow in which submitted publications utilize GlyTouCan identifiers as a standard reference for explicitly describing glycan structures cited in manuscripts.
Bacterial carbohydrate structure database (BCSDB) is an open-access project that collects primary publication data on carbohydrate structures originating from bacteria, their biological properties, bibliographic and taxonomic annotations, NMR spectra, etc. Almost complete coverage and outstanding data consistency are achieved. BCSDB version 3 and the principles lying behind it, including glycan description language, are reported.
no abstract.
Carbohydrate structures in the Carbohydrate Structure Database have been referenced to glycoepitopes from the Immune Epitope Database allowing users to explore the glycan structures and contained epitopes. Starting with an epitope, one can figure out the glycans from other organisms that share the same structural determinant, and retrieve the associated taxonomical, medical, and other data. This database mapping demonstrates the advantages of the integration of immunological and glycomic databases.
Carbohydrate Structure Database (CSDB) is a regularly updated database on structures, taxonomy, bibliography, NMR spectra, and other data published for prokaryotic, plant, and fungal carbohydrates and their derivatives, including those containing noncarbohydrate moieties. Key features of this project are high data quality and aiming at full coverage. CSDB has multiple services, such as NMR prediction, NMR-based structure ranking, and tools for statistical analysis. It is freely available at http://csdb.glycoscience.ru.
Natural carbohydrates play important roles in living systems and therefore are used as diagnostic and therapeutic targets. The main goal of glycomics is systematization of carbohydrates and elucidation of their role in human health and disease. The amount of information on natural carbohydrates accumulates rapidly, but scientists still lack databases and computer-assisted tools needed for orientation in the glycomic information space. Therefore, freely available, regularly updated, and cross-linked databases are demanded. Bacterial Carbohydrate Structure Database (Bacterial CSDB) was developed for provision of structural, bibliographic, taxonomic, NMR spectroscopic, and other related information on bacterial and archaeal carbohydrate structures. Its main features are (1) coverage above 90%, (2) high data consistence (above 90% of error-free records), and (3) presence of manually verified bibliographic, NMR spectroscopic, and taxonomic annotations. Recently, CSDB has been expanded to cover carbohydrates of plant and fungal origin. The achievement of full coverage in the plant and fungal domains is expected in the future. CSDB is freely available on the Internet as a web service at http://csdb.glycoscience.ru. This chapter aims at showing how to use CSDB in your daily scientific practice.
The Carbohydrate Structure Databases (CSDBs, http://csdb.glycoscience.ru) store structural, bibliographic, taxonomic, NMR spectroscopic, and other data on natural carbohydrates and their derivatives published in the scientific literature. The CSDB project was launched in 2005 for bacterial saccharides (as BCSDB). Currently, it includes two parts, the Bacterial CSDB and the Plant&Fungal CSDB. In March 2015, these databases were merged to the single CSDB. The combined CSDB includes information on bacterial and archaeal glycans and derivatives (the coverage is close to complete), as well as on plant and fungal glycans and glycoconjugates (almost all structures published up to 1998). CSDB is regularly updated via manual expert annotation of original publications. Both newly annotated data and data imported from other databases are manually curated. The CSDB data are exportable in a number of modern formats, such as GlycoRDF. CSDB provides additional services for simulation of (1)H, (13)C and 2D NMR spectra of saccharides, NMR-based structure prediction, glycan-based taxon clustering and other.
The Carbohydrate Structure Database (CSDB, http://csdb.glycoscience.ru/ ) is a free curated repository storing various data on glycans of bacterial, fungal and plant origins. Currently, it maintains a close-to-full coverage on bacterial and fungal carbohydrates up to the year 2020. The CSDB web-interface provides free access to the database content and dedicated tools. Still, the number of these tools and the types of the corresponding analyses is limited, whereas the database itself contains data that can be used in a broader scope of analytical studies. In this paper, we present CSDB source data files and a self-contained SQL dump, and exemplify their possible application in glycan-related studies. By using CSDB in an SQL format, the user can gain access to the chain length distribution or charge distribution (as an example) in a given set of glycans defined according to specific structural, taxonomic, or other parameters, whereas the source text dump files can be imported to any dedicated database with a specific internal architecture differing from that of CSDB.
no abstract.
Carbohydrates are one of the most chemically diverse classes of biomolecules. The amount of accumulated information on carbohydrates is far beyond the level allowing navigation in this data ocean without special tools, which are glycomic databases and prognostic services built on top of these data. Existing databases, focused on solving the particular challenges in glycoscience, are not fully compatible with each other in coverage, data formats, and features served to users. Major problems in the modern glyco-databases include data quality, gaps in coverage, and absence of a widely accepted carbohydrate notation. Most demanded are databases with broad coverage, which can provide a universal dataspace on structures, properties, and functions of carbohydrates, associated with taxonomy and other features of their natural sources. In the framework of the Carbohydrate Structure Database (CSDB) project, we created a database architecture aimed at development of the extensible glycoinformatic portal with continuous maintenance and regular content updates. This architecture was implemented in software free of drawbacks typical for glycomic databases. For the 15 years of existence, CSDB has become the main source of data on glycans of microorganisms, and a platform for multiple carbohydrate-related services. This project includes a global-scale database of natural carbohydrates; among its key features are free access, annual data deposition and updates, search and correction of errors (including those in publications), and regular announcement of new services.
The inherent flexibility and lack of strong intramolecular interactions of oligosaccharides demand the use of theoretical methods for their structural elucidation. In spite of the developments of theoretical methods, not much research on glycoinformatics is done so far when compared to bioinformatics research on proteins and nucleic acids. We have developed three dimensional structural database for a sialic acid-containing carbohydrates (3DSDSCAR). This is an open-access database that provides 3D structural models of a given sialic acid-containing carbohydrate. At present, 3DSDSCAR contains 60 conformational models, belonging to 14 different sialic acid-containing carbohydrates, deduced through 10 ns molecular dynamics (MD) simulations. The database is available at the URL: http://www.3dsdscar.org.
no abstract.
The EUROCarbDB project is a design study for a technical framework, which provides sophisticated, freely accessible, open-source informatics tools and databases to support glycobiology and glycomic research. EUROCarbDB is a relational database containing glycan structures, their biological context and, when available, primary and interpreted analytical data from high-performance liquid chromatography, mass spectrometry and nuclear magnetic resonance experiments. Database content can be accessed via a web-based user interface. The database is complemented by a suite of glycoinformatics tools, specifically designed to assist the elucidation and submission of glycan structure and experimental data when used in conjunction with contemporary carbohydrate research workflows. All software tools and source code are licensed under the terms of the Lesser General Public License, and publicly contributed structures and data are freely accessible. The public test version of the web interface to the EUROCarbDB can be found at http://www.ebi.ac.uk/eurocarb.
In this sixth installment of the series, we would like to describe databases and repositories for glycolipids and glycoproteins, which are collectively known as glycoconjugates. Many proteins are glycosylated and correspondingly exhibit a wide variety of functions, and these protein-modifying glycans comprise a wide variety of structures. It is also known that the proteins are modified with different glycan structures depending on diseases and other factors. Here is an introduction to databases for glycoconjugates and mass spectrometry data repositories for glycans and glycoproteins.
no abstract.
In this study, we selected 181 nematode glycogenes that are orthologous to human glycogenes and examined their RNAi phenotypes. The results are deposited in the Caenorhabditis elegans Glycogene Database (CGGDB) at AIST, Tsukuba, Japan. The most prominent RNAi phenotypes observed are disruptions of cell cycle progression in germline mitosis/meiosis and in early embryonic cell mitosis. Along with the previously reported roles of chondroitin proteoglycans, glycosphingolipids and GPI-anchored proteins in cell cycle progression, we show for the first time that the inhibition of the functions of N-glycan synthesis genes (cytoplasmic alg genes) resulted in abnormal germline formation, ER stress and small body size phenotypes. The results provide additional information on the roles of glycoconjugates in the cell cycle progression mechanisms of germline and embryonic cells.
MOTIVATION: The rapid increase in the number of structures in the Protein Databank (PDB) makes it difficult to find all structures in a given protein class. Automatically-maintained web-based summaries are one solution to this problem. RESULTS: Summary of Antibody Crystal Structures (SACS), a self-maintaining web-site containing summary information on antibody structures in the PDB, is described. Mirrored PDB data are processed automatically using a Make-based system to identify new antibody structures. The PDB header records and sequence data are then parsed to identify a number of features of the structure and the data are stored using eXtensible Markup Language (XML). eXtensible Stylesheet Language: Transformations (XSLT), a new style sheet language for XML, is used to generate Hypertext Markup Language (HTML) pages containing either a one-line summary of every structure or a more detailed page describing a single antibody.
Discusses the development of the International Classification of Diseases, Tenth Revision (ICD-10) as compared with the previous version and the Diagnostic and Statistical Manual of Mental Disorders-III (DSM-III) and the DSM-III—R. The Schedules for Clinical Assessment in Neuropsychiatry (SCAN) is also outlined. SCAN provides the clinical researcher with a detailed coverage of the mental state in terms of well-described, well-defined, and properly elicited symptoms with the facility of computer derived diagnoses according to DSM-III—R and ICD-10 criteria. (PsycINFO Database Record (c) 2016 APA, all rights reserved).
GenBank(R) (http://www.ncbi.nlm.nih.gov) is a comprehensive database that contains publicly available nucleotide sequences for almost 260 000 formally described species. These sequences are obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole-genome shotgun (WGS) and environmental sampling projects. Most submissions are made using the web-based BankIt or standalone Sequin programs, and GenBank staff assigns accession numbers upon data receipt. Daily data exchange with the European Nucleotide Archive (ENA) and the DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. GenBank is accessible through the NCBI Entrez retrieval system, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI home page: www.ncbi.nlm.nih.gov.
The Protein Data Bank began as a grassroots effort in 1971. It has grown from a small archive containing a dozen structures to a major international resource for structural biology containing more than 40000 entries. The interplay of science, technology and attitudes about data sharing have all played a role in the growth of this resource.
ProGlycProt (http://www.proglycprot.org/) is an open access, manually curated, comprehensive repository of bacterial and archaeal glycoproteins with at least one experimentally validated glycosite (glycosylated residue). To facilitate maximum information at one point, the database is arranged under two sections: (i) ProCGP-the main data section consisting of 95 entries with experimentally characterized glycosites and (ii) ProUGP-a supplementary data section containing 245 entries with experimentally identified glycosylation but uncharacterized glycosites. Every entry in the database is fully cross-referenced and enriched with available published information about source organism, coding gene, protein, glycosites, glycosylation type, attached glycan, associated oligosaccharyl/glycosyl transferases (OSTs/GTs), supporting references, and applicable additional information. Interestingly, ProGlycProt contains as many as 174 entries for which information is unavailable or the characterized glycosites are unannotated in Swiss-Prot release 2011_07. The website supports a dedicated structure gallery of homology models and crystal structures of characterized glycoproteins in addition to two new tools developed in view of emerging information about prokaryotic sequons (conserved sequences of amino acids around glycosites) that are never or rarely seen in eukaryotic glycoproteins. ProGlycProt provides an extensive compilation of experimentally identified glycosites (334) and glycoproteins (340) of prokaryotes that could serve as an information resource for research and technology applications in glycobiology.
Lectins, and related receptors such as adhesins and toxins, are glycan-binding proteins from all origins that decipher the glycocode, i.e. the structural information encoded in the conformation of complex carbohydrates present on the surface of all cells. Lectins are still poorly classified and annotated, but since their functions are based on ligand recognition, their 3D-structures provide a solid foundation for characterization. UniLectin3D is a curated database that classifies lectins on origin and fold, with cross-links to literature, other databases in glycosciences and functional data such as known specificity. The database provides detailed information on lectins, their bound glycan ligands, and features their interactions using the Protein-Ligand Interaction Profiler (PLIP) server. Special care was devoted to the description of the bound glycan ligands with the use of simple graphical representation and numerical format for cross-linking to other databases in glycoscience. We conceived the design of the database architecture and the navigation tools to account for all organisms, as well as to search for oligosaccharide epitopes complexed within specified binding sites. UniLectin3D is accessible at https://www.unilectin.eu/unilectin3D.
The search for new biomolecules requires a clear understanding of biosynthesis and degradation pathways. This view applies to most metabolites as well as other molecule types such as glycans whose repertoire is still poorly characterized. Lectins are proteins that recognize specifically and interact noncovalently with glycans. This particular class of proteins is considered as playing a major role in biology. Glycan-binding is based on multivalence, which gives lectins a unique capacity to interact with surface glycans and significantly contribute to cell-cell recognition and interactions. Lectins have been studied for many years using multiple technologies and part of the resulting information is available online in databases. Unfortunately, the connectivity of these databases with the most popular omics databases (genomics, proteomics, and glycomics) remains limited. Moreover, lectin diversity is extended and requires setting out a flexible classification that remains compatible with new sequences and 3D structures that are continuously released. We have designed UniLectin as a new insight into the knowledge of lectins, their classification, and their biological role. This platform encompasses UniLectin3D, a curated database of lectin 3D structures that follows a periodically updated classification, a set of comparative and visualizing tools and gradually released modules dedicated to specific lectins predicted in sequence databases. The second module is PropLec, focused on β-propeller lectin prediction in all species based on five distinct family profiles. This chapter describes how UniLectin can be used to explore the diversity of lectins, their 3D structures, and associated functional information as well as to perform reliable predictions of β-propeller lectins.
The Scopus database provides access to STM journal articles and the references included in those articles, allowing the searcher to search both forward and backward in time. The database can be used for collection development as well as for research. This review provides information on the key points of the database and compares it to Web of Science. Neither database is inclusive, but complements each other. If a library can only afford one, choice must be based in institutional needs.
The Carbohydrate-Active Enzyme (CAZy) database is a knowledge-based resource specialized in the enzymes that build and breakdown complex carbohydrates and glycoconjugates. As of September 2008, the database describes the present knowledge on 113 glycoside hydrolase, 91 glycosyltransferase, 19 polysaccharide lyase, 15 carbohydrate esterase and 52 carbohydrate-binding module families. These families are created based on experimentally characterized proteins and are populated by sequences from public databases with significant similarity. Protein biochemical information is continuously curated based on the available literature and structural information. Over 6400 proteins have assigned EC numbers and 700 proteins have a PDB structure. The classification (i) reflects the structural features of these enzymes better than their sole substrate specificity, (ii) helps to reveal the evolutionary relationships between these enzymes and (iii) provides a convenient framework to understand mechanistic properties. This resource has been available for over 10 years to the scientific community, contributing to information dissemination and providing a transversal nomenclature to glycobiologists. More recently, this resource has been used to improve the quality of functional predictions of a number genome projects by providing expert annotation. The CAZy resource resides at URL: http://www.cazy.org/.
Glycosyltransferases (GTs; EC 2.4.x.y) constitute a large group of enzymes that form glycosidic bonds through transfer of sugars from activated donor molecules to acceptor molecules. GTs are critical to the biosynthesis of plant cell walls, among other diverse functions. Based on the Carbohydrate-Active enZymes (CAZy) database and sequence similarity searches, we have identified 609 potential GT genes (loci) corresponding to 769 transcripts (gene models) in rice (Oryza sativa), the reference monocotyledonous species. Using domain composition and sequence similarity, these rice GTs were classified into 40 CAZy families plus an additional unknown class. We found that two Pfam domains of unknown function, PF04577 and PF04646, are associated with GT families GT61 and GT31, respectively. To facilitate functional analysis of this important and large gene family, we created a phylogenomic Rice GT Database (http://ricephylogenomics.ucdavis.edu/cellwalls/gt/). Through the database, several classes of functional genomic data, including mutant lines and gene expression data, can be displayed for each rice GT in the context of a phylogenetic tree, allowing for comparative analysis both within and between GT families. Comprehensive digital expression analysis of public gene expression data revealed that most ( approximately 80%) rice GTs are expressed. Based on analysis with Inparanoid, we identified 282 'rice-diverged' GTs that lack orthologs in sequenced dicots (Arabidopsis thaliana, Populus tricocarpa, Medicago truncatula, and Ricinus communis). Combining these analyses, we identified 33 rice-diverged GT genes (45 gene models) that are highly expressed in above-ground, vegetative tissues. From the literature and this analysis, 21 of these loci are excellent targets for functional examination toward understanding and manipulating grass cell wall qualities. Study of the remainder may reveal aspects of hormone and protein metabolism that are critical for rice biology. This list of 33 genes and the Rice GT Database will facilitate the study of GTs and cell wall synthesis in rice and other plants.
Lectins, a class of carbohydrate-binding proteins, are now widely recognized to play a range of crucial roles in many cell-cell recognition events triggering several important cellular processes. They encompass different members that are diverse in their sequences, structures, binding site architectures, quaternary structures, carbohydrate affinities, and specificities as well as their larger biological roles and potential applications. It is not surprising, therefore, that the vast amount of experimental data on lectins available in the literature is so diverse, that it becomes difficult and time consuming, if not impossible to comprehend the advances in various areas and obtain the maximum benefit. To achieve an effective use of all the data toward understanding the function and their possible applications, an organization of these seemingly independent data into a common framework is essential. An integrated knowledge base ( Lectindb, http://nscdb.bic.physics.iisc.ernet.in ) together with appropriate analytical tools has therefore been developed initially for plant lectins by collating and integrating diverse data. The database has been implemented using MySQL on a Linux platform and web-enabled using PERL-CGI and Java tools. Data for each lectin pertain to taxonomic, biochemical, domain architecture, molecular sequence, and structural details as well as carbohydrate and hence blood group specificities. Extensive links have also been provided for relevant bioinformatics resources and analytical tools. Availability of diverse data integrated into a common framework is expected to be of high value not only for basic studies in lectin biology but also for basic studies in pursuing several applications in biotechnology, immunology, and clinical practice, using these molecules.
The BRENDA enzyme information system (http://www.brenda-enzymes.org/) has developed into an elaborate system of enzyme and enzyme-ligand information obtained from different sources, combined with flexible query systems and evaluation tools. The information is obtained by manual extraction from primary literature, text and data mining, data integration, and prediction algorithms. Approximately 300 million data include enzyme function and molecular data from more than 30,000 organisms. The manually derived core contains 3 million data from 77,000 enzymes annotated from 135,000 literature references. Each entry is connected to the literature reference and the source organism. They are complemented by information on occurrence, enzyme/disease relationships from text mining, sequences and 3D structures from other databases, and predicted enzyme location and genome annotation. Functional and structural data of more than 190,000 enzyme ligands are stored in BRENDA. New features improving the functionality and analysis tools were implemented. The human anatomy atlas CAVEman is linked to the BRENDA Tissue Ontology terms providing a connection between anatomical and functional enzyme data. Word Maps for enzymes obtained from PubMed abstracts highlight application and scientific relevance of enzymes. The EnzymeDetector genome annotation tool and the reaction database BKM-react including reactions from BRENDA, KEGG and MetaCyc were improved. The website was redesigned providing new query options.
MatrixDB (http://matrixdb.ibcp.fr) is a freely available database focused on interactions established by extracellular proteins and polysaccharides. Only few databases report protein-polysaccharide interactions and, to the best of our knowledge, there is no other database of extracellular interactions. MatrixDB takes into account the multimeric nature of several extracellular protein families for the curation of interactions, and reports interactions with individual polypeptide chains or with multimers, considered as permanent complexes, when appropriate. MatrixDB is a member of the International Molecular Exchange consortium (IMEx) and has adopted the PSI-MI standards for the curation and the exchange of interaction data. MatrixDB stores experimental data from our laboratory, data from literature curation, data imported from IMEx databases, and data from the Human Protein Reference Database. MatrixDB is focused on mammalian interactions, but aims to integrate interaction datasets of model organisms when available. MatrixDB provides direct links to databases recapitulating mutations in genes encoding extracellular proteins, to UniGene and to the Human Protein Atlas that shows expression and localization of proteins in a large variety of normal human tissues and cells. MatrixDB allows researchers to perform customized queries and to build tissue- and disease-specific interaction networks that can be visualized and analyzed with Cytoscape or Medusa.
Many publicly available data repositories and resources have been developed to support protein-related information management, data-driven hypothesis generation, and biological knowledge discovery. To help researchers quickly find the appropriate protein-related informatics resources, we present a comprehensive review (with categorization and description) of major protein bioinformatics databases in this chapter. We also discuss the challenges and opportunities for developing next-generation protein bioinformatics databases and resources to support data integration and data analytics in the Big Data era.
Knowledge of glycosylation status and glycan-pattern of proteins are of considerable medical, academic and application interest. ProGlycProt V2.0 (www.proglycprot.org) therefore, is conceived and maintained as an exclusive web-resource providing comprehensive information on experimentally validated glycoproteins and protein glycosyltransferases (GTs) of prokaryotic origin. The second release of ProGlycProt features a major update with a 191% increase in the total number of entries, manually collected and curated from 607 peer-reviewed publications, on the subject. Protein GTs from prokaryotes that catalyze a varied range of glycan linkages are amenable glycoengineering tools. Therefore, the second release presents content that is greatly expanded and reorganized in two sub-databases: ProGPdb and ProGTdb. While ProGPdb provides information about validated glycoproteins (222 entries), ProGTdb catalogs enzymes/proteins that are instrumental in protein glycosylation, directly (122) or as accessory proteins (182). ProGlycProt V2.0 remains highly cross-referenced yet exclusive and complementary in content to other related databases. The second release further features enhanced search capability, a "compare" entries option and an innovative geoanalytical tool (MapView) facilitating location-assisted search-cum filtering of the entries using geo-positioning information of researchers/groups cited in the ProGlycProt V2.0 databases. Thus, ProGlycProt V2.0 continues to serve as a useful one-point web-resource on various evidence-based information on protein glycosylation in prokaryotes.
MatrixDB (http://matrixdb.univ-lyon1.fr/) is an interaction database focused on biomolecular interactions established by extracellular matrix (ECM) proteins and glycosaminoglycans (GAGs). It is an active member of the International Molecular Exchange (IMEx) consortium (https://www.imexconsortium.org/). It has adopted the HUPO Proteomics Standards Initiative standards for annotating and exchanging interaction data, either at the MIMIx (The Minimum Information about a Molecular Interaction eXperiment) or IMEx level. The following items related to GAGs have been added in the updated version of MatrixDB: (i) cross-references of GAG sequences to the GlyTouCan database, (ii) representation of GAG sequences in different formats (IUPAC and GlycoCT) and as SNFG (Symbol Nomenclature For Glycans) images and (iii) the GAG Builder online tool to build 3D models of GAG sequences from GlycoCT codes. The database schema has been improved to represent n-ary experiments. Gene expression data, imported from Expression Atlas (https://www.ebi.ac.uk/gxa/home), quantitative ECM proteomic datasets (http://matrisomeproject.mit.edu/ecm-atlas), and a new visualization tool of the 3D structures of biomolecules, based on the PDB Component Library and LiteMol, have also been added. A new advanced query interface now allows users to mine MatrixDB data using combinations of criteria, in order to build specific interaction networks related to diseases, biological processes, molecular functions or publications.
The UniProt Knowledgebase is a collection of sequences and annotations for over 120 million proteins across all branches of life. Detailed annotations extracted from the literature by expert curators have been collected for over half a million of these proteins. These annotations are supplemented by annotations provided by rule based automated systems, and those imported from other resources. In this article we describe significant updates that we have made over the last 2 years to the resource. We have greatly expanded the number of Reference Proteomes that we provide and in particular we have focussed on improving the number of viral Reference Proteomes. The UniProt website has been augmented with new data visualizations for the subcellular localization of proteins as well as their structure and interactions. UniProt resources are available under a CC-BY (4.0) license via the web at https://www.uniprot.org/.
The UniProt knowledgebase is a large resource of protein sequences and associated detailed annotation. The database contains over 60 million sequences, of which over half a million sequences have been curated by experts who critically review experimental and predicted data for each protein. The remainder are automatically annotated based on rule systems that rely on the expert curated knowledge. Since our last update in 2014, we have more than doubled the number of reference proteomes to 5631, giving a greater coverage of taxonomic diversity. We implemented a pipeline to remove redundant highly similar proteomes that were causing excessive redundancy in UniProt. The initial run of this pipeline reduced the number of sequences in UniProt by 47 million. For our users interested in the accessory proteomes, we have made available sets of pan proteome sequences that cover the diversity of sequences for each species that is found in its strains and sub-strains. To help interpretation of genomic variants, we provide tracks of detailed protein information for the major genome browsers. We provide a SPARQL endpoint that allows complex queries of the more than 22 billion triples of data in UniProt (http://sparql.uniprot.org/). UniProt resources can be accessed via the website at http://www.uniprot.org/.
Carbohydrate-binding proteins play crucial roles across all organisms and viruses. The complexity of carbohydrate structures, together with inconsistencies in how their 3D structures are reported, has led to difficulties in characterizing the protein-carbohydrate interfaces. In order to better understand protein-carbohydrate interactions, we have developed an open-access database, ProCarbDB, which, unlike the Protein Data Bank (PDB), clearly distinguishes between the complete carbohydrate ligands and their monomeric units. ProCarbDB is a comprehensive database containing over 5200 3D X-ray crystal structures of protein-carbohydrate complexes. In ProCarbDB, the complete carbohydrate ligands are annotated and all their interactions are displayed. Users can also select any protein residue in the proximity of the ligand to inspect its interactions with the carbohydrate ligand and with other neighbouring protein residues. Where available, additional curated information on the binding affinity of the complex and the effects of mutations on the binding have also been provided in the database. We believe that ProCarbDB will be an invaluable resource for understanding protein-carbohydrate interfaces. The ProCarbDB web server is freely available at http://www.procarbdb.science/procarb.
Carbohydrates are well known for their physicochemical, biological, functional, and therapeutic characteristics. Unfortunately, their chemical nature imposes severe challenges for the structural elucidation of these phenomena, impairing not only the depth of our understanding of carbohydrates but also the development of new biotechnological and therapeutic applications based on these molecules. In the recent past, the amount of structural information, obtained mainly from X-ray crystallography, has increased progressively, as well as its quality. In this context, the current work presents a global analysis of the carbohydrate information available in the Protein Data Bank (PDB). From high quality structures, it is clear that most of the data are highly concentrated on a few sets of residue types, on their monosaccharidic forms, and connected by a small diversity of glycosidic linkages. The geometries of these linkages can be mostly associated with the types of linkages instead of residues, while the level of puckering distortion was characterized, quantified, and located in a pseudorotational equilibrium landscape, not only to local minima but also to transitional states. These qualitative and quantitative analyses offer a global picture of the carbohydrate structural content in the PDB, potentially supporting the building of new models for carbohydrate-related biological phenomena at the atomistic level, including new developments on force field parameters.
Thirty years have elapsed since the emergence of the classification of carbohydrate-active enzymes in sequence-based families that became the CAZy database over 20 years ago, freely available for browsing and download at www.cazy.org. In the era of large scale sequencing and high-throughput Biology, it is important to examine the position of this specialist database that is deeply rooted in human curation. The three primary tasks of the CAZy curators are (i) to maintain and update the family classification of this class of enzymes, (ii) to classify sequences newly released by GenBank and the Protein Data Bank and (iii) to capture and present functional information for each family. The CAZy website is updated once a month. Here we briefly summarize the increase in novel families and the annotations conducted during the last 8 years. We present several important changes that facilitate taxonomic navigation, and allow to download the entirety of the annotations. Most importantly we highlight the considerable amount of work that accompanies the analysis and report of biochemical data from the literature.
Structural antibody database (SAbDab; http://opig.stats.ox.ac.uk/webapps/sabdab) is an online resource containing all the publicly available antibody structures annotated and presented in a consistent fashion. The data are annotated with several properties including experimental information, gene details, correct heavy and light chain pairings, antigen details and, where available, antibody-antigen binding affinity. The user can select structures, according to these attributes as well as structural properties such as complementarity determining region loop conformation and variable domain orientation. Individual structures, datasets and the complete database can be downloaded.
In 2017, we reported a new database on glycosyltransferase (GT) activities, CSDB_GT (http://csdb.glycoscience.ru/gt.html), which was built at the platform of the Carbohydrate Structure Database (CSDB, http://csdb.glycoscience.ru/database/index.html) and contained data on experimentally confirmed GT activities from Arabidopsis thaliana. All entries in CSDB_GT are curated manually upon the analysis of scientific publications, and the key features of the database are accurate structural, genetic, protein and bibliographic references and close-to-complete coverage on experimentally proven GT activities in selected species. In 2018, CSDB_GT was supplemented with data on Escherichia coli GT activities. Now it contains ca. 800 entries on E. coli GTs, including ca. 550 with functions predicted in silico. This information was extracted from research papers published up to the year 2018 or was obtained by the authors' efforts on GT annotation. Thus, CSDB_GT was extended to provide not only experimentally confirmed GT activities, but also those predicted on the basis of gene or protein sequence homology that could carry valuable information. Accordingly, a new confirmation status-predicted in silico-was introduced. In addition, the coverage on A. thaliana was extended up to ca. 900 entries, all of which had experimental confirmation. Currently, CSDB_GT provides close-to-complete coverage on experimentally confirmed GT activities from A. thaliana and E. coli presented up to the year 2018.
We report the accomplishment of the first stage of the development of a novel manually curated database on glycosyltransferase (GT) activities, CSDB_GT. CSDB_GT (http://csdb.glycoscience.ru/gt.html) has been supplemented with GT activities from Saccharomyces cerevisiae. Now it provides the close-to-complete coverage on experimentally confirmed GTs from the three most studied model organisms from the three kingdoms: plantae (Arabidopsis thaliana, ca. 930 activities), bacteria (Escherichia coli, ca. 820 activities) and fungi (S. cerevisiae, ca. 270 activities).
Glycosyltransferases (GTs) are carbohydrate-active enzymes (CAZy) involved in the synthesis of natural glycan structures. The application of CAZy is highly demanded in biotechnology and pharmaceutics. However, it is being hindered by the lack of high-quality and comprehensive repositories of the research data accumulated so far. In this paper, we describe a new curated Carbohydrate Structure Glycosyltransferase Database (CSDB_GT). Currently, CSDB_GT provides ca. 780 activities exhibited by GTs, as well as several other CAZy, found in Arabidopsis thaliana and described in ca. 180 publications. It covers most published data on A. thaliana GTs with evidenced functions. CSDB_GT is linked to the Carbohydrate Structure Database (CSDB), which stores data on archaeal, bacterial, fungal and plant glycans. The CSDB_GT data are supported by experimental evidences and can be traced to original publications. CSDB_GT is freely available at http://csdb.glycoscience.ru/gt.html.
The NCBI Taxonomy database (http://www.ncbi.nlm.nih.gov/taxonomy) is the standard nomenclature and classification repository for the International Nucleotide Sequence Database Collaboration (INSDC), comprising the GenBank, ENA (EMBL) and DDBJ databases. It includes organism names and taxonomic lineages for each of the sequences represented in the INSDC's nucleotide and protein sequence databases. The taxonomy database is manually curated by a small group of scientists at the NCBI who use the current taxonomic literature to maintain a phylogenetic taxonomy for the source organisms represented in the sequence databases. The taxonomy database is a central organizing hub for many of the resources at the NCBI, and provides a means for clustering elements within other domains of NCBI web site, for internal linking between domains of the Entrez system and for linking out to taxon-specific external resources on the web. Our primary purpose is to index the domain of sequences as conveniently as possible for our user community.
The Cambridge Structural Database (CSD) contains a complete record of all published organic and metal-organic small-molecule crystal structures. The database has been in operation for over 50 years and continues to be the primary means of sharing structural chemistry data and knowledge across disciplines. As well as structures that are made public to support scientific articles, it includes many structures published directly as CSD Communications. All structures are processed both computationally and by expert structural chemistry editors prior to entering the database. A key component of this processing is the reliable association of the chemical identity of the structure studied with the experimental data. This important step helps ensure that data is widely discoverable and readily reusable. Content is further enriched through selective inclusion of additional experimental data. Entries are available to anyone through free CSD community web services. Linking services developed and maintained by the CCDC, combined with the use of standard identifiers, facilitate discovery from other resources. Data can also be accessed through CCDC and third party software applications and through an application programming interface.
O-GLYCBASE is a database of glycoproteins with O-linked glycosylation sites. Entries with at least one experimentally verified O-glycosylation site have been compiled from protein sequence databases and literature. Each entry contains information about the glycan involved, the species, sequence, a literature reference and http-linked cross-references to other databases. Version 4.0 contains 179 protein entries, an approximate 15% increase over the last version. Sequence logos representing the acceptor specificity patterns for GalNAc, GlcNAc, mannosyl and xylosyl transferases are shown. The O-GLYCBASE database is available through the WWW at http://www.cbs.dtu.dk/databases/OGLYCBASE/.
BACKGROUND: PubChem is a chemical information repository, consisting of three primary databases: Substance, Compound, and BioAssay. When individual data contributors submit chemical substance descriptions to Substance, the unique chemical structures are extracted and stored into Compound through an automated process called structure standardization. The present study describes the PubChem standardization approaches and analyzes them for their success rates, reasons that cause structures to be rejected, and modifications applied to structures during the standardization process. Furthermore, the PubChem standardization is compared to the structure normalization of the IUPAC International Chemical Identifier (InChI) software, as manifested by conversion of the InChI back into a chemical structure. RESULTS: The observed rejection rate for substances processed by PubChem standardization was 0.36%, which is predominantly attributed to structures with invalid atom valences that cannot be readily corrected without additional information from contributors. Of all structures that pass standardization, 44% are modified in the process, reducing the count of unique structures from 53,574,724 in substance to 45,808,881 in compound as identified by de-aromatized canonical isomeric SMILES. Even though the processing time is very low on average (only 0.4% of structures have individual standardization time above 0.1 s), total standardization time is completely dominated by edge cases: 90% of the time to standardize all structures in PubChem substance is spent on the 2.05% of structures with the highest individual standardization time. It is worth noting that 60% of the structures obtained from PubChem structure standardization are not identical to the chemical structure resulting from the InChI (primarily due to preferences for a different tautomeric form). CONCLUSIONS: Standardization of chemical structures is complicated by the diversity of chemical information and their representations approaches. The PubChem standardization is an effective and efficient tool to account for molecular diversity and to eliminate invalid/incomplete structures. Further development will concentrate on improved tautomer consideration and an expanded stereocenter definition. Modifications are difficult to thoroughly validate, with slight changes often affecting many thousands of structures and various edge cases. The PubChem structure standardization service is accessible as a public resource ( https://pubchem.ncbi.nlm.nih.gov/standardize ), and via programmatic interfaces.
BACKGROUND: The International Classification of Diseases (ICD) has long been the main basis for comparability of statistics on causes of mortality and morbidity between places and over time. This paper provides an overview of the recently completed 11th revision of the ICD, focusing on the main innovations and their implications. MAIN TEXT: Changes in content reflect knowledge and perspectives on diseases and their causes that have emerged since ICD-10 was developed about 30 years ago. Changes in design and structure reflect the arrival of the networked digital era, for which ICD-11 has been prepared. ICD-11's information framework comprises a semantic knowledge base (the Foundation), a biomedical ontology linked to the Foundation and classifications derived from the Foundation. ICD-11 for Mortality and Morbidity Statistics (ICD-11-MMS) is the primary derived classification and the main successor to ICD-10. Innovations enabled by the new architecture include an online coding tool (replacing the index and providing additional functions), an application program interface to enable remote access to ICD-11 content and services, enhanced capability to capture and combine clinically relevant characteristics of cases and integrated support for multiple languages. CONCLUSIONS: ICD-11 was adopted by the World Health Assembly in May 2019. Transition to implementation is in progress. ICD-11 can be accessed at icd.who.int.
Taking advantage of the known planarity of the N-acetyl group of N-acetylglucosamine, an analysis of the quality of carbohydrate structures found in the protein databank was performed. Few obvious defects of the local geometry of the carbonyl group were observed. However, the N-acetyl group was often found in the less favorable cis conformation (12% of the cases). It was also found severely twisted in numerous instances, especially in structures with a resolution poorer than 1.9 Å determined between 2000 and 2015. Though the automated PDB-REDO procedure has proved able to improve nearly 85% of the structural models deposited to the PDB, and does prove able to cure most severely twisted conformations of the N-acetyl group, it fails to correct its high rate of cis conformations. More generally, for structures with a resolution poorer than 1.6 Å, it produces N-acetylglucosamine models in slightly poorer agreement with experimental data, as measured using real-space correlation coefficients. Significant improvements are thus still needed, at least as far as this carbohydrate structure is concerned. .
Lectins are a large group of carbohydrate-binding proteins, having been shown to comprise at least 48 protein scaffolds or protein family entries. They occur ubiquitously in living organisms-from humans to microorganisms, including viruses-and while their functions are yet to be fully elucidated, their main underlying actions are thought to mediate cell-cell and cell-glycoconjugate interactions, which play important roles in an extensive range of biological processes. The basic feature of each lectin's function resides in its specific sugar-binding properties. In this regard, it is beneficial for researchers to have access to fundamental information about the detailed oligosaccharide specificities of diverse lectins. In this review, the authors describe a publicly available lectin database named "Lectin frontier DataBase (LfDB)", which undertakes the continuous publication and updating of comprehensive data for lectin-standard oligosaccharide interactions in terms of dissociation constants (Kd's). For Kd determination, an advanced system of frontal affinity chromatography (FAC) is used, with which quantitative datasets of interactions between immobilized lectins and >100 fluorescently labeled standard glycans have been generated. The FAC system is unique in its clear principle, simple procedure and high sensitivity, with an increasing number (>67) of associated publications that attest to its reliability. Thus, LfDB, is expected to play an essential role in lectin research, not only in basic but also in applied fields of glycoscience.
Owing to the importance of the post-translational modifications (PTMs) of proteins in regulating biological processes, the dbPTM (http://dbPTM.mbc.nctu.edu.tw/) was developed as a comprehensive database of experimentally verified PTMs from several databases with annotations of potential PTMs for all UniProtKB protein entries. For this 10th anniversary of dbPTM, the updated resource provides not only a comprehensive dataset of experimentally verified PTMs, supported by the literature, but also an integrative interface for accessing all available databases and tools that are associated with PTM analysis. As well as collecting experimental PTM data from 14 public databases, this update manually curates over 12 000 modified peptides, including the emerging S-nitrosylation, S-glutathionylation and succinylation, from approximately 500 research articles, which were retrieved by text mining. As the number of available PTM prediction methods increases, this work compiles a non-homologous benchmark dataset to evaluate the predictive power of online PTM prediction tools. An increasing interest in the structural investigation of PTM substrate sites motivated the mapping of all experimental PTM peptides to protein entries of Protein Data Bank (PDB) based on database identifier and sequence identity, which enables users to examine spatially neighboring amino acids, solvent-accessible surface area and side-chain orientations for PTM substrate sites on tertiary structures. Since drug binding in PDB is annotated, this update identified over 1100 PTM sites that are associated with drug binding. The update also integrates metabolic pathways and protein-protein interactions to support the PTM network analysis for a group of proteins. Finally, the web interface is redesigned and enhanced to facilitate access to this resource.
The structural registration of chemically modified macromolecules is vital for the development of biopharmaceuticals. However, registration and search of such complex molecules has so far posed formidable challenges performance-wise, since today's chemistry-oriented databases do not scale well to macromolecules. As a practical consequence, macromolecules tend to be stored in protein databases with a focus on protein sequence only, and salient chemistry details are therefore lost. This article describes protein format extensions and the use of pseudoatoms for representing natural amino acids in chemical structures to allow high-performance registration and retrieval of large macromolecules. The representations include exact chemical modifications and enable lossless conversion between chemistry and sequence formats. Registration is done in parallel in both sequence and chemistry formats, and users can register and retrieve molecules in either format as they choose, resulting in what we call a BioChemformatics database. Having both sequence and chemistry formats available on-demand allows for the construction of protein SAR tables with mixed sequence and chemistry information. Likewise, searching may combine sequence and chemistry terms and be performed in standard vendor applications like MDL's ISIS/Base or in-house applications using standard SQL queries.
Molecular dynamics simulations of membrane proteins have provided deeper insights into their functions and interactions with surrounding environments at the atomic level. However, compared to solvation of globular proteins, building a realistic protein/membrane complex is still challenging and requires considerable experience with simulation software. Membrane Builder in the CHARMM-GUI website (http://www.charmm-gui.org) helps users to build such a complex system using a web browser with a graphical user interface. Through a generalized and automated building process including system size determination as well as generation of lipid bilayer, pore water, bulk water, and ions, a realistic membrane system with virtually any kinds and shapes of membrane proteins can be generated in 5 minutes to 2 hours depending on the system size. Default values that were elaborated and tested extensively are given in each step to provide reasonable options and starting points for both non-expert and expert users. The efficacy of Membrane Builder is illustrated by its applications to 12 transmembrane and 3 interfacial membrane proteins, whose fully equilibrated systems with three different types of lipid molecules (DMPC, DPPC, and POPC) and two types of system shapes (rectangular and hexagonal) are freely available on the CHARMM-GUI website. One of the most significant advantages of using the web environment is that, if a problem is found, users can go back and re-generate the whole system again before quitting the browser. Therefore, Membrane Builder provides the intuitive and easy way to build and simulate the biologically important membrane system.
Protein glycosylation is a common post-translational modification that plays important roles in terms of protein function. However, analyzing the relationship between glycosylation and protein function remains technically challenging. This problem arises from the fact that the attached glycans possess diverse and heterogeneous structures. We believe that the first step to elucidate glycan function is to systematically determine the status of protein glycosylation under physiological conditions. Such studies involve analyzing differences in glycan structure on cell type (tissue), sex, and age, as well as changes associated with perturbations as a result of gene knockout of glycan biosynthesis-related enzyme, disease and drug treatment. Therefore, we analyzed a series of glycoproteomes in several mouse tissues to identify glycosylated proteins and their glycosylation sites. Comprehensive analysis was performed by lectin- or HILIC-capture of glycopeptide subsets followed by enzymatic deglycosylation in stable isotope-labeled water (H(2)(1)(8)O, IGOT) and finally LC-MS analyses. In total, 5060 peptides derived from 2556 glycoproteins were identified. We then constructed a glycoprotein database, GlycoProtDB, using our experimental-based information to facilitate future studies in glycobiology.
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a popular chemical information resource that serves the scientific community as well as the general public, with millions of unique users per month. In the past two years, PubChem made substantial improvements. Data from more than 100 new data sources were added to PubChem, including chemical-literature links from Thieme Chemistry, chemical and physical property links from SpringerMaterials, and patent links from the World Intellectual Properties Organization (WIPO). PubChem's homepage and individual record pages were updated to help users find desired information faster. This update involved a data model change for the data objects used by these pages as well as by programmatic users. Several new services were introduced, including the PubChem Periodic Table and Element pages, Pathway pages, and Knowledge panels. Additionally, in response to the coronavirus disease 2019 (COVID-19) outbreak, PubChem created a special data collection that contains PubChem data related to COVID-19 and the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2).
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public repository for information on chemical substances and their biological activities, launched in 2004 as a component of the Molecular Libraries Roadmap Initiatives of the US National Institutes of Health (NIH). For the past 11 years, PubChem has grown to a sizable system, serving as a chemical information resource for the scientific research community. PubChem consists of three inter-linked databases, Substance, Compound and BioAssay. The Substance database contains chemical information deposited by individual data contributors to PubChem, and the Compound database stores unique chemical structures extracted from the Substance database. Biological activity data of chemical substances tested in assay experiments are contained in the BioAssay database. This paper provides an overview of the PubChem Substance and Compound databases, including data sources and contents, data organization, data submission using PubChem Upload, chemical structure standardization, web-based interfaces for textual and non-textual searches, and programmatic access. It also gives a brief description of PubChem3D, a resource derived from theoretical three-dimensional structures of compounds in PubChem, as well as PubChemRDF, Resource Description Framework (RDF)-formatted PubChem data for data sharing, analysis and integration with information contained in other databases.
The SWISS-MODEL Repository is a database of annotated three-dimensional comparative protein structure models generated by the fully automated homology-modelling pipeline SWISS-MODEL. The Repository currently contains about 300,000 three-dimensional models for sequences from the Swiss-Prot and TrEMBL databases. The content of the Repository is updated on a regular basis incorporating new sequences, taking advantage of new template structures becoming available and reflecting improvements in the underlying modelling algorithms. Each entry consists of one or more three-dimensional protein models, the superposed template structures, the alignments on which the models are based, a summary of the modelling process and a force field based quality assessment. The SWISS-MODEL Repository can be queried via an interactive website at http://swissmodel.expasy. org/repository/. Annotation and cross-linking of the models with other databases, e.g. Swiss-Prot on the ExPASy server, allow for seamless navigation between protein sequence and structure information. The aim of the SWISS-MODEL Repository is to provide access to an up-to-date collection of annotated three-dimensional protein models generated by automated homology modelling, bridging the gap between sequence and structure databases.
MatrixDB (http://matrixdb.ibcp.fr) is a freely available database focused on interactions established by extracellular proteins and polysaccharides. It is an active member of the International Molecular Exchange (IMEx) consortium and has adopted the PSI-MI standards for annotating and exchanging interaction data, either at the MIMIx or IMEx level. MatrixDB content has been updated by curation and by importing extracellular interaction data from other IMEx databases. Other major changes include the creation of a new website and the development of a novel graphical navigator, iNavigator, to build and expand interaction networks. Filters may be applied to build sub-networks based on a list of biomolecules, a specified interaction detection method and/or an expression level by tissue, developmental stage, and health state (UniGene data). Any molecule of the network may be selected and its partners added to the network at any time. Networks may be exported under Cytoscape and tabular formats and as images, and may be saved for subsequent re-use.
dbPTM is a database that compiles information on protein post-translational modifications (PTMs), such as the catalytic sites, solvent accessibility of amino acid residues, protein secondary and tertiary structures, protein domains and protein variations. The database includes all of the experimentally validated PTM sites from Swiss-Prot, PhosphoELM and O-GLYCBASE. Only a small fraction of Swiss-Prot proteins are annotated with experimentally verified PTM. Although the Swiss-Prot provides rich information about the PTM, other structural properties and functional information of proteins are also essential for elucidating protein mechanisms. The dbPTM systematically identifies three major types of protein PTM (phosphorylation, glycosylation and sulfation) sites against Swiss-Prot proteins by refining our previously developed prediction tool, KinasePhos (http://kinasephos.mbc.nctu.edu.tw/). Solvent accessibility and secondary structure of residues are also computationally predicted and are mapped to the PTM sites. The resource is now freely available at http://dbPTM.mbc.nctu.edu.tw/.
The Carbohydrate-Active Enzymes database (CAZy; http://www.cazy.org) provides online and continuously updated access to a sequence-based family classification linking the sequence to the specificity and 3D structure of the enzymes that assemble, modify and breakdown oligo- and polysaccharides. Functional and 3D structural information is added and curated on a regular basis based on the available literature. In addition to the use of the database by enzymologists seeking curated information on CAZymes, the dissemination of a stable nomenclature for these enzymes is probably a major contribution of CAZy. The past few years have seen the expansion of the CAZy classification scheme to new families, the development of subfamilies in several families and the power of CAZy for the analysis of genomes and metagenomes. This article outlines the changes that have occurred in CAZy during the past 5 years and presents our novel effort to display the resolution and the carbohydrate ligands in crystallographic complexes of CAZymes.
The past decade has witnessed the modern advances of high-throughput technology and rapid growth of research capacity in producing large-scale biological data, both of which were concomitant with an exponential growth of biomedical literature. This wealth of scholarly knowledge is of significant importance for researchers in making scientific discoveries and healthcare professionals in managing health-related matters. However, the acquisition of such information is becoming increasingly difficult due to its large volume and rapid growth. In response, the National Center for Biotechnology Information (NCBI) is continuously making changes to its PubMed Web service for improvement. Meanwhile, different entities have devoted themselves to developing Web tools for helping users quickly and efficiently search and retrieve relevant publications. These practices, together with maturity in the field of text mining, have led to an increase in the number and quality of various Web tools that provide comparable literature search service to PubMed. In this study, we review 28 such tools, highlight their respective innovations, compare them to the PubMed system and one another, and discuss directions for future development. Furthermore, we have built a website dedicated to tracking existing systems and future advances in the field of biomedical literature search. Taken together, our work serves information seekers in choosing tools for their needs and service providers and developers in keeping current in the field. Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/search.
Knowledge of the 3D structure of glycans is a prerequisite for a complete understanding of the biological processes glycoproteins are involved in. However, due to a lack of standardised nomenclature, carbohydrate compounds are difficult to locate within the Protein Data Bank (PDB). Using an algorithm that detects carbohydrate structures only requiring element types and atom coordinates, we were able to detect 1663 entries containing a total of 5647 carbohydrate chains. The majority of chains are found to be N-glycosidically bound. Noncovalently bound ligands are also frequent, while O-glycans form a minority. About 30% of all carbohydrate containing PDB entries comprise one or several errors. The automatic assignment of carbohydrate structures in PDB entries will improve the cross-linking of glycobiology resources with genomic and proteomic data collections, which will be an important issue of the upcoming glycomics projects. By aiding in detection of erroneous annotations and structures, the algorithm might also help to increase database quality.
Interactions between proteins are essential to any cellular process and constitute the basis for molecular networks that determine the functional state of a cell. With the technical advances in recent years, an astonishingly high number of protein-protein interactions has been revealed. However, the interactome of O-linked N-acetylglucosamine transferase (OGT), the sole enzyme adding the O-linked beta-N-acetylglucosamine (O-GlcNAc) onto its target proteins, has been largely undefined. To that end, we collated OGT interaction proteins experimentally identified in the past several decades. Rigorous curation of datasets from public repositories and O-GlcNAc-focused publications led to the identification of up to 929 high-stringency OGT interactors from multiple species studied (including Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Arabidopsis thaliana, and others). Among them, 784 human proteins were found to be interactors of human OGT. Moreover, these proteins spanned a very diverse range of functional classes (e.g., DNA repair, RNA metabolism, translational regulation, and cell cycle), with significant enrichment in regulating transcription and (co)translation. Our dataset demonstrates that OGT is likely a hub protein in cells. A webserver OGT-Protein Interaction Network (OGT-PIN) has also been created, which is freely accessible.
Understanding of the three-dimensional structures of proteins that interact with carbohydrates covalently (glycoproteins) as well as noncovalently (protein-carbohydrate complexes) is essential to many biological processes and plays a significant role in normal and disease-associated functions. It is important to have a central repository of knowledge available about these protein-carbohydrate complexes as well as preprocessed data of predicted structures. This can be significantly enhanced by tools de novo which can predict carbohydrate-binding sites for proteins in the absence of structure of experimentally known binding site. PROCARB is an open-access database comprising three independently working components, namely, (i) Core PROCARB module, consisting of three-dimensional structures of protein-carbohydrate complexes taken from Protein Data Bank (PDB), (ii) Homology Models module, consisting of manually developed three-dimensional models of N-linked and O-linked glycoproteins of unknown three-dimensional structure, and (iii) CBS-Pred prediction module, consisting of web servers to predict carbohydrate-binding sites using single sequence or server-generated PSSM. Several precomputed structural and functional properties of complexes are also included in the database for quick analysis. In particular, information about function, secondary structure, solvent accessibility, hydrogen bonds and literature reference, and so forth, is included. In addition, each protein in the database is mapped to Uniprot, Pfam, PDB, and so forth.
The SugarBind Database (SugarBindDB) covers knowledge of glycan binding of human pathogen lectins and adhesins. It is a curated database; each glycan-protein binding pair is associated with at least one published reference. The core data element of SugarBindDB is a set of three inseparable components: the pathogenic agent, a lectin/adhesin and a glycan ligand. Each entity (agent, lectin or ligand) is described by a range of properties that are summarized in an entity-dedicated page. Several search, navigation and visualisation tools are implemented to investigate the functional role of glycans in pathogen binding. The database is cross-linked to protein and glycan-relaled resources such as UniProtKB and UniCarbKB. It is tightly bound to the latter via a substructure search tool that maps each ligand to full structures where it occurs. Thus, a glycan-lectin binding pair of SugarBindDB can lead to the identification of a glycan-mediated protein-protein interaction, that is, a lectin-glycoprotein interaction, via substructure search and the knowledge of site-specific glycosylation stored in UniCarbKB. SugarBindDB is accessible at: http://sugarbind.expasy.org.
The Immune Epitope Database and Analysis Resource (IEDB) contains information related to antibodies and T cells across an expansive scope of research fields (infectious diseases, allergy, autoimmunity, and transplantation). Capture and representation of the data to reflect growing scientific standards and techniques have required continual refinement of our rigorous curation and query and reporting processes beginning with the automated classification of over 28 million PubMed abstracts, and resulting in easily searchable data from over 20,000 published manuscripts. Data related to MHC binding and elution, nonpeptidics, natural processing, receptors, and 3D structure is first captured through manual curation and subsequently maintained through recuration to reflect evolving scientific standards. Upon promotion to the free, public database, users can query and export records of specific relevance via the online web portal which undergoes iterative development to best enable efficient data access. In parallel, the companion Analysis Resource site hosts a variety of tools that assist in the bioinformatic analyses of epitopes and related structures, which can be applied to IEDB-derived and independent datasets alike. Available tools are classified into two categories: analysis and prediction. Analysis tools include epitope clustering, sequence conservancy, and more, while prediction tools cover T and B cell epitope binding, immunogenicity, and TCR/BCR structures. In addition to these tools, benchmarking servers which allow for unbiased performance comparison are also offered. In order to expand and support the user-base of both the database and Analysis Resource, the research team actively engages in community outreach through publication of ongoing work, conference attendance and presentations, hosting of user workshops, and the provision of online help. This review provides a description of the IEDB database infrastructure, curation and recuration processes, query and reporting capabilities, the Analysis Resource, and our Community Outreach efforts, including assessment of the impact of the IEDB across the research community.
AraCyc is a database containing biochemical pathways of Arabidopsis, developed at The Arabidopsis Information Resource (http://www.arabidopsis.org). The aim of AraCyc is to represent Arabidopsis metabolism as completely as possible with a user-friendly Web-based interface. It presently features more than 170 pathways that include information on compounds, intermediates, cofactors, reactions, genes, proteins, and protein subcellular locations. The database uses Pathway Tools software, which allows the users to visualize a bird's eye view of all pathways in the database down to the individual chemical structures of the compounds. The database was built using Pathway Tools' Pathologic module with MetaCyc, a collection of pathways from more than 150 species, as a reference database. This initial build was manually refined and annotated. More than 20 plant-specific pathways, including carotenoid, brassinosteroid, and gibberellin biosyntheses have been added from the literature. A list of more than 40 plant pathways will be added in the coming months. The quality of the initial, automatic build of the database was compared with the manually improved version, and with EcoCyc, an Escherichia coli database using the same software system that has been manually annotated for many years. In addition, a Perl interface, PerlCyc, was developed that allows programmers to access Pathway Tools databases from the popular Perl language. AraCyc is available at the tools section of The Arabidopsis Information Resource Web site (http://www.arabidopsis.org/tools/aracyc).
The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank((R)) nucleic acid sequence database and the PubMed database of citations and abstracts for published life science journals. Additional NCBI resources focus on literature (Bookshelf, PubMed Central (PMC) and PubReader); medical genetics (ClinVar, dbMHC, the Genetic Testing Registry, HIV-1/Human Protein Interaction Database and MedGen); genes and genomics (BioProject, BioSample, dbSNP, dbVar, Epigenomics, Gene, Gene Expression Omnibus (GEO), Genome, HomoloGene, the Map Viewer, Nucleotide, PopSet, Probe, RefSeq, Sequence Read Archive, the Taxonomy Browser, Trace Archive and UniGene); and proteins and chemicals (Biosystems, COBALT, the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART), the Molecular Modeling Database (MMDB), Protein Clusters, Protein and the PubChem suite of small molecule databases). The Entrez system provides search and retrieval operations for many of these databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of these resources can be accessed through the NCBI home page at http://www.ncbi.nlm.nih.gov.
Databases and data repositories provide essential functions for the research community by integrating, curating, archiving and otherwise packaging data to facilitate discovery and reuse. Despite their importance, funding for maintenance of these resources is increasingly hard to obtain. Fueled by a desire to find long term, sustainable solutions to database funding, staff from the Arabidopsis Information Resource (TAIR), founded the nonprofit organization, Phoenix Bioinformatics, using TAIR as a test case for user-based funding. Subscription-based funding has been proposed as an alternative to grant funding but its application has been very limited within the nonprofit sector. Our testing of this model indicates that it is a viable option, at least for some databases, and that it is possible to strike a balance that maximizes access while still incentivizing subscriptions. One year after transitioning to subscription support, TAIR is self-sustaining and Phoenix is poised to expand and support additional resources that wish to incorporate user-based funding strategies. Database URL: www.arabidopsis.org.
The 2016 Database Issue of Nucleic Acids Research starts with overviews of the resources provided by three major bioinformatics centers, the U.S. National Center for Biotechnology Information (NCBI), the European Bioinformatics Institute (EMBL-EBI) and Swiss Institute for Bioinformatics (SIB). Also included are descriptions of 62 new databases and updates on 95 databases that have been previously featured in NAR plus 17 previously described elsewhere. A number of papers in this issue deal with resources on nucleic acids, including various kinds of non-coding RNAs and their interactions, molecular dynamics simulations of nucleic acid structure, and two databases of super-enhancers. The protein database section features important updates on the EBI's Pfam, PDBe and PRIDE databases, as well as a variety of resources on pathways, metabolomics and metabolic modeling. This issue also includes updates on popular metagenomics resources, such as MG-RAST, EBI Metagenomics, and probeBASE, as well as a newly compiled Human Pan-Microbe Communities database. A significant fraction of the new and updated databases are dedicated to the genetic basis of disease, primarily cancer, and various aspects of drug research, including resources for patented drugs, their side effects, withdrawn drugs, and potential drug targets. A further six papers present updated databases of various antimicrobial and anticancer peptides. The entire Database Issue is freely available online on the Nucleic Acids Research website (http://nar.oxfordjournals.org/). The NAR online Molecular Biology Database Collection, http://www.oxfordjournals.org/nar/database/c/, has been updated with the addition of 88 new resources and removal of 23 obsolete websites, which brought the current listing to 1685 databases.
GenBank(R) (https://www.ncbi.nlm.nih.gov/genbank/) is a comprehensive, public database that contains 25 trillion base pairs from over 3.7 billion nucleotide sequences for 557 000 formally described species. Daily data exchange with the European Nucleotide Archive (ENA) and the DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. Recent updates include policies for including spatio-temporal metadata, clarified documentation for GenBank data processing, enhanced foreign contamination screening tools, new processes in the Submission Portal, migration of Entrez Genome and Assembly displays into NCBI Datasets, and the impending retirement of tbl2asn, replaced by table2asn.
The BRENDA (BRaunschweig ENzyme Database, http://www.brenda-enzymes.org) enzyme information system is the main collection of enzyme functional and property data for the scientific community. The majority of the data are manually extracted from the primary literature. The content covers information on function, structure, occurrence, preparation and application of enzymes as well as properties of mutants and engineered variants. The number of manually annotated references increased by 30% to more than 100,000, the number of ligand structures by 45% to almost 100,000. New query, analysis and data management tools were implemented to improve data processing, data presentation, data input and data access. BRENDA now provides new viewing options such as the display of the statistics of functional parameters and the 3D view of protein sequence and structure features. Furthermore a ligand summary shows comprehensive information on the BRENDA ligands. The enzymes are linked to their respective pathways and can be viewed in pathway maps. The disease text mining part is strongly enhanced. It is possible to submit new, not yet classified enzymes to BRENDA, which then are reviewed and classified by the International Union of Biochemistry and Molecular Biology. A new SBML output format of BRENDA kinetic data allows the construction of organism-specific metabolic models.
he number of times an article is acknowledged as a reference in another article reflects its scientific impact. Citation analysis is one of the parameters for assessing the quality of research published in scientific, technology and social science journals. Web of Science enables users to search current and retrospective multidisciplinary information. Parameters and practical applications evaluating journal and article citation characteristics available through the Science Citation Index are summarized.
SugarBindDB lists pathogen and biotoxin lectins and their carbohydrate ligands in a searchable format. Information is collected from articles published in peer-reviewed scientific journals. Help files guide the user through the search process and provide a review of structures and names of sugars that appear in human oligosaccharides. Glycans are written in the condensed form of the carbohydrate nomenclature system developed by the International Union of Pure and Applied Chemistry (IUPAC). Since its online publication by The MITRE Corporation in 2005, the database has served as a resource for research on the glycobiology of infectious disease. SugarBindDB is currently hosted by the Swiss Institute of Bioinformatics on the ExPASy server and will be enhanced and linked to related resources as part of the wider UniCarbKB initiative. Enhancements will include the option to display glycans in a variety of formats, including modified 2D condensed IUPAC and symbolic nomenclature.
MOTIVATION: Protein-carbohydrate interactions perform several cellular and biological functions and their structure and function are mainly dictated by their binding affinity. Although plenty of experimental data on binding affinity are available, there is no reliable and comprehensive database in the literature. RESULTS: We have developed a database on binding affinity of protein-carbohydrate complexes, ProCaff, which contains 3122 entries on dissociation constant (Kd), Gibbs free energy change (DeltaG), experimental conditions, sequence, structure and literature information. Additional features include the options to search, display, visualization, download and upload the data. AVAILABILITY AND IMPLEMENTATION: The database is freely available at http://web.iitm.ac.in/bioinfo2/procaff/. The website is implemented using HTML and PHP and supports recent versions of major browsers such as Chrome, Firefox, IE10 and Opera. CONTACT: gromiha@iitm.ac.in. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
The process of designing and implementing NMRShiftDB, an open-source, open-content database for chemical structures and their NMR data based solely on free software is described. NMRShiftDB is available to the community on http://www.nmrshiftdb.org. It allows for open submission and retrieval of data sets by its user community. The software and the content itself is freely distributable, allowing for the establishment of a highly available mirror system of databases in collaborating laboratories.
Antibodies are used extensively for a wide range of basic research and clinical applications. While an abundant and diverse collection of antibodies to protein antigens have been developed, good monoclonal antibodies to carbohydrates are much less common. Moreover, it can be difficult to determine if a particular antibody has the appropriate specificity, which antibody is best suited for a given application, and where to obtain that antibody. Herein, we provide an overview of the current state of the field, discuss challenges for selecting and using antiglycan antibodies, and summarize deficiencies in the existing repertoire of antiglycan antibodies. This perspective was enabled by collecting information from publications, databases, and commercial entities and assembling it into a single database, referred to as the Database of Anti-Glycan Reagents (DAGR). DAGR is a publicly available, comprehensive resource for anticarbohydrate antibodies, their applications, availability, and quality.
The LIPID MAPS Structure Database (LMSD) is a relational database encompassing structures and annotations of biologically relevant lipids. Structures of lipids in the database come from four sources: (i) LIPID MAPS Consortium's core laboratories and partners; (ii) lipids identified by LIPID MAPS experiments; (iii) computationally generated structures for appropriate lipid classes; (iv) biologically relevant lipids manually curated from LIPID BANK, LIPIDAT and other public sources. All the lipid structures in LMSD are drawn in a consistent fashion. In addition to a classification-based retrieval of lipids, users can search LMSD using either text-based or structure-based search options. The text-based search implementation supports data retrieval by any combination of these data fields: LIPID MAPS ID, systematic or common name, mass, formula, category, main class, and subclass data fields. The structure-based search, in conjunction with optional data fields, provides the capability to perform a substructure search or exact match for the structure drawn by the user. Search results, in addition to structure and annotations, also include relevant links to external databases. The LMSD is publicly available at www.lipidmaps.org/data/structure/.
Carbohydrate-Active enZymes (CAZymes) assemble, breakdown, and modify glycans and glycoconjugates using their catalytic and binding modules (functional protein domains). The CAZy database offers since 1998 an online and continuously updated classification of CAZyme modules (Lombard et al. 2014). Each module family in the CAZy classification has been created based on experimentally characterized protein modules from the literature, and the families are populated by related module sequences from public protein sequence databases. Since no universal threshold allows the systematic classification of the various CAZyme families, CAZy annotations result from an expert combination of module modeling/calibration and human curation. CAZy annotations are made publicly available for all proteins released by GenBank, Swiss-Prot and the Protein Data Bank. Further, functional and 3-D structural information, curated from the literature on a regular basis, constitute essential added values to the CAZy annotation. In this spirit, the display of ligand information from crystallographic complexes has been recently developed. This chapter will guide the reader through the usage of CAZy to search enzyme annotations. It will also answer frequent questions such as (i) how to obtain CAZy annotations for a specific protein, a genome, or a metagenome, (ii) how to have a newly characterized family included in the CAZy classification scheme, (iii) why CAZy does not cover all protein families related to glycans/glycoconjugates, and (iv) why CAZy does not transfer functional annotation to similar sequences. Finally, we present here a recent CAZy-associated tool, namely, the Polysaccharide Utilization Loci (PUL) predictor and database in Bacteroidetes species.
Glycogenes include genes of glycosyltransferase, sugar-nucleotide synthase, sugar-nucleotide transporter, sulfotransferase, etc. A total of 184 glycogenes are known to exist, and the authors collected the data of human glycogenes including the newly found ones in the “Construction of Glycogene Library” project (April 2001–March 2004) (Narimatsu 2004). There has been no database which comprehensively stores the information on human glycogenes. At present, over 184 genes of human glycosyltransferases and sulfotransferases are identified, cloned and expressed in various expression systems to analyze the activity for carbohydrate synthesis and its biological function. Therefore, in order to enable one-stop search of information necessary for the analysis of glycogenes, we constructed GlycoGene Database (GGDB) (Kwon 2004) which comprehensively includes all information of glycogenes obtained to date, and equips functions for analysis of homology and other purposes.
The BioMagResBank (BMRB: www.bmrb.wisc.edu) is a repository for experimental and derived data gathered from nuclear magnetic resonance (NMR) spectroscopic studies of biological molecules. BMRB is a partner in the Worldwide Protein Data Bank (wwPDB). The BMRB archive consists of four main data depositories: (i) quantitative NMR spectral parameters for proteins, peptides, nucleic acids, carbohydrates and ligands or cofactors (assigned chemical shifts, coupling constants and peak lists) and derived data (relaxation parameters, residual dipolar couplings, hydrogen exchange rates, pK(a) values, etc.), (ii) databases for NMR restraints processed from original author depositions available from the Protein Data Bank, (iii) time-domain (raw) spectral data from NMR experiments used to assign spectral resonances and determine the structures of biological macromolecules and (iv) a database of one- and two-dimensional (1)H and (13)C one- and two-dimensional NMR spectra for over 250 metabolites. The BMRB website provides free access to all of these data. BMRB has tools for querying the archive and retrieving information and an ftp site (ftp.bmrb.wisc.edu) where data in the archive can be downloaded in bulk. Two BMRB mirror sites exist: one at the PDBj, Protein Research Institute, Osaka University, Osaka, Japan (bmrb.protein.osaka-u.ac.jp) and the other at CERM, University of Florence, Florence, Italy (bmrb.postgenomicnmr.net/). The site at Osaka also accepts and processes data depositions.
The aim of the UniProt Knowledgebase is to provide users with a comprehensive, high-quality and freely accessible set of protein sequences annotated with functional information. In this publication we describe enhancements made to our data processing pipeline and to our website to adapt to an ever-increasing information content. The number of sequences in UniProtKB has risen to over 227 million and we are working towards including a reference proteome for each taxonomic group. We continue to extract detailed annotations from the literature to update or create reviewed entries, while unreviewed entries are supplemented with annotations provided by automated systems using a variety of machine-learning techniques. In addition, the scientific community continues their contributions of publications and annotations to UniProt entries of their interest. Finally, we describe our new website (https://www.uniprot.org/), designed to enhance our users' experience and make our data easily accessible to the research community. This interface includes access to AlphaFold structures for more than 85% of all entries as well as improved visualisations for subcellular localisation of proteins.
After years of strategic planning, the National Library of Medicine has introduced an updated and redesigned version of its PubMed health sciences research website. The new website features a more modern and responsive interface, especially on mobile devices. Tools and features have been relocated to make them more intuitive for new users. While not without some turbulence and slight discomfort for long-time users adjusting to the modernized interface and search engine, the new version of the PubMed website introduced in 2020 succeeds in the website's time-honored task of collecting and making freely accessible high-quality health sciences information and resources.
When the National Library of Medicine acquired a computer to augment its publication program, the intent was to present in one medium an index to journal articles and a catalog of books and new serial titles. The computer programs designed for indexing were unsatisfactory for cataloging, however; so two publications were issued, the Index Medicus and the NLM Current Catalog. The Current Catalog features separate name and subject sections, added volumes, and technical reports. The Express Cataloging Service was one of the first attempts to increase the speed and coverage of the Catalog. Shared cataloging with the Library of Congress, the Countway Library at Harvard, and the Upstate Medical Library in Syracuse, New York, have also contributed to the efforts toward improving this library service. An additional shared cataloging program, this time with the National Medical Audiovisual Center, is expected to be implemented shortly.
Summary: GlycoStore is a curated chromatographic, electrophoretic and mass-spectrometry composition database of N-, O-, glycosphingolipid (GSL) glycans and free oligosaccharides associated with a range of glycoproteins, glycolipids and biotherapeutics. The database is built on publicly available experimental datasets from GlycoBase developed in the Oxford Glycobiology Institute and then the National Institute for Bioprocessing Research and Training (NIBRT). It has now been extended to include recently published and in-house data collections from the Bioprocessing Technology Institute (BTI) A*STAR, Macquarie University and Ludger Ltd. GlycoStore provides access to approximately 850 unique glycan structure entries supported by over 8500 retention positions determined by: (i) hydrophilic interaction chromatography (HILIC) ultra-high performance liquid chromatography (U/HPLC) and reversed phase (RP)-U/HPLC with fluorescent detection; (ii) porous graphitized carbon (PGC) chromatography in combination with ESI-MS/MS detection; and (iii) capillary electrophoresis with laser induced fluorescence detection (CE-LIF). GlycoStore enhances many features previously available in GlycoBase while addressing the limitations of the data collections and model of this popular resource. GlycoStore aims to support detailed glycan analysis by providing a resource that underpins current workflows. It will be regularly updated by expert annotation of published data and data obtained from the project partners. Availability and implementation: http://www.glycostore.org. Supplementary information: Supplementary data are available at Bioinformatics online.
Databases play an increasingly important role in biology. They archive, store, maintain, and share information on genes, genomes, expression data, protein sequences and structures, metabolites and reactions, interactions, and pathways. All these data are critically important to microbiologists. Furthermore, microbiology has its own databases that deal with model microorganisms, microbial diversity, physiology, and pathogenesis. Thousands of biological databases are currently available, and it becomes increasingly difficult to keep up with their development. The purpose of this minireview is to provide a brief survey of current databases that are of interest to microbiologists.
no abstract.
The Protein Data Bank (PDB; http://www.rcsb.org/pdb/) is the single worldwide archive of structural data of biological macromolecules. This paper describes the data uniformity project that is underway to address the inconsistency in PDB data.
A new web tool, PDB2MultiGIF (http://www.dkfz-heidelberg.de/spec/pdb2mgif/),which converts the topological information (atom types, 3D coordinates, molecular connectivity) of molecules (given in PDB format [1]) to a series of animated images (in GIF Format) [2] is described. The molecular visualisation program RASMOL [3] is used to generate the images.
The Protein Data Bank (PDB)--the single global repository of experimentally determined 3D structures of biological macromolecules and their complexes--was established in 1971, becoming the first open-access digital resource in the biological sciences. The PDB archive currently houses ~130,000 entries (May 2017). It is managed by the Worldwide Protein Data Bank organization (wwPDB; wwpdb.org), which includes the RCSB Protein Data Bank (RCSB PDB; rcsb.org), the Protein Data Bank Japan (PDBj; pdbj.org), the Protein Data Bank in Europe (PDBe; pdbe.org), and BioMagResBank (BMRB; www.bmrb.wisc.edu). The four wwPDB partners operate a unified global software system that enforces community-agreed data standards and supports data Deposition, Biocuration, and Validation of ~11,000 new PDB entries annually (deposit.wwpdb.org). The RCSB PDB currently acts as the archive keeper, ensuring disaster recovery of PDB data and coordinating weekly updates. wwPDB partners disseminate the same archival data from multiple FTP sites, while operating complementary websites that provide their own views of PDB data with selected value-added information and links to related data resources. At present, the PDB archives experimental data, associated metadata, and 3D-atomic level structural models derived from three well-established methods: crystallography, nuclear magnetic resonance spectroscopy (NMR), and electron microscopy (3DEM). wwPDB partners are working closely with experts in related experimental areas (small-angle scattering, chemical cross-linking/mass spectrometry, Forster energy resonance transfer or FRET, etc.) to establish a federation of data resources that will support sustainable archiving and validation of 3D structural models and experimental data derived from integrative or hybrid methods.
Protein-carbohydrate interactions underlie essential biological processes. Elucidating the mechanism of protein-carbohydrate recognition is a prerequisite for modeling and optimizing protein-carbohydrate interactions, which will help in discovery of carbohydrate-derived therapeutics. In this work, we present a survey of a curated database consisting of 6,402 protein-carbohydrate complexes in the Protein Data Bank (PDB). We performed an all-against-all comparison of a subset of nonredundant binding sites, and the result indicates that the interaction pattern similarity is not completely relevant to the binding site structural similarity. Investigation of both binding site and ligand promiscuities reveals that the geometry of chemical feature points is more important than local backbone structure in determining protein-carbohydrate interactions. A further analysis on the frequency and geometry of atomic interactions shows that carbohydrate functional groups are not equally involved in binding interactions. Finally, we discuss the usefulness of protein-carbohydrate complexes in the PDB with acknowledgement that the carbohydrates in many structures are incomplete.
Carbohydrates are well known for their physicochemical, biological, functional, and therapeutic characteristics. Unfortunately, their chemical nature imposes severe challenges for the structural elucidation of these phenomena, impairing not only the depth of our understanding of carbohydrates but also the development of new biotechnological and therapeutic applications based on these molecules. In the recent past, the amount of structural information, obtained mainly from X-ray crystallography, has increased progressively, as well as its quality. In this context, the current work presents a global analysis of the carbohydrate information available in the Protein Data Bank (PDB). From high quality structures, it is clear that most of the data are highly concentrated on a few sets of residue types, on their monosaccharidic forms, and connected by a small diversity of glycosidic linkages. The geometries of these linkages can be mostly associated with the types of linkages instead of residues, while the level of puckering distortion was characterized, quantified, and located in a pseudorotational equilibrium landscape, not only to local minima but also to transitional states. These qualitative and quantitative analyses offer a global picture of the carbohydrate structural content in the PDB, potentially supporting the building of new models for carbohydrate-related biological phenomena at the atomistic level, including new developments on force field parameters.
The glycan fragment database (GFDB), freely available at http://www.glycanstructure.org, is a database of the glycosidic torsion angles derived from the glycan structures in the Protein Data Bank (PDB). Analogous to protein structure, the structure of an oligosaccharide chain in a glycoprotein, referred to as a glycan, can be characterized by the torsion angles of glycosidic linkages between relatively rigid carbohydrate monomeric units. Knowledge of accessible conformations of biologically relevant glycans is essential in understanding their biological roles. The GFDB provides an intuitive glycan sequence search tool that allows the user to search complex glycan structures. After a glycan search is complete, each glycosidic torsion angle distribution is displayed in terms of the exact match and the fragment match. The exact match results are from the PDB entries that contain the glycan sequence identical to the query sequence. The fragment match results are from the entries with the glycan sequence whose substructure (fragment) or entire sequence is matched to the query sequence, such that the fragment results implicitly include the influences from the nearby carbohydrate residues. In addition, clustering analysis based on the torsion angle distribution can be performed to obtain the representative structures among the searched glycan structures.
Knowledge of the 3D structure of glycans is a prerequisite for a complete understanding of the biological processes glycoproteins are involved in. However, due to a lack of standardised nomenclature, carbohydrate compounds are difficult to locate within the Protein Data Bank (PDB). Using an algorithm that detects carbohydrate structures only requiring element types and atom coordinates, we were able to detect 1663 entries containing a total of 5647 carbohydrate chains. The majority of chains are found to be N-glycosidically bound. Noncovalently bound ligands are also frequent, while O-glycans form a minority. About 30% of all carbohydrate containing PDB entries comprise one or several errors. The automatic assignment of carbohydrate structures in PDB entries will improve the cross-linking of glycobiology resources with genomic and proteomic data collections, which will be an important issue of the upcoming glycomics projects. By aiding in detection of erroneous annotations and structures, the algorithm might also help to increase database quality.
Knowledge of the 3D structure of glycoproteins and protein-carbohydrate complexes is indispensable to fully understand the biological processes they are involved in. Carbohydrate Structure Suite is an attempt to automatically analyse carbohydrate structures contained in the PDB and make the results publicly available on the internet. Characteristic torsion angles, glycoprotein sequences and carbohydrate-protein interactions are analysed. Furthermore, tools to crosslink the PDB and carbohydrate databases and to check the integrity of carbohydrate 3D structures are included. The service is available at (www.dkfz.de/spec/css/).
The compilation of data collections for carbohydrates has only recently gained momentum. The availability of such comprehensive databases, however, will be a prerequisite to successfully perform large-scale glycomics projects aiming to decipher new biological functions of glycans. With the carbohydrate structure suite (CSS), the carbohydrate-related data contained in the protein data bank (PDB) are now accessible through the Internet. It turned out that the PDB is a versatile resource for structural aspects in glycobiology. It provides reliable data about glycosylation sites, the conformational preferences of glycans and the specificity of protein carbohydrate recognition. A detailed comparison between the carbohydrate assignment reported in the PDB files and the nomenclature derived from atom coordinates guarantees that only consistent data will be evaluated. The automatic assignment of a unique structural description (LINUCS notation) enables easy cross-linking and referencing with other carbohydrate-related resources like NMR and MS-spectra. Exemplified for a- and b-N -acetylglucosamine it is shown that a particular distribution of amino acids is required to establish specific recognition for both anomers. The unrestricted use of primary data enables an online linkage of carbohydrate-related databases with other bioinformatics and biomedical resources and will thus provide maximal synergism.
The 3D structural data of glycoprotein or protein-carbohydrate complexes that are found in the Protein Data Bank (PDB) are an interesting data source for glycobiologists. Unfortunately, carbohydrate components are difficult to find with the means provided by the PDB. The GLYCOSCIENCES.de internet portal offers a variety of tools and databases to locate and analyze these structures. This chapter describes how to find PDB entries that feature a specific carbohydrate structure and how to locate carbohydrate residues in a 3D structure file and to check their consistency. In addition to this, methods to statistically analyze torsion angles and the abundance of amino acids both in the neighborhood of glycosylation sites and in the spatial vicinity of non-covalently bound carbohydrate chains are summarized.
BACKGROUND: Carbohydrates are involved in a variety of fundamental biological processes and pathological situations. They therefore have a large pharmaceutical and diagnostic potential. Knowledge of the 3D structure of glycans is a prerequisite for a complete understanding of their biological functions. The largest source of biomolecular 3D structures is the Protein Data Bank. However, about 30% of all 1663 PDB entries (version September 2003) containing carbohydrates comprise errors in glycan description. Unfortunately, no software is currently available which aligns the 3D information with the reported assignments. It is the aim of this work to fill this gap. RESULTS: The pdb-care program http://www.glycosciences.de/tools/pdb-care/ is able to identify and assign carbohydrate structures using only atom types and their 3D atom coordinates given in PDB-files. Looking up a translation table where systematic names and the respective PDB residue codes are listed, both assignments are compared and inconsistencies are reported. Additionally, the reliability of reported and calculated connectivities for molecules listed within the HETATOM records is checked and unusual values are reported. CONCLUSION: Frequent use of pdb-care will help to improve the quality of carbohydrate data contained in the PDB. Automatic assignment of carbohydrate structures contained in PDB entries will enable the cross-linking of glycobiology resources with genomic and proteomic data collections.
MOTIVATION: Glycans play a central role in many essential biological processes. Glycan Reader was originally developed to simplify the reading of Protein Data Bank (PDB) files containing glycans through the automatic detection and annotation of sugars and glycosidic linkages between sugar units and to proteins, all based on atomic coordinates and connectivity information. Carbohydrates can have various chemical modifications at different positions, making their chemical space much diverse. Unfortunately, current PDB files do not provide exact annotations for most carbohydrate derivatives and more than 50% of PDB glycan chains have at least one carbohydrate derivative that could not be correctly recognized by the original Glycan Reader. RESULTS: Glycan Reader has been improved and now identifies most sugar types and chemical modifications (including various glycolipids) in the PDB, and both PDB and PDBx/mmCIF formats are supported. CHARMM-GUI Glycan Reader is updated to generate the simulation system and input of various glycoconjugates with most sugar types and chemical modifications. It also offers a new functionality to edit the glycan structures through addition/deletion/modification of glycosylation types, sugar types, chemical modifications, glycosidic linkages, and anomeric states. The simulation system and input files can be used for CHARMM, NAMD, GROMACS, AMBER, GENESIS, LAMMPS, Desmond, OpenMM, and CHARMM/OpenMM. Glycan Fragment Database in GlycanStructure.Org is also updated to provide an intuitive glycan sequence search tool for complex glycan structures with various chemical modifications in the PDB. AVAILABILITY AND IMPLEMENTATION: http://www.charmm-gui.org/input/glycan and http://www.glycanstructure.org. CONTACT: wonpil@lehigh.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Following the discovery of serious errors in the structure of biomacromolecules, structure validation has become a key topic of research, especially for ligands and non-standard residues. ValidatorDB (freely available at http://ncbr.muni.cz/ValidatorDB) offers a new step in this direction, in the form of a database of validation results for all ligands and non-standard residues from the Protein Data Bank (all molecules with seven or more heavy atoms). Model molecules from the wwPDB Chemical Component Dictionary are used as reference during validation. ValidatorDB covers the main aspects of validation of annotation, and additionally introduces several useful validation analyses. The most significant is the classification of chirality errors, allowing the user to distinguish between serious issues and minor inconsistencies. Other such analyses are able to report, for example, completely erroneous ligands, alternate conformations or complete identity with the model molecules. All results are systematically classified into categories, and statistical evaluations are performed. In addition to detailed validation reports for each molecule, ValidatorDB provides summaries of the validation results for the entire PDB, for sets of molecules sharing the same annotation (three-letter code) or the same PDB entry, and for user-defined selections of annotations or PDB entries.
Since 1971, the Protein Data Bank (PDB) has served as the single global archive for experimentally determined 3D structures of biological macromolecules made freely available to the global community according to the FAIR principles of Findability-Accessibility-Interoperability-Reusability. During the first 50 years of continuous PDB operations, standards for data representation have evolved to better represent rich and complex biological phenomena. Carbohydrate molecules present in more than 14,000 PDB structures have recently been reviewed and remediated to conform to a new standardized format. This machine-readable data representation for carbohydrates occurring in the PDB structures and the corresponding reference data improves the findability, accessibility, interoperability and reusability of structural information pertaining to these molecules. The PDB Exchange MacroMolecular Crystallographic Information File data dictionary now supports (i) standardized atom nomenclature that conforms to International Union of Pure and Applied Chemistry-International Union of Biochemistry and Molecular Biology (IUPAC-IUBMB) recommendations for carbohydrates, (ii) uniform representation of branched entities for oligosaccharides, (iii) commonly used linear descriptors of carbohydrates developed by the glycoscience community and (iv) annotation of glycosylation sites in proteins. For the first time, carbohydrates in PDB structures are consistently represented as collections of standardized monosaccharides, which precisely describe oligosaccharide structures and enable improved carbohydrate visualization, structure validation, robust quantitative and qualitative analyses, search for dendritic structures and classification. The uniform representation of carbohydrate molecules in the PDB described herein will facilitate broader usage of the resource by the glycoscience community and researchers studying glycoproteins.
The geometries of the contacts between monosaccharides and aromatic rings of amino acids found in X-ray crystallography structures, in the Protein Data Bank (PDB), were analyzed, while the energies of the interactions were calculated using quantum chemical method. We found 1913 sugar/aromatic ring contacts, 1054 of them (55%) with CH/pi interactions and 859 of them (45%) without CH/pi interactions. We showed that only the carbohydrate/aromatic contacts with CH/pi interactions are preferentially parallel and enable sliding in the plane parallel to aromatic ring. The calculated interaction energies in systems with CH/pi interactions are in the range from -1.7 kcal/mol to -6.8 kcal/mol, while in the systems without CH/pi interactions are in the range -0.2 to -3.2 kcal/mol. Hence, the binding that does not include CH/pi interactions, can also be important for aromatic amino acid and carbohydrate binding processes, since some of these interactions can be as strong as the CH/pi interactions. At the same time, these interactions can be weak enough to enable releasing of small carbohydrate fragments after the enzymatic reaction. The analysis of the protein-substrate patterns showed that every second or third carbohydrate unit in long substrates stacks with protein aromatic amino acids.
Glycosylation is one of the most common forms of protein post-translational modification, but is also the most complex. Dealing with glycoproteins in structure model building, refinement, validation and PDB deposition is more error-prone than dealing with nonglycosylated proteins owing to limitations of the experimental data and available software tools. Also, experimentalists are typically less experienced in dealing with carbohydrate residues than with amino-acid residues. The results of the reannotation and re-refinement by PDB-REDO of 8114 glycoprotein structure models from the Protein Data Bank are analyzed. The positive aspects of 3620 reannotations and subsequent refinement, as well as the remaining challenges to obtaining consistently high-quality carbohydrate models, are discussed.
With over 150,000 entries, the worldwide protein data bank (PDB) is the primary repository for 3D macromolecular structure. Unfortunately, structural, annotational, and ambiguity errors exist for carbohydrates throughout the database due in part to the lack of carbohydrate‐specific tools for checking the quality of structures prior to deposition. Our group has partnered with the PDB Biocuration team to assist in the identification and remediation of carbohydrates in their database and we have developed a user‐friendly web interface called GlyFinder to accurately find, retrieve, and assess these glycan and glycoproteins. Using GlyFinder, we have found that 45,852 of the PDB entries contain carbohydrates (30.1% of the PDB). Nearly 6,000 glycoproteins have been identified, with an average of five N‐linked glycans per glycoprotein. Only 415 glycoproteins contained O‐linked glycans, with an average of three O‐linked glycans each. A surprisingly high number of glycoprotein PDBs (500, 7.45%) contain one or more N ‐linked glycans that are alpha‐linked to the asparagine, illustrating the unfortunate errors that sometimes exist in the data. Because of such errors, and because the glycans in crystal structures are often truncated, we have also developed tools (Glycoprotein Builder) that can build realistic models of the glycoprotein with intact glycans employing the crystal structure of the protein core. These models allow us to predict the impact of glycosylation on protein function, antigenicity, immunogenicity and stability. We illustrate these capabilities for several proteins, including human Erythropoietin, HIV gp120, and influenza A hemagglutinin.
The Protein Data Bank (PDB) is the single global repository for experimentally‐determined 3D structures of biological macromolecules and their complexes with ligands. The Worldwide Protein Data Bank (wwPDB) is the international collaboration that manages the PDB archive according to the FAIR Principles: Findability, Accessibility, Interoperability , and Reusability . The PDB archive now holds more than 150,000 structures, which are all publicly accessible without restriction. A major focus of the wwPDB is maintaining consistency and accuracy across the archive. As the PDB grows, developments in structure determination methods and technologies can challenge how all structures are represented. The wwPDB addresses these challenges with “remediation” efforts to improve data representation in support of making PDB data FAIR . Understanding the structure and organization of carbohydrates is critical to comprehending their biological roles in health and disease. Approximately 10% of PDB structures contain carbohydrates. However, there is lack of uniform representation for carbohydrates due to complex nature of carbohydrates, which exhibit stereo‐isomers, anomeric configurations, and branched chains. As a result, glycoscientists and other experts are unable to fully utilize the rich structural information available for these structures. Working with the glycoscience community, carbohydrate‐appropriate annotation tools are being developed and implemented within the OneDep system for deposition, validation, and biocuration of PDB structures. These software tools will provide standard nomenclature and consistent oligosaccharide representation that can be easily translated to other representations commonly used by glycobiologists. wwPDB carbohydrate remediation efforts involve: (1) standardizing sugar nomenclature following IUPAC/IUBMB; (2) providing a uniform polymer representation for polysaccharides with appropriate descriptor(s); (3) adopting community software for reliable carbohydrate identification, assignment of standard nomenclature, and detection of intra/inter‐molecular connectivity between monosaccharides and other molecules/proteins; and (4) providing intra‐ and inter‐molecular connectivity at atom level explicitly.
SugarSketcher is an intuitive and fast JavaScript interface module for online drawing of glycan structures in the popular Symbol Nomenclature for Glycans (SNFG) notation and exporting them to various commonly used formats encoding carbohydrate sequences (e.g., GlycoCT) or quality images (e.g., svg). It does not require a backend server or any specific browser plugins and can be integrated in any web glycoinformatics project. SugarSketcher allows drawing glycans both for glycobiologists and non-expert users. The “quick mode” allows a newcomer to build up a glycan structure having only a limited knowledge in carbohydrate chemistry. The “normal mode” integrates advanced options which enable glycobiologists to tailor complex carbohydrate structures. The source code is freely available on GitHub and glycoinformaticians are encouraged to participate in the development process while users are invited to test a prototype available on the ExPASY web-site and send feedback.
Glycans are key molecules in many physiological and pathological processes. As with other molecules, like proteins, visualization of the 3D structures of glycans adds valuable information for understanding their biological function. Hence, here we introduce Azahar, a computing environment for the creation, visualization and analysis of glycan molecules. Azahar is implemented in Python and works as a plugin for the well known PyMOL package (Schrodinger in The PyMOL molecular graphics system, version 1.3r1, 2010). Besides the already available visualization and analysis options provided by PyMOL, Azahar includes 3 cartoon-like representations and tools for 3D structure caracterization such as a comformational search using a Monte Carlo with minimization routine and also tools to analyse single glycans or trajectories/ensembles including the calculation of radius of gyration, Ramachandran plots and hydrogen bonds. Azahar is freely available to download from http://www.pymolwiki.org/index.php/Azahar and the source code is available at https://github.com/agustinaarroyuelo/Azahar .
The Linear Code is a new syntax for representing glycoconjugates and their associated molecules in a simple linear fashion. Similar to the straightforward single letter nomenclature of DNA and proteins, Linear Code presents glycoconjugates in a canonic, compact and practical form while accounting for all relevant stereochemical and structural configurations. It uses a single letter code to represent each monosaccharide and includes a condensed description of the connections between monosaccharides and their modifications, allowing a simple linear representation of these compounds. The new linear syntax enables the implementation of bioinformatics tools for investigation and analysis of glyco-molecules and their biology.
This article describes features, usage, and application of an CSDB/SNFG Structure Editor, a new online tool for quick and intuitive input of carbohydrate and derivative structures using Symbol Nomenclature for Glycans (SNFG). The Editor is built on a platform of the Carbohydrate Structure Database (CSDB) and relies on its online services via the dedicated web-API. The Editor allows building of oligo- and polymeric glycan structures and supports most features of natural glycans, such as underdetermined structures, alternative branches, repeating subunits, SMILES specification of atypical monomers, and others. The vocabulary of building blocks contains 600+ monomeric residues, including 327 monosaccharides. Support for SMILES allows input and visualization of chemical structures of virtually unlimited complexity. On the other hand, the interface follows the recognized GlycanBuilder style easy to novice users. The export feature includes support for CSDB Linear, GlycoCT, WURCS, SweetDB, and Glycam notations, SMILES codes, MOL/PDB atomic coordinate formats, raster and vector SNFG images, and on-the-fly visualization as 2D structural formulas and 3D molecular models. Integration of the Editor into any web-based glycoinformatics project is straightforward and simple, similarly to any other modern JavaScript application.
The use of proteomics databases has become indispensable for daily work of molecular biologists, but this situation has not yet been achieved for carbohydrate applications. One obvious reason is that existing data collections are only rarely annotated and no cross-linking to other resources exists. The existence of a generally accepted linear, canonical description for carbohydrates which can be readily processed by computers will enable efficient automatic cross-linking of distributed carbohydrate data collections by serving as a unique and unambiguous database access key. Various possibilities to derive a canonical notation are discussed. They can be divided into attempts that require structure description alone and alternatives that profit from the fact that a preferred graph direction (non-reducing to reducing end) exists within the structure. To open a fruitful discussion among glycoscientists a possible solution is presented where the reducing monosaccharide unit is selected as graph root and linkage information is used to define the priority of the various branches. A Web interface (http://www.dkfz.de/spec/linucs/) has been created that directly converts the commonly used extended representation of complex carbohydrates into the preferred canonical description or into its inverted form.
A novel system of substructure codes has been developed to characterize the spherical environment of single atoms and complete ring systems. The codes are generated automatically from topologically represented chemical structures and serve to describe structural entities corresponding to spectral parameters uniquely. Their hierarchical order permits desired substructures and the corresponding chemical shifts to be sought in inverted files generated from a larger data base, thereby facilitating the estimation of unknown spectra.
SUMMARY: This manuscript describes an open-source program, DrawGlycan-SNFG (version 2), that accepts IUPAC (International Union of Pure & Applied Chemist)-condensed inputs to render Symbol Nomenclature For Glycans (SNFG) drawings. A wide range of local and global options enable display of various glycan/peptide modifications including bond breakages, adducts, repeat structures, ambiguous identifications, etc. These facilities make DrawGlycan-SNFG ideal for integration into various glycoinformatics software, including glycomics and glycoproteomics mass spectrometry applications. As a demonstration of such usage, we incorporated DrawGlycan-SNFG into gpAnnotate, a standalone application to score and annotate individual MS/MS glycopeptide spectrum in different fragmentation modes. AVAILABILITY AND IMPLEMENTATION: DrawGlycan-SNFG and gpAnnotate are platform independent. While originally coded using MATLAB, compiled packages are also provided to enable DrawGlycan-SNFG implementation in Python and Java. All programs are available from https://virtualglycome.org/drawglycan; https://virtualglycome.org/gpAnnotate. SUPPLEMENTARY INFORMATION: Supplementary Material are available at Bioinformatics online.
Glycan or carbohydrate structures can be pictorially represented using symbolic nomenclatures. The symbol nomenclature for glycans (SNFG) contains 67 different monosaccharides represented using various colors and geometric shapes. A simple tool to convert International Union of Pure and Applied Chemistry (IUPAC) format text to SNFG will be useful for sketching glycans and glycopeptides. Such code can also enable the development of more sophisticated applications, where the visual representation of carbohydrate structures is necessary. To address this need, the current manuscript describes DrawGlycan-SNFG, a freely available, platform-independent, open-source tool. It allows: i. the display of glycans and glycopeptides from IUPAC-condensed text inputs and ii. the depiction of glycan and glycopeptide fragments. The online version of this program is provided with a user-friendly web interface at www.virtualglycome.org/DrawGlycan. Downloadable, stand-alone GUI (Graphical User Interface) version and the program source code are also available from this repository. DrawGlycan-SNFG will be useful for experimentalists looking for a ready to use, simple program for sketching carbohydrates and for software developers interested in incorporating SNFG into their program suite.
We report the addition of two visualisation algorithms, termed PaperChain and Twister, to the freely available Visual Molecular Dynamics (VMD) package. These algorithms produce visualisations of complex cyclic molecules and multi-branched polysaccharides and are a generalization and optimization of those we previously developed in a standalone package for carbohydrates. PaperChain highlights each ring in a molecular structure with a polygon, which is coloured according to the ring pucker. Twister traces glycosidic bonds with a ribbon that twists according to the relative orientation of successive sugar residues. Combination of these novel algorithms and new ring selection statements with the large set of visualisations already available in VMD allows for unprecedented flexibility in the level of detail displayed for glycoconjugate, glycoprotein and carbohydrate-binding protein structures, as well as other cyclic structures. We highlight the efficacy of these algorithms with selected illustrative examples, clearly demonstrating the value of the new visualisations, not only for structure validation, but also for facilitating insights into molecular structure and mechanism.
During the EUROCarbDB project our group developed the GlycanBuilder and GlycoWorkbench glycoinformatics tools. This short communication summarizes the capabilities of these two tools and updates which have been made since the original publications in 2007 and 2008. GlycanBuilder is a tool that allows for the fast and intuitive drawing of glycan structures; this tool can be used standalone, embedded in web pages and can also be integrated into other programs. GlycoWorkbench has been designed to semi-automatically annotate glycomics data. This tool can be used to annotate mass spectrometry (MS) and MS/MS spectra of free oligosaccharides, N and O-linked glycans, GAGs (glycosaminoglycans) and glycolipids, as well as MS spectra of glycoproteins.
Symbolic diagrams are commonly used to depict N- and O-linked glycans but there is no general consensus as to how individual constituent monosaccharides or linkages are shown. This article proposes a system that avoids ambiguities inherent in most other systems and is appropriate for both hand drawing and computer applications. Constituent monosaccharides are depicted by shapes modified to show OAc, deoxy, etc. Linkage is indicated by the bond angle and anomericity by solid (beta) or dashed (alpha) lines.
Since its public introduction in 2005 the IUPAC InChI chemical structure identifier standard has become the international, worldwide standard for defined chemical structures. This article will describe the extensive use and dissemination of the InChI and InChIKey structure representations by and for the world-wide chemistry community, the chemical information community, and major publishers and disseminators of chemical and related scientific offerings in manuscripts and databases.
As part of the EUROCarbDB project (www.eurocarbdb.org) we have carefully analyzed the encoding capabilities of all existing carbohydrate sequence formats and the content of publically available structure databases. We have found that none of the existing structural encoding schemata are capable of coping with the full complexity to be expected for experimentally derived structural carbohydrate sequence data across all taxonomic sources. This gap motivated us to define an encoding scheme for complex carbohydrates, named GlycoCT, to overcome the current limitations. This new format is based on a connection table approach, instead of a linear encoding scheme, to describe the carbohydrate sequences, with a controlled vocabulary to name monosaccharides, adopting IUPAC rules to generate a consistent, machine-readable nomenclature. The format uses a block concept to describe frequently occurring special features of carbohydrate sequences like repeating units. It exists in two variants, a condensed form and a more verbose XML syntax. Sorting rules assure the uniqueness of the condensed form, thus making it suitable as a direct primary key for database applications, which rely on unique identifiers. GlycoCT encompasses the capabilities of the heterogeneous landscape of digital encoding schemata in glycomics and is thus a step forward on the way to a unified and broadly accepted sequence format in glycobioinformatics.
This chapter provides an introduction to the different ways of storing carbohydrate sequence data for oligo- and polysaccharides in electronic repositories. It starts with a short historical outline and the major capabilities of the existing formats and continues with a more detailed description of the individual sequence formats for carbohydrates. It covers the principles of the established formats of Carbbank, LINUCS, KCF, BCSDB, LinearCode, CabosML and Glyde, while other symbolic descriptions and small molecule formats are also briefly described. The chapter is concluded with an outlook on recent developments in the area of sequence formats and monosaccharide standardisation for oligo- and polysaccharides.
VMD is a molecular graphics program designed for the display and analysis of molecular assemblies, in particular biopolymers such as proteins and nucleic acids. VMD can simultaneously display any number of structures using a wide variety of rendering styles and coloring methods. Molecules are displayed as one or more "representations," in which each representation embodies a particular rendering method and coloring scheme for a selected subset of atoms. The atoms displayed in each representation are chosen using an extensive atom selection syntax, which includes Boolean operators and regular expressions. VMD provides a complete graphical user interface for program control, as well as a text interface using the Tcl embeddable parser to allow for complex scripts with variable substitution, control loops, and function calls. Full session logging is supported, which produces a VMD command script for later playback. High-resolution raster images of displayed molecules may be produced by generating input scripts for use by a number of photorealistic image-rendering applications. VMD has also been expressly designed with the ability to animate molecular dynamics (MD) simulation trajectories, imported either from files or from a direct connection to a running MD simulation. VMD is the visualization component of MDScope, a set of tools for interactive problem solving in structural biology, which also includes the parallel MD program NAMD, and the MDCOMM software used to connect the visualization and simulation programs. VMD is written in C++, using an object-oriented design; the program, including source code and extensive documentation, is freely available via anonymous ftp and through the World Wide Web.
The GlycoViewer (http://www.systemsbiology.org.au/glycoviewer) is a web-based tool that can visualize, summarize and compare sets of glycan structures. Its input is a group of glycan structures; these can be entered as a list in IUPAC format or via a sugar structure builder. Its output is a detailed graphic, which summarizes all salient features of the glycans according to the shapes of the core structures, the nature and length of any chains, and the types of terminal epitopes. The tool can summarize up to hundreds of structures in a single figure. This allows unique, high-level views to be generated of glycans from one protein, from a cell, a tissue or a whole organism. Use of the tool is illustrated in the analysis of normal and disease-associated glycans from the human glycoproteome.
UNLABELLED: Bioinformatics resources for glycomics are very poor as compared with those for genomics and proteomics. The complexity of carbohydrate sequences makes it difficult to define a common language to represent them, and the development of bioinformatics tools for glycomics has not progressed. In this study, we developed a carbohydrate sequence markup language (CabosML), an XML description of carbohydrate structures. AVAILABILITY: The language definition (XML Schema) and an experimental database of carbohydrate structures using an XML database management system are available at http://www.phoenix.hydra.mki.co.jp/CabosDemo.html CONTACT: kikuchi@hydra.mki.co.jp.
Standard molecular visualizations, such as the classic ball-and-stick model, are not suitable for large, complex molecules because the overall molecular structure is obscured by the atomic detail. For proteins, the more abstract ribbon and cartoon representations are instead used to reveal large scale molecular conformation and connectivity. However, there is currently no accepted convention for simplifying oligo- and polysaccharide structures. We introduce two novel visualization algorithms for carbohydrates, incorporated into a visualization package, CarboHydra. Both algorithms highlight the sugar rings and backbone conformation of the carbohydrate chain, ignoring ring substituents. The first algorithm, termed PaperChain, emphasizes the type and conformation of the carbohydrate rings. The second, Twister, emphasizes the relative orientation of the rings. We further include two rendering enhancements to augment these visualizations: silhouettes edges and a translucent overlay of the ball-and-stick atomic representation. To demonstrate their utility, the algorithms and visualization enhancements are here applied to a variety of carbohydrate molecules. User evaluations indicate that they present a more useful view of carbohydrate structure than the standard ball-and-stick representation. The algorithms were found to be complementary, with PaperChain particularly effective for smaller carbohydrates and Twister useful at larger scales for highlighting the backbone twist of polysaccharides.
Drawing and visualisation of molecular structures are some of the most common tasks carried out in structural glycobiology, typically using various software. In this perspective article, we outline developments in the computational tools for the sketching, visualisation and modelling of glycans. The article also provides details on the standard representation of glycans, and glycoconjugates, which helps the communication of structure details within the scientific community. We highlight the comparative analysis of the available tools which could help researchers to perform various tasks related to structure representation and model building of glycans. These tools can be useful for glycobiologists or any researcher looking for a ready to use, simple program for the sketching or building of glycans.
Glycans are essential to all scales of biology, with their intricate structures being crucial for their biological functions. The structural complexity of glycans is communicated through simplified and unified visual representations according to the Symbol Nomenclature for Glycans (SNFGs) guidelines adopted by the community. Here, we introduce GlycoDraw, a Python-native implementation for high-throughput generation of high-quality, SNFG-compliant glycan figures with flexible display options. GlycoDraw is released as part of our glycan analysis ecosystem, glycowork, facilitating integration into existing workflows by enabling fully automated annotation of glycan-related figures and thus assisting the analysis of e.g. differential abundance data or glycomics mass spectra.
Glycans are essential to all scales of biology, with their intricate structures being crucial for their biological functions. The structural complexity of glycans is communicated through simplified and unified visual representations according to the Symbol Nomenclature for Glycans (SNFGs) guidelines adopted by the community. Here, we introduce GlycoDraw, a Python-native implementation for high-throughput generation of high-quality, SNFG-compliant glycan figures with flexible display options. GlycoDraw is released as part of our glycan analysis ecosystem, glycowork, facilitating integration into existing workflows by enabling fully automated annotation of glycan-related figures and thus assisting the analysis of e.g. differential abundance data or glycomics mass spectra.
Various glycobioinformatics resources have developed individual carbohydrate sequence formats to store and handle glycan data. This diversity of sequence formats is one of the major reasons for a rather low interoperability of glycobioinformatics resources. The formats have often been optimized to serve special requirements of the individual resources and are thus not fully compatible, but in many cases translation from one format to another is possible. This chapter summarizes some of the major glycan sequence formats and demonstrates the use of tools for translation between these formats. Some pitfalls that users of sequence conversion tools need to pay attention to are also illustrated.
A variety of different notation formats is used by current glycoinformatics resources, which includes a diversity of residue names to encode carbohydrate building blocks. Not only residue nomenclature but even the delimitation of individual residues can differ between these formats. Within individual formats, multiple names for particular monosaccharides may be in use, caused by inconsistent use of trivial names, varying handling of substituents, and the inherent peculiarities of monosaccharide notation in combination with specific modifications. Such inconsistencies do not only hamper data exchange between different resources but also complicate queries in single databases. The problems can be addressed by using dictionaries of standard residue names. However, it is virtually impossible to pre-compile a complete dictionary of all feasible monosaccharides, let alone one of all potential synonyms in the various notation formats. Instead, routines to parse monosaccharide names into residue properties and to generate unique names from these properties can be used to solve this issue.
Various glycoinformatics resources developed their individual notations for encoding of glycan structure information. Therefore, translation of glycan structures is required for an efficient use of multiple resources and for data exchange. A major problem for translations is residue notation, because individual notations use different names to encode the same residues, and the number of different monosaccharides is too large to quickly build a translation table manually. MonosaccharideDB offers various means to perform translation of carbohydrate residue names. This chapter illustrates the usage of the MonosaccharideDB web interface for both manual and automated conversion and validation of glycan residues.
We report a new classification method for pyranose ring conformations called Best-fit, Four-Membered Plane (BFMP), which describes pyranose ring conformations based on reference planes defined by four atoms. The method is able to characterize all asymmetrical and symmetrical shapes of a pyran ring, is readily automated, easy to interpret, and maps trivially to IUPAC definitions. It also provides a qualitative measurement of the distortion of the ring. Example applications include the analysis of data from crystal structures and molecular dynamics simulations.
BACKGROUND: High-throughput technologies became common tools to decipher genome-wide changes of gene expression (GE) patterns. Functional analysis of GE patterns is a daunting task as it requires often recourse to the public repositories of biological knowledge. On the other hand, in many cases researcher's inquiry can be served by a comprehensive glimpse. The KEGG PATHWAY database is a compilation of manually verified maps of biological interactions represented by the complete set of pathways related to signal transduction and other cellular processes. Rapid mapping of the differentially expressed genes to the KEGG pathways may provide an idea about the functional relevance of the gene lists corresponding to the high-throughput expression data. RESULTS: Here we present a web based graphic tool KEGG Pathway Painter (KPP). KPP paints pathways from the KEGG database using large sets of the candidate genes accompanied by "overexpressed" or "underexpressed" marks, for example, those generated by microarrays or miRNA profilings. CONCLUSION: KPP provides fast and comprehensive visualization of the global GE changes by consolidating a list of the color-coded candidate genes into the KEGG pathways. KPP is freely available and can be accessed at http://web.cos.gmu.edu/~gmanyam/kegg/.
Accurate representation of structural ambiguity is important for storing carbohydrate structures containing varying levels of ambiguity in the literature and databases. Although many representations for carbohydrates have been developed in the past, a generalized but discrete representation format did not exist. We had previously developed the Web3 Unique Representation of Carbohydrate Structures (WURCS) in an attempt to define a generalizable and unique linear representation for carbohydrate structures. However, it lacked sufficient rules to uniquely describe ambiguous structures. In this work, we updated WURCS to handle such ambiguous monosaccharide structures. In particular, to handle structural ambiguity around (potential) carbonyl groups incidental to the carbohydrate analysis, we defined a representation of backbone carbons containing atomic-level ambiguity. As a result, we show that WURCS 2.0 can represent a wider variety of carbohydrate structures containing ambiguous monosaccharides, such as those whose ring closure is undefined or whose anomeric information is only known. This new format provides a representation of carbohydrates that was not possible before, and it is currently being used by the International Glycan Structure Repository GlyTouCan.
These Recommendations expand and replace the Tentative Rules for Carbohydrate Nomenclature [l] issued in 1969 jointly by the IUPAC Commission on the Nomenclature of Organic Chemistry and the In-IUPAC Commission on Biochemical Nomenclature (CBN) and reprinted in [2]. They also replace other published JCBN Recommendations [3-71 that deal with specialized areas of carbohydrate terminology; however, these documents can be consulted for further examples. Of relevance to the field, though not incorporated into the present document, are the following recommendations: - Nomenclature of cyclitols, 1973 - Numbering of atoms in myo-inositol, 1988 - Symbols for specifying the conformation of polysaccharide chains, 1981 - Nomenclature of glycoproteins, glycopeptides and peptidoglycans, 1985 - Nomenclature of glycolipids, in preparation The present Recommendations deal with the acyclic and cyclic forms of monosaccharides and their simple derivatives, as well as with the nomenclature of oligosaccharides and polysaccharides. They are additional to the Definitive Rules for the Nomenclature of Organic Chemistry [13,14] and are intended to govern those aspects of the nomenclature of carbohydrates not covered by those rules.
The close-range interactions provided by covalently linked glycans are essential for the correct folding of glycoproteins and also play a pivotal role in recognition processes. Being able to visualise protein-glycan and glycan-glycan contacts in a clear way is thus of great importance for the understanding of these biological processes. In structural terms, glycosylation sugars glue the protein together via hydrogen bonds, whereas non-covalently bound glycans frequently harness additional stacking interactions. Finding an unobscured molecular view of these multipartite scenarios is usually far from trivial; in addition to the need to show the interacting protein residues, glycans may contain many branched sugars, each composed of more than ten non-H atoms and offering more than three potential bonding partners. With structural glycoscience finally gaining popularity and steadily increasing the deposition rate of three-dimensional structures of glycoproteins, the need for a clear way of depicting these interactions is more pressing than ever. Here a schematic representation, named Glycoblocks, is introduced which combines a simplified bonding-network depiction (covering hydrogen bonds and stacking interactions) with the familiar two-dimensional glycan notation used by the glycobiology community, brought into three dimensions by the CCP4 molecular graphics project (CCP4mg).
MOTIVATION: Glycan structures are commonly represented using symbols or linear nomenclature such as that from the Consortium for Functional Glycomics (also known as modified IUPAC-condensed nomenclature). No current tool allows for writing the name in such format using a graphical user interface (GUI); thus, names are prone to errors or non-standardized representations. RESULTS: Here we present GlycoGlyph, a web application built using JavaScript, which is capable of drawing glycan structures using a GUI and providing the linear nomenclature as an output or using it as an input in a dynamic manner. GlycoGlyph also allows users to save the structures as an SVG vector graphic, and allows users to export the structure as condensed GlycoCT. AVAILABILITY AND IMPLEMENTATION: The application can be used at: https://glycotoolkit.com/Tools/GlycoGlyph/. The application is tested to work in modern web browsers such as Firefox or Chrome. CONTACT: aymehta@bidmc.harvard.edu or rcummin1@bidmc.harvard.edu.
The Symbol Nomenclature for Glycans (SNFG) is a community-curated standard for the depiction of monosaccharides and complex glycans using various colored-coded, geometric shapes, along with defined text additions. It is hosted by the National Center for Biotechnology Information (NCBI) at the NCBI-Glycans Page (www.ncbi.nlm.nih.gov/glycans/snfg.html). Several changes have been made to the SNFG page in the past year to update the rules for depicting glycans using the SNFG, to include more examples of use, particularly for non-mammalian organisms, and to provide guidelines for the depiction of ambiguous glycan structures. This Glycoforum article summarizes these recent changes.
This chapter offers a general background for researchers embarking in glycoscience to grasp the evolution and present status of the nomenclature(s) and representation(s) of glycans and complex carbohydrates. The availability of high-performance computing and the application of data mining are opening new paths to discovery. The field of structural glycobiology has benefited from such advances with the development of tools and databases for the structural and functional analysis of carbohydrates. There is a need to conform to the recommendations of nomenclatures of carbohydrates while the constraints are required by the developing field of glycobiology in terms of visualization and encoding. The present chapter describes the nomenclatures, symbols, and presentations that form part of the “language” used to communicate more effectively and used in different databases. Besides, some issues related to the interoperability of glycan databases throughout glycan databases are also addressed. The semantic web approach promotes further the description and integration of structural and experimental metadata throughout the development of ontologies for domain knowledge representation.
A molecular visualization program tailored to deal with the range of 3D structures of complex carbohydrates and polysaccharides, either alone or in their interactions with other biomacromolecules, has been developed using advanced technologies elaborated by the video games industry. All the specific structural features displayed by the simplest to the most complex carbohydrate molecules have been considered and can be depicted. This concerns the monosaccharide identification and classification, conformations, location in single or multiple branched chains, depiction of secondary structural elements and the essential constituting elements in very complex structures. Particular attention was given to cope with the accepted nomenclature and pictorial representation used in glycoscience. This achievement provides a continuum between the most popular ways to depict the primary structures of complex carbohydrates to visualizing their 3D structures while giving the users many options to select the most appropriate modes of representations including new features such as those provided by the use of textures to depict some molecular properties. These developments are incorporated in a stand-alone viewer capable of displaying molecular structures, biomacromolecule surfaces and complex interactions of biomacromolecules, with powerful, artistic and illustrative rendering methods. They result in an open source software compatible with multiple platforms, i.e., Windows, MacOS and Linux operating systems, web pages, and producing publication-quality figures. The algorithms and visualization enhancements are demonstrated using a variety of carbohydrate molecules, from glycan determinants to glycoproteins and complex protein-carbohydrate interactions, as well as very complex mega-oligosaccharides and bacterial polysaccharides and multi-stranded polysaccharide architectures.
The GLYcan Data Exchange (GLYDE) standard has been developed for the representation of the chemical structures of monosaccharides, glycans and glycoconjugates using a connection table formalism formatted in XML. This format allows structures, including those that do not exist in any database, to be unambiguously represented and shared by diverse computational tools. GLYDE implements a partonomy model based on human language along with rules that provide consistent structural representations, including a robust namespace for specifying monosaccharides. This approach facilitates the reuse of data processing software at the level of granularity that is most appropriate for extraction of the desired information. GLYDE-II has already been used as a key element of several glycoinformatics tools. The philosophical and technical underpinnings of GLYDE-II and recent implementation of its enhanced features are described.
MOTIVATION: The interactive visualization of very large macromolecular complexes on the web is becoming a challenging problem as experimental techniques advance at an unprecedented rate and deliver structures of increasing size. RESULTS: We have tackled this problem by developing highly memory-efficient and scalable extensions for the NGL WebGL-based molecular viewer and by using Macromolecular Transmission Format (MMTF), a binary and compressed MMTF. These enable NGL to download and render molecular complexes with millions of atoms interactively on desktop computers and smartphones alike, making it a tool of choice for web-based molecular visualization in research and education. AVAILABILITY AND IMPLEMENTATION: The source code is freely available under the MIT license at github.com/arose/ngl and distributed on NPM (npmjs.com/package/ngl). MMTF-JavaScript encoders and decoders are available at github.com/rcsb/mmtf-javascript.
The NGL Viewer (http://proteinformatics.charite.de/ngl) is a web application for the visualization of macromolecular structures. By fully adopting capabilities of modern web browsers, such as WebGL, for molecular graphics, the viewer can interactively display large molecular complexes and is also unaffected by the retirement of third-party plug-ins like Flash and Java Applets. Generally, the web application offers comprehensive molecular visualization through a graphical user interface so that life scientists can easily access and profit from available structural data. It supports common structural file-formats (e.g. PDB, mmCIF) and a variety of molecular representations (e.g. 'cartoon, spacefill, licorice'). Moreover, the viewer can be embedded in other web sites to provide specialized visualizations of entries in structural databases or results of structure-related calculations.
The amount of glycomics data being generated is rapidly increasing as a result of improvements in analytical and computational methods. Correlation and analysis of this large, distributed data set requires an extensible and flexible representational standard that is also 'understood' by a wide range of software applications. An XML-based data representation standard that faithfully captures essential structural details of a glycan moiety along with additional information (such as data provenance) to aid the interpretation and usage of glycan data, will facilitate the exchange of glycomics data across the scientific community. To meet this need, we introduce GLYcan Data Exchange (GLYDE) standard as an XML-based representation format to enable interoperability and exchange of glycomics data. An online tool () for the conversion of other representations to GLYDE format has been developed.
We present the LiteMol suite, a tool for visualizing large macromolecular structure data sets that is freely available at https://www.litemol.org.
The representation of carbohydrates in 3D space using symbols is a powerful visualization method, but such representations are lacking in currently available visualization software. The work presented here allows researchers to display carbohydrate 3D structures as 3D-SNFG symbols using LiteMol from a web browser (e.g., v.litemol.org/?loadFromCS=5T3X ). Any PDB ID can be substituted at the end of the URL. Alternatively, the user may enter a PDB ID or upload a structure. LiteMol is available at https://v.litemol.org and automatically depicts any carbohydrate residues as 3D-SNFG symbols. To embed LiteMol in a webpage, visit https://github.com/dsehnal/LiteMol .
In recent years, the Semantic Web has become the focus of life science database development as a means to link life science data in an effective and efficient manner. In order for carbohydrate data to be applied to this new technology, there are two requirements for carbohydrate data representations: (1) a linear notation which can be used as a URI (Uniform Resource Identifier) if needed and (2) a unique notation such that any published glycan structure can be represented distinctively. This latter requirement includes the possible representation of nonstandard monosaccharide units as a part of the glycan structure, as well as compositions, repeating units, and ambiguous structures where linkages/linkage positions are unidentified. Therefore, we have developed the Web3 Unique Representation of Carbohydrate Structures (WURCS) as a new linear notation for representing carbohydrates for the Semantic Web.
no abstract.
The CSDB Linear notation for carbohydrate sequences utilized in the Carbohydrate Structure Database (CSDB) has been improved to meet modern requirements in glycoinformatics. The new features include: the possibility to combine repeating and nonrepeating moieties in one structure; support of carbon-carbon bonds; and usage of SMILES encodings for unambiguous chemical description of glycan structures, including aglycons and atypical components. The new capabilities of CSDB Linear, together with the older ones, allow efficient detection of errors in CSDB and, at the same time, ensure the absence of informatic problems common for human-readable notations. The CSDB Linear implementation provides translation to other carbohydrate notations and multiple procedures for content error checking.
GlyTouCan version 1.0 was released in 2015 as the international glycan structure repository, and a new sequence format called WURCS (Web3 Unique Representation of Carbohydrate Structures) was proposed during the early stages of the GlyTouCan project. GlyTouCan uses WURCS as its base representation for glycans because existing formats were insufficient in their flexibility to represent any and all glycans universally. Therefore, in order to obtain WURCS strings for existing or new glycan structures, conversion tools or glycan structure editors that can export WURCS became necessary. GlycanBuilder was an obvious choice to extend due to its wide usage by the community. However, GlycanBuilder was limited because it was originally developed to support mammalian glycans. It also did not support the newly proposed monosaccharide symbol standard called Symbol Nomenclature for Glycans (SNFG). Therefore in this work, we implemented a new version of GlycanBuilder to greatly increase its usability. The glycan rendering system was refactored so that cyclic glycans, nested repeating units, monosaccharide compositions and cross-linked glycan structures can be represented. Both import and export utilities for WURCS were also implemented and SNFG symbols were incorporated to allow glycans to be exported as graphics using the latest glycan symbol nomenclature. This new version of GlycanBuilder called "GlycanBuilder2", is able to support a wide variety of ambiguous glycans, including structures containing monosaccharides from bacteria and plants. These glycans can also be displayed using the new SNFG symbols. This tool can aid researchers in communicating about the complex, diverse, and ambiguous structures of glycans more rapidly. Moreover, the new GlycanBuilder can now easily output WURCS sequences from glycans drawn on the canvas. Most importantly, because GlyTouCan employs WURCS as the basic format for registration and searching of glycan information, a wider variety of glycans can now be readily registered and queried in GlyTouCan.
MOTIVATION: Glycans are biomolecules that take an important role in the biological processes of living organisms. They form diverse, complicated structures such as branched and cyclic forms. Web3 Unique Representation of Carbohydrate Structures (WURCS) was proposed as a new linear notation for uniquely representing glycans during the GlyTouCan project. WURCS defines rules for complex glycan structures that other text formats did not support, and so it is possible to represent a wide variety glycans. However, WURCS uses a complicated nomenclature, so it is not human-readable. Therefore, we aimed to support the interpretation of WURCS by converting WURCS to the most basic and widely used format IUPAC. RESULTS: In this study, we developed GlycanFormatConverter and succeeded in converting WURCS to the three kinds of IUPAC formats (IUPAC-Extended, IUPAC-Condensed and IUPAC-Short). Furthermore, we have implemented functionality to import IUPAC-Extended, KEGG Chemical Function (KCF) and LinearCode formats and to export WURCS. We have thoroughly tested our GlycanFormatConverter and were able to show that it was possible to convert all the glycans registered in the GlyTouCan repository, with exceptions owing only to the limitations of the original format. The source code for this conversion tool has been released as an open source tool. AVAILABILITY AND IMPLEMENTATION: https://github.com/glycoinfo/GlycanFormatConverter.git. .
no abstract.
The glycan symbol nomenclature proposed by Harvey et al. in these pages has relative advantages and disadvantages. The use of symbols to depict glycans originated from Kornfeld in 1978, was systematized in the First Edition of "Essentials of Glycobiology" and updated for the second edition, with input from relevant organizations such as the Consortium for Functional Glycomics. We also note that >200 illustrations in the second edition have already been published using our nomenclature and are available for download at PubMed.
SMILES (Simplified Molecular Input Line Entry System) is a chemical notation system designed for modern chemical information processing. Based on principles of molecular graph theory, SMILES allows rigorous structure specification by use of a very small and natural grammar. The SMILES notation system is also well suited for high-speedmachine processing. The resulting ease of usage by the chemist and machine compatability allow many highly efficient chemical computer applications to be designed including generation of a unique notation, constant-speed (zerceth order) database retrieval, flexible substructure searching, and property prediction models.
This chapter contains sections titled: - Introduction - Fundamental Concepts - Atom Specification - Bond Specification - Branch Specification - Ring Specification - Disconnections - Reaction Specification - Isomerism Beyond the Valence Model.
When biological macromolecules are used as therapeutic agents, it is often necessary to introduce non-natural chemical modifications to improve their pharmaceutical properties. The final products are complex structures where entities such as proteins, peptides, oligonucleotides, and small molecule drugs may be covalently linked to each other, or may include chemically modified biological moieties. An accurate in silico representation of these complex structures is essential, as it forms the basis for their electronic registration, storage, analysis, and visualization. The size of these molecules (henceforth referred to as "biomolecules") often makes them too unwieldy and impractical to represent at the atomic level, while the presence of non-natural chemical modifications makes it impossible to represent them by sequence alone. Here we describe the Hierarchical Editing Language for Macromolecules ("HELM") and demonstrate its utility in the representation of structures such as antisense oligonucleotides, short interference RNAs, peptides, proteins, and antibody drug conjugates.
Recent years have seen an increase in both the development and use of informatics tools and databases in glycobiology-based research. Mass spectrometric methods, which are capable of detecting oligosaccharides in the low pico- to femtomole range, are fundamental technologies used in glycan analysis. The availability of robust and reliable algorithms to automatically interpret MS spectra is critical to many glycomic projects. Unfortunately, the current state-of-the-art in glycoinformatics is characterized by the existence of disconnected and incompatible islands of experimental data, resources, and proprietary applications. The development of tools for the robust automatic assignment of glycans on the basis of MS measurements is often hampered by the paucity of available MS data. Here, we review the methodologies for semi‐automatic interpretation of MS spectra of glycans, based upon current technology. Three promising approaches are highlighted: (a) combinatorial approaches to the automatic assignment of possible monosaccharide superclass composition—Glyco‐Peakfinder, (b) the scoring of a set of identified structures with theoretically calculated fragments–GlycoWorkbench and (c) the correlation of experimental masses to a database of theoretical fragment masses, in a technique known as Glycofragment Mass Fingerprinting.
Mass spectrometry is the main analytical technique currently used to address the challenges of glycomics as it offers unrivalled levels of sensitivity and the ability to handle complex mixtures of different glycan variations. Determination of glycan structures from analysis of MS data is a major bottleneck in high-throughput glycomics projects, and robust solutions to this problem are of critical importance. However, all the approaches currently available have inherent restrictions to the type of glycans they can identify, and none of them have proved to be a definitive tool for glycomics. GlycoWorkbench is a software tool developed by the EUROCarbDB initiative to assist the manual interpretation of MS data. The main task of GlycoWorkbench is to evaluate a set of structures proposed by the user by matching the corresponding theoretical list of fragment masses against the list of peaks derived from the spectrum. The tool provides an easy to use graphical interface, a comprehensive and increasing set of structural constituents, an exhaustive collection of fragmentation types, and a broad list of annotation options. The aim of GlycoWorkbench is to offer complete support for the routine interpretation of MS data. The software is available for download from: http://www.eurocarbdb.org/applications/ms-tools.
SUMMARY: This manuscript describes an open-source program, DrawGlycan-SNFG (version 2), that accepts IUPAC (International Union of Pure & Applied Chemist)-condensed inputs to render Symbol Nomenclature For Glycans (SNFG) drawings. A wide range of local and global options enable display of various glycan/peptide modifications including bond breakages, adducts, repeat structures, ambiguous identifications, etc. These facilities make DrawGlycan-SNFG ideal for integration into various glycoinformatics software, including glycomics and glycoproteomics mass spectrometry applications. As a demonstration of such usage, we incorporated DrawGlycan-SNFG into gpAnnotate, a standalone application to score and annotate individual MS/MS glycopeptide spectrum in different fragmentation modes. AVAILABILITY AND IMPLEMENTATION: DrawGlycan-SNFG and gpAnnotate are platform independent. While originally coded using MATLAB, compiled packages are also provided to enable DrawGlycan-SNFG implementation in Python and Java. All programs are available from https://virtualglycome.org/drawglycan; https://virtualglycome.org/gpAnnotate. SUPPLEMENTARY INFORMATION: Supplementary Material are available at Bioinformatics online.
GlycoMod (http://www.expasy.ch/tools/glycomod/) is a software tool designed to find all possible compositions of a glycan structure from its experimentally determined mass. The program can be used to predict the composition of any glycoprotein-derived oligosaccharide comprised of either underivatised, methylated or acetylated monosaccharides, or with a derivatised reducing terminus. The composition of a glycan attached to a peptide can be computed if the sequence or mass of the peptide is known. In addition, if the protein is known and is contained in the SWISS-PROT or TrEMBL databases, the program will match the experimentally determined masses against all the predicted protease-produced peptides (including any post-translational modifications annotated in these databases) which have the potential to be glycosylated with either N- or O-linked oligosaccharides. Since many possible glycan compositions can be generated from the same mass, the program can apply compositional constraints to the output if the user supplies either known or suspected monosaccharide constituents. Furthermore, known oligosaccharide structural constraints on monosaccharide composition are also incorporated into the program to limit the output.
his chapter covers various aspects of the analysis of glycoprotein and released glycans by mass spectrometry. After a short introduction on the occurrence and structure of these compounds, a method is described for presenting the structures pictorially. Methods for the isolation and purification of glycoproteins are covered next followed by methods for determining the position of attachment of the glycans to the protein (known as site analysis). For the detailed analysis of glycan structure, it is normal practice to remove them from the protein, and methods based on hydrazinolysis, reductive amination, and the use of enzymes are described. Preparation of these glycans for mass spectrometric analysis involves the removal of contaminants, and suitable methods are briefly covered. Depending on the glycan and the mass spectrometric method employed, derivatization can be helpful although it is not always necessary. Such methods include permethylation, reducing terminal derivatization and stabilization of sialic acids by methyl ester or amide formation. The use of exoglycosidase digestions is included in the chapter because mass spectrometry of intact glycans does not provide information on the nature of the constituent monosaccharides, several of which are isobaric. Such information can be obtained by the use of these enzymes or, alternatively techniques based on derivatization, hydrolysis, and gas‐chromatography/mass spectrometry (GC/MS) can be used. The remainder of the review covers the use of the various types of mass spectrometry that are available for glycan and glycoprotein analysis. Several ionization techniques such as electrospray (ESI), matrix‐assisted laser desorption (MALDI), electron ionization (EI), and fast atom bombardment (FAB) are covered, followed by a description of fragmentation methods such as collision‐induced dissociation (CID), in‐source decay (ISD), postsource decay (PSD) and electron‐transfer dissociation (ETD). This section ends with examples on the use of ion mobility for possible isomer differentiation and sample clean‐up. Many of these techniques produce a large amount of data requiring interpretation by various computing techniques, and a number of popular methods are discussed. The chapter ends with a list of references.
no abstract.
Glycosylation is the most widespread posttranslational modification in eukaryotes; however, the role of oligosaccharides attached to proteins has been little studied because of the lack of a sensitive and easy analytical method for oligosaccharide structures. Recently, tandem mass spectrometric techniques have been revealing that oligosaccharides might have characteristic signal intensity profiles. We describe here a strategy for the rapid and accurate identification of the oligosaccharide structures on glycoproteins using only mass spectrometry. It is based on a comparison of the signal intensity profiles of multistage tandem mass (MSn) spectra between the analyte and a library of observational mass spectra acquired from structurally defined oligosaccharides prepared using glycosyltransferases. To smartly identify the oligosaccharides released from biological materials, a computer suggests which ion among the fragment ions in the MS/MS spectrum should yield the most informative MS3 spectrum to distinguish similar oligosaccharides. Using this strategy, we were able to identify the structure of N-linked oligosaccharides in immunoglobulin G as an example.
Glycosyl groups are an essential mediator of molecular interactions in cells and on cellular surfaces. There are very few methods that directly relate sugar-containing molecules to their biosynthetic machineries. Here, we introduce glycogenomics as an experiment-guided genome-mining approach for fast characterization of glycosylated natural products (GNPs) and their biosynthetic pathways from genome-sequenced microbes by targeting glycosyl groups in microbial metabolomes. Microbial GNPs consist of aglycone and glycosyl structure groups in which the sugar unit(s) are often critical for the GNP's bioactivity, e.g., by promoting binding to a target biomolecule. GNPs are a structurally diverse class of molecules with important pharmaceutical and agrochemical applications. Herein, O- and N-glycosyl groups are characterized in their sugar monomers by tandem mass spectrometry (MS) and matched to corresponding glycosylation genes in secondary metabolic pathways by a MS-glycogenetic code. The associated aglycone biosynthetic genes of the GNP genotype then classify the natural product to further guide structure elucidation. We highlight the glycogenomic strategy by the characterization of several bioactive glycosylated molecules and their gene clusters, including the anticancer agent cinerubin B from Streptomyces sp. SPB74 and an antibiotic, arenimycin B, from Salinispora arenicola CNB-527.
Glycobioinformatics is a rapidly developing field providing a vital support for MS-based glycoproteomics research. Recent advances in MS greatly increased technological capabilities for high throughput glycopeptide analysis. However, interpreting MS output, in terms of identifying glycan structures, attachment sites and glycosylation linkages still presents multiple challenges. Here, we discuss current strategies used in MS-based glycoproteomics and bioinformatics tools available for MS-based glycopeptide and glycan analysis. We also provide a brief overview of recent efforts in glycobioinformatics such as the new initiative UniCarbKB directed toward developing more comprehensive and unified glycobioinformatics platforms. With regards to glycobioinformatics tools and applications, we do not express our personal preferences or biases, but rather focus on providing a concise description of main features and functionalities of each application with the goal of assisting readers in making their own choices and identifying and locating glycobioinformatics tools most suitable for achieving their experimental objectives.
Protein N-glycosylation plays a crucial role in a considerable number of important biological processes. Research studies on glycoproteomes and glycomes have already characterized many glycoproteins and glycans associated with cell development, life cycle, and disease progression. Mass-spectrometry (MS) is the most powerful tool for identifying biomolecules including glycoproteins and glycans, however, utilizing MS-based approaches to identify glycoproteomes and glycomes is challenging due to the technical difficulties associated with glycosylation analysis. In this review, we summarize the most recent developments in MS-based glycoproteomics and glycomics, including a discussion on the development of analytical methodologies and strategies used to explore the glycoproteome and glycome, as well as noteworthy biological discoveries made in glycoproteome and glycome research. This review places special emphasis on China, where scientists have made sizeable contributions to the literature, as advancements in glycoproteomics and glycomincs are occurring quite rapidly.
Mass spectrometric techniques are the key technology for rapid and reliable glycan analysis. However, the lack of robust, dependable, and freely available software for the (semi-) automatic annotation of mass spectra is still a severe bottleneck that hampers their rapid interpretation. In this article the "Glyco-Peakfinder" web-service is described allowing de novo determination of glycan compositions from their mass signals. Starting from a basic set of mandatory masses of glycan components, the calculation can be performed without any knowledge concerning the biological background of the sample or the fragmentation technique used. "Glyco-Peakfinder" assigns all types of fragment ions including monosaccharide cross-ring cleavage products and multiply charged ions. It provides full user control to handle modified glycans (persubstituted molecules, reducing-end modifications, glycoconjugates) and ion types. The formula applied to calculate the fragment masses and an outline of the implemented algorithm are discussed. A systematic evaluation of the dependence of all factors influencing the computation time revealed strikingly different impact of the individual calculation steps. To provide access to known carbohydrate structures a "composition search" in the open access database GLYCOSCIENCES.de can be performed. The service is available at the URL: www.eurocarbdb.org/applications/ms-tools.
New computer software, GlycoMiner, has been developed to automatically identify tandem (MS/MS) spectra obtained in liquid chromatography/mass spectrometry (LC/MS) runs which correspond to N-glycopeptides. The program complements conventional proteomics analysis, and can be used in a high-throughput environment. The program interprets the spectra and determines the structure of the corresponding glycopeptides. GlycoMiner runs under Windows, can process spectra obtained on various instruments, and can be downloaded from our website (w3.chemres.hu/ms/glycominer). The algorithm works similarly to a human expert; evaluates the low mass oxonium ions; deduces oligosaccharide losses from the protonated molecule; and identifies the mass of the peptide residue. The program has been tested on tryptic digests of two glycopeptides: AGP (which has five different N-glycosylation sites) and transferrin (with two N-glycosylation sites). Results have been evaluated both manually and by GlycoMiner. Out of 3132 MS/MS spectra 338 were found to correspond to glycopeptides; identification by GlycoMiner showed a 0.1% false positive and 0.1% false negative rate. From these it was possible to identify 196 glycan structures manually; GlycoMiner correctly identified all of these, with no false positives. The rest were low quality spectra, not suitable for structure assignment.
Since the 2-D/3-D HPLC mapping technique was proposed for structural analyses of N-glycans, approximately 500 different structures have been elucidated. Based on the accumulated data, we developed a web application, GALAXY (Glycoanalysis by the three axes of MS and chromatography), to utilize the 2-D/3-D maps more effectively. This application will facilitate search of candidate structures satisfying 2-D/3-D HPLC and/or mass spectrometric data and enable us to predict coordinates of putative PA-glycans and to trace the effects of glycosidase treatments in a graphical manner.
This paper reviews the current status of bioinformatics applications and databases in glycobiology, which are based on bioinformatics approaches as well as informatics for glycobiology where an explicit encoding of glycan structures is required. The availability of the complete sequence of the human genome has accelerated the systematic identification of so far unidentified glycogenes considerably in many areas of glycobiology using well-established bioinfomatics tools. Although there has been an immense development of new glyco-related data collections as well as informatics tools and several efforts have been started to cross-link and reference the various data deposited in distributed databases, informatics for glycobiology and glycomics is still poorly developed compared to the genomics and proteomics area. The development of algorithms for the automatic interpretation of MS spectra - currently, a severe bottleneck, which hampers the rapid and reliable interpretation of MS data in high-throughput glycomics projects - is reviewed. A comprehensive list of web resources is given. Several lines of progression are discussed. There is an urgent need for the development of decentralised input facilities of experimentally determined glycan structures. Simultaneously, agreements of standards for the structural description of glycans as well as formats for the related data have to be established. The integration of glycomics with genomics/proteomics has to increase.
This chapter is dedicated to the presentation of different examples of the application of solution NMR to the study of conformation, dynamics of sugar molecules (oligo and polysaccharides, glycopeptides and glycomimetics) and to the investigation of glycan-related molecular recognition events. Selected examples since 2012 are presented depending on the chemical nature of the sugar molecule, on the environment (free or bound) and on the nature of the receptor.
Carbohydrates (glycans, saccharides, sugars) are everywhere. In fact, glycan-protein interactions are involved in many essential processes of life and disease. The understanding of the key structural details at the atomic and molecular level is of paramount importance to effectively design molecules for therapeutic purposes. Different approximations may be employed to decipher these molecular recognition processes with high resolution. Advances in cryo-electron microscopy are providing exquisite details on different biological mechanisms involving sugars, while better and better protocols for structural refinement in the application of X-ray methods for protein-sugar complexes and glycoproteins are also permitting fantastic advances in the glycoscience arena. Alternatively, NMR spectroscopy remains as one of the most rewarding techniques to explore protein-carbohydrate interactions. In fact, given the intrinsic dynamic nature of saccharides, NMR can afford exquisite structural information at the atomic detail, not accessible by other techniques. However, the access to this information is sometimes intricate, and requires careful analysis and well-defined strategies. In this review, we have highlighted these issues and presented an overview of different modern NMR approaches with a focus on the latest developments and challenges.
The online version of the SpecInfo system consists of a factual database with spectroscopic information and a set of special tools for the spectral data analysis and interpretation. It has been designed to support experts in laboratories in their analytical work. The database contains high-quality spectral data of various types, experimental conditions, chemical substance information including the connection tables, and the bibliographic source data. In the first release, the database comprises approximately 70 000 13C NMR, 17 000 infrared, and 6000 NMR spectra of other nuclei (I9F, 15N, I7O, 31P), Several other types of spectra are planned to be included in SpecInfo for future releases. The online database offers many different access capabilities to the factual data, and some illustrative examples are discussed in this paper. In addition, there are four special programs available in the SpecInfo system in dealing with 13CNMR data: CHESS for a search of identical or similar chemical structures, COUPCALfor the calculation of coupling constants, GETSPECfor a similarity research of 13C NMR spectra, and SPECALfor the estimation of a 13C NMR spectrum from a query structure. In addition, there is a spectrum editor (EDSPEC)available to support the building of spectral queries. At the online host STN International, the SpecInfo system is embedded in a cluster of chemical databases, and it is possible to navigate through this system using the various substance registry numbers connecting the different databases. Finally, some future development plants for SpecInfo are presented.
The diversity in molecular arrangements and dynamics displayed by glycans renders traditional NMR strategies, employed for proteins and nucleic acids, insufficient. Because of the unique properties of glycans, structural studies often require the adoption of a different repertoire of tailor-made experiments and protocols. We present an account of recent developments in NMR techniques that will deepen our understanding of structure–function relations in glycans. We open with a survey and comparison of methods utilized to determine the structure of proteins, nucleic acids and carbohydrates. Next, we discuss the structural information obtained from traditional NMR techniques like chemical shifts, NOEs/ROEs, and coupling-constants, along with the limitations imposed by the unique intrinsic characteristics of glycan structure on these approaches: flexibility, range of conformers, signal overlap, and non-first-order scalar (strong) coupling. Novel experiments taking advantage of isotopic labeling are presented as an option for overcoming spectral overlap and raising sensitivity. Computational tools used to explore conformational averaging in conjunction with NMR parameters are described. In addition, recent developments in hydroxyl detection and hydrogen bond detection in protonated solvents, in contrast to traditional sample preparations in D2O for carbohydrates, further increase the tools available for both structure information and chemical shift assignments. We also include previously unpublished data in this context. Accurate determination of couplings in carbohydrates has been historically challenging due to the common presence of strong-couplings. We present new strategies proposed for dealing with their influence on NMR signals. We close with a discussion of residual dipolar couplings (RDCs) and the advantages of using 13C isotope labeling that allows gathering one-bond 13C–13C couplings with a recently improved constant-time COSY technique, in addition to the commonly measured 1H–13C RDCs.
This chapter provides an overview of the 13Carbon-nuclear magnetic resonance (13C-NMR) spectroscopy of monosaccharides. The 13C-NMR spectroscopy has become increasingly important as a tool for the characterization and structural elucidation of sugars and their derivatives. Although 13C-NMR is closely related to 1H-NMR spectroscopy, especially when both types of spectra are recorded with Fourier-transform instruments, the two techniques are sufficiently different to be valuable complements to each other. In many cases, in particular when dealing with complex molecules such as polysaccharides, the amount of information obtainable from 1H-NMR spectra is limited as compared to that revealed by 13C- NMR spectra. This chapter provides an almost complete collection of 13C- NMR chemical shifts of monosaccharides, their methyl glycosides, and acetates, along with the examples of shift data for as many different types of monosaccharide derivative as possible. It also provides details on sampling techniques and assignment techniques, and discusses the identity of monosaccharides, their structure determination, and conformational analysis.
The combination of structural diversity at several levels and limited chemical shift dispersion ensures that NMR spectra of carbohydrates are relatively difficult to interpret. This introduction to applications of NMR spectroscopy for the study of carbohydrates provides guidelines for interpretation of their 1- and 2-D spectra against a background of their tautomeric, configurational, and conformational equilibria in solution and consideration of their biosynthetic diversity. The influence of structural features on chemical shifts and coupling constants is illustrated by the consequences for both homo- and heteronuclear 2-D NMR spectra. Some applications of NMR spectroscopy for studies of carbohydrate metabolism are briefly considered. .
Contributions to nuclear screening (chemical shifts) arising from molecular interactions with solvent molecules (excluding hydrogen bonding) are discussed in terms of appropriate theoretical models. These include contributions from van der Waals interactions σw, from the magnetic anisotropy of the solvent molecule σa, and from polar effects σE. By a suitable choice of solute‐solvent systems it has been possible to demonstrate each of these effects experimentally for proton resonances. For CH4 as a solute, σw was in all cases negative, its magnitude varying with the nature of the solvent and amounting to as much as 0.6 ppm for high molecular weight solvents. In agreement with the theoretical models, σa was found to be positive for disk‐shaped solvent molecules and negative for cylindrically symmetrical rod‐shaped molecules, its magnitude in extreme cases reaching 0.75 ppm. For CH3CN as a solute, σE was negative and showed the expected dependence on the dielectric constant of the solvent.
no abstract.
This chapter introduces the most common 1D NMR techniques used in the chemical laboratory. It initially considers the simplest single-pulse excitation method and describes how this should be executed to obtain optimum sensitivity or accurate quantification of samples. It then introduces the concept of spin decoupling and describes its application for homonuclear and heteronuclear decoupling. The use of spin-echoes for spectrum editing is described followed by consideration of methods that also employ polarisation transfer with spectrum editing for enhanced performance. The final section describes techniques that are specifically tailored to the observation of rapidly relaxing quadrupolar nuclei.
Structural biology plays a key role in understanding how networks of protein interactions with their partners are organized at the atomic level. In this review, we show that NMR is a very efficient method to solve 3D structures of protein – RNA and protein–carbohydrate complexes of high quality. We explain the importance of studying such interactions and describe the main steps that are required to determine structures of these types of complexes by NMR. Finally, we show that X-ray crystallography and NMR are complementary methods and briefly report on advantages and disadvantages of each approach.
Following the development and publication of the JCAMP-DX protocol 4.24 and its successful implementation in the field of infrared spectroscopy, data exchange without loss of information, between systems of different origin and internal format, has become a reality. The benefits of this system-independent data transfer standard have been recognized by workers in other areas who have expressed a wish for an equivalent, compatible standard in their own fields. This publication details a protocol for the exchange of Nuclear Magnetic Resonance (NMR) spectral data without any loss of information and in a format that is compatible with all storage media and computer systems. The protocol detailed below is designed for spectral data transfer, and its use for NMR imaging data transfer has not as yet been investigated.
no abstract.
A new approach for structure determination of native and O-desulfated fucoidans by the analysis of their 13C NMR spectra by artificial neural networks (ANNs) is described. Two ANN models were studied: the simple three-layer feed-forward network, which employs supervised learning, and the adaptive resonance theory (ART) network with unsupervised learning. Training sets for the networks were constructed using chemical shifts of synthetic oligofucosides. The results obtained demonstrate that both models worked better in the case of desulfated fucoidans, while the ART-type networks gave better results in sulfated (native) fucoidan structure elucidation.
Carbohydrate molecules are essential actors in key biological events, being involved as recognition points for cell–cell and cell–matrix interactions related to health and disease. Despite outstanding advances in cryoEM, X-ray crystallography and NMR still remain the most employed techniques to unravel their conformational features and to describe the structural details of their interactions with biomolecular receptors. Given the intrinsic flexibility of saccharides, NMR methods are of paramount importance to deduce the extent of motion around their glycosidic linkages and to explore their receptor-bound conformations. We herein present our particular view on the latest advances in NMR methodologies that are permitting to magnify their applications for deducing glycan conformation and dynamics and understanding the recognition events in which there are involved.
A. Loss, T. Lutteke.
Using NMR data on GLYCOSCIENCES.de.
Methods in Molecular Biology, 2015. 1273: 87-95.
DOI: 10.1007/978-1-4939-2343-4_6
NMR spectroscopy is frequently used in structural characterization of carbohydrates. The GLYCOSCIENCES.de database contains more than 3,000 NMR spectra stored as lists of chemical shifts, which can be searched online by atom and residue names and by chemical shift values. This chapter describes how to use the different interfaces to get access to these data. The atom search allows querying the database for NMR spectra that contain a specific carbohydrate residue with an NMR shift in a given range assigned to a particular atom, whereas the peak search enables queries to find spectra with NMR shifts most similar to a list of given shifts. The shift estimation feature facilitates prediction of NMR shifts of glycans, for which no experimental data are available.
The aim of this report is to collect the most important results in sugar studies by NMR methods in the last two years. This review covers determination of new and previously known structures of sugars isolated from natural sources, as well as the structures of carbohydrates obtained by chemical or enzymatic synthesis. Moreover, we have included herein the papers describing non‐covalent interactions between carbohydrates and other sugars, peptides, proteins and DNA fragments, as well as the application of NMR techniques to identification and quantification of sugars. The development in rare and unusual NMR methods used to study the sugar structures is also included. The last section focuses on the computational methods used to calculate NMR parameters, and on the carbohydrate databases.
Structural determination of N- and O-linked glycans as well as polysaccharides is hampered by the limited spectral dispersion. The computerized approach CASPER, an acronym for computer assisted spectrum evaluation of regular polysaccharides, uses liquid state NMR data to elucidate carbohydrate structure based on agreement with predicted (1)H and (13)C chemical shifts. We here demonstrate developments based on multiple through-bond J-based correlations that significantly enhance the credence to the sequence connectivities proposed in the analysis exemplified by an oligosaccharide and a bacterial polysaccharide. The approach is also suitable for predicting (1)H and (13)C NMR chemical shifts of synthesized oligosaccharides and glycoconjugates, thereby corroborating a proposed structure.
Understanding the dynamics of protein-ligand interactions, which lie at the heart of host-pathogen recognition, represents a crucial step to clarify the molecular determinants implicated in binding events, as well as to optimize the design of new molecules with therapeutic aims. Over the last decade, advances in complementary biophysical and spectroscopic methods permitted us to deeply dissect the fine structural details of biologically relevant molecular recognition processes with high resolution. This Review focuses on the development and use of modern nuclear magnetic resonance (NMR) techniques to dissect binding events. These spectroscopic methods, complementing X-ray crystallography and molecular modeling methodologies, will be taken into account as indispensable tools to provide a complete picture of protein-glycoconjugate binding mechanisms related to biomedicine applications against infectious diseases.
Computer-based procedures are developed for simulating the 13C NMR spectra of carbohydrates. By use of data from five sources, models are derlved that related observed chemical shifts to numerical parameters encoding aspects of the chemlcai environments of the corresponding carbons. Molecular mechanics techniques are used to compute parameters encoding the effects of multiple oxygen atoms on the carbon atom envlronments. A calibration procedure Is introduced for adjusting experimental spectra to the computed models, thereby allowing valid comparisons to be made between the spectrum of an unknown and the slmultated spectra of possible candidate structures. The derlved models are tested by slmultatlng the spectra of 15 compounds not included in the modeilng study. Experimental spectra are used to evaluate the simulations.
F. Pereira.
1D 13C-NMR data as molecular descriptors in spectra-structure relationship analysis of oligosaccharides.
Molecules, 2012. 17(4): 3818-3833.
DOI: 10.3390/molecules17043818
Spectra-structure relationships were investigated for estimating the anomeric configuration, residues and type of linkages of linear and branched trisaccharides using 13C-NMR chemical shifts. For this study, 119 pyranosyl trisaccharides were used that are trimers of the alpha or beta anomers of D-glucose, D-galactose, D-mannose, L-fucose or L-rhamnose residues bonded through a or b glycosidic linkages of types 1-->2, 1-->3, 1-->4, or 1-->6, as well as methoxylated and/or N-acetylated amino trisaccharides. Machine learning experiments were performed for: (1) classification of the anomeric configuration of the first unit, second unit and reducing end; (2) classification of the type of first and second linkages; (3) classification of the three residues: reducing end, middle and first residue; and (4) classification of the chain type. Our previously model for predicting the structure of disaccharides was incorporated in this new model with an improvement of the predictive power. The best results were achieved using Random Forests with 204 di- and trisaccharides for the training set-it could correctly classify 83%, 90%, 88%, 85%, 85%, 75%, 79%, 68% and 94% of the test set (69 compounds) for the nine tasks, respectively, on the basis of unassigned chemical shifts.
Assignments for the 13C chemical shifts of methyl aldopento- and hexofuranosides and of cyclopentane polyols are reported, and these chemical shifts are evaluated with respect to all combinations of configurational arrangements of the ring substituents. A cis configuration of vicinal substituents is generally associated with a substantial increase in shielding, as compared with the trans analog. Among the furanosides, however, changes in configuration involving O-3 and C-5 are accompanied by a complex shielding pattern: e.g., effects on C-3 are minor, C-4 is more shielded in cis than in trans isomers, and C-4 and C-5 of hexosides are more shielded than those of pentosides. Variations in shielding attributable to cis or trans arrangements of substituents in a 1,3-relationship are encountered in only a few instances; however, it is noteworthy that C-4 is less shielded when O-l and O-3 are cis than when trans. These diverse effects are considered also in relation to steric interactions and to conformational influences on 13C chemical shifts in the five-membered ring series.
A novel method for the determination of the three-dimensional (3D) structure of oligosaccharides in the solid state using experimental 13C NMR data is presented. The approach employs this information, combined with 13C chemical shift surfaces (CSSs) for the glycosidic bond carbons in the generation of NMR pseudopotential energy functions suitable for use as constraints in molecular modeling simulations. Application of the method to trehalose, cellobiose, and cellotetraose produces 3D models that agree remarkably well with the reported X-ray structures, with phi and psi dihedral angles that are within 10 degrees from the ones observed in the crystals. The usefulness of the approach is further demonstrated in the determination of the 3D structure of the cellohexaose, an hexasaccharide for which no X-ray data has been reported, as well as in the generation of accurate structural models for cellulose II and amylose V6.
Carbohydrates are ubiquitous components in nature involved in a range of tasks. They cover every cell and contribute both structural stability as well as identity. Lipopolysaccharides are the outermost exposed part of the bacterial cell wall and the primary target for host-pathogen recognition. Understanding the structure and biosynthesis of these polysaccharides is crucial to combat disease and develop new medicine. Structural determinations can be carried out using NMR spectroscopy, a powerful tool giving information on an atomistic scale. This thesis is focused on method development to study polysaccharide structures as well as application on bacterial lipopolysaccharides. The focus has been to incorporate a bioinformatics approach prior to analysis by NMR spectroscopy, and then computer assisted methods to aid in the subsequent analysis of the spectra. The third chapter deals with the recent developments of ECODAB, a tool that can help predict structural fragments in Escherichia coli O-antigens. It was migrated to a relational database and the aforementioned predictions can now be made automatically by ECODAB. The fourth chapter gives insight into the program CASPER, a computer program that helps with structure determination of oligo- and polysaccharides. An approach to determine substituent positions in polysaccharides was investigated. The underlying database was also expanded and the improved capabilities were demonstrated by determining O-antigenic structures that could not previously be solved. The fifth chapter is an application to Oantigen structures of E. coli strains. This is done by a combination of NMR spectroscopy and bioinformatics to predict components as well as linkages prior to spectra analysis. In the first case, a full structure elucidation was performed on E. coli serogroup O63, and in the second case a demonstration of the bioinformatics approach is done to E. coli serogroup O93. In the sixth chapter, a new version of the CarbBuilder software is presented. This includes a more robust building algorithm that helps build sterically crowded polysaccharide structures, as well as a general expansion of possible components.
The CASPER program which is used for determination of the structure of oligo- and polysaccharides has been extended. It can now handle a reduced number of experimental signals from an NMR spectrum in the comparison to the simulated spectra of structures that it generates, an improvement which is of practical importance since all signals in NMR spectra cannot always be identified. Furthermore, the program has been enhanced to simulate NMR spectra of multibranched oligo- and polysaccharides. The new developments were tested on four saccharides of known structure but of different complexity and were shown to predict the correct structures.
A fast, efficient program which runs on the ubiquitous personal computer for the analysis, storage, and retrieval of NMR information is presented. It is suitable for abstracting large quantities of data from published literature and includes many plausibility tests which are Executed simultaneously with the input. Automatic determination of stereoconfigurations in 2D structures; makes the program of great value for natural compounds. A structure elucidation system allows prediction chemical shifts and JJ coupling constants and automatic peak assignation. A database of 101 205 C-13, P-31, F-19, Si-29, N-15, B-11, O-17, and S-33 spectra, taken primarily from Russian-language original sources has been created.
This perspective article is focused on the presentation of the latest advances in NMR methods and applications that are behind the exciting achievements in the understanding of glycan receptors in molecular recognition events. Different NMR-based methodologies are discussed along with their applications to scrutinize the conformation and dynamics of glycans as well as their interactions with protein receptors.
In this chapter an introductory overview is presented of advances in N M R spectroscopy of carbohydrates. The main emphasis is on the application of 1H-NMR spectroscopy for identification and structural studies of glycans.
This chapter contains sections titled: - Introduction - Advantages and Disadvantages of NMR Approaches - NMR is Often Used to Determine Bacterial Polysaccharides - NMR and Informatics Approaches - SugaBase - Spectral Search - Chemical Shift Estimation - Use of Spectral Matching and Shift Estimation in Automatic Procedures.
To address data management and data exchange problems in the nuclear magnetic resonance (NMR) community, the Collaborative Computing Project for the NMR community (CCPN) created a "Data Model" that describes all the different types of information needed in an NMR structural study, from molecular structure and NMR parameters to coordinates. This paper describes the development of a set of software applications that use the Data Model and its associated libraries, thus validating the approach. These applications are freely available and provide a pipeline for high-throughput analysis of NMR data. Three programs work directly with the Data Model: CcpNmr Analysis, an entirely new analysis and interactive display program, the CcpNmr FormatConverter, which allows transfer of data from programs commonly used in NMR to and from the Data Model, and the CLOUDS software for automated structure calculation and assignment (Carnegie Mellon University), which was rewritten to interact directly with the Data Model. The ARIA 2.0 software for structure calculation (Institut Pasteur) and the QUEEN program for validation of restraints (University of Nijmegen) were extended to provide conversion of their data to the Data Model. During these developments the Data Model has been thoroughly tested and used, demonstrating that applications can successfully exchange data via the Data Model. The software architecture developed by CCPN is now ready for new developments, such as integration with additional software applications and extensions of the Data Model into other areas of research.
The (1)H NMR spectra of a number of alcohols, diols and inositols are reported and assigned in CDCl(3), D(2)O and DMSO-d(6) (henceforth DMSO) solutions. These data were used to investigate the effects of the OH group on the (1)H chemical shifts in these molecules and also the effect of changing the solvent. Inspection of the (1)H chemical shifts of those alcohols which were soluble in both CDCl(3) and D(2)O shows that there is no difference in the chemical shifts in the two solvents, provided that the molecules exist in the same conformation in the two solvents. In contrast, DMSO gives rise to significant and specific solvation shifts. The (1)H chemical shifts of these compounds in the three solvents were analysed using the CHARGE model. This model incorporates the electric field, magnetic anisotropy and steric effects of the functional group for long-range protons together with functions for the calculation of the two- and three-bond effects. The long-range effect of the OH group was quantitatively explained without the inclusion of either the C--O bond anisotropy or the C--OH electric field. Differential beta and gamma effects for the 1,2-diol group needed to be included to obtain accurate chemical shift predictions. For DMSO solution the differential solvent shifts were calculated in CHARGE on the basis of a similar model, incorporating two-bond, three-bond and long-range effects. The analyses of the (1)H spectra of the inositols and their derivatives in D(2)O and DMSO solution also gave the ring (1)H,(1)H coupling constants and for DMSO solution the CH--OH couplings and OH chemical shifts. The (1)H,(1)H coupling constants were calculated in the CHARGE program by an extension of the cos(2)phi equation to include the orientation effects of electronegative atoms and the CH--OH couplings by a simple cos(2)phi equation. Comparison of the observed and calculated couplings confirmed the proposed conformations of myo-inositol, chiro-inositol, quebrachitol and allo-inositol. The OH chemical shifts were also calculated in the CHARGE program. Comparison of the observed and calculated OH chemical shifts and CH.OH couplings suggested the existence of intramolecular hydrogen bonding in a myo-inositol derivative.
Counterpropagation neural networks were applied to the fast prediction of 1H NMR chemical shifts of CHn groups in organic compounds. The training set consisted of 744 examples of protons that were represented by physicochemical, topological, and geometric descriptors. The selection of descriptors was performed by genetic algorithms, and the models obtained were compared to those containing all the descriptors. The best models yielded very good predictions for an independent prediction set of 259 cases (mean absolute error for whole set, 0.25 ppm; mean absolute error for 90% of cases, 0.19 ppm) and for application cases consisting of four natural products recently described. Some stereochemical effects could be correctly predicted. A useful feature of the system resides in its ability to be retrained with a specific data set of compounds if improved predictions for related structures are required. .
A new software suite, called Crystallography & NMR System (CNS), has been developed for macromolecular structure determination by X-ray crystallography or solution nuclear magnetic resonance (NMR) spectroscopy. In contrast to existing structure-determination programs, the architecture of CNS is highly flexible, allowing for extension to other structure-determination methods, such as electron microscopy and solid-state NMR spectroscopy. CNS has a hierarchical structure: a high-level hypertext markup language (HTML) user interface, task-oriented user input files, module files, a symbolic structure-determination language (CNS language), and low-level source code. Each layer is accessible to the user. The novice user may just use the HTML interface, while the more advanced user may use any of the other layers. The source code will be distributed, thus source-code modification is possible. The CNS language is sufficiently powerful and flexible that many new algorithms can be easily implemented in the CNS language without changes to the source code. The CNS language allows the user to perform operations on data structures, such as structure factors, electron-density maps, and atomic properties. The power of the CNS language has been demonstrated by the implementation of a comprehensive set of crystallographic procedures for phasing, density modification and refinement. User-friendly task-oriented input files are available for nearly all aspects of macromolecular structure determination by X-ray crystallography and solution NMR.
A software package, MD2NOE, is presented which calculates Nuclear Overhauser Effect (NOE) build-up curves directly from molecular dynamics (MD) trajectories. It differs from traditional approaches in that it calculates correlation functions directly from the trajectory instead of extracting inverse sixth power distance terms as an intermediate step in calculating NOEs. This is particularly important for molecules that sample conformational states on a timescale similar to molecular reorientation. The package is tested on sucrose and results are shown to differ in small but significant ways from those calculated using an inverse sixth power assumption. Results are also compared to experiment and found to be in reasonable agreement despite an expected underestimation of water viscosity by the water model selected.
The direct (recomputation of two-electron integrals) implementation of the gauge-including atomic orbital (GIAO) and the CSGT (continuous set of gauge transformations) methods for calculating nuclear magnetic shielding tensors at both the Hartree-Fock and density functional levels of theory are presented. Isotropic C-13, N-15, and O-17 magnetic shielding constants for several molecules, including taxol (C47H51NO14 using 1032 basis functions) are reported. Shielding tensor components determined using the GIAO and CSGT methods are found to converge to the same value at sufficiently large basis sets; however, GIAO shielding tensor components for atoms other than carbon are found to converge faster with respect to basis set size than those determined using the CSGT method for both Hartree-Fock and DFT. For molecules where electron correlation effects are significant, shielding constants determined using (gradient-corrected) pure DFT or hybrid methods (including a mixture of Hartree-Fock exchange and DFT exchange-correlation) are closer to experiment than those determined at the Hartree-Fock level of theory. For the series of molecules studied here, the RMS error for C-13 chemical shifts relative to TMS determined using the B3LYP hybrid functional with the 6 - 311 + G(2d,p) basis is nearly three times smaller than the RMS error for shifts determined using Hartree-Fock at this same basis. Hartree-Fock C-13 chemical shifts calculated using the 6 - 31G* basis set give nearly the same RMS error as compared to experiment as chemical shifts obtained using Hartree-Fock with the bigger 6 - 311 + G(2d,p) basis set for the range of molecules studied here. The RMS error for chemical shifts relative to TMS calculated at the Hartree-Fock 6 - 31G* level of theory for taxol (C47H51NO14) is 6.4 ppm, indicating that for large systems, this level of theory is sufficient to determine accurate C-13 chemical shifts. (C) 1996 American Institute of Physics.
no abstract.
Regression equations have been developed to predict the 13C NMR spectra of 17 ribonucleosides through the use of atomic environmental descriptors. These descriptors were calculated directly from the structure of the compounds. Fifteen compounds are used as a training set for linear regression analysis, and two compounds are used as an external prediction set. Due to the diverse nature of the atoms within the data set, the chemical shifts were divided into subsets. The results for each subset are reported. Computational neural networks are also used to predict the chemical shifts of the atoms in the subsets.
The development of Karplus‐type formulas that relate the NMR coupling constants of carbohydrates and related ma olecules to the dihedral angle between the coupled nuclei is surveyed, including couplings over three, four, or five bonds. Most of the major types of coupling constants of interest in carbohydrate characterization are described, including several kinds of proton–proton couplings, those between protons and heteronuclei, and others involving two heteronuclei, over a number of different coupling pathways. The equations range from simple, two‐parameter versions that depict only the torsional dependence of coupling constants, to complex 22‐parameter forms that simulate the variation of the coupling with many different molecular properties. Methods for formulation of the Karplus relationships are discussed, including multidimensional NMR methods, molecular modeling, and theoretical computations that have been used either to support the experimental measurements, or as stand‐alone techniques for elucidation of molecular geometry, and the calculation of coupling constants for direct definition of the equations. Online calculator programs for coupling constants and torsion angles are described.
Carbohydrate structures containing alkyl groups as aglycones are useful for investigating enzyme activity and glycan-protein interactions. Moreover, linker-containing oligosaccharides with a spacer group are commonly used to print glycan microarrays or to prepare protein-conjugates as vaccine candidates. The structural accuracy of these synthesized glycans are essential for interpretation of results from biological experiments in which the compounds have been used and NMR spectroscopy can unravel and confirm their structures. An approach for efficient (1)H and (13)C NMR chemical shift assignments employed a parallel NOAH-10 measurement followed by NMR spin-simulation to refine the (1)H NMR chemical shifts, as exemplified for a disaccharide with an azidoethyl group as an aglycone, the NMR chemical shifts of which have been used to enhance the quality of CASPER (http://www.casper.organ.su.se/casper/). The CASPER program has been further developed to aid characterization of linker-containing oligo- and polysaccharides, either by chemical shift prediction for comparison to experimental NMR data or as structural investigation of synthesized glycans based on acquired unassigned NMR data. The ability of CASPER to elucidate structures of linker-containing oligosaccharides is demonstrated and comparisons to assigned or unassigned NMR data show the utility of CASPER in supporting a proposed oligosaccharide structure. Prediction of NMR chemical shifts of an oligosaccharide, corresponding to the repeating unit of an O-antigen polysaccharide, having a linker as an aglycone and a non-natural substituent derivative thereof are presented to exemplify the diversity of structures handled. Furthermore, NMR chemical shift predictions of synthesized polysaccharides, corresponding to bacterial polysaccharides, containing a linker are described showing that in addition to oligosaccharide structures also polysaccharide structures having an aglycone spacer group can be analyzed by CASPER.
The use of NMR methods to determine the three-dimensional structures of carbohydrates and glycoproteins is still challenging, in part because of the lack of standard protocols. In order to increase the convenience of structure determination, the topology and parameter files for carbohydrates in the program Crystallography & NMR System (CNS) were investigated and new files were developed to be compatible with the standard simulated annealing protocols for proteins and nucleic acids. Recalculating the published structures of protein-carbohydrate complexes and glycosylated proteins demonstrates that the results are comparable to the published structures which employed more complex procedures for structure calculation. Integrating the new carbohydrate parameters into the standard structure calculation protocol will facilitate three-dimensional structural study of carbohydrates and glycosylated proteins by NMR spectroscopy.
A nonempirical method for the calculation of nuclear magnetic shielding based on the four current density functional formalism is presented. Using SCF-LCAO-X α-calculations with application of GIAO's effective one particle equations are solved. The results for nuclear magnetic shielding in diatomic molecules are of good quality, compared with other theoretical and experimental data.
A theory of NMR shielding tensors is derived from Ramsey’s expressions, using the framework of the random phase approximation (RPA) and localized molecular orbitals. By expanding angular momentum terms relative to a local origin for each orbital and using properties of the RPA solutions, we arrive at shielding expressions that contain no reference to an overall gauge origin and that lead to appropriate damping of basis set errors in contributions from distant groups. The expressions allow an analysis of the shielding into intrinsic bond and bond–bond coupling contributions. The resulting method is a variant of the coupled‐Hartree–Fock approach. Ab initio results are presented for 13C isotropic shieldings and shielding tensors in a number of organic molecules ranging in size up to benzene. These results agree very well with experiment, for both isotropic shieldings and principal tensor components, even in double‐zeta basis sets. For some low symmetry molecules we study the asymmetry of the shielding tensor and show that the antisymmetry can be quite large even if the anitsotropy is moderate. The bond analyses shed light on the effects of molecular structural features on the shieldings and their anisotropies, and emphasize in particular the conformational effects in bond–bond coupling contributions.
Several methods for the calculation of indirect nuclear spin-spin coupling constants are discussed. There are two complementary approaches to the calculation of indirect nuclear spin-spin coupling constants from electronic structure theory. The spin-spin coupling constants can be calculated from an approximate electronic wave function or from an electronic density, using DFT. The standard starting point for a wave-function treatment of molecular properties is the Hartree-Fock model, which for many properties yields results in qualitative agreement with experimental measurements. It is suggested that when very accurate results are needed for small molecules, wave-function methods should be used. By carrying out calculations in a hierarchy of approximations, users can approach the exact solution systematically and estimate error bars. A number of program packages such as Aces II, Aces II MAB, Dalton, and Gaussian03, have been developed for the calculation of indirect nuclear spin-spin coupling constants.
The computer program CASPER and its algorithms are described. The program is aimed at facilitating the determination of structures of oligosaccharides and regular polysaccharides, requiring as input either the one-dimensional 1H or 13C NMR spectrum or the 2D C,H-correlation NMR spectrum together with information on components and linkages. The databases, the method of simulating spectra, options of the program, and techniques for faster calculations are described as well as an example of a structural determination.
A WWW-interface to a program for structure elucidation of oligo- and polysaccharides using NMR data, CASPER, is presented. The interface and the underlying program have been extensively tested using published data and it was able to simulate 13C NMR spectra of >200 structures with an average error of about 0.3 ppm/resonance. When applied to the repeating units of Escherichia coli O-antigens the published structures were found among the five highest ranked structures in 75% of the cases. The average deviation between calculated and experimental 13C chemical shifts was 0.45 ppm. Oligosaccharide spectra were calculated with even better accuracy (0.23 ppm/resonance) and the correct structure was ranked 1st or 2nd in all the cases examined. Additional NMR experiments that may be required to distinguish between candidate structures are aided by the assignments provided by the program. This computational approach is also suitable for use in structural confirmation of chemically or enzymatically synthesized oligosaccharides. The program is found at http://www.casper.organ.su.se/casper.
The computer program CASPER, used in the structural analysis of polysaccharides composed of repeating units, has been extended. The extended version uses either unassigned 1H- or 13C-n.m.r. chemical shifts or the complete unassigned C,H-correlation spectrum, and can predict the structure of linear and branched oligo- and poly-saccharides. The number of possible structures, consistent with sugar and methylation analysis, can be decreased by the use of 1JC,H and 3JH,H values. The database, which contains 1H- or 13C-n.m.r. chemical shift data for monosaccharides and 1H- or 13C-glycosylation shifts for all types of glycosidic linkages obtained by combination of the monosaccharides, has been increased and now also contains correction values for sugar residues present in branch-point regions. The program has been tested on four polysaccharides of known structure but with different degrees of complexity. For three polysaccharides, the correct structure was suggested; for the fourth, two structures were consistent with the n.m.r. data, one of them being correct.
Carbohydrates play an immense role in different aspects of life. NMR spectroscopy is the most powerful tool for investigation of these compounds. Nowadays, progress in computational procedures has opened up novel opportunities giving an impulse to the development of new instruments intended to make the research simpler and more efficient. In this paper, we present a new approach for simulating (13)C NMR chemical shifts of carbohydrates. The approach is suitable for any atomic observables, which could be stored in a database. The method is based on sequential generalization of the chemical surroundings of the atom under prediction and heuristic averaging of database data. Unlike existing applications, the generalization scheme is tuned for carbohydrates, including those containing phosphates, amino acids, alditols, and other non-carbohydrate constituents. It was implemented in the Glycan-Optimized Dual Empirical Spectrum Simulation (GODESS) software, which is freely available on the Internet. In the field of carbohydrates, our approach was shown to outperform all other existing methods of NMR spectrum prediction (including quantum-mechanical calculations) in accuracy. Only this approach supports NMR spectrum simulation for a number of structural features in polymeric structures.
Motivation: Carbohydrates play crucial roles in various biochemical processes and are useful for developing drugs and vaccines. However, in case of carbohydrates, the primary structure elucidation is usually a sophisticated task. Therefore, they remain the least structurally characterized class of biomolecules, and it hampers the progress in glycochemistry and glycobiology. Creating a usable instrument designed to assist researchers in natural carbohydrate structure determination would advance glycochemistry in biomedical and pharmaceutical applications. Results: We present GRASS (Generation, Ranking and Assignment of Saccharide Structures), a novel method for semi-automated elucidation of carbohydrate and derivative structures which uses unassigned 13C NMR spectra and information obtained from chromatography, optical, chemical and other methods. This approach is based on new methods of carbohydrate NMR simulation recently reported as the most accurate. It combines a broad diversity of supported structural features, high accuracy and performance. Availability: GRASS is implemented in a free web tool available at http://csdb.glycoscience.ru/grass.html. Contact: kapaev_roman@mail.ru or netbox@toukach.ru. Supplementary information: Supplementary data are available at Bioinformatics online.
The improved Carbohydrate Structure Generalization Scheme has been developed for the simulation of (13)C and (1)H NMR spectra of oligo- and polysaccharides and their derivatives, including those containing noncarbohydrate constituents found in natural glycans. Besides adding the (1)H NMR calculations, we improved the accuracy and performance of prediction and optimized the mathematical model of the precision estimation. This new approach outperformed other methods of chemical shift simulation, including database-driven, neural net-based, and purely empirical methods and quantum-mechanical calculations at high theory levels. It can process structures with rarely occurring and noncarbohydrate constituents unsupported by the other methods. The algorithm is transparent to users and allows tracking used reference NMR data to original publications. It was implemented in the Glycan-Optimized Dual Empirical Spectrum Simulation (GODESS) web service, which is freely available at the platform of the Carbohydrate Structure Database (CSDB) project ( http://csdb.glycoscience.ru).
Glycan Optimized Dual Empirical Spectrum Simulation (GODESS) is a web service, which has been recently shown to be one of the most accurate tools for simulation of (1)H and (13)C 1D NMR spectra of natural carbohydrates and their derivatives. The new version of GODESS supports visualization of the simulated (1)H and (13)C chemical shifts in the form of most 2D spin correlation spectra commonly used in carbohydrate research, such as (1)H-(1)H TOCSY, COSY/COSY-DQF/COSY-RCT, and (1)H-(13)C edHSQC, HSQC-COSY, HSQC-TOCSY, and HMBC. Peaks in the simulated 2D spectra are color-coded and labeled according to the signal assignment and can be exported in JCAMP-DX format. Peak widths are estimated empirically from the structural features. GODESS is available free of charge via the Internet at the platform of the Carbohydrate Structure Database project ( http://csdb.glycoscience.ru ).
A computerised approach to the structural analysis of unbranched regular polysaccharides is described, which is based on an evaluation of the 13C-n.m.r. spectra for all possible primary structures within the additive scheme starting from the chemical shifts of the 13C resonances of the constituent monosaccharides and the average values of the glycosylation effects. The analysis reveals a structure (or structures), the evaluated spectrum of which resembles most closely that observed. The approach has been verified by using a series of bacterial polysaccharides of known structure and, in combination with methylation analysis data, for the determination of the presently unknown structures of the O-specific polysaccharides from Salmonella arizonae O59 and O63, and Proteus hauseri O19.
GlyNest and CASPER (www.casper.organ.su.se/casper/) are two independent services aiming to predict (1)H- and (13)C-NMR chemical shifts of glycans. GlyNest estimates chemical shifts of glycans based on a spherical environment encoding scheme for each atom. CASPER is an increment rule-based approach which uses chemical shifts of the free reducing monosaccharides which are altered according to attached residues of an oligo- or polysaccharide sequence. Both services, which are located on separate, distributed, servers are now available through a common interface of the GLYCOSCIENCES.de portal (www.glycosciences.de). The predictive ability of both techniques was evaluated for a test set of 155 (13)C and 181 (1)H spectra of assigned glycan structures. The standard deviations between experimental and estimated shifts ((1)H; 0.081/0.102; (13)C 0.763/0.794; GlyNest/CASPER) are comparable for both methods and significantly better than procedures where stereochemistry is not encoded. The predictive ability of both approaches is in most cases sufficiently precise to be used for an automatic assignment of NMR-spectra. Since both procedures work efficiently and require computation times in the millisecond range on standard computers, they are well suited for the assignment of NMR spectra in high-throughput glycomics projects. The service is available at www.glycosciences.de/sweetdb/start.php?action=form_shift_estimation.
Structural determination of N- and O-linked glycans as well as polysaccharides is hampered by the limited spectral dispersion. The computerized approach CASPER, an acronym for computer assisted spectrum evaluation of regular polysaccharides, uses liquid state NMR data to elucidate carbohydrate structure based on agreement with predicted (1)H and (13)C chemical shifts. We here demonstrate developments based on multiple through-bond J-based correlations that significantly enhance the credence to the sequence connectivities proposed in the analysis exemplified by an oligosaccharide and a bacterial polysaccharide. The approach is also suitable for predicting (1)H and (13)C NMR chemical shifts of synthesized oligosaccharides and glycoconjugates, thereby corroborating a proposed structure.
A sum-over-states perturbation theory is combined with density functional methodology (SOS-DFPT) and is applied to NMR shielding tensor calculations. Individual gauges for localized orbitals (IGLO)were used. Different types of approximations for the energy difference of the ground and 'excited" states are compared. The calculations were carried out using a modified version of the deMon program. The results of NMR shielding tensor calculations using SOS-DFPT are in good agreement with those of the best post-Hartree-Fock approaches and also with experimental data. Results are presented for a number of organic and inorganic compounds (including transition metal complexes) and for a model dipeptide.
Computer-based procedures are developed for simulating the 13C NMR spectra of carbohydrates. By use of data from five sources, models are derlved that related observed chemical shifts to numerical parameters encoding aspects of the chemlcai environments of the corresponding carbons. Molecular mechanics techniques are used to compute parameters encoding the effects of multiple oxygen atoms on the carbon atom envlronments. A calibration procedure Is introduced for adjusting experimental spectra to the computed models, thereby allowing valid comparisons to be made between the spectrum of an unknown and the slmultated spectra of possible candidate structures. The derlved models are tested by slmultatlng the spectra of 15 compounds not included in the modeilng study. Experimental spectra are used to evaluate the simulations.
Mathematical models are developed that relate the structures of monosaccharides to their 13C nuclear magnetic resonance spectra. The data set of monosaccharides consists of 55 monosaccharides in the six-membered ring configuration and 56 monosaccharides in the five-membered ring configuration. The structural environment of each carbon atom in the data set is encoded using numerical atom-based descriptors which are then used to develop linear regression models relating the 13C chemical shift to the structural features. The atom-based descriptors used in this study encode topological, geometric, and electronic information about the carbon atoms in monosaccharides. Multiple linear regression analysis is used to develop an 11-descriptor model to predict the chemical shifts of pyranoses and pyranosides and an eight-descriptor model to predict the chemical shifts of furanoses and furanosides. The models are then submitted to computational neural networks, giving improved results with final training set rms errors of 1.03 ppm for pyranoses and pyranosides and 1.58 ppm for furanoses and furanosides.
Reliability of calculated (1)H and (13)C NMR chemical shifts for various classes of organic compounds obtained with gauge-invariant atomic orbital (GIAO) approach has been studied at the PBE/3zeta level (as implemented in PRIRODA code) using linear regression analysis with experimental data. Empirical corrections for the calculated chemical shifts delta(H,calc) = delta(PBE/3zeta) - 0.08 ppm (RMS 0.18 ppm, MAD 0.66 ppm) and delta(C,calc) = delta(PBE/) (3) (zeta) - 6.35 ppm (RMS 3.09 ppm, MAD 9.42 ppm) have been developed using the sets of 263 and 308 experimental values for (1)H and (13)C chemical shifts, respectively. The confidence intervals of NMR chemical shifts at 95% confidence probability are delta(H,calc) +/- 0.35 ppm for (1)H and deltaC,calc) +/- 6.05 ppm for (13)C.
Spectra-structure relationships were investigated for estimating the anomeric configuration, residues and type of linkages of linear and branched trisaccharides using 13C-NMR chemical shifts. For this study, 119 pyranosyl trisaccharides were used that are trimers of the alpha or beta anomers of D-glucose, D-galactose, D-mannose, L-fucose or L-rhamnose residues bonded through a or b glycosidic linkages of types 1-->2, 1-->3, 1-->4, or 1-->6, as well as methoxylated and/or N-acetylated amino trisaccharides. Machine learning experiments were performed for: (1) classification of the anomeric configuration of the first unit, second unit and reducing end; (2) classification of the type of first and second linkages; (3) classification of the three residues: reducing end, middle and first residue; and (4) classification of the chain type. Our previously model for predicting the structure of disaccharides was incorporated in this new model with an improvement of the predictive power. The best results were achieved using Random Forests with 204 di- and trisaccharides for the training set-it could correctly classify 83%, 90%, 88%, 85%, 85%, 75%, 79%, 68% and 94% of the test set (69 compounds) for the nine tasks, respectively, on the basis of unassigned chemical shifts.
The performance of conjugate gradient schemes for minimizing unconstrained energy functionals in the context of condensed matter electronic structure density functional calculations is studied. The unconstrained functionals allow a straightforward application of conjugate gradients by removing the explicit orthonormality constraints on the quantum-mechanical wave functions. However, the removal of the constraints can lead to slow convergence, in particular when preconditioning is used. The convergence properties of two previously suggested energy functionals are analyzed, and a new functional is proposed, which unifies some of the advantages of the other functionals. A numerical example derived from a diamond crystal confirms the analysis.
The NMR chemical shift, a six-parameter tensor property, is highly sensitive to the position of the atoms in a molecule. To extract structural parameters from chemical shifts, one must rely on theoretical models. Therefore, a high quality group of shift tensors that serve as benchmarks to test the validity of these models is warranted and necessary to highlight existing computational limitations. Here, a set of 102 13C chemical-shift tensors measured in single crystals, from a series of aromatic and saccharide molecules for which neutron diffraction data are available, is used to survey models based on the density functional (DFT) and Hartree-Fock (HF) theories. The quality of the models is assessed by their least-squares linear regression parameters. It is observed that in general DFT outperforms restricted HF theory. For instance, Becke's three-parameter exchange method and mpw1pw91 generally provide the best predicted shieldings for this group of tensors. However, this performance is not universal, as none of the DFT functionals can predict the saccharide tensors better than HF theory. Both the orientations of the principal axis system and the magnitude of the shielding were compared using the chemical-shift distance to evaluate the quality of the calculated individual tensor components in units of ppm. Systematic shortcomings in the prediction of the principal components were observed, but the theory predicts the corresponding isotropic value more accurately. This is because these systematic errors cancel, thereby indicating that the theoretical assessment of shielding predictions based on the isotropic shift should be avoided.
The efficacy of neural network (NN) and partial least-squares (PLS) methods is compared for the prediction of NMR chemical shifts for both 1H and 13C nuclei using very large databases containing millions of chemical shifts. The chemical structure description scheme used in this work is based on individual atoms rather than functional groups. The performances of each of the methods were optimized in a systematic manner described in this work. Both of the methods, least-squares and neural network analyses, produce results of a very similar quality, but the least-squares algorithm is approximately 2--3 times faster.
The NMR spectra of carbohydrates can be calculated accurately from glycosylation shifts and corrections for steric effects. The discrepancy between calculated and experimental chemical shifts is comparable to the difference between measurements from different laboratories and sufficient to establish the structure of many oligo- and polysaccharides. Programs that combine chemical shift calculations with algorithms for the generation of trial structures provide a powerful tool for structure elucidation. The addition of a web-interface as well as the ongoing work to connect CASPER to other glycomics databases and tools on the Internet will make the program a valuable tool for interpretation of carbohydrate NMR spectra.
Starting from a description of a molecular wave function using bond orbitals a general formula was derived describing the change of an expectation value of a one electron operator with bond polarization. This equation leads to a bond additive scheme for the description of bond polarization effects containing matrix elements of the Fock-operators. For these matrix elements simple formulae were derived for the case of a point charge approximation. In a second part of the paper this formalism is used for the interpretation of the influence of the second co-ordination sphere on the chemical shift. This scheme is used for a discussion of the 1H chemical shifts of selenites. The variation of the chemical shifts in dependence of the geometry of the hydrogen bond can be understood. Furthermore the theory proved to be useful for the interpretation of the bond angle dependence of 29Si chemical shifts in silica polymorphs. Finally it is possible to study π-bond polarization of various substituents and its influence on 13C chemical shifts. Especially vinyl and cyclohexene derivatives are investigated. In the case of propenyl compounds an explanation for the cis-trans differences in chemical shifts can be given.
The semi-empirical bond polarization theory is applied to the calculation of 13C chemical-shift tensors. This method allows prediction of shift tensors with deviations from experiment comparable to the errors of the ab initio methods. In contrast to ab initio calculations, a set of empirical parameters is needed, which can be estimated from experimental chemical-shift tensors solving a set of linear equations. The coefficients of this overdetermined set of equations are bond polarization energies that must be calculated within the framework of this theory. The parameters for C-C, C-H, and C-O bonds of sp3 and sp2 hybridized carbons and C-N bonds of sp3 carbons were obtained from 606 equations formed from experimental data from 20 substances taken from the literature. The substances include sugars, aromatic compounds, amino acids, and organic acids. The mean deviation of calculated from experimental 13C chemical-shift tensor components is 9 ppm.
The effects of the conformation and hydrogen bonding on 13C isotropic chemical shifts have theoretically been investigated for β-d-glucose, d-cellobiose, and the cellobiose units of native cellulose by quantum chemistry calculations based on the DFT method. The linear relationship between the chemical shift of the C6 carbon and the torsion angle around the C6single bondO6 bond in the CH2OH side group, which was previously obtained in experiments, is successfully reproduced for β-d-glucose by the theoretical calculations. A similar linear relationship is also found to hold for the C4 carbon, supporting the previous finding in experiments. Moreover, the C5 chemical shift also depends on the conformation of the side group, but the conformation of the O6H hydrogen atom at the γ position may mainly contribute to the dependence for the C5 carbon through the possible formation of intramolecular hydrogen bonding. The γH-gauche effect produced by the OH hydrogen atom (γ-H) at the γ position is found, for the first time, to induce 3–5 ppm downfield shift for the carbon in question, and this effect reduces by 2–3 ppm when the intramolecular hydrogen bonding associated with γ-H is formed. Similar calculations for d-cellobiose and the cellobiose units in native cellulose reveal appreciable dependences of the C1 and C4 chemical shifts on the torsion angles ϕ and ψ around the (1 → 4)-β-glycosidic linkage. In contrast, no significant effects of different intramolecular and intermolecular hydrogen bondings forming between neighboring glucose residues are recognized on the chemical shifts of the respective carbons associated with these hydrogen bondings.
All living systems are comprised of four fundamental classes of macromolecules--nucleic acids, proteins, lipids, and carbohydrates (glycans). Glycans play a unique role of joining three principal hierarchical levels of the living world: (1) the molecular level (pathogenic agents and vaccine recognition by the immune system, metabolic pathways involving saccharides that provide cells with energy, and energy accumulation via photosynthesis); (2) the nanoscale level (cell membrane mechanics, structural support of biomolecules, and the glycosylation of macromolecules); (3) the microscale and macroscale levels (polymeric materials, such as cellulose, starch, glycogen, and biomass). NMR spectroscopy is the most powerful research approach for getting insight into the solution structure and function of carbohydrates at all hierarchical levels, from monosaccharides to oligo- and polysaccharides. Recent progress in computational procedures has opened up novel opportunities to reveal the structural information available in the NMR spectra of saccharides and to advance our understanding of the corresponding biochemical processes. The ability to predict the molecular geometry and NMR parameters is crucial for the elucidation of carbohydrate structures. In the present paper, we review the major NMR spectrum simulation techniques with regard to chemical shifts, coupling constants, relaxation rates and nuclear Overhauser effect prediction applied to the three levels of glycomics. Outstanding development in the related fields of genomics and proteomics has clearly shown that it is the advancement of research tools (automated spectrum analysis, structure elucidation, synthesis, sequencing and amplification) that drives the large challenges in modern science. Combining NMR spectroscopy and the computational analysis of structural information encoded in the NMR spectra reveals a way to the automated elucidation of the structure of carbohydrates.
A computer-assisted approach to the prediction of the primary structures of regular glycopolymers is described. The analysis is based on comparing the calculated 13C NMR spectra of all the possible structures of the repeating unit (for the given monomeric composition) to an experimental 13C NMR spectrum. The spectra generation is based on the spectral database containing information on the 13C chemical shifts of monomers, di- and trimeric fragments. If the required data are missing from this database, the special database for average glycosylation effects is used. The analysis reveals those structures with the calculated 13C NMR spectrum most close to observed. The structures of repeating units of any topology containing up to six residues linked by glycosidic, amidic or phospho-diester bridges can be predicted. Unambiguous selection of the proper structure from the output list of possible structures may require additional experimental data. Testing the created program and databases on bacterial polysaccharides and their derivatives containing up to three non-sugar residues (alditols, amino acids, phosphate groups etc.) per repeating unit revealed the good convergence of prediction with independently obtained structural data.
A 1H NMR database computer program has been developed to determine the primary structure of complex carbohydrates. The database contains carbohydrate structures, their corresponding 1H NMR data, and literature references. From an input list of chemical shift values, the program generates an output list of partially or completely matching carbohydrate structures. In order to facilitate the recognition of the matching part of the selected carbohydrate structures, these structures are displayed with the matching structural elements highlighted. This new 1H NMR database, together with the search program described, now provides a fast access to the published 1H NMR data of complex carbohydrates and furnishes easy links to carbohydrate structures. The performance of the program is demonstrated by the analysis of five carbohydrate fractions prepared from a pool of horse serum glycoproteins.
Antibody-carbohydrate interactions play central roles in stimulating adverse immune reactions. The most familiar example of such a process is the reaction observed in ABO-incompatible blood transfusion and organ transplantation. The ABO blood groups are defined by the presence of specific carbohydrates expressed on the surface of red blood cells. Preformed antibodies in the incompatible recipient (i.e., different blood groups) recognize cells exhibiting host-incompatible ABO system antigens and proceed to initiate lysis of the incompatible cells. Pig-to-human xenotransplantation presents a similar immunological barrier. Antibodies present in humans recognize carbohydrate antigens on the surface of pig organs as foreign and proceed to initiate hyperacute xenograft rejection. The major carbohydrate xenoantigens all bear terminal Gal alpha(1,3)Gal epitopes (or alphaGal). In this study, we have developed and validated a site mapping technique to investigate protein-ligand recognition and applied it to antibody-carbohydrate systems. This site mapping technique involves the use of molecular docking to generate a series of antibody-carbohydrate complexes, followed by analysis of the hydrogen bonding and van der Waals interactions occurring in each complex. The technique was validated by application to a series of antibody-carbohydrate crystal structures. In each case, the majority of interactions made in the crystal structure complex were able to be reproduced. The technique was then applied to investigate xenoantigen recognition by a panel of monoclonal anti-alphaGal antibodies. The results indicate that there is a significant overlap of the antibody regions engaging the xenoantigens across the panel. Likewise, similar regions of the xenoantigens interact with the antibodies.
Unraveling the structure of lectin-carbohydrate complexes is vital for understanding key biological recognition processes and development of glycomimetic drugs. Molecular Docking application to predict them is challenging due to their low affinity, hydrophilic nature and ligand conformational diversity. In the last decade several strategies, such as the inclusion of glycan conformation specific scoring functions or our developed solvent-site biased method, have improved carbohydrate docking performance but significant challenges remain, in particular, those related to receptor conformational diversity. In the present work we have analyzed conventional and solvent-site biased autodock4 performance concerning receptor conformational diversity as derived from different crystal structures (apo and holo), Molecular Dynamics snapshots and Homology-based models, for 14 different lectin-monosaccharide complexes. Our results show that both conventional and biased docking yield accurate lectin-monosaccharide complexes, starting from either apo or homology-based structures, even when only moderate (45%) sequence identity templates are available. An essential element for success is a proper combination of a middle-sized (10-100 structures) conformational ensemble, derived either from Molecular dynamics or multiple homology model building. Consistent with our previous works, results show that solvent-site biased methods improve overall performance, but that results are still highly system dependent. Finally, our results also show that docking can select the correct receptor structure within the ensemble, underscoring the relevance of joint evaluation of both ligand pose and receptor conformation.
Biomolecular NMR spectroscopy has limitations in the determination of protein structures: an inherent size limit and the requirement for expensive and potentially difficult isotope labelling pose considerable hurdles. Therefore, structural analysis of larger proteins is almost exclusively performed by crystallography. However, the diversity of biological NMR applications outperforms that of any other structural biology technique. For the characterization of transient complexes formed by proteins and small ligands, notably oligosaccharides, one NMR technique has recently proven to be particularly powerful: saturation-transfer difference NMR (STD-NMR) spectroscopy. STD-NMR experiments are fast and simple to set up, with no general protein size limit and no requirement for isotope labelling. The method performs best in the moderate-to-low affinity range that is of interest in most of glycobiology. With small amounts of unlabelled protein, STD-NMR experiments can identify hits from mixtures of potential ligands, characterize mutant proteins and pinpoint binding epitopes on the ligand side. STD-NMR can thus be employed to complement and improve protein-ligand complex models obtained by other structural biology techniques or by purely computational means. With a set of protein-glycan interactions from our own work, this review provides an introduction to the technique for structural biologists. It exemplifies how crystallography and STD-NMR can be combined to elucidate protein-glycan (and other protein-ligand) interactions in atomic detail, and how the technique can extend structural biology from simplified systems amenable to crystallization to more complex biological entities such as membranes, live viruses or entire cells.
Structural biology plays a key role in understanding how networks of protein interactions with their partners are organized at the atomic level. In this review, we show that NMR is a very efficient method to solve 3D structures of protein – RNA and protein–carbohydrate complexes of high quality. We explain the importance of studying such interactions and describe the main steps that are required to determine structures of these types of complexes by NMR. Finally, we show that X-ray crystallography and NMR are complementary methods and briefly report on advantages and disadvantages of each approach.
Biomass recalcitrance, the resistance of cellulosic biomass to degradation, is due in part to the stability of the hydrogen bond network and stacking forces between the polysaccharide chains in cellulose microfibers. The fragment molecular orbital (FMO) method at the correlated Moller-Plesset second order perturbation level of theory was used on a model of the crystalline cellulose Ialpha core with a total of 144 glucose units. These computations show that the intersheet chain interactions are stronger than the intrasheet chain interactions for the crystalline structure, while they are more similar to each other for a relaxed structure. An FMO chain pair interaction energy decomposition analysis for both the crystal and relaxed structures reveals an intricate interplay between electrostatic, dispersion, charge transfer, and exchange repulsion effects. The role of the primary alcohol groups in stabilizing the interchain hydrogen bond network in the inner sheet of the crystal and relaxed structures of cellulose Ialpha, where edge effects are absent, was analyzed. The maximum attractive intrasheet interaction is observed for the GT-TG residue pair with one intrasheet hydrogen bond, suggesting that the relative orientation of the residues is as important as the hydrogen bond network in strengthening the interaction between the residues.
Docking methods are a valuable tool for the prediction of carbohydrate binding sites and the design of carbohydrate-based drugs. However, there are also significant limitations and care needs to be taken when evaluating the docking results. In this chapter the challenges, limitations, and possible pitfalls in docking of carbohydrates are described. Practical examples explain the use of docking methods for the rational design of carbohydrate-based inhibitors as well as the prediction of carbohydrate binding sites.
Carbohydrate molecules are essential actors in key biological events, being involved as recognition points for cell–cell and cell–matrix interactions related to health and disease. Despite outstanding advances in cryoEM, X-ray crystallography and NMR still remain the most employed techniques to unravel their conformational features and to describe the structural details of their interactions with biomolecular receptors. Given the intrinsic flexibility of saccharides, NMR methods are of paramount importance to deduce the extent of motion around their glycosidic linkages and to explore their receptor-bound conformations. We herein present our particular view on the latest advances in NMR methodologies that are permitting to magnify their applications for deducing glycan conformation and dynamics and understanding the recognition events in which there are involved.
Impressive improvements in docking performance can be achieved by applying energy bonuses to poses in which glycan hydroxyl groups occupy positions otherwise preferred by bound waters. In addition, inclusion of glycosidic conformational energies allows unlikely glycan conformations to be appropriately penalized. A method for predicting the binding specificity of glycan-binding proteins has been developed, which is based on grafting glycan branches onto a minimal binding determinant in the binding site. Grafting can be used either to screen virtual libraries of glycans, such as the known glycome, or to identify docked poses of minimal binding determinants that are consistent with specificity data. The reviewed advances allow accurate modelling of carbohydrate-protein 3D co-complexes, but challenges remain in ranking the affinity of congeners.
no abstract.
A variety of computational techniques may be applied to compute theoretical binding free energies for protein-carbohydrate complexes. Elucidation of the intermolecular interactions, as well as the thermodynamic effects, that contribute to the relative strength of receptor binding can shed light on biomolecular recognition, and the resulting initiation or inhibition of a biological process. Three types of free energy methods are discussed here, including MM-PB/GBSA, thermodynamic integration, and a non-equilibrium alternative utilizing SMD. Throughout this chapter, the well-known concanavalin A lectin is employed as a model system to demonstrate the application of these methods to the special case of carbohydrate binding. .
Thermodynamic information can be inferred from static atomic configurations. To model the thermodynamics of carbohydrate binding to proteins accurately, a large binding data set has been assembled from the literature. The data set contains information from 262 unique protein-carbohydrate crystal structures for which experimental binding information is known. Hydrogen atoms were added to the structures and training conformations were generated with the automated docking program AutoDock 3.06, resulting in a training set of 225,920 all-atom conformations. In all, 288 formulations of the AutoDock 3.0 free energy model were trained against the data set, testing each of four alternate methods of computing the van der Waals, solvation, and hydrogen-bonding energetic components. The van der Waals parameters from AutoDock 1 produced the lowest errors, and an entropic model derived from statistical mechanics produced the only models with five physically and statistically significant coefficients. Eight models predict the Gibbs free energy of binding with an error of less than 40% of the error of any similar models previously published.
Interactions between proteins and carbohydrates are ubiquitous in biology. Therefore, understanding the factors that determine their affinity and selectivity are correspondingly important. Herein, we have determined the relative strengths of intramolecular interactions between a series of monosaccharides and an aromatic ring close to the glycosylation site in an N-glycoprotein host. We employed the enhanced aromatic sequon, a structural motif found in the reverse turns of some N-glycoproteins, to facilitate face-to-face monosaccharide-aromatic interactions. A protein host was used because the dependence of the folding energetics on the identity of the monosaccharide can be accurately measured to assess the strength of the carbohydrate-aromatic interaction. Our data demonstrate that the carbohydrate-aromatic interaction strengths are moderately affected by changes in the stereochemistry and identity of the substituents on the pyranose rings of the sugars. Galactose seems to make the weakest and allose the strongest sugar-aromatic interactions, with glucose, N-acetylglucosamine (GlcNAc) and mannose in between. The NMR solution structures of several of the monosaccharide-containing N-glycoproteins were solved to further understand the origins of the similarities and differences between the monosaccharide-aromatic interaction energies. Peracetylation of the monosaccharides substantially increases the strength of the sugar-aromatic interaction in the context of our N-glycoprotein host. Finally, we discuss our results in light of recent literature regarding the contribution of electrostatics to CH-pi interactions and speculate on what our observations imply about the absolute conservation of GlcNAc as the monosaccharide through which N-linked glycans are attached to glycoproteins in eukaryotes.
The scoring function is one of the most important components in structure-based drug design. Despite considerable success, accurate and rapid prediction of protein-ligand interactions is still a challenge in molecular docking. In this perspective, we have reviewed three basic types of scoring functions (force-field, empirical, and knowledge-based) and the consensus scoring technique that are used for protein-ligand docking. The commonly-used assessment criteria and publicly available protein-ligand databases for performance evaluation of the scoring functions have also been presented and discussed. We end with a discussion of the challenges faced by existing scoring functions and possible future directions for developing improved scoring functions.
Protein-carbohydrate interactions play pivotal roles in health and disease. However, defining and manipulating these interactions has been hindered by an incomplete understanding of the underlying fundamental forces. To elucidate common and discriminating features in carbohydrate recognition, we have analyzed quantitatively X-ray crystal structures of proteins with noncovalently bound carbohydrates. Within the carbohydrate-binding pockets, aliphatic hydrophobic residues are disfavored, whereas aromatic side chains are enriched. The greatest preference is for tryptophan with an increased prevalence of 9-fold. Variations in the spatial orientation of amino acids around different monosaccharides indicate specific carbohydrate C-H bonds interact preferentially with aromatic residues. These preferences are consistent with the electronic properties of both the carbohydrate C-H bonds and the aromatic residues. Those carbohydrates that present patches of electropositive saccharide C-H bonds engage more often in CH-pi interactions involving electron-rich aromatic partners. These electronic effects are also manifested when carbohydrate-aromatic interactions are monitored in solution: NMR analysis indicates that indole favorably binds to electron-poor C-H bonds of model carbohydrates, and a clear linear free energy relationships with substituted indoles supports the importance of complementary electronic effects in driving protein-carbohydrate interactions. Together, our data indicate that electrostatic and electronic complementarity between carbohydrates and aromatic residues play key roles in driving protein-carbohydrate complexation. Moreover, these weak noncovalent interactions influence which saccharide residues bind to proteins, and how they are positioned within carbohydrate-binding sites.
Norovirus is a major pathogen of nonbacterial acute gastroenteritis in humans and animals. Carbohydrate recognition between norovirus capsid proteins and Lewis antigens is considered to play a critical role in initiating infection of eukaryotic cells. In this article, we first report a detailed atomistic simulation study of the norovirus capsid protein in complex with the Lewis antigen based on ab initio QM/MM combined with MD-FEP simulations. To understand the mechanistic details of ligand binding, we analyzed and compared the carbohydrate recognition mechanism of the wild-type P domain protein with a mutant protein. Small structural differences between two capsid proteins are observed on the weak interaction site of residue 389, which is located on the solvent exposed surface of the P domain. To further clarify affinity differences in ligand binding, we directly evaluated free energy changes of the ligand binding process. Although the mutant protein loses its interaction energy with the Lewis antigen, this small amount of energy penalty is compensated for by an increase in the solvation stability, which is induced by structural reorganization at the ligand binding site on the protein surface. As a sum of these opposite energy components, the mutant P domain obtains a slightly enhanced binding affinity for the Lewis antigen. The present computational study clearly demonstrated that a detailed free energy balance of the interaction energy between the capsid protein and the surrounding aqueous solvent is the mechanistic basis of carbohydrate recognition in the norovirus capsid protein.
Protein-ligand docking is an essential technique in computer-aided drug design. While generally available docking programs work well for most drug classes, carbohydrates and carbohydrate-like compounds are often problematic for docking. We present a new docking method specifically designed to handle docking of carbohydrate-like compounds. BALLDock/SLICK combines an evolutionary docking algorithm for flexible ligands and flexible receptor side chains with carbohydrate-specific scoring and energy functions. The scoring function has been designed to identify accurate ligand poses, while the energy function yields accurate estimates of the binding free energies of these poses. On a test set of known protein-sugar complexes we demonstrate the ability of the approach to generate correct poses for almost all of the structures and achieve very low mean errors for the predicted binding free energies.
Protein-carbohydrate interactions are increasingly being recognized as essential for many important biomolecular recognition processes. From these, numerous biomedical applications arise in areas as diverse as drug design, immunology, or drug transport. We introduce SLICK, a package containing a scoring and an energy function, which were specifically designed to predict binding modes and free energies of sugars and sugarlike compounds to proteins. SLICK accounts for van der Waals interactions, solvation effects, electrostatics, hydrogen bonds, and CH...pi interactions, the latter being a particular feature of most protein-carbohydrate interactions. Parameters for the empirical energy function were calibrated on a set of high-resolution crystal structures of protein-sugar complexes with known experimental binding free energies. We show that SLICK predicts the binding free energies of predicted complexes (through molecular docking) with high accuracy. SLICK is available as part of our molecular modeling package BALL (www.ball-project.org).
S. Makeneni, D.F. Thieker, R.J. Woods.
Applying Pose Clustering and MD Simulations To Eliminate False Positives in Molecular Docking.
Journal of chemical information and modeling, 2018. 58(3): 605-614.
DOI: 10.1021/acs.jcim.7b00588
In this work, we developed a computational protocol that employs multiple molecular docking experiments, followed by pose clustering, molecular dynamic simulations (10 ns), and energy rescoring to produce reliable 3D models of antibody-carbohydrate complexes. The protocol was applied to 10 antibody-carbohydrate co-complexes and three unliganded (apo) antibodies. Pose clustering significantly reduced the number of potential poses. For each system, 15 or fewer clusters out of 100 initial poses were generated and chosen for further analysis. Molecular dynamics (MD) simulations allowed the docked poses to either converge or disperse, and rescoring increased the likelihood that the best-ranked pose was an acceptable pose. This approach is amenable to automation and can be a valuable aid in determining the structure of antibody-carbohydrate complexes provided there is no major side chain rearrangement or backbone conformational change in the H3 loop of the CDR regions. Further, the basic protocol of docking a small ligand to a known binding site, clustering the results, and performing MD with a suitable force field is applicable to any protein ligand system.
Numerous biological processes are dependent on the interaction of proteins with carbohydrates, rendering glycan-binding proteins as potential targets for new medicines. With the rapid growth in genomic data, novel computational methods can be applied to pinpoint putative carbohydrate-binding sites in proteins, for experimental validation. Here, we review the recent developments in the structure and sequence-based protein carbohydrate binding site prediction methods. An attempt has been made to show the developments and the applications for each such method in a chronological order.
Understanding the dynamics of protein-ligand interactions, which lie at the heart of host-pathogen recognition, represents a crucial step to clarify the molecular determinants implicated in binding events, as well as to optimize the design of new molecules with therapeutic aims. Over the last decade, advances in complementary biophysical and spectroscopic methods permitted us to deeply dissect the fine structural details of biologically relevant molecular recognition processes with high resolution. This Review focuses on the development and use of modern nuclear magnetic resonance (NMR) techniques to dissect binding events. These spectroscopic methods, complementing X-ray crystallography and molecular modeling methodologies, will be taken into account as indispensable tools to provide a complete picture of protein-glycoconjugate binding mechanisms related to biomedicine applications against infectious diseases.
Docking algorithms that aim to be applicable to a broad range of ligands suffer reduced accuracy because they are unable to incorporate ligand-specific conformational energies. Here, we develop a set of Carbohydrate Intrinsic (CHI) energy functions that quantify the conformational properties of oligosaccharides, based on the values of their glycosidic torsion angles. The relative energies predicted by the CHI energy functions mirror the conformational distributions of glycosidic linkages determined from a survey of oligosaccharide-protein complexes in the protein data bank. Addition of CHI energies to the standard docking scores in Autodock 3, 4.2, and Vina consistently improves pose ranking of oligosaccharides docked to a set of anticarbohydrate antibodies. The CHI energy functions are also independent of docking algorithm, and with minor modifications, may be incorporated into both theoretical modeling methods, and experimental NMR or X-ray structure refinement programs.
Molecular docking programs are primarily designed to align rigid, drug-like fragments into the binding sites of macromolecules and frequently display poor performance when applied to flexible carbohydrate molecules. A critical source of flexibility within an oligosaccharide is the glycosidic linkages. Recently, Carbohydrate Intrinsic (CHI) energy functions were reported that attempt to quantify the glycosidic torsion angle preferences. In the present work, the CHI-energy functions have been incorporated into the AutoDock Vina (ADV) scoring function, subsequently termed Vina-Carb (VC). Two user-adjustable parameters have been introduced, namely, a CHI- energy weight term (chi_coeff) that affects the magnitude of the CHI-energy penalty and a CHI-cutoff term (chi_cutoff) that negates CHI-energy penalties below a specified value. A data set consisting of 101 protein-carbohydrate complexes and 29 apoprotein structures was used in the development and testing of VC, including antibodies, lectins, and carbohydrate binding modules. Accounting for the intramolecular energies of the glycosidic linkages in the oligosaccharides during docking led VC to produce acceptable structures within the top five ranked poses in 74% of the systems tested, compared to a success rate of 55% for ADV. An enzyme system was employed in order to illustrate the potential application of VC to proteins that may distort glycosidic linkages of carbohydrate ligands upon binding. VC represents a significant step toward accurately predicting the structures of protein-carbohydrate complexes. Furthermore, the described approach is conceptually applicable to any class of ligands that populate well-defined conformational states.
The article reviews the significant contributions to, and the present status of, applications of computational methods for the characterization and prediction of protein-carbohydrate interactions. After a presentation of the specific features of carbohydrate modeling, along with a brief description of the experimental data and general features of carbohydrate-protein interactions, the survey provides a thorough coverage of the available computational methods and tools. At the quantum-mechanical level, the use of both molecular orbitals and density-functional theory is critically assessed. These are followed by a presentation and critical evaluation of the applications of semiempirical and empirical methods: QM/MM, molecular dynamics, free-energy calculations, metadynamics, molecular robotics, and others. The usefulness of molecular docking in structural glycobiology is evaluated by considering recent docking- validation studies on a range of protein targets. The range of applications of these theoretical methods provides insights into the structural, energetic, and mechanistic facets that occur in the course of the recognition processes. Selected examples are provided to exemplify the usefulness and the present limitations of these computational methods in their ability to assist in elucidation of the structural basis underlying the diverse function and biological roles of carbohydrates in their dialogue with proteins. These test cases cover the field of both carbohydrate biosynthesis and glycosyltransferases, as well as glycoside hydrolases. The phenomenon of (macro)molecular recognition is illustrated for the interactions of carbohydrates with such proteins as lectins, monoclonal antibodies, GAG-binding proteins, porins, and viruses. .
Protein-carbohydrate interactions are involved in various essential biological events. 3D structural data from the Protein Data Bank (PDB) can help to understand the molecular basis of the specificity of carbohydrate recognition by proteins. Such interactions can be analyzed statistically using GlyVicinity. This chapter exemplifies the usage of this tool to find information on the frequency of the occurrence of specific amino acids in the vicinity of individual carbohydrate residues and to analyze the type of interacting atoms and their spatial distribution around the glycans. .
Critical Assessment of PRediction of Interactions (CAPRI) rounds 37 through 45 introduced larger complexes, new macromolecules, and multistage assemblies. For these rounds, we used and expanded docking methods in Rosetta to model 23 target complexes. We successfully predicted 14 target complexes and recognized and refined near-native models generated by other groups for two further targets. Notably, for targets T110 and T136, we achieved the closest prediction of any CAPRI participant. We created several innovative approaches during these rounds. Since round 39 (target 122), we have used the new RosettaDock 4.0, which has a revamped coarse-grained energy function and the ability to perform conformer selection during docking with hundreds of pregenerated protein backbones. Ten of the complexes had some degree of symmetry in their interactions, so we tested Rosetta SymDock, realized its shortcomings, and developed the next-generation symmetric docking protocol, SymDock2, which includes docking of multiple backbones and induced-fit refinement. Since the last CAPRI assessment, we also developed methods for modeling and designing carbohydrates in Rosetta, and we used them to successfully model oligosaccharide-protein complexes in round 41. Although the results were broadly encouraging, they also highlighted the pressing need to invest in (a) flexible docking algorithms with the ability to model loop and linker motions and in (b) new sampling and scoring methods for oligosaccharide-protein interactions.
Glycosaminoglycans represent a class of linear anionic periodic polysaccharides, which play a key role in a variety of biological processes in the extracellular matrix via interactions with their protein targets. Computationally, glycosaminoglycans are very challenging due to their high flexibility, periodicity and electrostatics-driven nature of the interactions with their protein counterparts. In this work, we carry out a detailed computational characterization of the interactions in protein-glycosaminoglycan complexes from the Protein Data Bank (PDB), which are split into two subsets accounting for their intrinsic nature: non-enzymatic-protein-glycosaminoglycan and enzyme-glycosaminoglycan complexes. We apply molecular dynamics to analyze the differences in these two subsets in terms of flexibility, retainment of the native interactions in the simulations, free energy components of binding and contributions of protein residue types to glycosaminoglycan binding. Furthermore, we systematically demonstrate that protein electrostatic potential calculations, previously found to be successful for glycosaminoglycan binding sites prediction for individual systems, are in general very useful for proposing protein surface regions as putative glycosaminoglycan binding sites, which can be further used for local docking calculations with these particular polysaccharides. Finally, the performance of six different docking programs (Autodock 3, Autodock Vina, MOE, eHiTS, FlexX and Glide), some of which proved to perform well for particular protein-glycosaminoglycan complexes in previous work, is evaluated on the complete protein-glycosaminoglycan data set from the PDB. This work contributes to widen our knowledge of protein-glycosaminoglycan molecular recognition and could be useful to steer a choice of the strategies to be applied in theoretical studies of these systems.
Glycosaminoglycans (GAGs) are anionic polysaccharides, which participate in key processes in the extracellular matrix by interactions with protein targets. Due to their charged nature, accurate consideration of electrostatic and water-mediated interactions is indispensable for understanding GAGs binding properties. However, solvent is often overlooked in molecular recognition studies. Here we analyze the abundance of solvent in GAG-protein interfaces and investigate the challenges of adding explicit solvent in GAG-protein docking experiments. We observe PDB GAG-protein interfaces being significantly more hydrated than protein-protein interfaces. Furthermore, by applying molecular dynamics approaches we estimate that about half of GAG-protein interactions are water-mediated. With a dataset of eleven GAG-protein complexes we analyze how solvent inclusion affects Autodock 3, eHiTs, MOE and FlexX docking. We develop an approach to de novo place explicit solvent into the binding site prior to docking, which uses the GRID program to predict positions of waters and to locate possible areas of solvent displacement upon ligand binding. To investigate how solvent placement affects docking performance, we compare these results with those obtained by taking into account information about the solvent position in the crystal structure. In general, we observe that inclusion of solvent improves the results obtained with these methods. Our data show that Autodock 3 performs best, though it experiences difficulties to quantitatively reproduce experimental data on specificity of heparin/heparan sulfate disaccharides binding to IL-8. Our work highlights the current challenges of introducing solvent in protein-GAGs recognition studies, which is crucial for exploiting the full potential of these molecules for rational engineering.
Glycosaminoglycans (GAGs) play key roles in virtually all biologic responses through their interaction with proteins. A major challenge in understanding these roles is their massive structural complexity. Computational approaches are extremely useful in navigating this bottleneck and, in some cases, the only avenue to gain comprehensive insight. We discuss the state-of-the-art on computational approaches and present a flowchart to help answer most basic, and some advanced, questions on GAG-protein interactions. For example, firstly, does my protein bind to GAGs?; secondly, where does the GAG bind?; thirdly, does my protein preferentially recognize a particular GAG type?; fourthly, what is the most optimal GAG chain length?; fifthly, what is the structure of the most favored GAG sequence?; and finally, is my GAG-protein system 'specific', 'non-specific', or a combination of both? Recent advances show the field is now poised to enable a non-computational researcher perform advanced experiments through the availability of various tools and online servers.
This volume focuses on solution and solid-state NMR of carbohydrates, glycoproteins, glyco-technologies, biomass and related topics. It is estimated that at least 80% of all proteins are glycoproteins. Because of the complexity, heterogeneity and flexibility of the sugar chains, the structural biology approaches for glycoconjugates have been generally avoided. NMR techniques although well established for structural analyses of proteins and nucleic acids, cannot be simply applied to this complex class of biomolecules. Nonetheless, recently developed NMR techniques for carbohydrates open the door to conformational studies of a variety of sugar chains of biological interest. NMR studies on glycans will have significant impact on the development of vaccines, adjuvants, therapeutics, biomarkers and on biomass regeneration. In this volume, the Editors have collected the most up-to-date NMR applications from experts in the field of carbohydrate NMR spectroscopy. Timely and useful, not only for NMR specialists, it will appeal to researchers in the general field of structural biology, biochemistry and biophysics, molecular and cellular biology and material science.
The interactions of simple carbohydrates with aromatic moieties have been investigated experimentally by NMR spectroscopy. The analysis of the changes in the chemical shifts of the sugar proton signals induced upon addition of aromatic entities has been interpreted in terms of interaction geometries. Phenol and aromatic amino acids (phenylalanine, tyrosine, tryptophan) have been used. The observed sugar-aromatic interactions depend on the chemical nature of the sugar, and thus on the stereochemistries of the different carbon atoms, and also on the solvent. A preliminary study of the solvation state of a model monosaccharide (methyl beta-galactopyranoside) in aqueous solution, both alone and in the presence of benzene and phenol, has also been carried out by monitoring of intermolecular homonuclear solvent-sugar and aromatic-sugar NOEs. These experimental results have been compared with those obtained by density functional theory methods and molecular mechanics calculations.
Molecular docking is a computational method for predicting the placement of ligands in the binding sites of their receptor(s). In this review, we discuss the methodological developments that occurred in the docking field in 2012 and 2013, with a particular focus on the more difficult aspects of this computational discipline. The main challenges and therefore focal points for developments in docking, covered in this review, are receptor flexibility, solvation, scoring, and virtual screening. We specifically deal with such aspects of molecular docking and its applications as selection criteria for constructing receptor ensembles, target dependence of scoring functions, integration of higher-level theory into scoring, implicit and explicit handling of solvation in the binding process, and comparison and evaluation of docking and scoring methods.
Sugars are the most stereochemically intricate family of biomolecules and present substantial challenges to anyone trying to understand their nomenclature, reactions or branched structures. Current crystallographic programs provide an abstraction layer allowing inexpert structural biologists to build complete protein or nucleic acid model components automatically either from scratch or with little manual intervention. This is, however, still not generally true for sugars. The need for carbohydrate-specific building and validation tools has been highlighted a number of times in the past, concomitantly with the introduction of a new generation of experimental methods that have been ramping up the production of protein-sugar complexes and glycoproteins for the past decade. While some incipient advances have been made to address these demands, correctly modelling and refining carbohydrates remains a challenge. This article will address many of the typical difficulties that a structural biologist may face when dealing with carbohydrates, with an emphasis on problem solving in the resolution range where X-ray crystallography and cryo-electron microscopy are expected to overlap in the next decade.
The conformational flexibility of the glycosaminoglycans (GAGs) is known to be key in their binding and biological function, for example in regulating coagulation and cell growth. In this work, we employ enhanced sampling molecular dynamics simulations to probe the ring conformations of GAG-related monosaccharides, including a range of acetylated and sulfated GAG residues. We first perform unbiased MD simulations of glucose anomers and the epimers glucuronate and iduronate. These calculations indicate that in some cases, an excess of 15 mus is required for adequate sampling of ring pucker due to the high energy barriers between states. However, by applying our recently developed msesMD simulation method (multidimensional swarm-enhanced sampling molecular dynamics), we were able to quantitatively and rapidly reproduce these ring pucker landscapes. From msesMD simulations, the puckering free energy profiles were then compared for 15 further monosaccharides related to GAGs; this includes to our knowledge the first simulation study of sulfation effects on beta-GalNAc ring puckering. For the force field employed, we find that in general the calculated pucker free energy profiles for sulfated sugars were similar to the corresponding unsulfated profiles. This accords with recent experimental studies suggesting that variation in ring pucker of sulfated GAG residues is primarily dictated by interactions with surrounding residues rather than by intrinsic conformational preference. As an exception to this, however, we predict that 4-O-sulfation of beta-GalNAc leads to reduced ring rigidity, with a significant lowering in energy of the (1)C(4) ring conformation; this observation may have implications for understanding the structural basis of the biological function of beta-GalNAc-containing glycosaminoglycans such as dermatan sulfate.
Carbohydrate-active enzymes such as glycoside hydrolases (GHs) and glycosyltransferases (GTs) are of growing importance as drug targets. The development of efficient competitive inhibitors and chaperones to treat diseases related to these enzymes requires a detailed knowledge of their mechanisms of action. In recent years, sophisticated first-principles modeling approaches have significantly advanced in our understanding of the catalytic mechanisms of GHs and GTs, not only the molecular details of chemical reactions but also the significant implications that just the conformational dynamics of a sugar ring can have on these mechanisms. Here we provide an overview of the progress that has been made in the past decade, combining molecular dynamics simulations with density functional theory to solve these sweet mysteries of nature.
Outer membranes are a crucial component of Gram-negative bacteria, containing standard lipids in their inner leaflet, lipopolysaccharides (LPSs) in their outer leaflet, and transmembrane beta-barrels known as outer membrane proteins (OMPs). OMPs regulate functions such as substrate transport and cell movement, while LPSs act as a protective barrier for bacteria and can cause toxic reactions in humans. However, the experimental study of outer membranes is challenging. Molecular dynamics simulations are often used for the computational study of membrane systems, but the preparation of complex, LPS-rich outer membranes is not straightforward. The Gram-Negative Outer Membrane Modeler (GNOMM) is an automated pipeline for preparing simulation systems of OMPs embedded in LPS-containing membranes in four different force fields. Given the physiological and clinical importance of outer membranes and their components, GNOMM can be a useful tool in the study of their structure, function, and implications in diseases. GNOMM is available at http://bioinformatics.biol.uoa.gr/GNOMM. (c) 2019 Wiley Periodicals, Inc.
The puckered conformations of furanose and pyranose carbohydrate rings are central to analyzing the action of enzymes on carbohydrates. Enzyme reaction mechanisms are generally inaccessible to experiments and so have become the focus of QM(semiempirical)/MM simulations. We show that the complete free energy of puckering is required to evaluate the accuracy of semiempirical methods used to study reactions involving carbohydrates. Interestingly, we find that reducing the free energy space to lower dimensions results in near meaningless minimum energy pathways. We analyze the furanose and pyranose free energy pucker surfaces and volumes using AM1, PM3, PM3CARB-1, and SCC-DFTB. A comparison with DFT optimized structures and a HF free energy surface reveals that SCC-DFTB provides the best semiempirical description of five- and six-membered carbohydrate ring deformation.
The software tool SWEET accessible through Internet is described which rapidly converts the commonly used sequence information of complex carbohydrates directly into a preliminary but reliable 3D model. The basic idea is to link preconstructed 3D molecular templates of monosaccharides in a specific way of binding as defined in the sequence information. In a subsequent step a fast routine to explore the conformational space for each glycosidic linkage has been implemented. Systematic rotations around the glycosidic linkages are performed, calculating the van der Waals interactions for each step of rotation. The user interaction is supported by an input spreadsheet consisting of a grid of sugar symbol and connection type cells. Several ways to visualise and to output the generated structures and related information are implemented. Since interactivity is an absolute prerequisite for each WWW application, the limitations of the approach are discussed in detail. SWEET will open modelling techniques to a broader range of users, especially for those who do not have access to the required hard- and software equipment.
UNLABELLED: SWEET is a WWW-based tool which rapidly converts the commonly used carbohydrate sequence information directly into a preliminary but reliable 3D model which can be visualised and written to files in several ways. AVAILABILITY: SWEET is accessible via the Internet at http://www.dkfz-heidelberg.de/spec/. CONTACT: a. bohne@dkfz-heidelberg.de or w.vonderlieth@ dkfz-heidelberg.de SUPPLEMENTARY INFORMATION: The current version of SWEET generates only one conformation out of a manifold. Several authors have analysed possible conformations of high-mannose N-linked glycans using a combination of NMR methods and computational approaches showing that such molecules are rather flexible populating normally several conformations for each glycosidic linkage. The displayed model exhibits for all glycosidic linkages a conformation which is in accordance with the reported variations of Phi, psi and omega values for specific linkage (see http://www.dkfz-heidelberg. de/spec/sweet2/doc/input/sba_example.html).
CHARMM (Chemistry at HARvard Molecular Mechanics) is a highly versatile and widely used molecular simulation program. It has been developed over the last three decades with a primary focus on molecules of biological interest, including proteins, peptides, lipids, nucleic acids, carbohydrates, and small molecule ligands, as they occur in solution, crystals, and membrane environments. For the study of such systems, the program provides a large suite of computational tools that include numerous conformational and path sampling methods, free energy estimators, molecular minimization, dynamics, and analysis techniques, and model-building capabilities. The CHARMM program is applicable to problems involving a much broader class of many-particle systems. Calculations with CHARMM can be performed using a number of different energy functions and models, from mixed quantum mechanical-molecular mechanical force fields, to all-atom classical potential energy functions with explicit solvent and various boundary conditions, to implicit solvent and membrane models. The program has been ported to numerous platforms in both serial and parallel architectures. This article provides an overview of the program as it exists today with an emphasis on developments since the publication of the original CHARMM article in 1983.
Molecular dynamics (MD) simulation methods have been an effective source of generating biomolecular-level structural information in immunology, as feedback to understand basic science and to design new experiments, leading to the discovery of drugs and vaccines. Different soluble or surface-bound proteins secreted by immune cells exchange signals through the formation of specialized molecular complexes. Molecules involved in the complex formation are complement proteins, antibodies, T cell receptors, MHC encoded HLA molecules, endogenous peptide antigens, and pathogenic peptides. Understanding the molecular details of the complex formation is very important to systematic design of drugs and vaccines. Experimental data provide only macroscopic reasoning and in many cases fail to perceive subtle differences in behaviors of two apparently very similar systems. Formation of stable complexes depends on complementary residues in proteins and peptides and their matching conformations. Here we present a comprehensive review of applications of MD simulations in immunology. In addition, a short section on computational predictive methods to identify T cell epitopes has been included.
Coot is a tool widely used for model building, refinement, and validation of macromolecular structures. It has been extensively used for crystallography and, more recently, improvements have been introduced to aid in cryo-EM model building and refinement, as cryo-EM structures with resolution ranging 2.5-4 A are now routinely available. Model building into these maps can be time-consuming and requires experience in both biochemistry and building into low-resolution maps. To simplify and expedite the model building task, and minimize the needed expertise, new tools are being added in Coot. Some examples include morphing, Geman-McClure restraints, full-chain refinement, and Fourier-model based residue-type-specific Ramachandran restraints. Here, we present the current state-of-the-art in Coot usage.
We describe the development, current features, and some directions for future development of the Amber package of computer programs. This package evolved from a program that was constructed in the late 1970s to do Assisted Model Building with Energy Refinement, and now contains a group of programs embodying a number of powerful tools of modern computational chemistry, focused on molecular dynamics and free energy calculations of proteins, nucleic acids, and carbohydrates.
no abstract.
Uromodulin is the pregnancy-associated Tamm-Horsfall glycoprotein, with the enhanced ability to inhibit T-cell proliferation. Pregnancy-associated structural changes mainly occur in the O-glycosylation of this glycoprotein. These include up to 12 glycan structures, made up of an unusual core type 2 sequence terminated with one, two, or three sialyl Lewis(x) sequences; this type of O-glycans could serve as E- and P-selectin ligands. The present work focuses on the most complex one; a tetradecamer made up of a type 2 core carrying three sialyl Lewis(x) branches. Five different monosaccharides are assembled by 14 glycosidic linkages. The conformational behavior of the constituting disaccharide segments was evaluated using the flexible residue procedure of the MM3 molecular mechanics procedure. For each disaccharide, the adiabatic energy surface, along with the local energy minima were established. All these results were used for the generation, prior to complete optimization of the tetradecamer. This was followed by a complete exploration of conformational hyperspace throughout the use of the single coordinate method as implemented in the CICADA program. Despite the potential flexibility of the tetradecasaccharide, only four conformational families occur, accounting for more than 95% of the total low energy conformations. For each family, the molecular properties (electrostatic, lipophilicity, and hydrogen potential) were studied. The shape of the tetradecasaccharide is best described as a flat ribbon, flanked by three branches having terminal sialyl residues. Two of the branches interact through nonbonded interactions, bringing further energy stabilization, and limiting the conformational flexibility of the sialyl residues. Only one branch maintains the original conformational features of sialyl Lewis(x). This O-glycan can be seen as a fascinating example of 'dendrimeric' structure, where the spatial arrangement of three S-Le(x) epitopes may favor its complementary 'presentations' for the interactions with E- and P-selectins.
Mammalian glycosaminoglycans are linear complex polysaccharides comprising heparan sulfate, heparin, dermatan sulfate, chondroitin sulfate, keratan sulfate and hyaluronic acid. They bind to numerous proteins and these interactions mediate their biological activities. GAG-protein interaction data reported in the literature are curated mostly in MatrixDB database (http://matrixdb.univ-lyon1.fr/). However, a standard nomenclature and a machine-readable format of GAGs together with bioinformatics tools for mining these interaction data are lacking. We report here the building of an automated pipeline to (i) standardize the format of GAG sequences interacting with proteins manually curated from the literature, (ii) translate them into the machine-readable GlycoCT format and into SNFG (Symbol Nomenclature For Glycan) images and (iii) convert their sequences into a format processed by a builder generating three-dimensional structures of polysaccharides based on a repertoire of conformations experimentally validated by data extracted from crystallized GAG-protein complexes. We have developed for this purpose a converter (the CT23D converter) to automatically translate the GlycoCT code of a GAG sequence into the input file required to construct a three-dimensional model.
no abstract.
Carbohydrates constitute a structurally and functionally diverse group of biological molecules and macromolecules. In cells they are involved in, e.g., energy storage, signaling, and cell-cell recognition. All of these phenomena take place in atomistic scales, thus atomistic simulation would be the method of choice to explore how carbohydrates function. However, the progress in the field is limited by the lack of appropriate tools for preparing carbohydrate structures and related topology files for the simulation models. Here we present tools that fill this gap. Applications where the tools discussed in this paper are particularly useful include, among others, the preparation of structures for glycolipids, nanocellulose, and glycans linked to glycoproteins. The molecular structures and simulation files generated by the tools are compatible with GROMACS.
This chapter outlines protocols for the preparation, execution, and analysis of molecular dynamics (MD) simulations of glycolipids in biologically relevant environments, i.e., imbedded in lipid bilayers or bound to proteins, with the goal of generating biologically relevant structural and dynamic information. Also included is a description of ensemble average (EA) charge set development consistent with the GLYCAM06 force field and its implementation using the AMBER molecular dynamics software suite. .
The frequency of glycosylated protein 3D structures in the Protein Data Bank (PDB) is significantly lower than the proportion of glycoproteins in nature, and if glycan 3D structures are present, then they often exhibit a large degree of errors. There are various reasons for this, one of which is a comparably low support of carbohydrates in software tools for 3D structure determination and validation. This chapter illustrates the current features that assist crystallographers with handling glycans during 3D structure determination in Coot and CNS and with validation of the results. .
A new procedure (POLYS) for producing three-dimensional structures of polysaccharides and complex carbohydrates is described. This employs a builder concept combining a database of monosaccharide structures with a database containing information on populations of independent neighboring glycosidic linkages in disaccharide fragments. The computer program is written in C, and it can cope with both the complexity and the diversity of carbohydrates and the unique topological features arising from multiple branching. A simple ASCII syntax was developed for describing the primary structures in accordance with IUPAC nomenclature. The translation of the primary structure is made through the combined use of a lexical analyzer and a command interpreter. In this way the program can be considered as a compiler of primary structures of carbohydrates. However, it also generates secondary and tertiary structures in the form of Cartesian coordinates in formats used by most molecular mechanics programs and packages. In our laboratory POLYS was exhaustively tested on standard homopolysaccharide systems such as cellulose and mannan and found to work very well. We now report the ease of use and the efficiency of the molecular builder in applications to more complex carbohydrate systems. These include the structural exploration of a pentaantennary oligosaccharide having 135 residues, the complex family of pectic polysaccharides including the organization and distribution of side chains (arabinan, arabinogalactan, and galactan) on the rhamnogalacturonan backbone.
This article describes an update of POLYS, the POLYSaccharide builder, for generating three-dimensional structures of polysaccharides and complex carbohydrates (Engelsen et al., Biopolymers 1996, 39, 417-433). POLYS is written in portable ANSI C and is now released under an open source license. Using this software, complex branched carbohydrate structures and polysaccharides can be constructed from their primary structure and the relevant monosaccharides stored in database containing information on optimized glycosidic linkage geometries. The constructed three-dimensional structures are described as Cartesian coordinate files which can be used as input to other molecular modeling software. The new version of POLYS includes a large database of monosaccharides and a helical generator to build and optimize regular single helix or double helix structures. To demonstrate the efficiency of POLYS to build carbohydrate structures, four examples of increasing complexity are presented in the manuscript, from simple alpha glucans over complex starch fragments and the double helical structure of amylopectin to the mega-oligosaccharide RhamnoGalacturonan II.
Complex carbohydrates (glycans) are the most abundant and versatile biopolymers in nature. The broad diversity of biochemical functions that carbohydrates cover is a direct consequence of the variety of 3D architectures they can adopt, displaying branched or linear arrangements, widely ranging in sizes, and with the highest diversity of building blocks of any other natural biopolymer. Despite this unparalleled complexity, a common denominator can be found in the glycans' inherent flexibility, which hinders experimental characterization, but that can be addressed by high-performance computing (HPC)-based molecular simulations. In this short review, I present and discuss the state-of-the-art of molecular simulations of complex carbohydrates and glycoconjugates, highlighting methodological strengths and weaknesses, important insights through emblematic case studies, and suggesting perspectives for future developments.
The characterization of the 3D structure of oligosaccharides, their conjugates and analogs is particularly challenging for traditional experimental methods. Molecular simulation methods provide a basis for interpreting sparse experimental data and for independently predicting conformational and dynamic properties of glycans. Here, we summarize and analyze the issues associated with modeling carbohydrates, with a detailed discussion of four of the most recently developed carbohydrate force fields, reviewed in terms of applicability to natural glycans, carbohydrate-protein complexes and the emerging area of glycomimetic drugs. In addition, we discuss prospectives and new applications of carbohydrate modeling in drug discovery.
Molecular dynamics (MD) simulation is an emerging technique in studying the interactions between macromolecules with small ligands in different media. In this review, the application of MD simulation in food carbohydrate research, including carbohydrate hydration, carbohydrate interaction with other components and carbohydrate inclusion complexation, will be discussed. The advantages and disadvantages of MD simulation in food carbohydrate research and trends will be proposed. The frequently used software to run the MD simulation and a standard protocol for MD simulation procedures have been discussed. This review offers a general idea about how to use MD simulation in food carbohydrate research, and what could be expected from such research.
In previous work [Pol-Fachin, L.; Fernandes, C. L.; Verli, H.; Carbohydr. Res.2009, 344, 491-500], we had demonstrated that GROMOS96 43a1 force field and Lowdin HF/6-31G * *-derived atomic charges, adequately represent a glycoprotein's conformational ensemble in aqueous solutions, taking as the starting geometries NMR-determined structures. Based on such data, the present work intends to evaluate the use of the main solution conformations of isolated disaccharides, to build the carbohydrate moiety of glycoproteins, for which no previous experimental information is available. The observed results suggested that the entire glycoprotein scaffold appears unable to promote major modifications in the conformational behavior of glycosidic linkages. Additionally, when compared to energy contour plots, the results support the use of solution ensembles, to refine vacuum conformations of carbohydrate databases in the assembling of glycoproteins 3D structures. Finally, such approach is applied to build a full glycosylated model for COX-1 and COX-2 enzymes.
This chapter contains sections titled: - Introduction - Conformational Analysis of (Poly)saccharides – the Pioneering Years - The exo-Anomeric Effect – To Be or Not To Be… - Carbohydrates on the Move … - Carbohydrate Meets Protein - Modern Times - Conclusion and Outlook.
Complex carbohydrates usually have a large number of rotatable bonds and consequently a large number of theoretically possible conformations can be generated (combinatorial explosion). The application of systematic search methods for conformational analysis of carbohydrates is therefore limited to disaccharides and trisaccharides in a routine analysis. An alternative approach is to use Monte-Carlo methods or (high-temperature) molecular dynamics (MD) simulations to explore the conformational space of complex carbohydrates. This chapter describes how to use MD simulation data to perform a conformational analysis (conformational maps, hydrogen bonds) of oligosaccharides and how to build realistic 3D structures of large polysaccharides using Conformational Analysis Tools (CAT). .
This chapter contains sections titled: - Introduction - Factors That Have an Influence on the 3D Structure of Carbohydrates - Exploring the Accessible Conformational Space of Carbohydrates - Generation of 3D Structures of Glycoproteins - Conclusion and Outlook.
Glycosylated proteins are ubiquitous components of extracellular matrices and cellular surfaces where their oligosaccharide moieties are implicated in a wide range of cell-cell and cell-matrix recognition events. Glycans constitute highly flexible molecules. Only a small number of glycan X-ray structures is available for which sufficient electron density for an entire oligosaccharide chain has been observed. An unambiguous structure determination based on NMR-derived geometric constraints alone is often not possible. Time consuming computational approaches such as Monte Carlo calculations and molecular dynamics simulations have been widely used to explore the conformational space accessible to complex carbohydrates. The generation of a comprehensive data base for N-glycan fragments based on long time molecular dynamics simulations is presented. The fragments are chosen in such a way that the effects of branched N-glycan structures are taken into account. The prediction database constitutes the basis of a procedure to generate a complete set of all possible conformations for a given N-glycan. The constructed conformations are ranked according to their energy content. The resulting conformations are in reasonable agreement with experimental data. A web interface has been established (http://www.dkfz.de/spec/glydict/), which enables to input any N-glycan of interest and to receive an ensemble of generated conformations within a few minutes.
Conformational energy maps of the glycosidic linkages are a valuable resource to gain information about preferred conformations and flexibility of carbohydrates. Here we present GlycoMapsDB, a new database containing more than 2500 calculated conformational maps for a variety of di- to pentasaccharide fragments contained in N- and O-glycans. Oligosaccharides representing branchpoints of N-glycans are included in the set of fragments, thus the influence of neighbouring residues is reflected in the conformational maps. During refinement of new crystal structures, maps contained in GlycoMapsDB can serve as a valuable resource to check whether the torsion values of a glycosidic linkage are located in an 'allowed' region similar to the Ramachandran plot analysis for proteins. This might help to improve the structural quality of the glycan data contained in the Protein Data Bank (PDB). A link between GlycoMapsDB and the PDB has been established so that the glycosidic torsions of all glycans contained in the PDB can be retrieved and compared to calculated data. The service is available at www.glycosciences.de/modeling/glycomapsdb/.
Some history of computerized modeling is furnished to set the stage for modern work, including the development of the justification for using flexible monosaccharide residues. Modeling studies of mono-and disaccharides, primarily by the author and colleagues, have been reviewed and additional analyses were furnished that relate the conformations found in crystal structures to the adiabatic energies through a Boltzmann-like distribution. Those decreases in the numbers of observed crystal structures with increasing energy, discussed earlier by Dunitz and coworkers, suggest that the consequences of crystal packing forces are equivalent to the distribution of structures that would be found at about 500 K for non-crystalline molecules. A closing study of a large oligosaccharide, a cycloamylose with 26 glucose residues (CA26), revealed the extent to which energy minimization studies could retain molecular features found in crystal structures of a large carbohydrate. As also done for the disaccharides, both empirical force fields (molecular mechanics) and density functional theory (quantum mechanics) were employed. Some intrinsic features of the CA26 were reproduced, but the detailed results from the different modeling methods diverged. Results from the CA26 calculations based on normal termination criteria changed significantly when tighter criteria were employed; it was not practical to tighten the criteria for the quantum calculations.
Computerized molecular modeling continues to increase in capability and applicability to carbohydrates. This chapter covers nomenclature and conformational aspects of carbohydrates, perhaps of greater use to computational chemists who do not have a strong background in carbohydrates, and its comments on various methods and studies might be of more use to carbohydrate chemists who are inexperienced with computation. Work on the intrinsic variability of glucose, an overall theme, is described. Other areas of the authors' emphasis, including evaluation of hydrogen bonding by the atoms-in-molecules approach, and validation of modeling methods with crystallographic results are also presented. .
Enterobacterial common antigen (ECA) is a surface glycolipid shared by all members of the Enterobacteriaceae family. In addition to lipopolysaccharides (LPS), ECA is an important component in the outer membrane (OM) of Gram-negative bacteria, making the OM an effective, selective barrier against the permeation of toxic molecules. Previous modeling and simulation studies represented OMs exclusively with LPS in the outer leaflet. In this work, various ECA molecules were first modeled and incorporated into symmetric bilayers with LPS in different ratios, and all-atom molecular dynamics simulations were conducted to investigate the properties of the mixed bilayers mimicking OM outer leaflets. Dynamic and flexible conformational ensembles are sampled for each ECA/LPS system. Incorporation of ECA(LPS) (an LPS core-linked form) and ECA(PG) (a phosphatidylglycerol-linked form) affects lipid packing and ECA/LPS distributions on the bilayer surface. Hydrophobic thickness and chain order parameter analyses indicate that incorporation of ECA(PG) makes the acyl chains of LPS more flexible and disordered and thus increases the area per lipid of LPS. The calculated area per lipid of each ECA/LPS provides a good estimate for building more realistic OMs with different ratios of ECA/LPS, which will be useful in order to characterize their interactions with outer membrane proteins in more realistic OMs.
High level correlated quantum chemical calculations, using MP2 and local MP2 theory, have been performed for conformations of the disaccharide, beta-maltose, and the trisaccharide, 3,6-di-O-(alpha-D-mannopyranosyl)-alpha-D-mannopyranose. For beta-maltose, MP2 and local MP2 calculations using the 6-311++G** basis set are in good agreement, predicting a global minimum gas-phase conformation with a counterclockwise hydrogen bond network and the experimentally-observed intersaccharide hydrogen bonding arrangement. For conformations of 3,6-di-O-(alpha-D-mannopyranosyl)-alpha-D-mannopyranose, MP2/6-311++G**, and local MP2/6-311++G** calculations do not provide a consensus prediction of relative energetics, with the MP2 method finding large differences in stability between extended and folded trisaccharide conformations. Local MP2 calculations, less susceptible to intramolecular basis set superposition errors, predict a narrower range of trisaccharide energetics, in line with estimates from Hartree-Fock theory and B3LYP and BP86 density functionals. All levels of theory predict compact, highly hydrogen-bonded conformations as lowest in energy on the in vacuo potential energy surface of the trisaccharide. These high level, correlated local MP2/6-311++G** calculations of di- and trisaccharide energetics constitute potential reference data in the development and testing of improved empirical and semiempirical potentials for modeling of carbohydrates in the condensed phase.
Polysaccharides represent an important class of biopolymers. Advance in their investigation depends considerably on the adequate modeling of their structures by using appropriate oligosaccharides. Herein, we discuss the role of the length of oligosaccharide models used in predicting polysaccharide properties, in particular, their conformational and spectral characteristics.
Impressive improvements in docking performance can be achieved by applying energy bonuses to poses in which glycan hydroxyl groups occupy positions otherwise preferred by bound waters. In addition, inclusion of glycosidic conformational energies allows unlikely glycan conformations to be appropriately penalized. A method for predicting the binding specificity of glycan-binding proteins has been developed, which is based on grafting glycan branches onto a minimal binding determinant in the binding site. Grafting can be used either to screen virtual libraries of glycans, such as the known glycome, or to identify docked poses of minimal binding determinants that are consistent with specificity data. The reviewed advances allow accurate modelling of carbohydrate-protein 3D co-complexes, but challenges remain in ranking the affinity of congeners.
no abstract.
Norovirus is a major pathogen of nonbacterial acute gastroenteritis in humans and animals. Carbohydrate recognition between norovirus capsid proteins and Lewis antigens is considered to play a critical role in initiating infection of eukaryotic cells. In this article, we first report a detailed atomistic simulation study of the norovirus capsid protein in complex with the Lewis antigen based on ab initio QM/MM combined with MD-FEP simulations. To understand the mechanistic details of ligand binding, we analyzed and compared the carbohydrate recognition mechanism of the wild-type P domain protein with a mutant protein. Small structural differences between two capsid proteins are observed on the weak interaction site of residue 389, which is located on the solvent exposed surface of the P domain. To further clarify affinity differences in ligand binding, we directly evaluated free energy changes of the ligand binding process. Although the mutant protein loses its interaction energy with the Lewis antigen, this small amount of energy penalty is compensated for by an increase in the solvation stability, which is induced by structural reorganization at the ligand binding site on the protein surface. As a sum of these opposite energy components, the mutant P domain obtains a slightly enhanced binding affinity for the Lewis antigen. The present computational study clearly demonstrated that a detailed free energy balance of the interaction energy between the capsid protein and the surrounding aqueous solvent is the mechanistic basis of carbohydrate recognition in the norovirus capsid protein.
Molecular dynamics simulations of membrane proteins have provided deeper insights into their functions and interactions with surrounding environments at the atomic level. However, compared to solvation of globular proteins, building a realistic protein/membrane complex is still challenging and requires considerable experience with simulation software. Membrane Builder in the CHARMM-GUI website (http://www.charmm-gui.org) helps users to build such a complex system using a web browser with a graphical user interface. Through a generalized and automated building process including system size determination as well as generation of lipid bilayer, pore water, bulk water, and ions, a realistic membrane system with virtually any kinds and shapes of membrane proteins can be generated in 5 minutes to 2 hours depending on the system size. Default values that were elaborated and tested extensively are given in each step to provide reasonable options and starting points for both non-expert and expert users. The efficacy of Membrane Builder is illustrated by its applications to 12 transmembrane and 3 interfacial membrane proteins, whose fully equilibrated systems with three different types of lipid molecules (DMPC, DPPC, and POPC) and two types of system shapes (rectangular and hexagonal) are freely available on the CHARMM-GUI website. One of the most significant advantages of using the web environment is that, if a problem is found, users can go back and re-generate the whole system again before quitting the browser. Therefore, Membrane Builder provides the intuitive and easy way to build and simulate the biologically important membrane system.
The CHARMM-GUI Membrane Builder (http://www.charmm-gui.org/input/membrane), an intuitive, straightforward, web-based graphical user interface, was expanded to automate the building process of heterogeneous lipid bilayers, with or without a protein and with support for up to 32 different lipid types. The efficacy of these new features was tested by building and simulating lipid bilayers that resemble yeast membranes, composed of cholesterol, dipalmitoylphosphatidylcholine, dioleoylphosphatidylcholine, palmitoyloleoylphosphatidylethanolamine, palmitoyloleoylphosphatidylamine, and palmitoyloleoylphosphatidylserine. Four membranes with varying concentrations of cholesterol and phospholipids were simulated, for a total of 170 ns at 303.15 K. Unsaturated phospholipid chain concentration had the largest influence on membrane properties, such as average lipid surface area, density profiles, deuterium order parameters, and cholesterol tilt angle. Simulations with a high concentration of unsaturated chains (73%, membrane(unsat)) resulted in a significant increase in lipid surface area and a decrease in deuterium order parameters, compared with membranes with a high concentration of saturated chains (60-63%, membrane(sat)). The average tilt angle of cholesterol with respect to bilayer normal was largest, and the distribution was significantly broader for membrane(unsat). Moreover, short-lived cholesterol orientations parallel to the membrane surface existed only for membrane(unsat). The membrane(sat) simulations were in a liquid-ordered state, and agree with similar experimental cholesterol-containing membranes.
Glycoproteins and protein-carbohydrate complexes in the worldwide Protein Data Bank (wwPDB) can be an excellent source of information for glycoscientists. Unfortunately, a rather large number of errors and inconsistencies is found in the glycan moieties of these 3D structures. This review illustrates frequent problems of carbohydrate moieties in wwPDB entries, such as nomenclature issues, incorrect N-glycan core structures, missing or erroneous linkages, or poor glycan geometry, and describes the carbohydrate-specific validation tools that are designed to identify such problems. Recommendations how to avoid these issues or how to rectify incorrect structures are also given.
Protein-carbohydrate interactions are increasingly being recognized as essential for many important biomolecular recognition processes. From these, numerous biomedical applications arise in areas as diverse as drug design, immunology, or drug transport. We introduce SLICK, a package containing a scoring and an energy function, which were specifically designed to predict binding modes and free energies of sugars and sugarlike compounds to proteins. SLICK accounts for van der Waals interactions, solvation effects, electrostatics, hydrogen bonds, and CH...pi interactions, the latter being a particular feature of most protein-carbohydrate interactions. Parameters for the empirical energy function were calibrated on a set of high-resolution crystal structures of protein-sugar complexes with known experimental binding free energies. We show that SLICK predicts the binding free energies of predicted complexes (through molecular docking) with high accuracy. SLICK is available as part of our molecular modeling package BALL (www.ball-project.org).
Micelles play an important role in both experimental and computational studies of the effect of lipid interactions on biological systems. The spherical geometry and the dynamical behavior of micelles makes generating micelle structures for use in molecular simulations challenging. An easy tool for generating simulation-ready micelle models, covering a broad range of lipids, is highly desirable. Here, we present a new Web server, Micelle Maker, which can provide equilibrated micelle models as a direct input for subsequent molecular dynamics simulations from a broad range of lipids (currently 25 lipid types, including 24 glycolipids). The Web server, which is available at http://www.micellemaker.net, uses error checking routines to prevent clashes during the initial placement of the lipids and uses AMBER's GLYCAM library for generating minimized or equilibrated micelle models, but the resulting structures can be used as starting points for simulations with any force field or simulation package. Extensive validation simulations with an overall simulation time of 12 mus using eight micelle models where assembly information is available show that all of the micelles remain very stable over the whole simulation time. Finally, we discuss the advantages of Micelle Maker relative to other approaches in the field.
Licensed conjugate vaccines have proven to be highly effective in preventing bacterial disease. Coverage of a vaccine is extended when specific antigens elicit immune responses that provide cross-protection against infection by closely related, non-vaccine strains. However, structural similarity between carbohydrate antigens has not proven to be a reliable predictor of cross-protection and the current understanding of the role of saccharide antigen conformation in immunogenicity is sparse. Identification of the conformational effect of specific structural changes in conjugate vaccine antigens may usefully inform the development of conjugate vaccines. The limited ability of experimental methods to establish saccharide conformation has led to the development of systematic molecular modeling protocols. Here we cover the computational methodologies employed to model carbohydrate antigens and demonstrate, through case studies, the valuable role that molecular simulations can play in furthering our understanding of carbohydrate immunogenicity. The case studies comprise molecular modeling of the capsular polysaccharides for meningococcal serogroups Y and W and pneumococcal serogroups 6, 19 and 23, as well O-antigens of Salmonella enterica and Shigella flexneri. Conformational analysis can provide a mechanistic insight into clinical observations on cross-protection and may further indicate the importance of specific structural features, such as substituents, thereby facilitating vaccine design and broadening vaccine coverage.
CarbBuilder is a portable software tool for producing three-dimensional molecular models of carbohydrates from the simple text specification of a primary structure. CarbBuilder can generate a wide variety of carbohydrate structures, ranging from monosaccharides to large, branched polysaccharides. Version 2.0 of the software, described in this article, supports monosaccharides of both mammalian and bacterial origin and a range of substituents for derivatization of individual sugar residues. This improved version has a sophisticated building algorithm to explore the range of possible conformations for a specified carbohydrate molecule. Illustrative examples of models of complex polysaccharides produced by CarbBuilder demonstrate the capabilities of the software. CarbBuilder is freely available under the Artistic License 2.0 from https://people.cs.uct.ac.za/~mkuttel/Downloads.html. (c) 2016 Wiley Periodicals, Inc.
The RosettaCarbohydrate framework is a new tool for modeling a wide variety of saccharide and glycoconjugate structures. This report describes the development of the framework and highlights its applications. The framework integrates with established protocols within the Rosetta modeling and design suite, and it handles the vast complexity and variety of carbohydrate molecules, including branching and sugar modifications. To address challenges of sampling and scoring, RosettaCarbohydrate can sample glycosidic bonds, side-chain conformations, and ring forms, and it utilizes a glycan-specific term within its scoring function. Rosetta can work with standard PDB, GLYCAM, and GlycoWorkbench (.gws) file formats. Saccharide residue-specific chemical information is stored internally, permitting glycoengineering and design. Carbohydrate-specific applications described herein include virtual glycosylation, loop-modeling of carbohydrates, and docking of glyco-ligands to antibodies. Benchmarking data are presented and compared to other studies, demonstrating Rosetta's ability to predict glyco-ligand binding. The framework expands the tools available to glycoscientists and engineers. (c) 2016 Wiley Periodicals, Inc.
Main characteristics are described of the PRIRODA quantum-chemical program suite designed for the study of complex molecular systems by the density functional theory, at the MP2, MP3, and MP4 levels of multiparticle perturbation theory, and by the coupled-cluster single and double excitations method (CCSD) with the application of parallel computing. A number of examples of calculations are presented.
Proper treatment of nonbonded interactions is essential for the accuracy of molecular dynamics (MD) simulations, especially in studies of lipid bilayers. The use of the CHARMM36 force field (C36 FF) in different MD simulation programs can result in disagreements with published simulations performed with CHARMM due to differences in the protocols used to treat the long-range and 1-4 nonbonded interactions. In this study, we systematically test the use of the C36 lipid FF in NAMD, GROMACS, AMBER, OpenMM, and CHARMM/OpenMM. A wide range of Lennard-Jones (LJ) cutoff schemes and integrator algorithms were tested to find the optimal simulation protocol to best match bilayer properties of six lipids with varying acyl chain saturation and head groups. MD simulations of a 1,2-dipalmitoyl-sn-phosphatidylcholine (DPPC) bilayer were used to obtain the optimal protocol for each program. MD simulations with all programs were found to reasonably match the DPPC bilayer properties (surface area per lipid, chain order parameters, and area compressibility modulus) obtained using the standard protocol used in CHARMM as well as from experiments. The optimal simulation protocol was then applied to the other five lipid simulations and resulted in excellent agreement between results from most simulation programs as well as with experimental data. AMBER compared least favorably with the expected membrane properties, which appears to be due to its use of the hard-truncation in the LJ potential versus a force-based switching function used to smooth the LJ potential as it approaches the cutoff distance. The optimal simulation protocol for each program has been implemented in CHARMM-GUI. This protocol is expected to be applicable to the remainder of the additive C36 FF including the proteins, nucleic acids, carbohydrates, and small molecules.
As part of our ongoing efforts to support diverse force fields and simulation programs in CHARMM-GUI, this work presents the development of FF-Converter to prepare Amber simulation inputs with various Amber force fields within the current CHARMM-GUI workflow. The currently supported Amber force fields are ff14SB/ff19SB (protein), Bsc1 (DNA), OL3 (RNA), GLYCAM06 (carbohydrate), Lipid17 (lipid), GAFF/GAFF2 (small molecule), TIP3P/TIP4P-EW/OPC (water), and 12-6-4 ions, and more will be added if necessary. The robustness and usefulness of this new CHARMM-GUI extension are demonstrated by two exemplary systems: a protein/N-glycan/ligand/membrane system and a protein/DNA/RNA system. Currently, CHARMM-GUI supports the Amber force fields only for the Amber program, but we will expand the FF-Converter functionality to support other simulation programs that support the Amber force fields.
The Rosetta software for macromolecular modeling, docking and design is extensively used in laboratories worldwide. During two decades of development by a community of laboratories at more than 60 institutions, Rosetta has been continuously refactored and extended. Its advantages are its performance and interoperability between broad modeling capabilities. Here we review tools developed in the last 5 years, including over 80 methods. We discuss improvements to the score function, user interfaces and usability. Rosetta is available at http://www.rosettacommons.org.
BACKGROUND: Carbohydrates are a class of large and diverse biomolecules, ranging from a simple monosaccharide to large multi-branching glycan structures. The covalent linkage of a carbohydrate to the nitrogen atom of an asparagine, a process referred to as N-linked glycosylation, plays an important role in the physiology of many living organisms. Most software for glycan modeling on a personal desktop computer requires knowledge of molecular dynamics to interface with specialized programs such as CHARMM or AMBER. There are a number of popular web-based tools that are available for modeling glycans (e.g., GLYCAM-WEB (http:// https://dev.glycam.org/gp/ ) or Glycosciences.db ( http://www.glycosciences.de/ )). However, these web-based tools are generally limited to a few canonical glycan conformations and do not allow the user to incorporate glycan modeling into their protein structure modeling workflow. RESULTS: Here, we present Glycosylator, a Python framework for the identification, modeling and modification of glycans in protein structure that can be used directly in a Python script through its application programming interface (API) or through its graphical user interface (GUI). The GUI provides a straightforward two-dimensional (2D) rendering of a glycoprotein that allows for a quick visual inspection of the glycosylation state of all the sequons on a protein structure. Modeled glycans can be further refined by a genetic algorithm for removing clashes and sampling alternative conformations. Glycosylator can also identify specific three-dimensional (3D) glycans on a protein structure using a library of predefined templates. CONCLUSIONS: Glycosylator was used to generate models of glycosylated protein without steric clashes. Since the molecular topology is based on the CHARMM force field, new complex sugar moieties can be generated without modifying the internals of the code. Glycosylator provides more functionality for analyzing and modeling glycans than any other available software or webserver at present. Glycosylator will be a valuable tool for the glycoinformatics and biomolecular modeling communities.
An extensive quantum mechanical study of a water dimer suggests that the introduction of a diffuse function into the basis set, which significantly reduces the basis set superposition error (BSSE) in the hydrogen bonding energy calculation, is the key to better calculations of the potential energy surfaces of carbohydrates. This article examines the potential energy surfaces of selected d-aldo- and d-ketohexoses (a total of 82 conformers) by quantum mechanics (QM) and molecular mechanics (MM) methods. In contrast to the results with a smaller basis set (B3LYP/6-31G** 5d), we found at the higher level calculation (B3LYP/6-311++G(2d,2p)//B3LYP/6-31G** 5d) that, in most cases, the furanose forms are less stable than the pyranose forms. These discrepancies are mainly due to the fact that intramolecular hydrogen bonding energies are overestimated in the lower level calculations. The higher level QM calculations of the potential energy surfaces of d-aldo- and d-ketohexoses now are more comparable to the MM3 results.
Knowledge of the three-dimensional structures of the carbohydrate molecules is indispensable for a full understanding of the molecular processes in which carbohydrates are involved, such as protein glycosylation or protein-carbohydrate interactions. The Protein Data Bank (PDB) is a valuable resource for three-dimensional structural information on glycoproteins and protein-carbohydrate complexes. Unfortunately, many carbohydrate moieties in the PDB contain inconsistencies or errors. This article gives an overview of the information that can be obtained from individual PDB entries and from statistical analyses of sets of three-dimensional structures, of typical problems that arise during the analysis of carbohydrate three-dimensional structures and of the validation tools that are currently available to scientists to evaluate the quality of these structures.
A detailed investigation of the conformational properties of all the biologically relevant O-glycosidic linkages using the Hamiltonian replica exchange (HREX) simulation methodology and the recently developed CHARMM carbohydrate force field parameters is presented. Fourteen biologically relevant O-linkages between the five sugars N-acetylgalactosamine (GalNAc), N-acetylglucosamine (GlcNAc), D-glucose (Glc), D-mannose (Man), and L-fucose (Fuc) and the amino acids serine and threonine were studied. The force field was tested by comparing the simulation results of the model glycopeptides to various NMR (3)J coupling constants, NOE distances, and data from molecular dynamics with time-averaged restraints (tar-MD). The results show the force field to be in overall agreement with experimental and previous tar-MD simulations, although some small limitations are identified. An in-depth hydrogen bond and bridging water analysis revealed an interplay of hydrogen bonding and bridge water interactions influencing the geometry of the underlying peptide backbone, with the O-linkages favoring extended beta-sheet and polyproline type II (PPII) conformations over the compact alpha(R)-helical conformation. The newly developed parameters were also able to identify hydrogen bonding and water mediated interactions between O-linked sugars and proteins. These results indicate that the newly developed parameters in tandem with HREX conformational sampling provide the means to study glycoproteins in the absence of targeted NMR restraint data.
Knowledge of the structure and conformational flexibility of carbohydrates in an aqueous solvent is important to improving our understanding of how carbohydrates function in biological systems. In this study, we extend a variant of the Hamiltonian replica-exchange molecular dynamics (MD) simulation to improve the conformational sampling of saccharides in an explicit solvent. During the simulations, a biasing potential along the glycosidic-dihedral linkage between the saccharide monomer units in an oligomer is applied at various levels along the replica runs to enable effective transitions between various conformations. One reference replica runs under the control of the original force field. The method was tested on disaccharide structures and further validated on biologically relevant blood group B, Lewis X and Lewis A trisaccharides. The biasing potential-based replica-exchange molecular dynamics (BP-REMD) method provided a significantly improved sampling of relevant conformational states compared with standard continuous MD simulations, with modest computational costs. Thus, the proposed BP-REMD approach adds a new dimension to existing carbohydrate conformational sampling approaches by enhancing conformational sampling in the presence of solvent molecules explicitly at relatively low computational cost.
Carbohydrate-active enzymes (CAZymes) are families of essential and structurally related enzymes, which catalyze the creation, modification, and degradation of glycosidic bonds in carbohydrates to maintain essentially all kingdoms of life. CAZymes play a key role in many biological processes underpinning human health and diseases (e.g., cancer, diabetes, Alzheimer's diseases, AIDS) and have thus emerged as important drug targets in the fight against pathogenesis. The realization of the full potential of CAZymes remains a significant challenge, relying on a deeper understanding of the molecular mechanisms of catalysis. Considering numerous unsettled questions in the literature, while with a large amount of structural, kinetic, and mutagenesis data available for CAZymes, there is a pressing need and an abundant opportunity for collaborative computational and experimental investigations with the aim to unlock the secrets of CAZyme catalysis at an atomic level. In this review, we briefly survey key methodology development in computational studies of CAZyme catalysis. This is complemented by selected case studies highlighting mechanistic insights provided by computational glycobiology. Implication for inhibitor design by mimicking the transition state is also illustrated for both glycoside hydrolases and glycosyltransferases. The challenges for such studies will be noted and finally an outlook for future directions will be provided. .
We have implemented a system called glygal that can perform conformational searches on oligosaccharides using several different genetic algorithm (GA) search methods. The searches are performed in the torsion angle conformational space, considering both the primary glycosidic linkages as well as the pendant groups (C-5-C-6 and hydroxyl groups) where energy calculations are performed using the MM3(96) force field. The system includes a graphical user interface for setting calculation parameters and incorporates a 3D molecular viewer. The system was tested using dozens of structures and we present two case studies for two previously investigated O-specific oligosaccharides of the Shigella dysenteriae type 2 and 4. The results obtained using glygal show a significant reduction in the number of structures that need to be sampled in order to find the best conformation, as compared to filtered systematic search.
Characterizing glycans and glycoconjugates in the context of three-dimensional structures is important in understanding their biological roles and developing efficient therapeutic agents. Computational modeling and molecular simulation have become an essential tool complementary to experimental methods. Here, we present a computational tool, Glycan Modeler for in silico N-/O-glycosylation of the target protein and generation of carbohydrate-only systems. In our previous study, we developed Glycan Reader, a web-based tool for detecting carbohydrate molecules from a PDB structure and generation of simulation system and input files. As integrated into Glycan Reader in CHARMM-GUI, Glycan Modeler (Glycan Reader & Modeler) enables to generate the structures of glycans and glycoconjugates for given glycan sequences and glycosylation sites using PDB glycan template structures from Glycan Fragment Database (http://glycanstructure.org/fragment-db). Our benchmark tests demonstrate the universal applicability of Glycan Reader & Modeler to various glycan sequences and target proteins. We also investigated the structural properties of modeled glycan structures by running 2-mus molecular dynamics simulations of HIV envelope protein. The simulations show that the modeled glycan structures built by Glycan Reader & Modeler have the similar structural features compared to the ones solved by X-ray crystallography. We also describe the representative examples of glycoconjugate modeling with video demos to illustrate the practical applications of Glycan Reader & Modeler. Glycan Reader & Modeler is freely available at http://charmm-gui.org/input/glycan.
no abstract.
Glycoscience assembles all the scientific disciplines involved in studying various molecules and macromolecules containing carbohydrates and complex glycans. Such an ensemble involves one of the most extensive sets of molecules in quantity and occurrence since they occur in all microorganisms and higher organisms. Once the compositions and sequences of these molecules are established, the determination of their three-dimensional structural and dynamical features is a step toward understanding the molecular basis underlying their properties and functions. The range of the relevant computational methods capable of addressing such issues is anchored by the specificity of stereoelectronic effects from quantum chemistry to mesoscale modeling throughout molecular dynamics and mechanics and coarse-grained and docking calculations. The Review leads the reader through the detailed presentations of the applications of computational modeling. The illustrations cover carbohydrate-carbohydrate interactions, glycolipids, and N- and O-linked glycans, emphasizing their role in SARS-CoV-2. The presentation continues with the structure of polysaccharides in solution and solid-state and lipopolysaccharides in membranes. The full range of protein-carbohydrate interactions is presented, as exemplified by carbohydrate-active enzymes, transporters, lectins, antibodies, and glycosaminoglycan binding proteins. A final section features a list of 150 tools and databases to help address the many issues of structural glycobioinformatics.
The Tinker software, currently released as version 8, is a modular molecular mechanics and dynamics package written primarily in a standard, easily portable dialect of Fortran 95 with OpenMP extensions. It supports a wide variety of force fields, including polarizable models such as the Atomic Multipole Optimized Energetics for Biomolecular Applications (AMOEBA) force field. The package runs on Linux, macOS, and Windows systems. In addition to canonical Tinker, there are branches, Tinker-HP and Tinker-OpenMM, designed for use on message passing interface (MPI) parallel distributed memory supercomputers and state-of-the-art graphical processing units (GPUs), respectively. The Tinker suite also includes a tightly integrated Java-based graphical user interface called Force Field Explorer (FFE), which provides molecular visualization capabilities as well as the ability to launch and control Tinker calculations.
Protein-glycan recognition regulates a wide range of biological and pathogenic processes. Conformational diversity of glycans in solution is apparently incompatible with specific binding to their receptor proteins. One possibility is that among the different conformational states of a glycan, only one conformer is utilized for specific binding to a protein. However, the labile nature of glycans makes characterizing their conformational states a challenging issue. All-atom molecular dynamics (MD) simulations provide the atomic details of glycan structures in solution, but fairly extensive sampling is required for simulating the transitions between rotameric states. This difficulty limits application of conventional MD simulations to small fragments like di- and tri-saccharides. Replica-exchange molecular dynamics (REMD) simulation, with extensive sampling of structures in solution, provides a valuable way to identify a family of glycan conformers. This article reviews recent REMD simulations of glycans carried out by us or other research groups and provides new insights into the conformational equilibria of N-glycans and their alteration by chemical modification. We also emphasize the importance of statistical averaging over the multiple conformers of glycans for comparing simulation results with experimental observables. The results support the concept of "conformer selection" in protein-glycan recognition.
BACKGROUND: Detailed experimental three dimensional structures of carbohydrates are often difficult to acquire. Molecular modelling and computational conformation prediction are therefore commonly used tools for three dimensional structure studies. Modelling procedures generally require significant training and computing resources, which is often impractical for most experimental chemists and biologists. Shape has been developed to improve the availability of modelling in this field. RESULTS: The Shape software package has been developed for simplicity of use and conformation prediction performance. A trivial user interface coupled to an efficient genetic algorithm conformation search makes it a powerful tool for automated modelling. Carbohydrates up to a few hundred atoms in size can be investigated on common computer hardware. It has been shown to perform well for the prediction of over four hundred bioactive oligosaccharides, as well as compare favourably with previously published studies on carbohydrate conformation prediction. CONCLUSION: The Shape fully automated conformation prediction can be used by scientists who lack significant modelling training, and performs well on computing hardware such as laptops and desktops. It can also be deployed on computer clusters for increased capacity. The prediction accuracy under the default settings is good, as it agrees well with experimental data and previously published conformation prediction studies. This software is available both as open source and under commercial licenses.
Modeling of carbohydrates is particularly challenging because of the variety of structures resulting for the high number of monosaccharides and possible linkages and also because of their intrinsic flexibility. The development of carbohydrate parameters for molecular modeling is still an active field. Nowadays, main carbohydrates force fields are GLYCAM06, CHARMM36, and GROMOS 45A4. GLYCAM06 includes the largest choice of compounds and is compatible with the AMBER force fields and associated. Furthermore, AMBER includes tools for the implementation of new parameters. When looking at protein-carbohydrate interaction, the choice of the starting structure is of importance. Such complex can be sometimes obtained from the Protein Data Bank-although the stereochemistry of sugars may require some corrections. When no experimental data is available, molecular docking simulation is generally used to the obtain protein-carbohydrate complex coordinates. As molecular docking parameters are not specifically dedicated to carbohydrates, inaccuracies should be expected, especially for the docking of polysaccharides. This issue can be addressed at least partially by combining molecular docking with molecular dynamics simulation in water. .
The human glycome comprises a vast untapped repository of 3D-structural information that holds the key to glycan recognition and a new era of rationally designed mimetic chemical probes, drugs, and biomaterials. Toward routine prediction of oligosaccharide conformational populations and exchange rates at thermodynamic equilibrium, we apply hardware-accelerated aqueous molecular dynamics to model mus motions in N-glycans that underpin inflammation and immunity. In 10mus simulations, conformational equilibria of mannosyl cores, sialyl Lewis (sLe) antennae, and constituent sub-sequences agreed with prior refinements (X-ray and NMR). Glycosidic linkage and pyranose ring flexing were affected by branching, linkage position, and secondary structure, implicating sequence dependent motions in glycomic functional diversity. Linkage and ring conformational transitions that have eluded precise quantification by experiment and conventional (ns) simulations were predicted to occur on mus timescales. All rings populated non-chair shapes and the stacked galactose and fucose pyranoses of sLe(a) and sLe(x) were rigidified, suggesting an exploitable 3D-signature of cell adhesion protein binding. Analyses of sLe(x) dynamics over 25mus revealed that only 10mus were sufficient to explore all aqueous conformers. This simulation protocol, which yields conformational ensembles that are independent of initial 3D-structure, is proposed as a route to understanding oligosaccharide recognition and structure-activity relationships, toward development of carbohydrate-based novel chemical entities.
Population analysis in terms of glycosidic torsion angles is frequently used to reveal preferred conformers of glycans. However, due to high structural diversity and flexibility of carbohydrates, conformational characterization of complex glycans can be a challenging task. Herein, we present a conformation module of oligosaccharide fragments occurring in natural glycan structures developed on the platform of the Carbohydrate Structure Database. Currently, this module deposits free energy surface and conformer abundance maps plotted as a function of glycosidic torsions for 194 "inter"residue bonds. Data are automatically and continuously derived from explicit-solvent molecular dynamics (MD) simulations. The module was also supplemented with high-temperature MD data of saccharides (2,403 maps) provided by GlycoMapsDB (hosted by GLYCOSCIENCES.de project). Conformational data defined by up to 4 torsional degrees of freedom can be freely explored using a web interface of the module available at http://csdb.glycoscience.ru/database/core/search_conf.html.
Analysis and systematization of accumulated data on carbohydrate structural diversity is a subject of great interest for structural glycobiology. Despite being a challenging task, development of computational methods for efficient treatment and management of spatial (3D) structural features of carbohydrates breaks new ground in modern glycoscience. This review is dedicated to approaches of chemo- and glyco-informatics towards 3D structural data generation, deposition and processing in regard to carbohydrates and their derivatives. Databases, molecular modeling and experimental data validation services, and structure visualization facilities developed for last five years are reviewed.
Here we report the launch of a web-tool (the GLYCAM-Web GAG Builder, www.glycam.org/gag) for the rapid and straightforward prediction of 3D structural models for glycosaminoglycans (GAGs). The tool provides the user with coordinate files (PDB format) for use in visualization, as well as files for performing MD simulation with the AMBER software package. Counter ions and water may also be added as desired. The tool is designed with the non-expert in mind, and as such has implemented typical default values for structural parameters, which the user may change if desired. Multiple GAG types are supported, including Heparin/Heparan Sulfate, Chondroitin Sulfate, Dermatan Sulfate, Keratan Sulfate, and Hyaluronan; however, the user may alter the default sulfation patterns to create novel sequences. The common non-natural unsaturated uronic acid (DeltaUA) and its sulfated derivative are also supported.
Relaxed MM3 phi, psi potential energy surfaces (conformational maps) were calculated for analogues of α,α-trehalose, β,β-trehalose, α,β-trehalose, maltose, cellobiose and galabiose based on 2-methyltetrahydropyran. Starting structures included not only 4C1 (sugar nomenclature) geometries, but also combinations with 1C4 conformers, and some flexible (boat or skew) forms. These forms were included as part of continuing efforts to eliminate unwarranted assumptions in modelling studies, as well as to account for new experimental findings. Four to nine maps were obtained for each analogue, and from them adiabatic maps were produced. Although the minimum energy regions always resulted from 4C1-4C1 geometries, moderate to large parts of most maps had lower energies when one or both rings were in the 1C4 conformation. Only the adiabatic surface for the (diequatorial) analogue of β,β-trehalose was covered entirely by 4C1-4C1 conformers. For the cellobiose and α,β-trehalose analogues, these conformers covered 74 and 67% of the surfaces, respectively. The remainder of the cellobiose analogue surface was covered by conformers having a 1C4 conformation at the "reducing" end, and for the α,β-trehalose analogue, by conformers having 1C4 shapes for the -linked unit. Adiabatic surfaces of the other three analogues were based on all combinations of 4C1 and 1C4 conformers. The "normal" 4C1-4C1 combination only covered 37-41% of those surfaces, whereas each of the other three conformations accounted for 10-31%. Although the "normal" conformation accounted for 97.0-99.8% of the total population, adiabaticity in disaccharide maps is not guaranteed unless variable ring shapes (another manifestation of the "multiple minima problem") are considered.
Six empirical force fields were tested for applicability to calculations for automated carbohydrate database filling. They were probed on eleven disaccharide molecules containing representative structural features from widespread classes of carbohydrates. The accuracy of each method was queried by predictions of nuclear Overhauser effects (NOEs) from conformational ensembles obtained from 50 to 100 ns molecular dynamics (MD) trajectories and their comparison to the published experimental data. Using various ranking schemes, it was concluded that explicit solvent MM3 MD yielded non-inferior NOE accuracy with newer GLYCAM-06, and ultimately PBE0-D3/def2-TZVP (Triple-Zeta Valence Polarized) Density Functional Theory (DFT) simulations. For seven of eleven molecules, at least one empirical force field with explicit solvent outperformed DFT in NOE prediction. The aggregate of characteristics (accuracy, speed, and compatibility) made MM3 dynamics with explicit solvent at 300 K the most favorable method for bulk generation of disaccharide conformation maps for massive database filling.
no abstract.
In spite of the abundance of glycoproteins in biological processes, relatively little three-dimensional structural data is available for glycan structures. Here, we study the structure and flexibility of the vast majority of mammalian oligosaccharides appearing in N- and O-glycosylated proteins using a bottom up approach. We report the conformational free-energy landscapes of all relevant glycosidic linkages as obtained from local elevation simulations and subsequent umbrella sampling. To the best of our knowledge, this represents the first complete conformational library for the construction of N- and O-glycan structures. Next, we systematically study the effect of neighboring residues, by extensively simulating all relevant trisaccharides and one tetrasaccharide. This allows for an unprecedented comparison of disaccharide linkages in large oligosaccharides. With a small number of exceptions, the conformational preferences in the larger structures are very similar as in the disaccharides. This, finally, allows us to suggest several efficient approaches to construct complete N- and O-glycans on glycoproteins, as exemplified on two relevant examples.
N-Glycosylation is one of the most common post-translational modifications and is implicated in, for example, protein folding and interaction with ligands and receptors. N-Glycosylation trees are complex structures of linked carbohydrate residues attached to asparagine residues. While carbohydrates are typically modeled in protein structures, they are often incomplete or have the wrong chemistry. Here, new tools are presented to automatically rebuild existing glycosylation trees, to extend them where possible, and to add new glycosylation trees if they are missing from the model. The method has been incorporated in the PDB-REDO pipeline and has been applied to build or rebuild 16 452 carbohydrate residues in 11 651 glycosylation trees in 4498 structure models, and is also available from the PDB-REDO web server. With better modeling of N-glycosylation, the biological function of this important modification can be better and more easily understood.
Carbohydrates are regarded as the interesting molecules of nature because of their structural diversity and functional characteristics. The nature of existence of carbohydrates in varied forms and conformations is crucial in understanding their functional features in living systems. The dynamical behavior of carbohydrates in free or bound state with other biological molecules influences their functional role in biological systems. In N- and O-glycosylation, sequence, structure, and conformation of carbohydrates play a vital role. Hence, necessity arises for the complete understanding of the three-dimensional structures of carbohydrates. One of the theoretical ways of studying the structural and conformational aspect of carbohydrates is by molecular dynamics simulation. Not only the structure and conformation but also the interaction of carbohydrates with its conjugated forms can be investigated. The resources for carbohydrates in the form of databases available are discussed. Sialic acid-containing oligosaccharides which have an important role in molecular recognition phenomena are attributed to their sequence, structure, and conformational diversity. A three-dimensional structural database for sialic acid-containing carbohydrates (3DSDSCAR) developed based on molecular dynamics simulation results is discussed in detail. Glycoinformatics, knowledge about carbohydrates or glycans, is still a field of informatics to be explored more.
Carbohydrates, in more biologically oriented areas referred to as glycans, constitute one of the four groups of biomolecules. The glycans, often present as glycoproteins or glycolipids, form highly complex structures. In mammals ten monosaccharides are utilized in building glycoconjugates in the form of oligo- (up to about a dozen monomers) and polysaccharides. Subsequent modifications and additions create a large number of different compounds. In bacteria, more than a hundred monosaccharides have been reported to be constituents of lipopolysaccharides, capsular polysaccharides, and exopolysaccharides. Thus, the number of polysaccharide structures possible to create is huge. NMR spectroscopy plays an essential part in elucidating the primary structure, that is, monosaccharide identity and ring size, anomeric configuration, linkage position, and sequence, of the sugar residues. The structural studies may also employ computational approaches for NMR chemical shift predictions (CASPER program). Once the components and sequence of sugar residues have been unraveled, the three-dimensional arrangement of the sugar residues relative to each other (conformation), their flexibility (transitions between and populations of conformational states), together with the dynamics (timescales) should be addressed. To shed light on these aspects we have utilized a combination of experimental liquid state NMR techniques together with molecular dynamics simulations. For the latter a molecular mechanics force field such as our CHARMM-based PARM22/SU01 has been used. The experimental NMR parameters acquired are typically (1)H,(1)H cross-relaxation rates (related to NOEs), (3)JCH and (3)JCCtrans-glycosidic coupling constants and (1)H,(13)C- and (1)H,(1)H-residual dipolar couplings. At a glycosidic linkage two torsion angles varphi and psi are defined and for 6-substituted residues also the omega torsion angle is required. Major conformers can be identified for which highly populated states are present. Thus, in many cases a well-defined albeit not rigid structure can be identified. However, on longer timescales, oligosaccharides must be considered as highly flexible molecules since also anti-conformations have been shown to exist with H-C-O-C torsion angles of approximately 180 degrees , compared to syn-conformations in which the protons at the carbon atoms forming the glycosidic linkage are in close proximity. The accessible conformational space governs possible interactions with proteins and both minor changes and significant alterations occur for the oligosaccharides in these interaction processes. Transferred NOE NMR experiments give information on the conformation of the glycan ligand when bound to the proteins whereas saturation transfer difference NMR experiments report on the carbohydrate part in contact with the protein. It is anticipated that the subtle differences in conformational preferences for glycan structures facilitate a means to regulate biochemical processes in different environments. Further developments in the analysis of glycan structure and in particular its role in interactions with other molecules, will lead to clarifications of the importance of structure in biochemical regulation processes essential to health and disease.
Complex carbohydrates are ubiquitous in nature, and together with proteins and nucleic acids they comprise the building blocks of life. But unlike proteins and nucleic acids, carbohydrates form nonlinear polymers, and they are not characterized by robust secondary or tertiary structures but rather by distributions of well-defined conformational states. Their molecular flexibility means that oligosaccharides are often refractory to crystallization, and nuclear magnetic resonance (NMR) spectroscopy augmented by molecular dynamics (MD) simulation is the leading method for their characterization in solution. The biological importance of carbohydrate-protein interactions, in organismal development as well as in disease, places urgency on the creation of innovative experimental and theoretical methods that can predict the specificity of such interactions and quantify their strengths. Additionally, the emerging realization that protein glycosylation impacts protein function and immunogenicity places the ability to define the mechanisms by which glycosylation impacts these features at the forefront of carbohydrate modeling. This review will discuss the relevant theoretical approaches to studying the three-dimensional structures of this fascinating class of molecules and interactions, with reference to the relevant experimental data and techniques that are key for validation of the theoretical predictions.
Modern computational methods offer the tools to provide insight into the structural and dynamic properties of carbohydrate-protein complexes, beyond that provided by experimental structural biology. Dynamic properties such as the fluctuation of inter-molecular hydrogen bonds, the residency times of bound water molecules, side chain motions and ligand flexibility may be readily determined computationally. When taken with respect to the unliganded states, these calculations can also provide insight into the entropic and enthalpic changes in free energy associated with glycan binding. In addition, virtual ligand screening may be employed to predict the three dimensional (3D) structures of carbohydrate-protein complexes, given 3D structures for the components. In principle, the 3D structure of the protein may itself be derived by modeling, leading to the exciting--albeit high risk--realm of virtual structure prediction. This latter approach is appealing, given the difficulties associated with generating experimental 3D structures for some classes of glycan binding proteins; however, it is also the least robust. An unexpected outcome of the development of algorithms for modeling carbohydrate-protein interactions has been the discovery of errors in reported experimental 3D structures and a heightened awareness of the need for carbohydrate-specific computational tools for assisting in the refinement and curation of carbohydrate-containing crystal structures. Here we present a summary of the basic strategies associated with employing classical force field based modeling approaches to problems in glycoscience, with a focus on identifying typical pitfalls and limitations. This is not an exhaustive review of the current literature, but hopefully will provide a guide for the glycoscientist interested in modeling carbohydrates and carbohydrate-protein complexes, as well as the computational chemist contemplating such tasks.
no abstract.
CHARMM-GUI Membrane Builder, http://www.charmm-gui.org/input/membrane, is a web-based user interface designed to interactively build all-atom protein/membrane or membrane-only systems for molecular dynamics simulations through an automated optimized process. In this work, we describe the new features and major improvements in Membrane Builder that allow users to robustly build realistic biological membrane systems, including (1) addition of new lipid types, such as phosphoinositides, cardiolipin (CL), sphingolipids, bacterial lipids, and ergosterol, yielding more than 180 lipid types, (2) enhanced building procedure for lipid packing around protein, (3) reliable algorithm to detect lipid tail penetration to ring structures and protein surface, (4) distance-based algorithm for faster initial ion displacement, (5) CHARMM inputs for P21 image transformation, and (6) NAMD equilibration and production inputs. The robustness of these new features is illustrated by building and simulating a membrane model of the polar and septal regions of E. coli membrane, which contains five lipid types: CL lipids with two types of acyl chains and phosphatidylethanolamine lipids with three types of acyl chains. It is our hope that CHARMM-GUI Membrane Builder becomes a useful tool for simulation studies to better understand the structure and dynamics of proteins and lipids in realistic biological membrane environments.
This is the second in a set of two articles where we describe our newly developed scheme to predict conformations of complex oligosaccharides in solution. We apply our fast sugar conformation prediction tool to the case of two complex human milk oligosaccharides LNF-1 and LND-1. As described in detail in the first paper, our protocol initially delivers a set of "unique structures" corresponding to important minima on the potential-energy landscape of a complex sugar using an implicit solvent model. The nuclear Overhauser effect ranking of individual conformations provides a suitable way for comparison with available experiments. The structures obtained agree well with earlier computational predictions but are obtained at a significantly lower computational cost. Sugar conformations corresponding to stable energy minima not found by earlier molecular dynamics studies were also detected using our methodology. In order to evaluate the effects of explicit solvation and thermal fluctuations on several different predicted conformers, we also performed short-time molecular dynamics simulations in an explicit solvent.
This article reports on the implementation of J coupling calculations in our recently developed Fast Sugar Structure Prediction Software (FSPS). The FSPS combines a smart and exhaustive algorithm to search through conformational space with the calculation of different experimental nuclear magnetic resonance observables to establish the conformation of saccharides in solution. Using our algorithm in combination with NMR data, we investigate the solution structure of three simple disaccharides (methyl alpha-sophoroside, methyl alpha-laminarabioside, and methyl alpha-cellobioside) and one complex bacterial polysaccharide (Shigella flexneri 5a).
The conformation of saccharides in solution is challenging to characterize in the context of a single well-defined three-dimensional structure. Instead, they are better represented by an ensemble of conformations associated with their structural diversity and flexibility. In this study, we delineate the conformational heterogeneity of five trisaccharides via a combination of experimental and computational techniques. Experimental NMR measurements target conformationally sensitive parameters, including J couplings and effective distances around the glycosidic linkages, while the computational simulations apply the well-calibrated additive CHARMM carbohydrate force field in combination with efficient enhanced sampling molecular dynamics simulation methods. Analysis of conformational heterogeneity is performed based on sampling of discreet states as defined by dihedral angles, on root-mean-square differences of Cartesian coordinates and on the extent of volume sampled. Conformational clustering, based on the glycosidic linkage dihedral angles, shows that accounting for the full range of sampled conformations is required to reproduce the experimental data, emphasizing the utility of the molecular simulations in obtaining an atomic detailed description of the conformational properties of the saccharides. Results show the presence of differential conformational preferences as a function of primary sequence and glycosidic linkage types. Significant differences in conformational ensembles associated with the anomeric configuration of a single glycosidic linkage reinforce the impact of such changes on the conformational properties of carbohydrates. The present structural insights of the studied trisaccharides represent a foundation for understanding the range of conformations adopted in larger oligosaccharides and how these molecules encode their conformational heterogeneity into the monosaccharide sequence.
The conformation of a molecule strongly affects its function, as demonstrated for peptides and nucleic acids. This correlation is much less established for carbohydrates, the most abundant organic materials in nature. Recent advances in synthetic and analytical techniques have enabled the study of carbohydrates at the molecular level. Recurrent structural features were identified as responsible for particular biological activities or material properties. In this Minireview, recent achievements in the structural characterization of carbohydrates, enabled by systematic studies of chemically defined oligosaccharides, are discussed. These findings can guide the development of more potent glycomimetics. Synthetic carbohydrate materials by design can be envisioned.
We describe a general method to use Monte Carlo simulation followed by torsion-angle molecular dynamics simulations to create ensembles of structures to model a wide variety of soft-matter biological systems. Our particular emphasis is focused on modeling low-resolution small-angle scattering and reflectivity structural data. We provide examples of this method applied to HIV-1 Gag protein and derived fragment proteins, TraI protein, linear B-DNA, a nucleosome core particle, and a glycosylated monoclonal antibody. This procedure will enable a large community of researchers to model low-resolution experimental data with greater accuracy by using robust physics based simulation and sampling methods which are a significant improvement over traditional methods used to interpret such data.
Over the past decade, the Rosetta biomolecular modeling suite has informed diverse biological questions and engineering challenges ranging from interpretation of low-resolution structural data to design of nanomaterials, protein therapeutics, and vaccines. Central to Rosetta's success is the energy function: a model parametrized from small-molecule and X-ray crystal structure data used to approximate the energy associated with each biomolecule conformation. This paper describes the mathematical models and physical concepts that underlie the latest Rosetta energy function, called the Rosetta Energy Function 2015 (REF15). Applying these concepts, we explain how to use Rosetta energies to identify and analyze the features of biomolecular models. Finally, we discuss the latest advances in the energy function that extend its capabilities from soluble proteins to also include membrane proteins, peptides containing noncanonical amino acids, small molecules, carbohydrates, nucleic acids, and other macromolecules.
The MM3 force field has been extended to the title class of compounds. Structures may be generally well calculated for a variety of rather simple alcohols and ethers. A number of conformational properties of molecules of this class were examined. Hydrogen bonding and anomeric effects were also studied. The vibrational spectra of four simple molecules have been fit with an rms of 35 cm-1, and the effects of hydrogen bonding on vibrational spectra have been examined.
A new molecular mechanics force field (called MM3) for the treatment of aliphatic hydrocarbons has been developed and is presented here. This force field will enable one to calculate the structures and energies, including heats of formation, conformational energies, and rotational barriers, for hydrocarbons more accurately than was possible with earlier force fields. In addition to simple molecules, a great many highly strained molecules have been studied, and the results are almost always of experimental accuracy.
We present an extension of the CHARMM Drude polarizable force field to enable modeling of polysaccharides containing pyranose and furanose monosaccharides. The new force field parameters encompass 1<-->2, 1-->3, 1-->4, and 1-->6 pyranose-furanose linkages, 2-->1 and 2-->6 furanose-furanose linkages, 2-->2, 2-->3, and 2-->4 furanose-pyranose, and 1<-->1, 1-->2, 1-->3, 1-->4, and 1-->6 pyranose-pyranose linkages. For the glycosidic linkages, both simple model compounds and the full disaccharides with methylation at the reducing end were used for parameter optimization. The model compounds were chosen to be monomers or glycosidic-linked dimers of tetrahydropyran (THP) and tetrahydrofuran (THF). Target data for optimization included one- and two-dimensional potential energy scans of omega and the Phi/Psi glycosidic dihedral angles in the model compounds and full disaccharides computed by quantum mechanical (QM) RIMP2/cc-pVQZ single point energies on MP2/6-31G(d) optimized structures. Also included in the target data are extensive sets of QM gas phase monohydrate water-saccharide interactions, dipole moments, and molecular polarizabilities for both model compounds and full disaccharides. The resulting polarizable model is shown to be in good agreement with a range of QM data, offering a significant improvement over the additive CHARMM36 carbohydrate force field, as well as experimental data including crystal structures and conformational properties of disaccharides and a trisaccharide in aqueous solution.
Computational description of conformational and dynamic properties of anticoagulant heparin analogue pentasaccharides is of crucial importance in understanding their biological activities. We designed and synthesized idraparinux derivatives modified with sulfonatomethyl moieties at the D, F, and H glucose units that display varied potencies depending on the exact nature of the substitution. In this report we examined the capability of molecular dynamics (MD) simulations to describe the conformational behavior of these novel idraparinux derivatives. We used Gaussian accelerated MD (GAMD) simulations on the parent compound, idraparinux, to choose the most suitable carbohydrate force field for these type of compounds. GAMD provided significant acceleration of conformational transitions compared to classical MD. We compared descriptors obtained from GAMD with NMR spectroscopic parameters related to geometrical descriptors such as scalar couplings and nuclear Overhauser effects (NOE) measured on idraparinux. We found that the experimental data of idraparinux is best reproduced by the CHARMM carbohydrate force field. Furthermore, we propose a torsion angle parameter for the sulfonato-methyl group, which was developed for the chosen CHARMM force field using quantum chemical calculations and validated by comparison with NMR data. The work lays down the foundation of using MD simulations to gain insight into the conformational properties of sulfonato-methyl group modified idraparinux derivatives and to understand their structure-activity relationship thus enabling rational design of further modifications.
The CHARMM36 carbohydrate parameter set did not adequately reproduce experimental thermodynamic data of carbohydrate interactions with water or proteins or carbohydrate self-association; thus, a new nonbonded parameter set for carbohydrates was developed. The parameters were developed to reproduce experimental Kirkwood-Buff integral values, defined by the Kirkwood-Buff theory of solutions, and applied to simulations of glycerol, sorbitol, glucose, sucrose, and trehalose. Compared to the CHARMM36 carbohydrate parameters, these new Kirkwood-Buff-based parameters reproduced accurately carbohydrate self-association and the trend of activity coefficient derivative changes with concentration. When using these parameters, preferential interaction coefficients calculated from simulations of these carbohydrates and the proteins lysozyme, bovine serum albumin, alpha-chymotrypsinogen A, and RNase A agreed well with the experimental data, whereas use of the CHARMM36 parameters indicated preferential inclusion of carbohydrates, in disagreement with the experiment. Thus, calculating preferential interaction coefficients from simulations requires using a force field that accurately reproduces trends in the thermodynamic properties of binary excipient-water solutions, and in particular the trend in the activity coefficient derivative. Finally, the carbohydrate-protein simulations using the new parameters indicated that the carbohydrate size was a major factor in the distribution of different carbohydrates around a protein surface.
The conductor-like solvation model, as developed in the framework of the polarizable continuum model (PCM), has been reformulated and newly implemented in order to compute energies, geometric structures, harmonic frequencies, and electronic properties in solution for any chemical system that can be studied in vacuo. Particular attention is devoted to large systems requiring suitable iterative algorithms to compute the solvation charges: the fast multipole method (FMM) has been extensively used to ensure a linear scaling of the computational times with the size of the solute. A number of test applications are presented to evaluate the performances of the method.
The OPLS all-atom (AA) force field for organic and biomolecular systems has been expanded to include carbohydrates. Starting with reported nonbonded parameters of alcohols, ethers, and diols, torsional parameters were fit to reproduce results from ab initio calculations on the hexopyranoses, α,β-d-glucopyranose, α,β-d-mannopyranose, α,β-d-galactopyranose, methyl α,β-d-glucopyranoside, and methyl α,β-d-mannopyranoside. In all, geometry optimizations were carried out for 144 conformers at the restricted Hartree–Fock (RHF)/6–31G* level. For the conformers with a relative energy within 3 kcal/mol of the global minima, the effects of electron correlation and basis-set extension were considered by performing single-point calculations with density functional theory at the B3LYP/6–311+G** level. The torsional parameters for the OPLS-AA force field were parameterized to reproduce the energies and structures of these 44 conformers. The resultant force field reproduces the ab initio calculated energies with an average unsigned error of 0.41 kcal/mol. The α/β ratios as well as the relative energies between the isomeric hexopyranoses are in good accord with the ab initio results. The predictive abilities of the force field were also tested against RHF/6–31G* results for d-allopyranose with excellent success; a surprising discovery is that the lowest energy conformer of d-allopyranose is a β anomer.
Carbohydrates present a special set of challenges to the generation of force fields. First, the tertiary structures of monosaccharides are complex merely by virtue of their exceptionally high number of chiral centers. In addition, their electronic characteristics lead to molecular geometries and electrostatic landscapes that can be challenging to predict and model. The monosaccharide units can also interconnect in many ways, resulting in a large number of possible oligosaccharides and polysaccharides, both linear and branched. These larger structures contain a number of rotatable bonds, meaning they potentially sample an enormous conformational space. This article briefly reviews the history of carbohydrate force fields, examining and comparing their challenges, forms, philosophies, and development strategies. Then it presents a survey of recent uses of these force fields, noting trends, strengths, deficiencies, and possible directions for future expansion.
A semiempirical method based on the AM1/d Hamiltonian is introduced to model chemical glycobiological systems. We included in the parameter training set glycans and the chemical environment often found about them in glycoenzymes. Starting with RM1 and AM1/d-PhoT models we optimized H, C, N, O, and P atomic parameters targeting the best performing molecular properties that contribute to enzyme catalyzed glycan reaction mechanisms. The training set comprising glycans, amino acids, phosphates and small organic model systems was used to derive parameters that reproduce experimental data or high-level density functional results for carbohydrate, phosphate and amino acid heats of formation, amino acid proton affinities, amino acid and monosaccharide dipole moments, amino acid ionization potentials, water-phosphate interaction energies, and carbohydrate ring pucker relaxation times. The result is the AM1/d-Chemical Biology 1 or AM1/d-CB1 model that is considerably more accurate than existing NDDO methods modeling carbohydrates and the amino acids often present in the catalytic domains of glycoenzymes as well as the binding sites of lectins. Moreover, AM1/d-CB1 computed proton affinities, dipole moments, ionization potentials and heats of formation for transition state puckered carbohydrate ring conformations, observed along glycoenzyme catalyzed reaction paths, are close to values computed using DFT M06-2X. AM1/d-CB1 provides a platform from which to accurately model reactions important in chemical glycobiology.
We present an all-atom additive empirical force field for the hexopyranose monosaccharide form of glucose and its diastereomers allose, altrose, galactose, gulose, idose, mannose, and talose. The model is developed to be consistent with the CHARMM all-atom biomolecular force fields, and the same parameters are used for all diastereomers, including both the alpha- and beta-anomers of each monosaccharide. The force field is developed in a hierarchical manner and reproduces the gas-phase and condensed-phase properties of small-molecule model compounds corresponding to fragments of pyranose monosaccharides. The resultant parameters are transferred to the full pyranose monosaccharides, and additional parameter development is done to achieve a complete hexopyranose monosaccharide force field. Parametrization target data include vibrational frequencies, crystal geometries, solute-water interaction energies, molecular volumes, heats of vaporization, and conformational energies, including those for over 1800 monosaccharide conformations at the MP2/cc-pVTZ//MP2/6-31G(d) level of theory. Although not targeted during parametrization, free energies of aqueous solvation for the model compounds compare favorably with experimental values. Also well-reproduced are monosaccharide crystal unit cell dimensions and ring pucker, densities of concentrated aqueous glucose systems, and the thermodynamic and dynamic properties of the exocyclic torsion in dilute aqueous systems. The new parameter set expands the CHARMM additive force field to allow for simulation of heterogeneous systems that include hexopyranose monosaccharides in addition to proteins, nucleic acids, and lipids.
We present an extension of the CHARMM hexopyranose monosaccharide additive all-atom force field to enable modeling of glycosidic-linked hexopyranose polysaccharides. The new force field parameters encompass 1-->1, 1-->2, 1-->3, 1-->4, and 1-->6 hexopyranose glycosidic linkages, as well as O-methylation at the C(1) anomeric carbon, and are developed to be consistent with the CHARMM all-atom biomolecular force fields for proteins, nucleic acids, and lipids. The parameters are developed in a hierarchical fashion using model compounds containing the key atoms in the full carbohydrates, in particular O-methyl-tetrahydropyran and glycosidic-linked dimers consisting of two molecules of tetrahyropyran or one of tetrahydropyran and one of cyclohexane. Target data for parameter optimization include full two-dimensional energy surfaces defined by the Phi/Psi glycosidic dihedral angles in the disaccharide analogs as determined by quantum mechanical MP2/cc-pVTZ single point energies on MP2/6-31G(d) optimized structures (MP2/cc-pVTZ//MP2/6-31G(d)). In order to achieve balanced, transferable dihedral parameters for the Phi/Psi glycosidic dihedral angles, surfaces for all possible chiralities at the ring carbon atoms involved in the glycosidic linkages are considered, resulting in over 5000 MP2/cc-pVTZ//MP2/6-31G(d) conformational energies. Also included as target data are vibrational frequencies, pair interaction energies and distances with water molecules, and intramolecular geometries including distortion of the glycosidic valence angle as a function of the glycosidic dihedral angles. The model-compound optimized force field parameters are validated on full disaccharides through comparison of molecular dynamics results to available experimental data. Good agreement is achieved with experiment for a variety of properties including crystal cell parameters and intramolecular geometries, aqueous densities, and aqueous NMR coupling constants associated with the glycosidic linkage. The newly-developed parameters allow for the modeling of linear, branched, and cyclic hexopyranose glycosides both alone and in heterogenous systems including proteins, nucleic acids and/or lipids when combined with existing CHARMM biomolecular force fields.
Monosaccharide derivatives such as xylose, fucose, N-acetylglucosamine (GlcNAc), N-acetylgalactosamine (GlaNAc), glucuronic acid, iduronic acid, and N-acetylneuraminic acid (Neu5Ac) are important components of eukaryotic glycans. The present work details development of force-field parameters for these monosaccharides and their covalent connections to proteins via O-linkages to serine or threonine sidechains and via N-linkages to asparagine sidechains. The force field development protocol was designed to explicitly yield parameters that are compatible with the existing CHARMM additive force field for proteins, nucleic acids, lipids, carbohydrates, and small molecules. Therefore, when combined with previously developed parameters for pyranose and furanose monosaccharides, for glycosidic linkages between monosaccharides, and for proteins, the present set of parameters enables the molecular simulation of a wide variety of biologically-important molecules such as complex carbohydrates and glycoproteins. Parametrization included fitting to quantum mechanical (QM) geometries and conformational energies of model compounds, as well as to QM pair interaction energies and distances of model compounds with water. Parameters were validated in the context of crystals of relevant monosaccharides, as well NMR and/or x-ray crystallographic data on larger systems including oligomeric hyaluronan, sialyl Lewis X, O- and N-linked glycopeptides, and a lectin:sucrose complex. As the validated parameters are an extension of the CHARMM all-atom additive biomolecular force field, they further broaden the types of heterogeneous systems accessible with a consistently-developed force-field model.
This article introduces MMFF94, the initial published version of the Merck molecular force field (MMFF). It describes the objectives set for MMFF, the form it takes, and the range of systems to which it applies. This study also outlines the methodology employed in parameterizing MMFF94 and summarizes its performance in reproducing computational and experimental data. Though similar to MM3 in some respects, MMFF94 differs in ways intended to facilitate application to condensed-phase processes in molecular-dynamics simulations. Indeed, MMFF94 seeks to achieve MM3-like accuracy for small molecules in a combined “organic/protein” force field that is equally applicable to proteins and other systems of biological significance. A second distinguishing feature is that the core portion of MMFF94 has primarily been derived from high-quality computational data—ca. 500 molecular structures optimized at the HF/6-31G* level, 475 structures optimized at the MP2/6-31G* level, 380 MP2/6-31G* structures evaluated at a defined approximation to the MP4SDQ/TZP level, and 1450 structures partly derived from MP2/6-31G* geometries and evaluated at the MP2/TZP level. A third distinguishing feature is that MMFF94 has been parameterized for a wide variety of chemical systems of interest to organic and medicial chemists, including many that feature frequently occurring combinations of functional groups for which little, if any, useful experimental data are available. The methodology used in parameterizing MMFF94 represents a fourth distinguishing feature. Rather than using the common “functional group” approach, nearly all MMFF parameters have been determined in a mutually consistent fashion from the full set of available computational data. MMFF94 reproduces the computational data used in its parameterization very well. In addition, MMFF94 reproduces experimental bond lengths (0.014 Å root mean square [rms]), bond angles (1.2° rms), vibrational frequencies (61 cm−1 rms), conformational energies (0.38 kcal/mol/rms), and rotational barriers (0.39 kcal/mol rms) very nearly as well as does MM3 for comparable systems. MMFF94 also describes intermolecular interactions in hydrogen-bonded systems in a way that closely parallels that given by the highly regarded OPLS force field.
This article presents a reoptimization of the GROMOS 53A6 force field for hexopyranose-based carbohydrates (nearly equivalent to 45A4 for pure carbohydrate systems) into a new version 56A(CARBO) (nearly equivalent to 53A6 for non-carbohydrate systems). This reoptimization was found necessary to repair a number of shortcomings of the 53A6 (45A4) parameter set and to extend the scope of the force field to properties that had not been included previously into the parameterization procedure. The new 56A(CARBO) force field is characterized by: (i) the formulation of systematic build-up rules for the automatic generation of force-field topologies over a large class of compounds including (but not restricted to) unfunctionalized polyhexopyranoses with arbritrary connectivities; (ii) the systematic use of enhanced sampling methods for inclusion of experimental thermodynamic data concerning slow or unphysical processes into the parameterization procedure; and (iii) an extensive validation against available experimental data in solution and, to a limited extent, theoretical (quantum-mechanical) data in the gas phase. At present, the 56A(CARBO) force field is restricted to compounds of the elements C, O, and H presenting single bonds only, no oxygen functions other than alcohol, ether, hemiacetal, or acetal, and no cyclic segments other than six-membered rings (separated by at least one intermediate atom). After calibration, this force field is shown to reproduce well the relative free energies of ring conformers, anomers, epimers, hydroxymethyl rotamers, and glycosidic linkage conformers. As a result, the 56A(CARBO) force field should be suitable for: (i) the characterization of the dynamics of pyranose ring conformational transitions (in simulations on the microsecond timescale); (ii) the investigation of systems where alternative ring conformations become significantly populated; (iii) the investigation of anomerization or epimerization in terms of free-energy differences; and (iv) the design of simulation approaches accelerating the anomerization process along an unphysical pathway.
A polarizable empirical force field for acyclic polyalcohols based on the classical Drude oscillator is presented. The model is optimized with an emphasis on the transferability of the developed parameters among molecules of different sizes in this series and on the condensed-phase properties validated against experimental data. The importance of the explicit treatment of electronic polarizability in empirical force fields is demonstrated in the cases of this series of molecules with vicinal hydroxyl groups that can form cooperative intra- and intermolecular hydrogen bonds. Compared to the CHARMM additive force field, improved treatment of the electrostatic interactions avoids overestimation of the gas-phase dipole moments resulting in significant improvement in the treatment of the conformational energies and leads to the correct balance of intra- and intermolecular hydrogen bonding of glycerol as evidenced by calculated heat of vaporization being in excellent agreement with experiment. Computed condensed phase data, including crystal lattice parameters and volumes and densities of aqueous solutions are in better agreement with experimental data as compared to the corresponding additive model. Such improvements are anticipated to significantly improve the treatment of polymers in general, including biological macromolecules.
Knowledge on thermodynamic and transport properties of aqueous solutions of carbohydrates is of great interest for process and product design in the food, pharmaceutical, and biotechnological industries. Molecular simulation is a powerful tool to calculate these properties, but current classical force fields cannot provide accurate estimates for all properties of interest. The poor performance of the force fields is mainly observed for concentrated solutions, where solute-solute interactions are overestimated. In this study, we propose a method to refine force fields, such that solute-solute interactions are more accurately described. The OPLS force field combined with the SPC/Fw water model is used as a basis. We scale the nonbonded interaction parameters of sucrose, a disaccharide. The scaling factors are chosen in such a way that experimental thermodynamic and transport properties of aqueous solutions of sucrose are accurately reproduced. Using a scaling factor of 0.8 for Lennard-Jones energy parameters (ϵ) and a scaling factor of 0.95 for partial atomic charges ( q), we find excellent agreement between experiments and computed liquid densities, thermodynamic factors, shear viscosities, self-diffusion coefficients, and Fick (mutual) diffusion coefficients. The transferability of these optimum scaling factors to other carbohydrates is verified by computing thermodynamic and transport properties of aqueous solutions of d-glucose, a monosaccharide. The good agreement between computed properties and experiments suggests that the scaled interaction parameters are transferable to other carbohydrates, especially for concentrated solutions.
An empirical all-atom CHARMM polarizable force filed for aldopentofuranoses and methyl-aldopentofuranosides based on the classical Drude oscillator is presented. A single electrostatic model is developed for eight different diastereoisomers of aldopentofuranoses by optimizing the existing electrostatic and bonded parameters as transferred from ethers, alcohols, and hexopyranoses to reproduce quantum mechanical (QM) dipole moments, furanose-water interaction energies and conformational energies. Optimization of selected electrostatic and dihedral parameters was performed to generate a model for methyl-aldopentofuranosides. Accuracy of the model was tested by reproducing experimental data for crystal intramolecular geometries and lattice unit cell parameters, aqueous phase densities, and ring pucker and exocyclic rotamer populations as obtained from NMR experiments. In most cases the model is found to reproduce both QM data and experimental observables in an excellent manner, whereas for the remainder the level of agreement is in the satisfactory regimen. In aqueous phase simulations the monosaccharides have significantly enhanced dipoles as compared to the gas phase. The final model from this study is transferrable for future studies on carbohydrates and can be used with the existing CHARMM Drude polarizable force field for biomolecules.
The parametrization and testing of the OPLS all-atom force field for organic molecules and peptides are described. Parameters for both torsional and nonbonded energetics have been derived, while the bond stretching and angle bending parameters have been adopted mostly from the AMBER all-atom force field. The torsional parameters were determined by fitting to rotational energy profiles obtained from ab initio molecular orbital calculations at the RHF/6-31G*//RHF/6-31G* level for more than 50 organic molecules and ions. The quality of the fits was high with average errors for conformational energies of less than 0.2 kcal/mol. The force-field results for molecular structures are also demonstrated to closely match the ab initio predictions. The nonbonded parameters were developed in conjunction with Monte Carlo statistical mechanics simulations by computing thermodynamic and structural properties for 34 pure organic liquids including alkanes, alkenes, alcohols, ethers, acetals, thiols, sulfides, disulfides, aldehydes, ketones, and amides. Average errors in comparison with experimental data are 2% for heats of vaporization and densities. The Monte Carlo simulations included sampling all internal and intermolecular degrees of freedom. It is found that such non-polar and monofunctional systems do not show significant condensed-phase effects on internal energies in going from the gas phase to the pure liquids. .
Lipopolysaccharides (LPS) comprise the outermost layer of the Gram-negative bacteria cell envelope. Packed onto a lipid layer, the outer membrane displays remarkable physical-chemical differences compared to cell membranes. The carbohydrate-rich region confers a membrane asymmetry that underlies many biological processes such as endotoxicity, antibiotic resistance, and cell adhesion. Furthermore, unlike membrane proteins from other sources, integral outer-membrane proteins do not consist of transmembrane alpha helices; instead they consist of antiparallel beta-barrels, which highlights the importance of the LPS membrane as a medium. In this work, we present an extension of the GLYCAM06 force field that has been specifically developed for LPS membranes using our Wolf2Pack program. This new set of parameters for lipopolysaccharide molecules expands the GLYCAM06 repertoire of monosaccharides to include phosphorylated N- and O-acetylglucosamine, 3-deoxy-d-manno-oct-2-ulosonic acid, l-glycero-D-manno-heptose and its O-carbamoylated variant, and N-alanine-d-galactosamine. A total of 1 mus of molecular dynamics simulations of the rough LPS membrane of Pseudomonas aeruginosa PA01 is used to showcase the added parameter set. The equilibration of the LPS membrane is shown to be significantly slower compared to phospholipid membranes, on the order of 500 ns. It is further shown that water molecules penetrate the hydrocarbon region up to the terminal methyl groups, much deeper than commonly observed for phospholipid bilayers, and in agreement with neutron diffraction measurements. A comparison of simulated structural, dynamical, and electrostatic properties against corresponding experimentally available data shows that the present parameter set reproduces well the overall structure and the permeability of LPS membranes in the liquid-crystalline phase.
A new derivation of the GLYCAM06 force field, which removes its previous specificity for carbohydrates, and its dependency on the AMBER force field and parameters, is presented. All pertinent force field terms have been explicitly specified and so no default or generic parameters are employed. The new GLYCAM is no longer limited to any particular class of biomolecules, but is extendible to all molecular classes in the spirit of a small-molecule force field. The torsion terms in the present work were all derived from quantum mechanical data from a collection of minimal molecular fragments and related small molecules. For carbohydrates, there is now a single parameter set applicable to both alpha- and beta-anomers and to all monosaccharide ring sizes and conformations. We demonstrate that deriving dihedral parameters by fitting to QM data for internal rotational energy curves for representative small molecules generally leads to correct rotamer populations in molecular dynamics simulations, and that this approach removes the need for phase corrections in the dihedral terms. However, we note that there are cases where this approach is inadequate. Reported here are the basic components of the new force field as well as an illustration of its extension to carbohydrates. In addition to reproducing the gas-phase properties of an array of small test molecules, condensed-phase simulations employing GLYCAM06 are shown to reproduce rotamer populations for key small molecules and representative biopolymer building blocks in explicit water, as well as crystalline lattice properties, such as unit cell dimensions, and vibrational frequencies.
Starting from the screening in conductors, an algorithm for the accurate calculation of dielectric screening effects in solvents is presented, which leads to rather simple explicit expressions for the screening energy and its analytic gradient with respect to the solute coordinates. Thus geometry optimization of a solute within a realistic dielectric continuum model becomes practicable for the first time. The algorithm is suited for molecular mechanics as well as for any molecular orbital algorithm. The implementation into MOPAC and some example applications are reported.
This work describes an improved version of the original OPLS-all atom (OPLS-AA) force field for carbohydrates (Damm et al., J Comp Chem 1997, 18, 1955). The improvement is achieved by applying additional scaling factors for the electrostatic interactions between 1,5- and 1,6-interactions. This new model is tested first for improving the conformational energetics of 1,2-ethanediol, the smallest polyol. With a 1,5-scaling factor of 1.25 the force field calculated relative energies are in excellent agreement with the ab initio-derived data. Applying the new 1,5-scaling makes it also necessary to use a 1,6-scaling factor for the interactions between the C4 and C6 atoms in hexopyranoses. After torsional parameter fitting, this improves the conformational energetics in comparison to the OPLS-AA force field. The set of hexopyranoses included in the torsional parameter derivation consists of the two anomers of D-glucose, D-mannose, and D-galactose, as well as of the methyl-pyranosides of D-glucose, D-mannose. Rotational profiles for the rotation of the exocyclic group and of different hydroxyl groups are also compared for the two force fields and at the ab initio level of theory. The new force field reduces the overly high barriers calculated using the OPLS-AA force field. This leads to better sampling, which was shown to produce more realistic conformational behavior for hexopyranoses in liquid simulation. From 10-ns molecular dynamics (MD) simulations of alpha-D-glucose and alpha-D-galactose the ratios for the three different conformations of the hydroxymethylene group and the average (3)J(H,H) coupling constants are derived and compared to experimental values. The results obtained for OPLS-AA-SEI force field are in good agreement with experiment whereas the properties derived for the OPLS-AA force field suffer from sampling problems. The undertaken investigations show that the newly derived OPLS-AA-SEI force field will allow simulating larger carbohydrates or polysaccharides with improved sampling of the hydroxyl groups.
A new parameter set (referred to as 45A4) is developed for the explicit-solvent simulation of hexopyranose-based carbohydrates. This set is compatible with the most recent version of the GROMOS force field for proteins, nucleic acids, and lipids, and the SPC water model. The parametrization procedure relies on: (1) reassigning the atomic partial charges based on a fit to the quantum-mechanical electrostatic potential around a trisaccharide; (2) refining the torsional potential parameters associated with the rotations of the hydroxymethyl, hydroxyl, and anomeric alkoxy groups by fitting to corresponding quantum-mechanical profiles for hexopyranosides; (3) adapting the torsional potential parameters determining the ring conformation so as to stabilize the (experimentally predominant) (4)C(1) chair conformation. The other (van der Waals and nontorsional covalent) parameters and the rules for third and excluded neighbors are taken directly from the most recent version of the GROMOS force field (except for one additional exclusion). The new set is general enough to define parameters for any (unbranched) hexopyranose-based mono-, di-, oligo- or polysaccharide. In the present article, this force field is validated for a limited set of monosaccharides (alpha- and beta-D-glucose, alpha- and beta-D-galactose) and disaccharides (trehalose, maltose, and cellobiose) in solution, by comparing the results of simulations to available experimental data. More extensive validation will be the scope of a forthcoming article. (c) 2005 Wiley Periodicals, Inc. J Comput Chem 26: 1400-1412, 2005.
We present an extension of the Martini coarse-grained force field to carbohydrates. The parametrization follows the same philosophy as was used previously for lipids and proteins, focusing on the reproduction of partitioning free energies of small compounds between polar and nonpolar phases. The carbohydrate building blocks considered are the monosaccharides glucose and fructose and the disaccharides sucrose, trehalose, maltose, cellobiose, nigerose, laminarabiose, kojibiose, and sophorose. Bonded parameters for these saccharides are optimized by comparison to conformations sampled with an atomistic force field, in particular with respect to the representation of the most populated rotameric state for the glycosidic bond. Application of the new coarse-grained carbohydrate model to the oligosaccharides amylose and Curdlan shows a preservation of the main structural properties with 3 orders of magnitude more efficient sampling than the atomistic counterpart. Finally, we investigate the cryo- and anhydro-protective effect of glucose and trehalose on a lipid bilayer and find a strong decrease of the melting temperature, in good agreement with both experimental findings and atomistic simulation studies.
We present an extension of the Martini coarse-grained force field to glycolipids. The glycolipids considered here are the glycoglycerolipids monogalactosyldiacylglycerol (MGDG), sulfoquinovosyldiacylglycerol (SQDG), digalactosyldiacylglycerol (DGDG), and phosphatidylinositol (PI) and its phosphorylated forms (PIP, PIP2), as well as the glycosphingolipids galactosylceramide (GCER) and monosialotetrahexosylganglioside (GM1). The parametrization follows the same philosophy as was used previously for lipids, proteins, and carbohydrates focusing on the reproduction of partitioning free energies of small compounds between polar and nonpolar solvents. Bonded parameters are optimized by comparison to lipid conformations sampled with an atomistic force field, in particular with respect to the representation of the most populated states around the glycosidic linkage. Simulations of coarse-grained glycolipid model membranes show good agreement with atomistic simulations as well as experimental data available, especially concerning structural properties such as electron densities, area per lipid, and membrane thickness. Our coarse-grained model opens the way to large scale simulations of biological processes in which glycolipids are important, including recognition, sorting, and clustering of both external and membrane bound proteins.
New protein parameters are reported for the all-atom empirical energy function in the CHARMM program. The parameter evaluation was based on a self-consistent approach designed to achieve a balance between the internal (bonding) and interaction (nonbonding) terms of the force field and among the solvent-solvent, solvent-solute, and solute-solute interactions. Optimization of the internal parameters used experimental gas-phase geometries, vibrational spectra, and torsional energy surfaces supplemented with ab initio results. The peptide backbone bonding parameters were optimized with respect to data for N-methylacetamide and the alanine dipeptide. The interaction parameters, particularly the atomic charges, were determined by fitting ab initio interaction energies and geometries of complexes between water and model compounds that represented the backbone and the various side chains. In addition, dipole moments, experimental heats and free energies of vaporization, solvation and sublimation, molecular volumes, and crystal pressures and structures were used in the optimization. The resulting protein parameters were tested by applying them to noncyclic tripeptide crystals, cyclic peptide crystals, and the proteins crambin, bovine pancreatic trypsin inhibitor, and carbonmonoxy myoglobin in vacuo and in crystals. A detailed analysis of the relationship between the alanine dipeptide potential energy surface and calculated protein phi, chi angles was made and used in optimizing the peptide group torsional parameters. The results demonstrate that use of ab initio structural and energetic data by themselves are not sufficient to obtain an adequate backbone representation for peptides and proteins in solution and in crystals. Extensive comparisons between molecular dynamics simulations and experimental data for polypeptides and proteins were performed for both structural and dynamic properties. Energy minimization and dynamics simulations for crystals demonstrate that the latter are needed to obtain meaningful comparisons with experimental crystal structures. The presented parameters, in combination with the previously published CHARMM all-atom parameters for nucleic acids and lipids, provide a consistent set for condensed-phase simulations of a wide variety of molecules of biological interest.
Presented is an extension of the CHARMM additive all-atom carbohydrate force field to enable the modeling of phosphate and sulfate linked to carbohydrates. The parameters are developed in a hierarchical fashion using model compounds containing the key atoms in the full carbohydrates. Target data for parameter optimization included full two-dimensional energy surfaces defined by the glycosidic dihedral angle pairs in the phosphate/sulfate model compound analogs of hexopyranose monosaccharide phosphates and sulfates, as determined by quantum mechanical (QM) MP2/cc-pVTZ single point energies on MP2/6-31+G(d) optimized structures. In order to achieve balanced, transferable dihedral parameters for the dihedral angles, surfaces for all possible anomeric and conformational states were included during the parametrization process. In addition, to model physiologically relevant systems both the mono- and di-anionic charged states were studied for the phosphates. This resulted in over 7000 MP2/cc-pVTZ//MP2/6-31G+(d) model compound conformational energies which, supplemented with QM geometries, were the main target data for the parametrization. Parameters were validated against crystals of relevant monosaccharide derivatives obtained from the Cambridge Structural Database (CSD) and larger systems, namely inositol-(tri/tetra/penta) phosphates non-covalently bound to the pleckstrin homology (PH) domain and oligomeric chondroitin sulfate in solution and in complex with cathepsin K protein.
We present a new continuum solvation model based on the quantum mechanical charge density of a solute molecule interacting with a continuum description of the solvent. The model is called SMD, where the "D" stands for "density" to denote that the full solute electron density is used without defining partial atomic charges. "Continuum" denotes that the solvent is not represented explicitly but rather as a dielectric medium with surface tension at the solute-solvent boundary. SMD is a universal solvation model, where "universal" denotes its applicability to any charged or uncharged solute in any solvent or liquid medium for which a few key descriptors are known (in particular, dielectric constant, refractive index, bulk surface tension, and acidity and basicity parameters). The model separates the observable solvation free energy into two main components. The first component is the bulk electrostatic contribution arising from a self-consistent reaction field treatment that involves the solution of the nonhomogeneous Poisson equation for electrostatics in terms of the integral-equation-formalism polarizable continuum model (IEF-PCM). The cavities for the bulk electrostatic calculation are defined by superpositions of nuclear-centered spheres. The second component is called the cavity-dispersion-solvent-structure term and is the contribution arising from short-range interactions between the solute and solvent molecules in the first solvation shell. This contribution is a sum of terms that are proportional (with geometry-dependent proportionality constants called atomic surface tensions) to the solvent-accessible surface areas of the individual atoms of the solute. The SMD model has been parametrized with a training set of 2821 solvation data including 112 aqueous ionic solvation free energies, 220 solvation free energies for 166 ions in acetonitrile, methanol, and dimethyl sulfoxide, 2346 solvation free energies for 318 neutral solutes in 91 solvents (90 nonaqueous organic solvents and water), and 143 transfer free energies for 93 neutral solutes between water and 15 organic solvents. The elements present in the solutes are H, C, N, O, F, Si, P, S, Cl, and Br. The SMD model employs a single set of parameters (intrinsic atomic Coulomb radii and atomic surface tension coefficients) optimized over six electronic structure methods: M05-2X/MIDI!6D, M05-2X/6-31G, M05-2X/6-31+G, M05-2X/cc-pVTZ, B3LYP/6-31G, and HF/6-31G. Although the SMD model has been parametrized using the IEF-PCM protocol for bulk electrostatics, it may also be employed with other algorithms for solving the nonhomogeneous Poisson equation for continuum solvation calculations in which the solute is represented by its electron density in real space. This includes, for example, the conductor-like screening algorithm. With the 6-31G basis set, the SMD model achieves mean unsigned errors of 0.6-1.0 kcal/mol in the solvation free energies of tested neutrals and mean unsigned errors of 4 kcal/mol on average for ions with either Gaussian03 or GAMESS.
Protein-carbohydrate recognition is crucial in many vital biological processes including host-pathogen recognition, cell-signaling, and catalysis. Accordingly, computational prediction of protein-carbohydrate binding free energies is of enormous interest for drug design. However, the accuracy of current force fields (FFs) for predicting binding free energies of protein-carbohydrate complexes is not well understood owing to technical challenges such as the highly polar nature of the complexes, anomerization, and conformational flexibility of carbohydrates. The present study evaluated the performance of alchemical predictions of binding free energies with the GAFF1.7/AM1-BCC and GLYCAM06j force fields for modeling protein-carbohydrate complexes. Mean unsigned errors of 1.1 +/- 0.06 (GLYCAM06j) and 2.6 +/- 0.08 (GAFF1.7/AM1-BCC) kcal.mol(-1) are achieved for a large data set of monosaccharide ligands for Ralstonia solanacearum lectin (RSL). The level of accuracy provided by GLYCAM06j is sufficient to discriminate potent, moderate, and weak binders, a goal that has been difficult to achieve through other scoring approaches. Accordingly, the protocols presented here could find useful applications in carbohydrate-based drug and vaccine developments.
A new force field has been designed to implement the calculation of Coulomb interactions with fluctuating atomic charges. The charges are calculated by use of a semi-empirical quantum chemical method – bond polarization theory (BPT). The BPT method establishes a direct proportionality between molecular properties, for instance atomic charges or chemical shifts, and bond polarization energies. These energies are calculated from bond orbitals that are constructed for every bond of the force field. Thus the charges depend on the three-dimensional geometry of the molecular system, and it is possible to include all mutual polarizations in the term for electrostatic interaction. The primary goal of this new force field is better description of the intermolecular interactions of molecular systems. No special term within the force field is applied for the description of hydrogen bonds. The inclusion of the polarization effect over the whole system is one of the most important advantages of the method in respect of force fields that divide the molecular system into molecular mechanics and quantum chemical regions. The force field was tested by being used to describe the structure and interaction energies of several small molecular systems (26 hydrogen-bonded dimers) from a web-based ab initio data collection by Halgren. The results show an overall RMS deviation of 2.5 kcal mol–1 for the interaction energies, 0.06 Å for the hydrogen bond distances (X...Z) and 20.1° for the X-H...Z angles. This is comparable with most existing force fields. The results were obtained with the original parametrization of Halgren for the van der Waals interactions without any fine tuning of the interaction parameters. Additional interaction energies and structures of selected DNA/RNA base pairs (see Figure) were studied. The geometries of hydrogen bonds, in particular, are reproduced satisfactorily – after geometry optimization the distances differ on average by 0.06 Å and in the angles by 6° from the ab initio Hartree–Fock results including correlation.
The GROMOS 56A(CARBO) force field for the description of carbohydrates was modified for calculations of chitosan (poly-1,4-(N-acetyl)-beta-D-glucopyranosamine-2) with protonated and non-protonated amino groups and its derivatives. Additional parameterization was developed on the basis of quantum chemical calculations. The modified force field (56A(CARBO_CHT)) allows performing the molecular dynamic calculations of chitosans with different degrees of protonation corresponding to various acidity of medium. Test calculations of the conformational transitions in the chitosan rings and polymeric chains as well as the chitosan nanocrystal dissolution demonstrate good agreement with experimental data. Graphical abstract The GROMOS 56A(CARBO_CHT) force field allows performing the molecular dynamic calculations of chitosans with different types of amio-group: free, protonated, substituted.
The article describes a GROMOS force field parameter set for molecular dynamics simulations of furanose carbohydrates. The proposed united-atom force field is designed and validated with respect to the conformational properties of furanose mono-, di-, oligo-, and polymers in aqueous solvent. The set accounts for the possibility of arbitrary glycosidic linkage connectivity between units, O-alkylation, as well as of different anomery. The compatibility with the already existing, pyranose-dedicated GROMOS 56A6(CARBO/CARBO_R) set allows one to use the presently proposed extension for studying more diverse and biologically relevant carbohydrates that exploit both pyranose and furanose units. The validation performed against the quantum-mechanical and experimental data concerning the structural and conformational features shows that the newly developed set is capable to reproduce conformational equilibrium within the furanose ring, relative free energies of anomers, hydroxymethyl rotamers, and glycosidic linkage conformers. Additionally, the results concerning the conformation of the furanose ring with relation to the two-state model as well as other conformational features of furanose-containing saccharides are discussed.
An extension of the GROMOS 56a6(CARBO/CARBO_R) force field for hexopyranose-based carbohydrates is presented. The additional parameters describe the conformational properties of uronate residues. The three distinct chemical states of the carboxyl group are considered: deprotonated (negatively charged), protonated (neutral), and esterified (neutral). The parametrization procedure was based on quantum-chemical calculations, and the resulting parameters were tested in the context of (i) flexibility of the pyranose rings under different pH conditions, (ii) conformation of the glycosidic linkage of the (1 --> 4)-type for uronates with different chemical states of carboxyl moieties, (iii) conformation of the exocyclic (i.e., carboxylate and lactol) moieties, and (iv) structure of the Ca(2+)-linked chain-chain complexes of uronates. The presently proposed parameters in combination with the 56a6(CARBO/CARBO_R) set can be used to describe the naturally occurring polyuronates, composed either of homogeneous (e.g., glucuronans) or heterogeneous (e.g., pectins, alginates) segments. The results of simulations relying on the new set of parameters indicate that the conformation of glycosidic linkage is nearly unaffected by the chemical state of the carboxyl group, in contrary to the ring conformational equilibria. The calculations for the poly(alpha-d-galacturonate)-Ca(2+) and poly(alpha-l-guluronate)-Ca(2+) complexes show that both parallel and anitiparallel arrangements of uronate chains are possible but differ in several structural aspects.
In this work, we report the development of Drude polarizable force field parameters for the carboxylate and N-acetyl amine derivatives, extending the functionality of the existing Drude polarizable carbohydrate force field. The force field parameters have been developed in a hierarchical manner, reproducing the quantum mechanical gas-phase properties of small model compounds representing the key functional group in the carbohydrate derivatives, including optimization of the electrostatic and bonded parameters. The optimized parameters were then used to generate the models for carboxylate and N-acetyl amine carbohydrate derivatives. The transferred parameters were further tested and optimized to reproduce crystal geometries and J-coupling data from nuclear magnetic resonance experiments. The parameter development resulted in the incorporation of d-glucuronate, l-iduronate, N-acetyl-d-glucosamine (GlcNAc), and N-acetyl-d-galactosamine (GalNAc) sugars into the Drude polarizable force field. The parameters developed in this study were then applied to study the conformational properties of glycosaminoglycan polymer hyaluronan, composed of d-glucuronate and N-acetyl-d-glucosamine, in aqueous solution. Upon comparing the results from the additive and polarizable simulations, it was found that the inclusion of polarization improved the description of the electrostatic interactions observed in hyaluronan, resulting in enhanced conformational flexibility. The developed Drude polarizable force field parameters in conjunction with the remainder of the Drude polarizable force field parameters can be used for future studies involving carbohydrates and their conjugates in complex, heterogeneous systems.
A polarizable empirical force field based on the classical Drude oscillator is presented for the hexopyranose form of selected monosaccharides. Parameter optimization targeted quantum mechanical (QM) dipole moments, solute-water interaction energies, vibrational frequencies, and conformational energies. Validation of the model was based on experimental data on crystals, densities of aqueous-sugar solutions, diffusion constants of glucose, and rotational preferences of the exocylic hydroxymethyl of d-glucose and d-galactose in aqueous solution as well as additional QM data. Notably, the final model involves a single electrostatic model for all sixteen diastereomers of the monosaccharides, indicating the transferability of the polarizable model. The presented parameters are anticipated to lay the foundation for a comprehensive polarizable force field for saccharides that will be compatible with the polarizable Drude parameters for lipids and proteins, allowing for simulations of glycolipids and glycoproteins.
We propose a simple analytic representation of the correlation energy Ec, for a uniform electron gas, as a function of density parameter rs, and relative spin polarization g. Within the random-phase approximation (RPA), this representation allows for the rs^-3/4 behavior as rs->INF. Close agreement with numerical RPA values for Ec(rs,0), Ec (rs,0), and the spin stiffness ac(rs ) = d2Ec(rs,g=0)/dg2, and recovery of the correct rs*ln(rs), term for rs-->0, indicate the appropriateness of the chosen analytic form. Beyond RPA, different parameters for the same analytic form are found by fitting to the Green's-function Monte Carlo data of Ceperley and Alder [Phys. Rev. Lett. 45, 566 (1980)], taking into account data uncertainties that have been ignored in earlier fits by Vosko, Wilk, and Nusair (VWN) [Can. J. Phys. 58, 1200 (1980)] or by Perdew and Zunger (PZ) [Phys. Rev. B 23, 5048 {1981)]. While we confirm the practical accuracy of the VWN and PZ representations, we eliminate some minor problems with these forms. We study the g-dependent coefficients in the high- and low-density expansions, and the rs-dependent spin susceptibility. We also present a conjecture for the exact low-density limit. The correlation potential (...) is evaluated for use in self-consistent density-functional calculations.
The performance of conjugate gradient schemes for minimizing unconstrained energy functionals in the context of condensed matter electronic structure density functional calculations is studied. The unconstrained functionals allow a straightforward application of conjugate gradients by removing the explicit orthonormality constraints on the quantum-mechanical wave functions. However, the removal of the constraints can lead to slow convergence, in particular when preconditioning is used. The convergence properties of two previously suggested energy functionals are analyzed, and a new functional is proposed, which unifies some of the advantages of the other functionals. A numerical example derived from a diamond crystal confirms the analysis.
This article describes a revised version 56A6(CARBO_R) of the GROMOS 56A6(CARBO) force field for hexopyranose-based carbohydrates. The simulated properties of unfunctionalized hexopyranoses are unaltered with respect to 56A6CARBO . In the context of both O1 -alkylated hexopyranoses and oligosaccharides, the revision stabilizes the regular (4) C1 chair for alpha-anomers, with the opposite effect for beta-anomers. As a result, spurious ring inversions observed in alpha(1-->4)-linked chains when using the original 56A6(CARBO) force field are alleviated. The (4) C1 chair is now the most stable conformation for all d-hexopyranose residues, irrespective of the linkage type and anomery, and of the position of the residue along the chain. The methylation of a d-hexopyranose leads to a systematic shift in the ring-inversion free energy ((4) C1 to (1) C4 ) by 7-8 kJ mol(-1), positive for the alpha-anomers and negative for the beta-anomers, which is qualitatively compatible with the expected enhancement of the anomeric effect upon methylation at O1. The ring-inversion free energies for residues within chains are typically smaller in magnitude compared to those of the monomers, and correlate rather poorly with the latter. This suggests that the crowding of ring substituents upon chain formation alters the ring flexibility in a nonsystematic fashion. In general, the description of carbohydrate chains afforded by 56A6(CARBO_R) suggests a significant extent of ring flexibility, i.e., small but often non-negligible equilibrium populations of inverted chairs, and challenges the "textbook" picture of conformationally locked carbohydrate rings.
Considering the small number of papers assessing the conformational profile of glycoproteins through molecular dynamics (MD) simulations, the current work reports on a systematic analysis of the performance of the GROMOS96 43a1 force field and Lowdin HF/6-31G( * *)-derived atomic charges in the conformational description of glycoproteins. The results substantiate the accuracy of the computational representation of glycoprotein conformational ensembles in aqueous solution based on their agreement to available experimental information, supporting further contributions of computational techniques, mainly MD, in future studies on the characterization of glycoprotein structure and function.
An improved parameter set for explicit-solvent simulations of carbohydrates (referred to as GROMOS 53A6GLYC) is presented, allowing proper description of the most stable conformation of all 16 possible aldohexopyranose-based monosaccharides. This set includes refinement of torsional potential parameters associated with the determination of hexopyranose rings conformation by fitting to their corresponding quantum-mechanical profiles. Other parameters, as the rules for third and excluded neighbors, are taken directly from the GROMOS 53A6 force field. Comparisons of the herein presented parameter set to our previous version (GROMOS 45A4), the GLYCAM06 force field, and available NMR data are presented in terms of ring puckering free energies, conformational distribution of the hydroxymethyl group, and glycosidic linkage geometries for 16 selected monosaccharides and eight disaccharides. The proposed parameter modifications have shown a significant improvement for the above-mentioned quantities over the two tested force fields, while retaining full compatibility with the GROMOS 53A6 and 54A7 parameter sets for other classes of biomolecules.
An extension of the GROMOS 53A6GLYC force field for carbohydrates to encompass glycoprotein linkages is presented. The set includes new atomic charges and incorporates adequate torsional potential parameters for N-, S-, C-, P-, and O-glycosydic linkages, offering compatibility with the GROMOS force field family for proteins. Validation included the description of glycosydic linkage geometries between amino acid and monosaccharide residues, comparison of NMR-derived protein-carbohydrate and carbohydrate-carbohydrate nuclear overhauser effect (NOE) signals for glycoproteins and the effects of glycosylation on protein flexibility and dynamics.
Presented is an extension of the CHARMM additive carbohydrate all-atom force field to enable modeling of polysaccharides containing furanose sugars. The new force field parameters encompass 1 <--> 2, 1 --> 3, 1 --> 4, and 1 --> 6 pyranose-furanose linkages and 2 --> 1 and 2 --> 6 furanose-furanose linkages, building on existing hexopyranose and furanose monosaccharide parameters. The model compounds were chosen to be monomers or glycosidic-linked dimers of tetrahydropyran (THP) and tetrahydrofuran (THF) as to contain the key atoms in full carbohydrates. Target data for optimization included two-dimensional quantum mechanical (QM) potential energy scans of the Phi/Psi glycosidic dihedral angles, with geometry optimization at the MP2/6-31G(d) level followed by MP2/cc-pVTZ single-point energies. All possible chiralities of the model compounds at the linkage carbons were considered, and for each geometry, the THF ring was constrained to the favorable south or north conformations. Target data also included QM vibrational frequencies and pair interaction energies and distances with water molecules. Force field validation included comparison of computed crystal properties, aqueous solution densities, and NMR J-coupling constants to experimental reference values. Simulations of infinite crystals showed good agreement with experimental values for intramolecular geometries as well as for crystal unit cell parameters. Additionally, aqueous solution densities and available NMR data were reproduced to a high degree of accuracy, thus validating the hierarchically optimized parameters in both crystalline and aqueous condensed phases. The newly developed parameters allow for the modeling of linear, branched, and cyclic pyranose/furanose polysaccharides both alone and in heterogeneous systems including proteins, nucleic acids, and/or lipids when combined with existing additive CHARMM biomolecular force fields.
Polysaccharides (carbohydrates) are key regulators of a large number of cell biological processes. However, precise biochemical or genetic manipulation of these often complex structures is laborious and hampers experimental structure-function studies. Molecular Dynamics (MD) simulations provide a valuable alternative tool to generate and test hypotheses on saccharide function. Yet, currently used MD force fields often overestimate the aggregation propensity of polysaccharides, affecting the usability of those simulations. Here we tested MARTINI, a popular coarse-grained (CG) force field for biological macromolecules, for its ability to accurately represent molecular forces between saccharides. To this end, we calculated a thermodynamic solution property, the second virial coefficient of the osmotic pressure (B(22)). Comparison with light scattering experiments revealed a nonphysical aggregation of a prototypical polysaccharide in MARTINI, pointing at an imbalance of the nonbonded solute-solute, solute-water, and water-water interactions. This finding also applies to smaller oligosaccharides which were all found to aggregate in simulations even at moderate concentrations, well below their solubility limit. Finally, we explored the influence of the Lennard-Jones (LJ) interaction between saccharide molecules and propose a simple scaling of the LJ interaction strength that makes MARTINI more reliable for the simulation of saccharides.
A description of the ab initio quantum chemistry package GAMESS is presented. Chemical systems containing atoms through radon can be treated with wave functions ranging from the simplest closed-shell case up to a general MCSCF case, permitting calculations at the necessary level of sophistication. Emphasis is given to novel features of the program. The parallelization strategy used in the RHF, ROHF, UHF, and GVB sections of the program is described, and detailed speecup results are given. Parallel calculations can be run on ordinary workstations as well as dedicated parallel machines.
CA-26 is the largest cyclodextrin (546 atoms) for which refined X-ray structural data is available. Because of its size, 26 D-glucose residues, it is beyond the scope of study of most ab initio or density functional methods and to date has only been computationally examined using empirical force fields. The crystal structure of CA-26 is folded like a figure "8" into two 10 D-glucoses long antiparallel left-handed V (Verkleisterung)-type helices with a "band-flip" and "kink" at the top and bottom of the helices. DFTr methods were applied to CA-26 to determine if a carbohydrate molecule of this size could be geometry optimized and if it would show structural variances from application of dispersion and/or solvation. The DFTr reduced basis set method developed by the authors uses 4-31G on the carbon atoms of the glucose rings and 6-31+G* on all other atoms. B3LYP is the density functional used to successfully optimize CA-26, and other density functionals were then applied, including a self-consistent charge density functional tight binding (SCC-DFTB) method and the B97D (dispersion-corrected) and B97D-PCM (dispersion + implicit solvent) methods. Heavy atom coordinates were taken from one X-ray structure, fitted with hydrogen atoms, and geometry optimized using PM3 followed by B3LYP/6-31+G*/4-31G optimization. After optimization, the heavy atom rms deviation of the optimized DFTr (B3LYP) structure to the crystal structure was 0.89 A, the rmsd of the B97D optimization was 1.38 A, that for B97D-PCM was 0.95 A, and that for SCC-DFTB was 0.94 A. These results are very good considering that no explicit water molecules were included in the computational analysis and there were ~32-38 water molecules around each CA-26 molecule in the crystal structure. Tables of internal coordinates and puckering parameters were compared to the X-ray structures, and close correspondence was found.
Glycans play a vital role in a large number of cellular processes. Their complex and flexible nature hampers structure-function studies using experimental techniques. Molecular dynamics (MD) simulations can help in understanding dynamic aspects of glycans if the force field parameters used can reproduce key experimentally observed properties. Here, we present optimized coarse-grained (CG) Martini force field parameters for N-glycans, calibrated against experimentally derived binding affinities for lectins. The CG bonded parameters were obtained from atomistic (ATM) simulations for different glycan topologies including high mannose and complex glycans with various branching patterns. In the CG model, additional elastic networks are shown to improve maintenance of the overall conformational distribution. Solvation free energies and octanol-water partition coefficients were also calculated for various N-glycan disaccharide combinations. When using standard Martini nonbonded parameters, we observed that glycans spontaneously aggregated in the solution and required down-scaling of their interactions for reproduction of ATM model radial distribution functions. We also optimized the nonbonded interactions for glycans interacting with seven lectin candidates and show that a relatively modest scaling down of the glycan-protein interactions can reproduce free energies obtained from experimental studies. These parameters should be of use in studying the role of glycans in various glycoproteins and carbohydrate binding proteins as well as their complexes, while benefiting from the efficiency of CG sampling.
Glycosaminoglycans (GAGs) are an important class of carbohydrates that serve critical roles in blood clotting, tissue repair, cell migration and adhesion, and lubrication. The variable sulfation pattern and iduronate ring conformations in GAGs influence their polymeric structure and nature of interaction. This study characterizes several heparin-like GAG disaccharides and tetrasaccharides using NMR and molecular dynamics simulations to assist in the development of parameters for GAGs within the GLYCAM06 force field. The force field additions include parameters and charges for a transferable sulfate group for O- and N-sulfation, neutral (COOH) forms of iduronic and glucuronic acid, and Delta4,5-unsaturated uronate (DeltaUA) residues. DeltaUA residues frequently arise from the enzymatic digestion of heparin and heparin sulfate. Simulations of disaccharides containing DeltaUA reveal that the presence of sulfation on this residue alters the relative populations of (1)H(2) and (2)H(1) ring conformations. Simulations of heparin tetrasaccharides containing N-sulfation in place of N-acetylation on glucosamine residues influence the ring conformations of adjacent iduronate residues.
Aromatic amino acid residues are often present in carbohydrate-binding sites of proteins. These binding sites are characterized by a placement of a carbohydrate moiety in a stacking orientation to an aromatic ring. This arrangement is an example of CH/pi interactions. Ab initio interaction energies for 20 carbohydrate-aromatic complexes taken from 6 selected ultra-high resolution X-ray structures of glycosidases and carbohydrate-binding proteins were calculated. All interaction energies of a pyranose moiety with a side chain of an aromatic residue were calculated as attractive with interaction energy ranging from -2.8 to -12.3 kcal/mol as calculated at the MP2/6-311+G(d) level. Strong attractive interactions were observed for a wide range of orientations of carbohydrate and aromatic ring as present in selected X-ray structures. The most attractive interaction was associated with apparent combination of CH/pi interactions and classical H-bonds. The failure of Hartree-Fock method (interaction energies from +1.0 to -6.9 kcal/mol) can be explained by a dispersion nature of a majority of the studied complexes. We also present a comparison of interaction energies calculated at the MP2 level with those calculated using molecular mechanics force fields (OPLS, GROMOS, CSFF/CHARMM, CHEAT/CHARMM, Glycam/AMBER, MM2 and MM3). For a majority of force fields there was a strong correlation with MP2 values. RMSD between MP2 and force field values were 1.0 for CSFF/CHARMM, 1.2 for Glycam/AMBER, 1.2 for GROMOS, 1.3 for MM3, 1.4 for MM2, 1.5 for OPLS and to 2.3 for CHEAT/CHARMM (in kcal/mol). These results show that molecular mechanics approximates interaction energies very well and support an application of molecular mechanics methods in the area of glycochemistry and glycobiology.
Several modifications that have been made to the NDDO core-core interaction term and to the method of parameter optimization are described. These changes have resulted in a more complete parameter optimization, called PM6, which has, in turn, allowed 70 elements to be parameterized. The average unsigned error (AUE) between calculated and reference heats of formation for 4,492 species was 8.0 kcal mol(-1). For the subset of 1,373 compounds involving only the elements H, C, N, O, F, P, S, Cl, and Br, the PM6 AUE was 4.4 kcal mol(-1). The equivalent AUE for other methods were: RM1: 5.0, B3LYP 6-31G*: 5.2, PM5: 5.7, PM3: 6.3, HF 6-31G*: 7.4, and AM1: 10.0 kcal mol(-1). Several long-standing faults in AM1 and PM3 have been corrected and significant improvements have been made in the prediction of geometries.
The production of an adiabatic map for a di- or trisaccharide requires the generation of many relaxed maps, ideally 59,049 for a disaccharide or 4,782,969 for a trisaccharide composed by hexose residues, due to a combination of exocyclic angle torsions. As the production of this amount of maps is usually ruled out for time considerations, different approaches were exploited. When working at low dielectric constants, starting points originated in cooperative hydrogen bonds through the rings are usually sufficient to produce an adiabatic map, but at higher dielectric constants those circuits are meaningless, and many low-energy conformers appear in each energy well. Herein, different conformations of four disaccharides (beta-4-linked mannobiose, and three galactobioses, linked alpha-(1-->3), alpha-(1-->4), and beta-(1-->4)) and one trisaccharide (beta-4-linked mannotriose) were minimized using mm3 at epsilon = 80, and the difference in energy produced by changes in torsional angles was recorded. A remarkable additive effect was found to occur when the exocyclics were gathered in groupings of two or three neighboring angles. Thus, in most cases, each grouping can be studied separately, and the minimum energy conformers can be predicted without the need of resorting to thousands of calculations. In some cases where two protons of different groups show steric interactions in some specific conformations, small deviations of the additivity were encountered. Anyway, a complex system with many variables can be transformed in one with many fewer variables, thus simplifying further studies. An attempt to calculate the same effect at epsilon = 3 shows that hydrogen bonding and electrostatic interactions make impossible to find those additive effects, thus precluding its utilization at such low dielectric constants.
The 1992 version of MM3 was largely used for modeling mono-, di-, and trisaccharides. In later versions of MM3 improvements were made in some parameters that may be important for carbohydrates. This corrected MM3 force field is part of the Tinker package, freely available (as its 4.1 version), and included in the Chem 3D Ultra 8.0 package (as the 3.7 version). The latter version lacks the corrections to the standard bond lengths produced by electronegativity and anomeric effects, whereas the Tinker 4.1 version only lacks the latter correction. The present work compares the performance of the three MM3 versions (and in some cases, DFT and/or HF/ab initio procedures) on several carbohydrate model problems as the chair and rotamer equilibria in 2-hydroxy- and 2-methoxytetrahydropyran, hydrogen bonding in cis-2,3-dihydroxytetrahydropyran, and the potential energy surfaces around the glycosidic bonds of two sulfated disaccharides and two trisaccharides. Tinker MM3 can be used accurately to estimate carbohydrate energies and geometries, and-with the help of some programming-to pursue studies on the potential energy surfaces of di- and trisaccharides. In most cases results obtained using the three MM3 versions are similar, although large energy differences are obtained when comparing a rotameric distribution around a O-C-O-H dihedral, which is almost forced to the exo-anomeric position by the Tinker versions. In other systems smaller energy differences are found, but they can nevertheless lead to a different global minimum when comparing conformers of similar energy. MM3(92) establishes better the differences between the bond lengths in both anomers, as an expected expression of the anomeric correction.
The adiabatic potential energy surfaces (PES) of six trisaccharides, sulfated derivatives of alpha-D-Gal p-(1-->3)-beta-D-Gal p-(1-->4)-alpha-D-Gal p and beta-D-Gal p-(1-->4)-alpha-D-Gal p-(1-->3)-beta-D-Gal p representing models of lambda-, mu-, and nu-carrageenans were obtained using the MM3 force-field at epsilon = 3. Each PES was described by a single contour map for which the energy is plotted against the two psi glycosidic angles, given the small variations of the phi glycosidic torsional angle in the low-energy regions of disaccharide maps. Most surfaces appear as expected from the maps of the disaccharidic repeating units of carrageenans, with less important factors altering the additive effect of both linkages. Only small interactions between the first and third monosaccharidic moieties of the trisaccharides are observed. The flexibility of the alpha-linkages appears nearly identical to that in their disaccharide counterparts, with only one exception, where it appears reduced by the presence of the third monosaccharide. On the other hand, the flexibility of the beta-linkage appears to be equal or sometimes even higher than that observed for the corresponding disaccharide.
Eighteen empirical force fields and the semi-empirical quantum method PM3CARB-1 were compared for studying beta-cellobiose, alpha-maltose, and alpha-galabiose [alpha-D-Galp-(1-->4)-alpha-D-Galp]. For each disaccharide, the energies of 54 conformers with differing hydroxymethyl, hydroxyl, and glycosidic linkage orientations were minimized by the different methods, some at two dielectric constants. By comparing these results and the available crystal structure data and/or higher level density functional theory results, it was concluded that the newer parameterizations for force fields (GROMOS, GLYCAM06, OPLS-2005 and CSFF) give results that are reasonably similar to each other, whereas the older parameterizations for Amber, CHARMM or OPLS were more divergent. However, MM3, an older force field, gave energy and geometry values comparable to those of the newer parameterizations, but with less sensitivity to dielectric constant values. These systems worked better than MM2 variants, which were still acceptable. PM3CARB-1 also gave adequate results in terms of linkage and exocyclic torsion angles. GROMOS, GLYCAM06, and MM3 appear to be the best choices, closely followed by MM4, CSFF, and OPLS-2005. With GLYCAM06 and to a lesser extent, CSFF, and OPLS-2005, a number of the conformers that were stable with MM3 changed to other forms.
Six empirical force fields were tested for applicability to calculations for automated carbohydrate database filling. They were probed on eleven disaccharide molecules containing representative structural features from widespread classes of carbohydrates. The accuracy of each method was queried by predictions of nuclear Overhauser effects (NOEs) from conformational ensembles obtained from 50 to 100 ns molecular dynamics (MD) trajectories and their comparison to the published experimental data. Using various ranking schemes, it was concluded that explicit solvent MM3 MD yielded non-inferior NOE accuracy with newer GLYCAM-06, and ultimately PBE0-D3/def2-TZVP (Triple-Zeta Valence Polarized) Density Functional Theory (DFT) simulations. For seven of eleven molecules, at least one empirical force field with explicit solvent outperformed DFT in NOE prediction. The aggregate of characteristics (accuracy, speed, and compatibility) made MM3 dynamics with explicit solvent at 300 K the most favorable method for bulk generation of disaccharide conformation maps for massive database filling.
The new ONIOM (our own n-layered integrated molecular orbital and molecular mechanics) approach has been proposed and shown to be successful in reproducing benchmark calculations and experimental results. ONIOM3, a three-layered version, divides a system into an active part treated at a very high level of ab initio molecular orbital theory like CCSD(T), a semiactive part that includes important electronic contributions and is treated at the HF or MP2 level, and a nonactive part that is handled using force field approaches. The three-layered scheme allows us to study a larger system more accurately than the previously proposed two-layered schemes IMOMO, which can treat a medium size system very accurately, and IMOMM, which can handle a very large system with modest accuracy. This three-layered scheme has been applied to activation barriers for the Diels−Alder reaction of acrolein + isoprene, acrolein + 2-tert-butyl-1,3-butadiene, and ethylene + 1,4-di-tert-butyl-1,3-butadiene. In general, the results for both geometry optimizations and single point energy calculations agree well with benchmark predictions and experimental results. The scheme has also been applied to the transition state for the oxidative addition of H2 to Pt(P(t-Bu)3)2. The activation energy of this 83-atom reaction is predicted to be 14.2 kcal/mol with the ONIOM3(CCSD(T):MP2:MM3) method. .
GLYCAM06 is a generalisable biomolecular force field that is extendible to diverse molecular classes in the spirit of a small-molecule force field. Here we report parameters for lipids, lipid bilayers and glycolipids for use with GLYCAM06. Only three lipid-specific atom types have been introduced, in keeping with the general philosophy of transferable parameter development. Bond stretching, angle bending, and torsional force constants were derived by fitting to quantum mechanical data for a collection of minimal molecular fragments and related small molecules. Partial atomic charges were computed by fitting to ensemble-averaged quantum-computed molecular electrostatic potentials.In addition to reproducing quantum mechanical internal rotational energies and experimental valence geometries for an array of small molecules, condensed-phase simulations employing the new parameters are shown to reproduce the bulk physical properties of a DMPC lipid bilayer. The new parameters allow for molecular dynamics simulations of complex systems containing lipids, lipid bilayers, glycolipids, and carbohydrates, using an internally consistent force field. By combining the AMBER parameters for proteins with the GLYCAM06 parameters, it is also possible to simulate protein-lipid complexes and proteins in biologically relevant membrane-like environments.
Carbohydrate dynamics plays a vital role in many biological processes, but we are not currently able to probe this with experimental approaches. The highly flexible nature of carbohydrate structures differs in many aspects from other biomolecules, posing significant challenges for studies employing computational simulation. Over past decades, computational study of carbohydrates has been focused on the development of structure prediction methods, force field optimization, molecular dynamics simulation, and scoring functions for carbohydrate-protein interactions. Advances in carbohydrate force fields and scoring functions can be largely attributed to enhanced computational algorithms, application of quantum mechanics, and the increasing number of experimental structures determined by X-ray and NMR techniques. The conformational analysis of carbohydrates is challengeable and has gone into intensive study in elucidating the anomeric, the exo-anomeric, and the gauche effects. Here, we review the issues associated with carbohydrate force fields and scoring functions, which will have a broad application in the field of carbohydrate-based drug design.
Although density functional theory is widely used in the computational chemistry community, the most popular density functional, B3LYP, has some serious shortcomings: (i) it is better for main-group chemistry than for transition metals; (ii) it systematically underestimates reaction barrier heights; (iii) it is inaccurate for interactions dominated by medium-range correlation energy, such as van der Waals attraction, aromatic-aromatic stacking, and alkane isomerization energies. We have developed a variety of databases for testing and designing new density functionals. We used these data to design new density functionals, called M06-class (and, earlier, M05-class) functionals, for which we enforced some fundamental exact constraints such as the uniform-electron-gas limit and the absence of self-correlation energy. Our M06-class functionals depend on spin-up and spin-down electron densities (i.e., spin densities), spin density gradients, spin kinetic energy densities, and, for nonlocal (also called hybrid) functionals, Hartree-Fock exchange. We have developed four new functionals that overcome the above-mentioned difficulties: (a) M06, a hybrid meta functional, is a functional with good accuracy "across-the-board" for transition metals, main group thermochemistry, medium-range correlation energy, and barrier heights; (b) M06-2X, another hybrid meta functional, is not good for transition metals but has excellent performance for main group chemistry, predicts accurate valence and Rydberg electronic excitation energies, and is an excellent functional for aromatic-aromatic stacking interactions; (c) M06-L is not as accurate as M06 for barrier heights but is the most accurate functional for transition metals and is the only local functional (no Hartree-Fock exchange) with better across-the-board average performance than B3LYP; this is very important because only local functionals are affordable for many demanding applications on very large systems; (d) M06-HF has good performance for valence, Rydberg, and charge transfer excited states with minimal sacrifice of ground-state accuracy. In this Account, we compared the performance of the M06-class functionals and one M05-class functional (M05-2X) to that of some popular functionals for diverse databases and their performance on several difficult cases. The tests include barrier heights, conformational energy, and the trend in bond dissociation energies of Grubbs' ruthenium catalysts for olefin metathesis. Based on these tests, we recommend (1) the M06-2X, BMK, and M05-2X functionals for main-group thermochemistry and kinetics, (2) M06-2X and M06 for systems where main-group thermochemistry, kinetics, and noncovalent interactions are all important, (3) M06-L and M06 for transition metal thermochemistry, (4) M06 for problems involving multireference rearrangements or reactions where both organic and transition-metal bonds are formed or broken, (5) M06-2X, M05-2X, M06-HF, M06, and M06-L for the study of noncovalent interactions, (6) M06-HF when the use of full Hartree-Fock exchange is important, for example, to avoid the error of self-interaction at long-range, (7) M06-L when a local functional is required, because a local functional has much lower cost for large systems.
Resource description framework (RDF) and Property Graph databases are emerging technologies that are used for storing graph-structured data. We compare these technologies through a molecular biology use case: glycan substructure search. Glycans are branched tree-like molecules composed of building blocks linked together by chemical bonds. The molecular structure of a glycan can be encoded into a direct acyclic graph where each node represents a building block and each edge serves as a chemical linkage between two building blocks. In this context, Graph databases are possible software solutions for storing glycan structures and Graph query languages, such as SPARQL and Cypher, can be used to perform a substructure search. Glycan substructure searching is an important feature for querying structure and experimental glycan databases and retrieving biologically meaningful data. This applies for example to identifying a region of the glycan recognised by a glycan binding protein (GBP). In this study, 19,404 glycan structures were selected from GlycomeDB (www.glycome-db.org) and modelled for being stored into a RDF triple store and a Property Graph. We then performed two different sets of searches and compared the query response times and the results from both technologies to assess performance and accuracy. The two implementations produced the same results, but interestingly we noted a difference in the query response times. Qualitative measures such as portability were also used to define further criteria for choosing the technology adapted to solving glycan substructure search and other comparable issues.
The Integrated Life Science Database Project of Japan funded a group of glycoscientists to carry out a project to integrate glycoscience databases using Semantic Web technologies. As a continuation of the previous project period, the Japan Consortium for Glycobiology and Glycotechnology Database (JCGGDB) developed several glycoscience-related databases. The GlycoProtDB database is among those being integrated, providing an important resource to understand protein glycosylation. Another database being integrated is GlycoEpitope, a comprehensive database of carbohydrate epitopes and antibodies. In the current project period, we started the development of GlyTouCan, the international glycan structure repository providing unique accession numbers to all glycan structures. Although such databases are sufficiently important in and of themselves, their integration with other −omics data such as the protein information in UniProt will be crucial to bring glycosciences to the forefront of life sciences. However, to integrate such disparate sets of data among different fields in a way such that future maintenance costs are minimal, standardized ontologies and formats must be established. Our latest project has attempted to define the minimal standards that are necessary to enable this integration. The technical challenges to integrate all these databases and the technologies to overcome these challenges will be described.
BACKGROUND: Glycoscience is a research field focusing on complex carbohydrates (otherwise known as glycans)a, which can, for example, serve as "switches" that toggle between different functions of a glycoprotein or glycolipid. Due to the advancement of glycomics technologies that are used to characterize glycan structures, many glycomics databases are now publicly available and provide useful information for glycoscience research. However, these databases have almost no link to other life science databases. RESULTS: In order to implement support for the Semantic Web most efficiently for glycomics research, the developers of major glycomics databases agreed on a minimal standard for representing glycan structure and annotation information using RDF (Resource Description Framework). Moreover, all of the participants implemented this standard prototype and generated preliminary RDF versions of their data. To test the utility of the converted data, all of the data sets were uploaded into a Virtuoso triple store, and several SPARQL queries were tested as "proofs-of-concept" to illustrate the utility of the Semantic Web in querying across databases which were originally difficult to implement. CONCLUSIONS: We were able to successfully retrieve information by linking UniCarbKB, GlycomeDB and JCGGDB in a single SPARQL query to obtain our target information. We also tested queries linking UniProt with GlycoEpitope as well as lectin data with GlycomeDB through PDB. As a result, we have been able to link proteomics data with glycomics data through the implementation of Semantic Web technologies, allowing for more flexible queries across these domains.
The level of ambiguity in describing glycan structure has significantly increased with the upsurge of large-scale glycomics and glycoproteomics experiments. Consequently, an ontology-based model appears as an appropriate solution for navigating these data. However, navigation is not sufficient and the model should also enable advanced search and comparison. A new ontology with a tree logical structure is introduced to represent glycan structures irrespective of the precision of molecular details. The model heavily relies on the GlycoCT encoding of glycan structures. Its implementation in the GlySTreeM knowledge base was validated with GlyConnect data and benchmarked with the Glycowork library. GlySTreeM is shown to be fast, consistent, reliable and more flexible than existing solutions for matching parts of or whole glycan structures. The model is also well suited for painless future expansion.
The application of semantic technologies to the integration of biological data and the interoperability of bioinformatics analysis and visualization tools has been the common theme of a series of annual BioHackathons hosted in Japan for the past five years. Here we provide a review of the activities and outcomes from the BioHackathons held in 2011 in Kyoto and 2012 in Toyama. In order to efficiently implement semantic technologies in the life sciences, participants formed various sub-groups and worked on the following topics: Resource Description Framework (RDF) models for specific domains, text mining of the literature, ontology development, essential metadata for biological databases, platforms to enable efficient Semantic Web technology development and interoperability, and the development of applications for Semantic Web data. In this review, we briefly introduce the themes covered by these sub-groups. The observations made, conclusions drawn, and software development projects that emerged from these activities are discussed.
The improvement and diversification of experimental technologies have caused a flood of data. In order to share and integrate such huge and diverse data, it is important to describe the relationship between data using Semantic Web technology. A goal of the Semantic Web is that computers can automatically process data by linking meaningful data and by forming a web of data. The Semantic Web consists of key technologies such as Resource Description Framework (RDF), ontologies, triple stores (database for RDF), and SPARQL Protocol and RDF Query Language (SPARQL), which is a query language for triple stores. Although the Semantic Web has been used by some specific domains such as government and media, recently it is also applied to the life sciences. In this chapter, I will describe about the Semantic Web and its application to life science including glycobiology. Finally, I will introduce TogoTable, which is a web application using the Semantic Web, used for collecting annotations from distributed databases.
BACKGROUND: In the era of semantic web, life science ontologies play an important role in tasks such as annotating biological objects, linking relevant data pieces, and verifying data consistency. Understanding ontology structures and overlapping ontologies is essential for tasks such as ontology reuse and development. We present an exploratory study where we examine structure and look for patterns in BioPortal, a comprehensive publicly available repository of live science ontologies. METHODS: We report an analysis of biomedical ontology mapping data over time. We apply graph theory methods such as Modularity Analysis and Betweenness Centrality to analyse data gathered at five different time points. We identify communities, i.e., sets of overlapping ontologies, and define similar and closest communities. We demonstrate evolution of identified communities over time and identify core ontologies of the closest communities. We use BioPortal project and category data to measure community coherence. We also validate identified communities with their mutual mentions in scientific literature. RESULTS: With comparing mapping data gathered at five different time points, we identified similar and closest communities of overlapping ontologies, and demonstrated evolution of communities over time. Results showed that anatomy and health ontologies tend to form more isolated communities compared to other categories. We also showed that communities contain all or the majority of ontologies being used in narrower projects. In addition, we identified major changes in mapping data after migration to BioPortal Version 4.
MOTIVATION: Over the last decades several glycomics-based bioinformatics resources and databases have been created and released to the public. Unfortunately, there is no common standard in the representation of the stored information or a common machine-readable interface allowing bioinformatics groups to easily extract and cross-reference the stored information. RESULTS: An international group of bioinformatics experts in the field of glycomics have worked together to create a standard Resource Description Framework (RDF) representation for glycomics data, focused on glycan sequences and related biological source, publications and experimental data. This RDF standard is defined by the GlycoRDF ontology and will be used by database providers to generate common machine-readable exports of the data stored in their databases. AVAILABILITY AND IMPLEMENTATION: The ontology, supporting documentation and source code used by database providers to generate standardized RDF are available online (http://www.glycoinfo.org/GlycoRDF/).
no abstract.
BACKGROUND: In recent years, a large amount of "-omics" data have been produced. However, these data are stored in many different species-specific databases that are managed by different institutes and laboratories. Biologists often need to find and assemble data from disparate sources to perform certain analyses. Searching for these data and assembling them is a time-consuming task. The Semantic Web helps to facilitate interoperability across databases. A common approach involves the development of wrapper systems that map a relational database schema onto existing domain ontologies. However, few attempts have been made to automate the creation of such wrappers. RESULTS: We developed a framework, named BioSemantic, for the creation of Semantic Web Services that are applicable to relational biological databases. This framework makes use of both Semantic Web and Web Services technologies and can be divided into two main parts: (i) the generation and semi-automatic annotation of an RDF view; and (ii) the automatic generation of SPARQL queries and their integration into Semantic Web Services backbones. We have used our framework to integrate genomic data from different plant databases. CONCLUSIONS: BioSemantic is a framework that was designed to speed integration of relational databases. We present how it can be used to speed the development of Semantic Web Services for existing relational biological databases. Currently, it creates and annotates RDF views that enable the automatic generation of SPARQL queries. Web Services are also created and deployed automatically, and the semantic annotations of our Web Services are added automatically using SAWSDL attributes. BioSemantic is downloadable at http://southgreen.cirad.fr/?q=content/Biosemantic.
The life sciences field is entering an era of big data with the breakthroughs of science and technology. More and more big data-related projects and activities are being performed in the world. Life sciences data generated by new technologies are continuing to grow in not only size but also variety and complexity, with great speed. To ensure that big data has a major influence in the life sciences, comprehensive data analysis across multiple data sources and even across disciplines is indispensable. The increasing volume of data and the heterogeneous, complex varieties of data are two principal issues mainly discussed in life science informatics. The ever-evolving next-generation Web, characterized as the Semantic Web, is an extension of the current Web, aiming to provide information for not only humans but also computers to semantically process large-scale data. The paper presents a survey of big data in life sciences, big data related projects and Semantic Web technologies. The paper introduces the main Semantic Web technologies and their current situation, and provides a detailed analysis of how Semantic Web technologies address the heterogeneous variety of life sciences big data. The paper helps to understand the role of Semantic Web technologies in the big data era and how they provide a promising solution for the big data in life sciences.
Recent years have seen great advances in the development of glycoproteomics protocols and methods resulting in a sustainable increase in the reporting proteins, their attached glycans and glycosylation sites. However, only very few of these reports find their way into databases or data repositories. One of the major reasons is the absence of digital standard to represent glycoproteins and the challenging annotations with glycans. Depending on the experimental method, such a standard must be able to represent glycans as complete structures or as compositions, store not just single glycans but also represent glycoforms on a specific glycosylation side, deal with partially missing site information if no site mapping was performed, and store abundances or ratios of glycans within a glycoform of a specific site. To support the above, we have developed the GlycoConjugate Ontology (GlycoCoO) as a standard semantic framework to describe and represent glycoproteomics data. GlycoCoO can be used to represent glycoproteomics data in triplestores and can serve as a basis for data exchange formats. The ontology, database providers and supporting documentation are available online (https://github.com/glycoinfo/GlycoCoO).
The field of Glycomics has emerged as a result of technical advances that allow high-throughput, high-sensitivity analysis of structurally complex molecules. The complexity of glycan biosynthesis makes its analysis difficult, limiting glycomics knowledge of the mechanisms by which glycans carry out their biological functions. This chapter provides an overview of the general problem of integrating glycomics information and structural databases. Development of tools is required to create informatics systems that significantly advance our understanding of glycobiology. The basic infrastructure for semantic integration of glycomics data includes knowledge repositories, data repositories and tools for glycoinformatics that are required for high-throughput data acquisition, integration of the data, and discovery of knowledge that is inferred by the data. This provides an understanding of the structure and biosynthesis of glycans. These powerful new analytical techniques produce large amounts of information-rich raw data, making it necessary to develop equally powerful informatics techniques that process and mine data in order to understand its biological implications in detail. The highest priority is the extension and utilization of curational tools to populate ontologies with trusted knowledge in the glycomics domain. Ontological specification of the fundamental knowledge serves as a springboard for the further development of powerful tools for interpreting glycomics data by providing a basis for the expressive annotation of experimental data, facilitating the realization of its biological relevance.
Here, we used data of complete genomes to study comparatively the metabolism of different species. We built phenetic trees based on the enzymatic functions present in different parts of metabolism. Seven broad metabolic classes, comprising a total of 69 metabolic pathways, were comparatively analyzed for 27 fully sequenced organisms of the domains Eukarya, Bacteria and Archaea. Phylogenetic profiles based on the presence/absence of enzymatic functions for each metabolic class were determined and distance matrices for all the organisms were then derived from the profiles. Unrooted phenetic trees based upon the matrices revealed the distribution of the organisms according to their metabolic capabilities, reflecting the ecological pressures and adaptations that those species underwent during their evolution. We found that organisms that are closely related in phylogenetic terms could be distantly related metabolically and that the opposite is also true. For example, obligate bacterial pathogens were usually grouped together in our metabolic trees, demonstrating that obligate pathogens share common metabolic features regardless of their diverse phylogenetic origins. The branching order of proteobacteria often did not match their classical phylogenetic classification and Gram-positive bacteria showed diverse metabolic affinities. Archaea were found to be metabolically as distant from free-living bacteria as from eukaryotes, and sometimes were placed close to the metabolically highly specialized group of obligate bacterial pathogens. Metabolic trees represent an integrative approach for the comparison of the evolution of the metabolism and its correlation with the evolution of the genome, helping to find new relationships in the tree of life.
Phylogenetic trees resulting from molecular phylogenetic analysis are available in Newick format from specialized databases but when it comes to phylogenetic networks, which provide an explicit representation of reticulate evolutionary events such as recombination, hybridization or lateral gene transfer, the lack of a standard format for their representation has hindered the publication of explicit phylogenetic networks in the specialized literature and their incorporation in specialized databases. Two different proposals to represent phylogenetic networks exist: as a single Newick string (where each hybrid node is splitted once for each parent) or as a set of Newick strings (one for each hybrid node plus another one for the phylogenetic network).
The Tree of Life (ToL) is meant to be a unique representation of the evolutionary relationships between all species on earth. Huge efforts are made to assemble such a large tree, helped by the decrease of sequencing costs and improved methods to reconstruct and combine phylogenies, but no tool exists today to explore the ToL in its entirety in a satisfying manner. By combining methods used in modern cartography, such as OpenStreetMap, with a new way of representing tree-like structures, I created Lifemap, a tool allowing the exploration of a complete representation of the ToL (between 800,000 and 2.2 million species depending on the data source) in a zoomable interface. A server version of Lifemap also allows users to visualize their own trees. This should help researchers in ecology and evolutionary biology in their everyday work, but may also permit the diffusion to a broader audience of our current knowledge of the evolutionary relationships linking all organisms.
The Minimum Evolution (ME) approach to phylogeny estimation has been shown to be statistically consistent when it is used in conjunction with ordinary least-squares (OLS) fitting of a metric to a tree structure. The traditional approach to using ME has been to start with the Neighbor Joining (NJ) topology for a given matrix and then do a topological search from that starting point. The first stage requires O(n(3)) time, where n is the number of taxa, while the current implementations of the second are in O(p n(3)) or more, where p is the number of swaps performed by the program. In this paper, we examine a greedy approach to minimum evolution which produces a starting topology in O(n(2)) time. Moreover, we provide an algorithm that searches for the best topology using nearest neighbor interchanges (NNIs), where the cost of doing p NNIs is O(n(2) + p n), i.e., O(n(2)) in practice because p is always much smaller than n. The Greedy Minimum Evolution (GME) algorithm, when used in combination with NNIs, produces trees which are fairly close to NJ trees in terms of topological accuracy. We also examine ME under a balanced weighting scheme, where sibling subtrees have equal weight, as opposed to the standard "unweighted" OLS, where all taxa have the same weight so that the weight of a subtree is equal to the number of its taxa. The balanced minimum evolution scheme (BME) runs slower than the OLS version, requiring O(n(2) x diam(T)) operations to build the starting tree and O(p n x diam(T)) to perform the NNIs, where diam(T) is the topological diameter of the output tree. In the usual Yule-Harding distribution on phylogenetic trees, the diameter expectation is in log(n), so our algorithms are in practice faster that NJ. Moreover, this BME scheme yields a very significant improvement over NJ and other distance-based algorithms, especially with large trees, in terms of topological accuracy.
From comparative analyses of the nucleotide sequences of genes encoding ribosomal RNAs and several proteins, molecular phylogeneticists have constructed a "universal tree of life," taking it as the basis for a "natural" hierarchical classification of all living things. Although confidence in some of the tree's early branches has recently been shaken, new approaches could still resolve many methodological uncertainties. More challenging is evidence that most archaeal and bacterial genomes (and the inferred ancestral eukaryotic nuclear genome) contain genes from multiple sources. If "chimerism" or "lateral gene transfer" cannot be dismissed as trivial in extent or limited to special categories of genes, then no hierarchical universal classification can be taken as natural. Molecular phylogeneticists will have failed to find the "true tree," not because their methods are inadequate or because they have chosen the wrong genes, but because the history of life cannot properly be represented as a tree. However, taxonomies based on molecular sequences will remain indispensable, and understanding of the evolutionary process will ultimately be enriched, not impoverished.
Carbohydrates are biological blocks participating in diverse and crucial processes both at cellular and organism levels. They protect individual cells, establish intracellular interactions, take part in the immune reaction and participate in many other processes. Glycosylation is considered as one of the most important modifications of proteins and other biologically active molecules. Still, the data on the enzymatic machinery involved in the carbohydrate synthesis and processing are scattered, and the advance on its study is hindered by the vast bulk of accumulated genetic information not supported by any experimental evidences for functions of proteins that are encoded by these genes. In this article, we present novel instruments for statistical analysis of glycomes in taxa. These tools may be helpful for investigating carbohydrate-related enzymatic activities in various groups of organisms and for comparison of their carbohydrate content. The instruments are developed on the Carbohydrate Structure Database (CSDB) platform and are available freely on the CSDB web-site at http://csdb.glycoscience.ru. Database URL: http://csdb.glycoscience.ru.
The NCBI Taxonomy database (http://www.ncbi.nlm.nih.gov/taxonomy) is the standard nomenclature and classification repository for the International Nucleotide Sequence Database Collaboration (INSDC), comprising the GenBank, ENA (EMBL) and DDBJ databases. It includes organism names and taxonomic lineages for each of the sequences represented in the INSDC's nucleotide and protein sequence databases. The taxonomy database is manually curated by a small group of scientists at the NCBI who use the current taxonomic literature to maintain a phylogenetic taxonomy for the source organisms represented in the sequence databases. The taxonomy database is a central organizing hub for many of the resources at the NCBI, and provides a means for clustering elements within other domains of NCBI web site, for internal linking between domains of the Entrez system and for linking out to taxon-specific external resources on the web. Our primary purpose is to index the domain of sequences as conveniently as possible for our user community.
A phylogenetic 'tree of life' has been constructed based on the observed presence and absence of families of protein-encoding genes observed in 11 complete genomes of free-living microorganisms. Past attempts to reconstruct the evolutionary relation-ships of microorganisms have been limited to sets of genes rather than complete genomes. Despite apparent rampant lateral gene transfer among microorganisms, these results indicate a single robust underlying evolutionary history for these organisms. Broadly, the tree produced is very similar to the small subunit rRNA tree although several additional phylogenetic relationships appear to be resolved, including the relationship of Archaeoglobus to the methanogens studied. This result is in contrast to notions that a robust phylogenetic reconstruction of microorganisms is impossible due to their genomes being composed of an incomprehensible amalgam of genes with complicated histories and suggests that this style of genome-wide phylogenetic analysis could become an important method for studying the ancient diversification of life on Earth. Analyses using informational and operational subsets of the genes showed that this 'tree of life' is not dependent on the phylogenetically more consistent informational genes.
We propose an improved version of the neighbor-joining (NJ) algorithm of Saitou and Nei. This new algorithm, BIONJ, follows the same agglomerative scheme as NJ, which consists of iteratively picking a pair of taxa, creating a new mode which represents the cluster of these taxa, and reducing the distance matrix by replacing both taxa by this node. Moreover, BIONJ uses a simple first-order model of the variances and covariances of evolutionary distance estimates. This model is well adapted when these estimates are obtained from aligned sequences. At each step it permits the selection, from the class of admissible reductions, of the reduction which minimizes the variance of the new distance matrix. In this way, we obtain better estimates to choose the pair of taxa to be agglomerated during the next steps. Moreover, in comparison with NJ's estimates, these estimates become better and better as the algorithm proceeds. BIONJ retains the good properties of NJ--especially its low run time. Computer simulations have been performed with 12-taxon model trees to determine BIONJ's efficiency. When the substitution rates are low (maximum pairwise divergence approximately 0.1 substitutions per site) or when they are constant among lineages, BIONJ is only slightly better than NJ. When the substitution rates are higher and vary among lineages,BIONJ clearly has better topological accuracy. In the latter case, for the model trees and the conditions of evolution tested, the topological error reduction is on the average around 20%. With highly-varying-rate trees and with high substitution rates (maximum pairwise divergence approximately 1.0 substitutions per site), the error reduction may even rise above 50%, while the probability of finding the correct tree may be augmented by as much as 15%.
BACKGROUND: There are considerable differences between bacterial and mammalian glycans. In contrast to most eukaryotic carbohydrates, bacterial glycans are often composed of repeating units with diverse functions ranging from structural reinforcement to adhesion, colonization and camouflage. Since bacterial glycans are typically displayed at the cell surface, they can interact with the environment and, therefore, have significant biomedical importance. RESULTS: The sequence characteristics of glycans (monosaccharide composition, modifications, and linkage patterns) for the higher bacterial taxonomic classes have been examined and compared with the data for mammals, with both similarities and unique features becoming evident. Compared to mammalian glycans, the bacterial glycans deposited in the current databases have a more than ten-fold greater diversity at the monosaccharide level, and the disaccharide pattern space is approximately nine times larger. Specific bacterial subclasses exhibit characteristic glycans which can be distinguished on the basis of distinctive structural features or sequence properties. CONCLUSION: For the first time a systematic database analysis of the bacterial glycome has been performed. This study summarizes the current knowledge of bacterial glycan architecture and diversity and reveals putative targets for the rational design and development of therapeutic intervention strategies by comparing bacterial and mammalian glycans.
The Rosetta software for macromolecular modeling, docking and design is extensively used in laboratories worldwide. During two decades of development by a community of laboratories at more than 60 institutions, Rosetta has been continuously refactored and extended. Its advantages are its performance and interoperability between broad modeling capabilities. Here we review tools developed in the last 5 years, including over 80 methods. We discuss improvements to the score function, user interfaces and usability. Rosetta is available at http://www.rosettacommons.org.
Interactive Tree Of Life (http://itol.embl.de) is a web-based tool for the display, manipulation and annotation of phylogenetic trees. It is freely available and open to everyone. In addition to classical tree viewer functions, iTOL offers many novel ways of annotating trees with various additional data. Current version introduces numerous new features and greatly expands the number of supported data set types. Trees can be interactively manipulated and edited. A free personal account system is available, providing management and sharing of trees in user defined workspaces and projects. Export to various bitmap and vector graphics formats is supported. Batch access interface is available for programmatic access or inclusion of interactive trees into other web services.
We built whole-genome trees based on the presence or absence of particular molecular features, either orthologs or folds, in the genomes of a number of recently sequenced microorganisms. To put these genomic trees into perspective, we compared them to the traditional ribosomal phylogeny and also to trees based on the sequence similarity of individual orthologous proteins. We found that our genomic trees based on the overall occurrence of orthologs did not agree well with the traditional tree. This discrepancy, however, vanished when one restricted the tree to proteins involved in transcription and translation, not including problematic proteins involved in metabolism. Protein folds unite superficially unrelated sequence families and represent a most fundamental molecular unit described by genomes. We found that our genomic occurrence tree based on folds agreed fairly well with the traditional ribosomal phylogeny. Surprisingly, despite this overall agreement, certain classes of folds, particularly all-beta ones, had a somewhat different phylogenetic distribution. We also compared our occurrence trees to whole-genome clusters based on the composition of amino acids and di-nucleotides. Finally, we analyzed some technical aspects of genomic trees-e.g., comparing parsimony versus distance-based approaches and examining the effects of increasing numbers of organisms. Additional information (e.g. clickable trees) is available from http://bioinfo.mbb.yale.edu/genome/trees.
NEXUS is a file format designed to contain systematic data for use by computer programs. The goals of the format are to allow future expansion, to include diverse kinds of information, to be independent of particular computer operating systems, and to be easily processed by a program. To this end, the format is modular, with a file consisting of separate blocks, each containing one particular kind of information, and consisting of standardized commands. Public blocks (those containing information utilized by several programs) house information about taxa, morphological and molecular characters, distances, genetic codes, assumptions, sets, trees, etc.; private blocks contain information of relevance to single programs. A detailed description of commands in public blocks is given. Guidelines are provided for reading and writing NEXUS files and for extending the format.
We detail new algorithms for the major agglomerative hierarchic clustering methods. These include: optimal O(N2) time and O(N) space algorithms for the centroid (UPGMC), median (WPGMC) and Ward's minimum variance methods; and O(N2) time and O(N2) space algorithms for any linkage-based method. For the former agglomerative methods, these represent a significant advance on currently implemented algorithms; for the latter, the time-optimal algorithms for the weighted and unweighted group average methods (UPGMA, WPGMA) are also superior to currently implemented versions. .
The Ward error sum of squares hierarchical clustering method has been very widely used since its first description by Ward in a 1963 publication. It has also been generalized in various ways. Two algorithms are found in the literature and software, both announcing that they implement the Ward clustering method. When applied to the same distance matrix, they produce different results. One algorithm preserves Ward’s criterion, the other does not. Our survey work and case studies will be useful for all those involved in developing software for data analysis using Ward’s hierarchical clustering method.
SUMMARY: We describe an algorithm and software tool for comparing alternative phylogenetic trees. The main application of the software is to compare phylogenies obtained using different phylogenetic methods for some fixed set of species or obtained using different gene sequences from those species. The algorithm pairs up each branch in one phylogeny with a matching branch in the second phylogeny and finds the optimum 1-to-1 map between branches in the two trees in terms of a topological score. The software enables the user to explore the corresponding mapping between the phylogenies interactively, and clearly highlights those parts of the trees that differ, both in terms of topology and branch length. AVAILABILITY: The software is implemented as a Java applet at http://www.mrc-bsu.cam.ac.uk/personal/thomas/phylo_comparison/comparison_page.htm l. It is also available on request from the authors.
BACKGROUND: To infer the tree of life requires knowledge of the common characteristics of each species descended from a common ancestor as the measuring criteria and a method to calculate the distance between the resulting values of each measure. Conventional phylogenetic analysis based on genomic sequences provides information about the genetic relationships between different organisms. In contrast, comparative analysis of metabolic pathways in different organisms can yield insights into their functional relationships under different physiological conditions. However, evaluating the similarities or differences between metabolic networks is a computationally challenging problem, and systematic methods of doing this are desirable. Here we introduce a graph-kernel method for computing the similarity between metabolic networks in polynomial time, and use it to profile metabolic pathways and to construct phylogenetic trees. RESULTS: To compare the structures of metabolic networks in organisms, we adopted the exponential graph kernel, which is a kernel-based approach with a labeled graph that includes a label matrix and an adjacency matrix. To construct the phylogenetic trees, we used an unweighted pair-group method with arithmetic mean, i.e., a hierarchical clustering algorithm. We applied the kernel-based network profiling method in a comparative analysis of nine carbohydrate metabolic networks from 81 biological species encompassing Archaea, Eukaryota, and Eubacteria. The resulting phylogenetic hierarchies generally support the tripartite scheme of three domains rather than the two domains of prokaryotes and eukaryotes. CONCLUSION: By combining the kernel machines with metabolic information, the method infers the context of biosphere development that covers physiological events required for adaptation by genetic reconstruction. The results show that one may obtain a global view of the tree of life by comparing the metabolic pathway structures using meta-level information rather than sequence information. This method may yield further information about biological evolution, such as the history of horizontal transfer of each gene, by studying the detailed structure of the phylogenetic tree constructed by the kernel-based method.
no abstract.
Analysis of Phylogenetics and Evolution (APE) is a package written in the R language for use in molecular evolution and phylogenetics. APE provides both utility functions for reading and writing data and manipulating phylogenetic trees, as well as several advanced methods for phylogenetic and evolutionary analysis (e.g. comparative and population genetic methods). APE takes advantage of the many R functions for statistics and graphics, and also provides a flexible framework for developing and implementing further statistical methods for the analysis of evolutionary processes. AVAILABILITY: The program is free and available from the official R package archive at http://cran.r-project.org/src/contrib/PACKAGES.html#ape. APE is licensed under the GNU General Public License.
Detailed knowledge of gene maps or even complete nucleotide sequences for small genomes leads to the feasibility of evolutionary inference based on the macrostructure of entire genomes, rather than on the traditional comparison of homologous versions of a single gene in different organisms. The mathematical modeling of evolution at the genomic level, however, and the associated inferential apparatus are qualitatively different from the usual sequence comparison theory developed to study evolution at the level of individual gene sequences. We describe the construction of a database of 16 mitochondrial gene orders from fungi and other eukaryotes by using complete or nearly complete genomic sequences; propose a measure of gene order rearrangement based on the minimal set of chromosomal inversions, transpositions, insertions, and deletions necessary to convert the order in one genome to that of the other; report on algorithm design and the development of the DERANGE software for the calculation of this measure; and present the results of analyzing the mitochondrial data with the aid of this tool.
The National Center for Biotechnology Information (NCBI) Taxonomy includes organism names and classifications for every sequence in the nucleotide and protein sequence databases of the International Nucleotide Sequence Database Collaboration. Since the last review of this resource in 2012, it has undergone several improvements. Most notable is the shift from a single SQL database to a series of linked databases tied to a framework of data called NameBank. This means that relations among data elements can be adjusted in more detail, resulting in expanded annotation of synonyms, the ability to flag names with specific nomenclatural properties, enhanced tracking of publications tied to names and improved annotation of scientific authorities and types. Additionally, practices utilized by NCBI Taxonomy curators specific to major taxonomic groups are described, terms peculiar to NCBI Taxonomy are explained, external resources are acknowledged and updates to tools and other resources are documented. Database URL: https://www.ncbi.nlm.nih.gov/taxonomy.
Molecular structures and sequences are generally more revealing of evolutionary relationships than are classical phenotypes (particularly so among microorganisms). Consequently, the basis for the definition of taxa has progressively shifted from the organismal to the cellular to the molecular level. Molecular comparisons show that life on this planet divides into three primary groupings, commonly known as the eubacteria, the archaebacteria, and the eukaryotes. The three are very dissimilar, the differences that separate them being of a more profound nature than the differences that separate typical kingdoms, such as animals and plants. Unfortunately, neither of the conventionally accepted views of the natural relationships among living systems--i.e., the five-kingdom taxonomy or the eukaryote-prokaryote dichotomy--reflects this primary tripartite division of the living world. To remedy this situation we propose that a formal system of organisms be established in which above the level of kingdom there exists a new taxon called a "domain." Life on this planet would then be seen as comprising three domains, the Bacteria, the Archaea, and the Eucarya, each containing two or more kingdoms. (The Eucarya, for example, contain Animalia, Plantae, Fungi, and a number of others yet to be defined). Although taxonomic structure within the Bacteria and Eucarya is not treated herein, Archaea is formally subdivided into the two kingdoms Euryarchaeota (encompassing the methanogens and their phenotypically diverse relatives) and Crenarchaeota (comprising the relatively tight clustering of extremely thermophilic archaebacteria, whose general phenotype appears to resemble most the ancestral phenotype of the Archaea.
Privateer (http://www.ccp4.ac.uk/html/privateer.html) is a new software package aimed at the detection and prevention of conformational, regiochemical and stereochemical anomalies in cyclic monosaccharide structures. Carbohydrates, including O- and N-glycans attached to protein and lipid structures, are increasingly being studied in cellular biology. Crystallographic refinement of sugars is, however, poorly performed, thus leading to thousands of incorrect structures having been deposited in the Protein Data Bank (PDB)1,2. Although nomenclature validation has become possible in the past decade, with the introduction of tools such as PDB carbohydrate residue check (pdb-care)3, inappropriate refinement protocols at resolutions lower than 1.6 Å can still force a correct sugar into a highly improbable ring conformation, let alone distort one with chemical errors1. High-energy conformations are very infrequent in nature—perhaps even out of the question in N-glycans—and must always be backed by clear electron density. Otherwise, such conformations should be treated as outliers. Privateer identifies incorrect regiochemistry and stereochemistry and unlikely conformations. A real-space correlation coefficient against omit mFo – DFc electron density is also calculated as a quality-of-fit indicator. Bias-minimized map coefficients are exported automatically and can be subsequently used to assess identification of the sugars. Using this information, Privateer produces a visual checklist for rapid correction of the errors in real space and ensures that conformational preferences are accounted for during subsequent rebuilding and refinement.
In the bioinformatics field, many computer algorithmic and data mining technologies have been developed for gene prediction, protein-protein interaction analysis, sequence analysis, and protein folding predictions, to name a few. This kind of research has branched off from the genomics field, creating the transcriptomics, proteomics, metabolomics, and glycomics research areas in the postgenomic age. In the glycomics field, given the complexity of glycan structures with their branches of monosaccharides in various conformations, new data mining and algorithmic methods have been developed in an attempt to gain a better understanding of glycans. However, these methods have not all been implemented as tools such that the glycobiology community may utilize them in their research. Thus, we have developed RINGS (Resource for INformatics of Glycomes at Soka) as a freely available Web resource for glycobiologists to analyze their data using the latest data mining and algorithmic techniques. It provides a number of tools including a 2D glycan drawing and querying interface called DrawRINGS, a Glycan Pathway Predictor (GPP) tool for dynamically computing the N-glycan biosynthesis pathway from a given glycan structure, and data mining tools Glycan Miner Tool and Profile PSTMM. These tools and other utilities provided by RINGS will be described. The URL for RINGS is http://rings.t.soka.ac.jp/.
In the bioinformatics field, many computer algorithmic and data mining technologies have been developed for gene prediction, protein-protein interaction analysis, sequence analysis, and protein folding predictions, to name a few. This kind of research has branched off from the genomics field, creating the transcriptomics, proteomics, metabolomics, and glycomics research areas in the postgenomic age. In the glycomics field, given the complexity of glycan structures with their branches of monosaccharides in various conformations, new data mining and algorithmic methods have been developed in an attempt to gain a better understanding of glycans. However, these methods have not all been implemented as tools such that the glycobiology community may utilize them in their research. Thus, we have developed RINGS (Resource for INformatics of Glycomes at Soka) as a freely available Web resource for glycobiologists to analyze their data using the latest data mining and algorithmic techniques. It provides a number of tools including a 2D glycan drawing and querying interface called DrawRINGS, a Glycan Pathway Predictor (GPP) tool for dynamically computing the N-glycan biosynthesis pathway from a given glycan structure, and data mining tools Glycan Miner Tool and Profile PSTMM. These tools and other utilities provided by RINGS will be described. The URL for RINGS is http://rings.t.soka.ac.jp/.
Knowledge of glycoproteins, their site-specific glycosylation patterns, and the glycan structures that they present to their recognition partners in health and disease is gradually being built on using a range of experimental approaches. The data from these analyses are increasingly being standardized and presented in various sources, from supplemental tables in publications to localized servers in investigator laboratories. Bioinformatics tools are now needed to collect these data and enable the user to search, display, and connect glycomics and glycoproteomics to other sources of related proteomics, genomics, and interactomics information. We here introduce GlyConnect ( https://glyconnect.expasy.org/ ), the central platform of the Glycomics@ExPASy portal for glycoinformatics. GlyConnect has been developed to gather, monitor, integrate, and visualize data in a user-friendly way to facilitate the interpretation of collected glycoscience data. GlyConnect is designed to accommodate and integrate multiple data types as they are increasingly produced.
SugarSketcher is an intuitive and fast JavaScript interface module for online drawing of glycan structures in the popular Symbol Nomenclature for Glycans (SNFG) notation and exporting them to various commonly used formats encoding carbohydrate sequences (e.g., GlycoCT) or quality images (e.g., svg). It does not require a backend server or any specific browser plugins and can be integrated in any web glycoinformatics project. SugarSketcher allows drawing glycans both for glycobiologists and non-expert users. The “quick mode” allows a newcomer to build up a glycan structure having only a limited knowledge in carbohydrate chemistry. The “normal mode” integrates advanced options which enable glycobiologists to tailor complex carbohydrate structures. The source code is freely available on GitHub and glycoinformaticians are encouraged to participate in the development process while users are invited to test a prototype available on the ExPASY web-site and send feedback.
This chapter introduces the web resource called RINGS which can be accessed at http:// www. rings. t. soka. ac. jp/ and provides freely available tools for analyzing and data mining glycomics data. These include DrawRINGS, an applet to draw glycan structures and obtain the KCF (KEGG Chemical Function)-formatted representation of the structure; ProfilePSTMM, a tool based on a probabilistic model which can extract patterns within groups of glycan structures; Glycan Miner, a tool for finding the common glycan substructures within a group of glycan structures; MCAW, a multiple alignment tool for glycans; Kernel Tool, a tool to find glycan substructures that are particular to one glycan structure data set compared to another; GPP, glycan pathway predictor for generating N-glycan biosynthesis pathways; and utilities for converting between various glycan structure representations. Each tool will be described in its usage and application, along with tips for using the tools most effectively. Moreover, RINGS provides a data management system as well as a feedback system to allow users to store their data on the RINGS server as well as to interact with developers to improve RINGS functionality.
Many databases of carbohydrate structures and related information can be found on the World Wide Web. This review covers the major carbohydrate databases that have potential utility for glycoscientists and researchers entering the glycosciences. The first half provides a brief overview of carbohydrate databases and web resources (including a history of carbohydrate databases and carbohydrate notations used in these databases), and the second half provides a guide that can be used as an index to determine which resources provide the data of most interest to the user.
no abstract.
Glycans are key molecules in many physiological and pathological processes. As with other molecules, like proteins, visualization of the 3D structures of glycans adds valuable information for understanding their biological function. Hence, here we introduce Azahar, a computing environment for the creation, visualization and analysis of glycan molecules. Azahar is implemented in Python and works as a plugin for the well known PyMOL package (Schrodinger in The PyMOL molecular graphics system, version 1.3r1, 2010). Besides the already available visualization and analysis options provided by PyMOL, Azahar includes 3 cartoon-like representations and tools for 3D structure caracterization such as a comformational search using a Monte Carlo with minimization routine and also tools to analyse single glycans or trajectories/ensembles including the calculation of radius of gyration, Ramachandran plots and hydrogen bonds. Azahar is freely available to download from http://www.pymolwiki.org/index.php/Azahar and the source code is available at https://github.com/agustinaarroyuelo/Azahar .
Glycans are crucial to the functioning of multicellular organisms. They may also play a role as mediators between host and parasite or symbiont. As many proteins (>50%) are posttranslationally modified by glycosylation, this mechanism is considered to be the most widespread posttranslational modification in eukaryotes. These surface modifications alter and regulate structure and biological activities/functions of proteins/biomolecules as they are largely involved in the recognition process of the appropriate structure in order to bind to the target cells. Consequently, the recognition of glycans on cellular surfaces plays a crucial role in the promotion or inhibition of various diseases and, therefore, glycosylation itself is considered to be a critical protein quality control attribute for commercial therapeutics, which is one of the fastest growing segments in the pharmaceutical industry. With the development of glycobiology as a separate discipline, a number of databases and tools became available in a similar way to other well-established "omics." Alleviating the recognized shortcomings of the available tools for data storage and retrieval is one of the highest priorities of the international glycoinformatics community. In the last decade, major efforts have been made, by leading scientific groups, towards the integration of a number of major databases and tools into a single portal, which would act as a centralized data repository for glycomics, equipped with a number of comprehensive analytical tools for data systematization, analysis, and comparison. This chapter provides an overview of the most important carbohydrate-related databases and glycoinformatic tools.
MOTIVATION: Complex carbohydrates play a central role in cellular communication and in disease development. O- and N-glycans, which are post-translationally attached to proteins and lipids, are sugar chains that are rooted, tree structures. Independent efforts to develop computational tools for analyzing complex carbohydrate structures have been designed to exploit specific databases requiring unique formatting and limited transferability. Attempts have been made at integrating these resources, yet it remains difficult to communicate and share data across several online resources. A disadvantage of the lack of coordination between development efforts is the inability of the user community to create reproducible analyses (workflows). The latter results in the more serious unreliability of glycomics metadata. RESULTS: In this paper, we realize the significance of connecting multiple online glycan resources that can be used to design reproducible experiments for obtaining, generating and analyzing cell glycomes. To address this, a suite of tools and utilities, have been integrated into the analytic functionality of the Galaxy bioinformatics platform to provide a Glycome Analytics Platform (GAP).Using this platform, users can design in silico workflows to manipulate various formats of glycan sequences and analyze glycomes through access to web data and services. We illustrate the central functionality and features of the GAP by way of example; we analyze and compare the features of the N-glycan glycome of monocytic cells sourced from two separate data depositions.This paper highlights the use of reproducible research methods for glycomics analysis and the GAP presents an opportunity for integrating tools in glycobioinformatics. AVAILABILITY AND IMPLEMENTATION: This software is open-source and available online at https://bitbucket.org/scientificomputing/glycome-analytics-platform CONTACTS: chris.barnett@uct.ac.za or kevin.naidoo@uct.ac.za SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
N-glycosylation is a post-translational modification heavily impacting protein functions. Some alterations of glycosylation, such as sialic acid hydrolysis, are related to protein dysfunction. Because of their high flexibility and the many reactive groups of the glycan chains, studying glycans with in vitro methods is a challenging task. Molecular dynamics is a useful tool and probably the only one in biology able to overcome this problem and gives access to conformational information through exhaustive sampling. To better decipher the impact of N-glycans, the analysis and visualization of their influence over time on protein structure is a prerequisite. We developed the Umbrella Visualization, a graphical method that assigns the glycan intrinsic flexibility during a molecular dynamics trajectory. The density plot generated by this method brought relevant informations regarding glycans dynamics and flexibility, but needs further development in order to integrate an accurate description of the protein topology and its interactions. We propose here to transform this analysis method into a visualization mode in UnityMol. UnityMol is a molecular editor, viewer and prototyping platform, coded in C#. The new representation of glycan chains presented in this study takes into account both the main positions adopted by each antenna of a glycan and their statistical relevance. By displaying the collected data on the protein surface, one is then able to investigate the protein/glycan interactions.
This article describes features, usage, and application of an CSDB/SNFG Structure Editor, a new online tool for quick and intuitive input of carbohydrate and derivative structures using Symbol Nomenclature for Glycans (SNFG). The Editor is built on a platform of the Carbohydrate Structure Database (CSDB) and relies on its online services via the dedicated web-API. The Editor allows building of oligo- and polymeric glycan structures and supports most features of natural glycans, such as underdetermined structures, alternative branches, repeating subunits, SMILES specification of atypical monomers, and others. The vocabulary of building blocks contains 600+ monomeric residues, including 327 monosaccharides. Support for SMILES allows input and visualization of chemical structures of virtually unlimited complexity. On the other hand, the interface follows the recognized GlycanBuilder style easy to novice users. The export feature includes support for CSDB Linear, GlycoCT, WURCS, SweetDB, and Glycam notations, SMILES codes, MOL/PDB atomic coordinate formats, raster and vector SNFG images, and on-the-fly visualization as 2D structural formulas and 3D molecular models. Integration of the Editor into any web-based glycoinformatics project is straightforward and simple, similarly to any other modern JavaScript application.
A new web tool, PDB2MultiGIF (http://www.dkfz-heidelberg.de/spec/pdb2mgif/),which converts the topological information (atom types, 3D coordinates, molecular connectivity) of molecules (given in PDB format [1]) to a series of animated images (in GIF Format) [2] is described. The molecular visualisation program RASMOL [3] is used to generate the images.
GlyProt (http://www.glycosciences.de/glyprot/) is a web-based tool that enables meaningful N-glycan conformations to be attached to all the spatially accessible potential N-glycosylation sites of a known three-dimensional (3D) protein structure. The probabilities of physicochemical properties such as mass, accessible surface and radius of gyration are calculated. The purpose of this service is to provide rapid access to reliable 3D models of glycoproteins, which can subsequently be refined by using more elaborate simulations and validated by comparing the generated models with experimental data.
Formerly developed inside the BASF company for inhouse use, SpecInfo is now available to the public due to a project of the German Federal Ministry of Research and Technology (BMFT). The system is meant as a supporting tool for interpreting spectroscopists elucidating the structure of chemical compounds. Speclnfo supports the synergistic use of NMR-, IR- and MS-spectra for the correlation between structure and spectral patterns, based on a large spectroscopic databank. Additionally, statistical evaluations on this database are used to link spectral patterns to (sub-)structures responsible for these patterns.
CHARMM (Chemistry at HARvard Molecular Mechanics) is a highly versatile and widely used molecular simulation program. It has been developed over the last three decades with a primary focus on molecules of biological interest, including proteins, peptides, lipids, nucleic acids, carbohydrates, and small molecule ligands, as they occur in solution, crystals, and membrane environments. For the study of such systems, the program provides a large suite of computational tools that include numerous conformational and path sampling methods, free energy estimators, molecular minimization, dynamics, and analysis techniques, and model-building capabilities. The CHARMM program is applicable to problems involving a much broader class of many-particle systems. Calculations with CHARMM can be performed using a number of different energy functions and models, from mixed quantum mechanical-molecular mechanical force fields, to all-atom classical potential energy functions with explicit solvent and various boundary conditions, to implicit solvent and membrane models. The program has been ported to numerous platforms in both serial and parallel architectures. This article provides an overview of the program as it exists today with an emphasis on developments since the publication of the original CHARMM article in 1983.
Version 1.2 of the software system, termed Crystallography and NMR system (CNS), for crystallographic and NMR structure determination has been released. Since its first release, the goals of CNS have been (i) to create a flexible computational framework for exploration of new approaches to structure determination, (ii) to provide tools for structure solution of difficult or large structures, (iii) to develop models for analyzing structural and dynamical properties of macromolecules and (iv) to integrate all sources of information into all stages of the structure determination process. Version 1.2 includes an improved model for the treatment of disordered solvent for crystallographic refinement that employs a combined grid search and least-squares optimization of the bulk solvent model parameters. The method is more robust than previous implementations, especially at lower resolution, generally resulting in lower R values. Other advances include the ability to apply thermal factor sharpening to electron density maps. Consistent with the modular design of CNS, these additions and changes were implemented in the high-level computing language of CNS.
A new software suite, called Crystallography & NMR System (CNS), has been developed for macromolecular structure determination by X-ray crystallography or solution nuclear magnetic resonance (NMR) spectroscopy. In contrast to existing structure-determination programs, the architecture of CNS is highly flexible, allowing for extension to other structure-determination methods, such as electron microscopy and solid-state NMR spectroscopy. CNS has a hierarchical structure: a high-level hypertext markup language (HTML) user interface, task-oriented user input files, module files, a symbolic structure-determination language (CNS language), and low-level source code. Each layer is accessible to the user. The novice user may just use the HTML interface, while the more advanced user may use any of the other layers. The source code will be distributed, thus source-code modification is possible. The CNS language is sufficiently powerful and flexible that many new algorithms can be easily implemented in the CNS language without changes to the source code. The CNS language allows the user to perform operations on data structures, such as structure factors, electron-density maps, and atomic properties. The power of the CNS language has been demonstrated by the implementation of a comprehensive set of crystallographic procedures for phasing, density modification and refinement. User-friendly task-oriented input files are available for nearly all aspects of macromolecular structure determination by X-ray crystallography and solution NMR.
Compared to proteomics, computational platforms for glycoproteomics is at an early stage and many researchers rely on the manual interpretation of large data sets to gain structural insights into the glycoproteome. Over the last few years there has been a steady increase in the availability of bioinformatics tools for processing and annotating glycoproteomics data sets. This mini-review describes advances in the development of algorithms and software applications and their applications. Furthermore, an update on structural and analytical databases is presented with a focus on those resources still actively maintained by the community, and how these resources are now being integrated into glycoproteomics pipelines to improve data interpretation.
Despite the success of several international initiatives the glycosciences still lack a managed infrastructure that contributes to the advancement of research through the provision of comprehensive structural and experimental glycan data collections. UniCarbKB is an initiative that aims to promote the creation of an online information storage and search platform for glycomics and glycobiology research. The knowledgebase will offer a freely accessible and information-rich resource supported by querying interfaces, annotation technologies and the adoption of common standards to integrate structural, experimental and functional data. The UniCarbKB framework endeavors to support the growth of glycobioinformatics and the dissemination of knowledge through the provision of an open and unified portal to encourage the sharing of data. In order to achieve this, the framework is committed to the development of tools and procedures that support data annotation, and expanding interoperability through cross-referencing of existing databases. Database URL: http://www.unicarbkb.org.
BACKGROUND: Recent progress in method development for characterising the branched structures of complex carbohydrates has now enabled higher throughput technology. Automation of structure analysis then calls for software development since adding meaning to large data collections in reasonable time requires corresponding bioinformatics methods and tools. Current glycobioinformatics resources do cover information on the structure and function of glycans, their interaction with proteins or their enzymatic synthesis. However, this information is partial, scattered and often difficult to find to for non-glycobiologists. METHODS: Following our diagnosis of the causes of the slow development of glycobioinformatics, we review the "objective" difficulties encountered in defining adequate formats for representing complex entities and developing efficient analysis software. RESULTS: Various solutions already implemented and strategies defined to bridge glycobiology with different fields and integrate the heterogeneous glyco-related information are presented. CONCLUSIONS: Despite the initial stage of our integrative efforts, this paper highlights the rapid expansion of glycomics, the validity of existing resources and the bright future of glycobioinformatics.
SUMMARY: The development of robust high-performance liquid chromatography (HPLC) technologies continues to improve the detailed analysis and sequencing of glycan structures released from glycoproteins. Here, we present a database (GlycoBase) and analytical tool (autoGU) to assist the interpretation and assignment of HPLC-glycan profiles. GlycoBase is a relational database which contains the HPLC elution positions for over 350 2-AB labelled N-glycan structures together with predicted products of exoglycosidase digestions. AutoGU assigns provisional structures to each integrated HPLC peak and, when used in combination with exoglycosidase digestions, progressively assigns each structure automatically based on the footprint data. These tools are potentially very promising and facilitate basic research as well as the quantitative high-throughput analysis of low concentrations of glycans released from glycoproteins. AVAILABILITY: http://glycobase.ucd.ie.
Coot is a tool widely used for model building, refinement, and validation of macromolecular structures. It has been extensively used for crystallography and, more recently, improvements have been introduced to aid in cryo-EM model building and refinement, as cryo-EM structures with resolution ranging 2.5-4 A are now routinely available. Model building into these maps can be time-consuming and requires experience in both biochemistry and building into low-resolution maps. To simplify and expedite the model building task, and minimize the needed expertise, new tools are being added in Coot. Some examples include morphing, Geman-McClure restraints, full-chain refinement, and Fourier-model based residue-type-specific Ramachandran restraints. Here, we present the current state-of-the-art in Coot usage.
We describe the development, current features, and some directions for future development of the Amber package of computer programs. This package evolved from a program that was constructed in the late 1970s to do Assisted Model Building with Energy Refinement, and now contains a group of programs embodying a number of powerful tools of modern computational chemistry, focused on molecular dynamics and free energy calculations of proteins, nucleic acids, and carbohydrates.
Recent years have seen an increase in both the development and use of informatics tools and databases in glycobiology-based research. Mass spectrometric methods, which are capable of detecting oligosaccharides in the low pico- to femtomole range, are fundamental technologies used in glycan analysis. The availability of robust and reliable algorithms to automatically interpret MS spectra is critical to many glycomic projects. Unfortunately, the current state-of-the-art in glycoinformatics is characterized by the existence of disconnected and incompatible islands of experimental data, resources, and proprietary applications. The development of tools for the robust automatic assignment of glycans on the basis of MS measurements is often hampered by the paucity of available MS data. Here, we review the methodologies for semi‐automatic interpretation of MS spectra of glycans, based upon current technology. Three promising approaches are highlighted: (a) combinatorial approaches to the automatic assignment of possible monosaccharide superclass composition—Glyco‐Peakfinder, (b) the scoring of a set of identified structures with theoretically calculated fragments–GlycoWorkbench and (c) the correlation of experimental masses to a database of theoretical fragment masses, in a technique known as Glycofragment Mass Fingerprinting.
Mass spectrometry is the main analytical technique currently used to address the challenges of glycomics as it offers unrivalled levels of sensitivity and the ability to handle complex mixtures of different glycan variations. Determination of glycan structures from analysis of MS data is a major bottleneck in high-throughput glycomics projects, and robust solutions to this problem are of critical importance. However, all the approaches currently available have inherent restrictions to the type of glycans they can identify, and none of them have proved to be a definitive tool for glycomics. GlycoWorkbench is a software tool developed by the EUROCarbDB initiative to assist the manual interpretation of MS data. The main task of GlycoWorkbench is to evaluate a set of structures proposed by the user by matching the corresponding theoretical list of fragment masses against the list of peaks derived from the spectrum. The tool provides an easy to use graphical interface, a comprehensive and increasing set of structural constituents, an exhaustive collection of fragmentation types, and a broad list of annotation options. The aim of GlycoWorkbench is to offer complete support for the routine interpretation of MS data. The software is available for download from: http://www.eurocarbdb.org/applications/ms-tools.
SUMMARY: This manuscript describes an open-source program, DrawGlycan-SNFG (version 2), that accepts IUPAC (International Union of Pure & Applied Chemist)-condensed inputs to render Symbol Nomenclature For Glycans (SNFG) drawings. A wide range of local and global options enable display of various glycan/peptide modifications including bond breakages, adducts, repeat structures, ambiguous identifications, etc. These facilities make DrawGlycan-SNFG ideal for integration into various glycoinformatics software, including glycomics and glycoproteomics mass spectrometry applications. As a demonstration of such usage, we incorporated DrawGlycan-SNFG into gpAnnotate, a standalone application to score and annotate individual MS/MS glycopeptide spectrum in different fragmentation modes. AVAILABILITY AND IMPLEMENTATION: DrawGlycan-SNFG and gpAnnotate are platform independent. While originally coded using MATLAB, compiled packages are also provided to enable DrawGlycan-SNFG implementation in Python and Java. All programs are available from https://virtualglycome.org/drawglycan; https://virtualglycome.org/gpAnnotate. SUPPLEMENTARY INFORMATION: Supplementary Material are available at Bioinformatics online.
Glycan or carbohydrate structures can be pictorially represented using symbolic nomenclatures. The symbol nomenclature for glycans (SNFG) contains 67 different monosaccharides represented using various colors and geometric shapes. A simple tool to convert International Union of Pure and Applied Chemistry (IUPAC) format text to SNFG will be useful for sketching glycans and glycopeptides. Such code can also enable the development of more sophisticated applications, where the visual representation of carbohydrate structures is necessary. To address this need, the current manuscript describes DrawGlycan-SNFG, a freely available, platform-independent, open-source tool. It allows: i. the display of glycans and glycopeptides from IUPAC-condensed text inputs and ii. the depiction of glycan and glycopeptide fragments. The online version of this program is provided with a user-friendly web interface at www.virtualglycome.org/DrawGlycan. Downloadable, stand-alone GUI (Graphical User Interface) version and the program source code are also available from this repository. DrawGlycan-SNFG will be useful for experimentalists looking for a ready to use, simple program for sketching carbohydrates and for software developers interested in incorporating SNFG into their program suite.
MOTIVATION: Glycans and glycoconjugates are usually recorded in dedicated databases in residue-based notations. Only a few of them can be converted into chemical (atom-based) formats highly demanded in conformational and biochemical studies. In this work, we present a tool for translation from a residue-based glycan notation to SMILES. RESULTS: The REStLESS algorithm for translation from the CSDB Linear notation to SMILES was developed. REStLESS stands for ResiduEs as Smiles and LinkagEs as SmartS, where SMARTS reaction expressions are used to merge pre-encoded residues into a molecule. The implementation supports virtually all structural features reported in natural carbohydrates and glycoconjugates. The translator is equipped with a mechanism for conversion of SMILES strings into optimized atomic coordinates which can be used as starting geometries for various computational tasks. AVAILABILITY AND IMPLEMENTATION: REStLESS is integrated in the Carbohydrate Structure Database (CSDB) and is freely available on the web (http://csdb.glycoscience.ru/csdb2atoms.html). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
no abstract.
Mammalian glycosaminoglycans are linear complex polysaccharides comprising heparan sulfate, heparin, dermatan sulfate, chondroitin sulfate, keratan sulfate and hyaluronic acid. They bind to numerous proteins and these interactions mediate their biological activities. GAG-protein interaction data reported in the literature are curated mostly in MatrixDB database (http://matrixdb.univ-lyon1.fr/). However, a standard nomenclature and a machine-readable format of GAGs together with bioinformatics tools for mining these interaction data are lacking. We report here the building of an automated pipeline to (i) standardize the format of GAG sequences interacting with proteins manually curated from the literature, (ii) translate them into the machine-readable GlycoCT format and into SNFG (Symbol Nomenclature For Glycan) images and (iii) convert their sequences into a format processed by a builder generating three-dimensional structures of polysaccharides based on a repertoire of conformations experimentally validated by data extracted from crystallized GAG-protein complexes. We have developed for this purpose a converter (the CT23D converter) to automatically translate the GlycoCT code of a GAG sequence into the input file required to construct a three-dimensional model.
GlycoMod (http://www.expasy.ch/tools/glycomod/) is a software tool designed to find all possible compositions of a glycan structure from its experimentally determined mass. The program can be used to predict the composition of any glycoprotein-derived oligosaccharide comprised of either underivatised, methylated or acetylated monosaccharides, or with a derivatised reducing terminus. The composition of a glycan attached to a peptide can be computed if the sequence or mass of the peptide is known. In addition, if the protein is known and is contained in the SWISS-PROT or TrEMBL databases, the program will match the experimentally determined masses against all the predicted protease-produced peptides (including any post-translational modifications annotated in these databases) which have the potential to be glycosylated with either N- or O-linked oligosaccharides. Since many possible glycan compositions can be generated from the same mass, the program can apply compositional constraints to the output if the user supplies either known or suspected monosaccharide constituents. Furthermore, known oligosaccharide structural constraints on monosaccharide composition are also incorporated into the program to limit the output.
We report the addition of two visualisation algorithms, termed PaperChain and Twister, to the freely available Visual Molecular Dynamics (VMD) package. These algorithms produce visualisations of complex cyclic molecules and multi-branched polysaccharides and are a generalization and optimization of those we previously developed in a standalone package for carbohydrates. PaperChain highlights each ring in a molecular structure with a polygon, which is coloured according to the ring pucker. Twister traces glycosidic bonds with a ribbon that twists according to the relative orientation of successive sugar residues. Combination of these novel algorithms and new ring selection statements with the large set of visualisations already available in VMD allows for unprecedented flexibility in the level of detail displayed for glycoconjugate, glycoprotein and carbohydrate-binding protein structures, as well as other cyclic structures. We highlight the efficacy of these algorithms with selected illustrative examples, clearly demonstrating the value of the new visualisations, not only for structure validation, but also for facilitating insights into molecular structure and mechanism.
During the EUROCarbDB project our group developed the GlycanBuilder and GlycoWorkbench glycoinformatics tools. This short communication summarizes the capabilities of these two tools and updates which have been made since the original publications in 2007 and 2008. GlycanBuilder is a tool that allows for the fast and intuitive drawing of glycan structures; this tool can be used standalone, embedded in web pages and can also be integrated into other programs. GlycoWorkbench has been designed to semi-automatically annotate glycomics data. This tool can be used to annotate mass spectrometry (MS) and MS/MS spectra of free oligosaccharides, N and O-linked glycans, GAGs (glycosaminoglycans) and glycolipids, as well as MS spectra of glycoproteins.
Carbohydrates constitute a structurally and functionally diverse group of biological molecules and macromolecules. In cells they are involved in, e.g., energy storage, signaling, and cell-cell recognition. All of these phenomena take place in atomistic scales, thus atomistic simulation would be the method of choice to explore how carbohydrates function. However, the progress in the field is limited by the lack of appropriate tools for preparing carbohydrate structures and related topology files for the simulation models. Here we present tools that fill this gap. Applications where the tools discussed in this paper are particularly useful include, among others, the preparation of structures for glycolipids, nanocellulose, and glycans linked to glycoproteins. The molecular structures and simulation files generated by the tools are compatible with GROMACS.
The carbohydrate fraction of most mammalian milks contains a variety of oligosaccharides that encompass a range of structures and monosaccharide compositions. Human milk oligosaccharides have received considerable attention due to their biological roles in neonatal gut microbiota, immunomodulation, and brain development. However, a major challenge in understanding the biology of milk oligosaccharides across other mammals is that reports span more than 5 decades of publications with varying data reporting methods. In the present study, publications on milk oligosaccharide profiles were identified and harmonized into a standardized format to create a comprehensive, machine-readable database of milk oligosaccharides across mammalian species. The resulting database, MilkOligoDB, includes 3193 entries for 783 unique oligosaccharide structures from the milk of 77 different species harvested from 113 publications. Cross-species and cross-publication comparisons of milk oligosaccharide profiles reveal common structural motifs within mammalian orders. Of the species studied, only chimpanzees, bonobos, and Asian elephants share the specific combination of fucosylation, sialylation, and core structures that are characteristic of human milk oligosaccharides. However, agriculturally important species do produce diverse oligosaccharides that may be valuable for human supplementation. Overall, MilkOligoDB facilitates cross-species and cross-publication comparisons of milk oligosaccharide profiles and the generation of new data-driven hypotheses for future research.
Most currently available glycan structure databases use their own proprietary structure representation schema and contain numerous annotation errors. These cause problems when glycan databases are used for the annotation or mining of data generated in the laboratory. Due to the complexity of glycan structures, curating these databases is often a tedious and labor-intensive process. However, rigorously validating glycan structures can be made easier with a curation workflow that incorporates a structure-matching algorithm that compares candidate glycans to a canonical tree that embodies structural features consistent with established mechanisms for the biosynthesis of a particular class of glycans. To this end, we have implemented Qrator, a web-based application that uses a combination of external literature and database references, user annotations and canonical trees to assist and guide researchers in making informed decisions while curating glycans. Using this application, we have started the curation of large numbers of N-glycans, O-glycans and glycosphingolipids. Our curation workflow allows creating and extending canonical trees for these classes of glycans, which have subsequently been used to improve the curation workflow.
The frequency of glycosylated protein 3D structures in the Protein Data Bank (PDB) is significantly lower than the proportion of glycoproteins in nature, and if glycan 3D structures are present, then they often exhibit a large degree of errors. There are various reasons for this, one of which is a comparably low support of carbohydrates in software tools for 3D structure determination and validation. This chapter illustrates the current features that assist crystallographers with handling glycans during 3D structure determination in Coot and CNS and with validation of the results. .
Coot is a graphics application that is used to build or manipulate macromolecular models; its particular forte is manipulation of the model at the residue level. The model-building tools of Coot have been combined and extended to assist or automate the building of N-linked glycans. The model is built by the addition of monosaccharides, placed by variation of internal coordinates. The subsequent model is refined by real-space refinement, which is stabilized with modified and additional restraints. It is hoped that these enhanced building tools will help to reduce building errors of N-linked glycans and improve our knowledge of the structures of glycoproteins.
This article describes an update of POLYS, the POLYSaccharide builder, for generating three-dimensional structures of polysaccharides and complex carbohydrates (Engelsen et al., Biopolymers 1996, 39, 417-433). POLYS is written in portable ANSI C and is now released under an open source license. Using this software, complex branched carbohydrate structures and polysaccharides can be constructed from their primary structure and the relevant monosaccharides stored in database containing information on optimized glycosidic linkage geometries. The constructed three-dimensional structures are described as Cartesian coordinate files which can be used as input to other molecular modeling software. The new version of POLYS includes a large database of monosaccharides and a helical generator to build and optimize regular single helix or double helix structures. To demonstrate the efficiency of POLYS to build carbohydrate structures, four examples of increasing complexity are presented in the manuscript, from simple alpha glucans over complex starch fragments and the double helical structure of amylopectin to the mega-oligosaccharide RhamnoGalacturonan II.
Recent advances in single-particle cryo-electron microscopy (cryoEM) have resulted in determination of an increasing number of protein structures with resolved glycans. However, existing protocols for the refinement of glycoproteins at low resolution have failed to keep up with these advances. As a result, numerous deposited structures contain glycan stereochemical errors. Here, we describe a Rosetta-based approach for both cryoEM and X-ray crystallography refinement of glycoproteins that is capable of correcting conformational and configurational errors in carbohydrates. Building upon a previous Rosetta framework, we introduced additional features and score terms enabling automatic detection, setup, and refinement of glycan-containing structures. We benchmarked this approach using 12 crystal structures and showed that glycan geometries can be automatically improved while maintaining good fit to the crystallographic data. Finally, we used this method to refine carbohydrates of the human coronavirus NL63 spike glycoprotein and of an HIV envelope glycoprotein, demonstrating its usefulness for cryoEM refinement.
no abstract.
Bioinformatics approaches to carbohydrate research have recently begun using large amounts of protein and carbohydrate data. In this field called glycome informatics, the foremost necessity is a comprehensive resource for genome-scale bioinformatics analysis of glycan data. Although the accumulation of experimental data may be useful as a reference of biological and biochemical information on carbohydrates, this is insufficient for bioinformatics analysis. Thus, we have developed a glycome informatics resource (http://www.genome.jp/kegg/glycan/) in KEGG (Kyoto Encyclopedia of Genes and Genomes), an integrated knowledge base of protein networks, genomic information, and chemical information. This review describes three noteworthy features: (1) GLYCAN, a database of carbohydrate structures; (2) glycan-related pathways; and (3) Composite Structure Map (CSM), a map illustrating all possible variations of carbohydrate structures within organisms. GLYCAN includes two useful tools: an intuitive drawing tool called KegDraw, and an efficient glycan search and alignment tool called KEGG Carbohydrate Matcher (KCaM). KEGG's glycan biosynthesis and metabolism pathways, integrating carbohydrate structures, proteins, and reactions, are also a pivotal resource. CSM is constructed as a bridge between carbohydrate functions and structures. CSM is able to display, for example, expression data of glycosyltransferases in a compact manner. In all the KEGG resources, various objects including KEGG pathways, chemical compounds, as well as carbohydrate structures are commonly represented as graphs, which are widely studied and utilized in the computer science field.
The following sections are included: - Introduction - Glycan Databases — Current - Glycan Databases — Historical - Glycan Tools (Computational Applications) - MIRAGE Initiative and the Definition of an Exchange Format between Glycan Databases - Glycan Data Formats.
VMD is a molecular graphics program designed for the display and analysis of molecular assemblies, in particular biopolymers such as proteins and nucleic acids. VMD can simultaneously display any number of structures using a wide variety of rendering styles and coloring methods. Molecules are displayed as one or more "representations," in which each representation embodies a particular rendering method and coloring scheme for a selected subset of atoms. The atoms displayed in each representation are chosen using an extensive atom selection syntax, which includes Boolean operators and regular expressions. VMD provides a complete graphical user interface for program control, as well as a text interface using the Tcl embeddable parser to allow for complex scripts with variable substitution, control loops, and function calls. Full session logging is supported, which produces a VMD command script for later playback. High-resolution raster images of displayed molecules may be produced by generating input scripts for use by a number of photorealistic image-rendering applications. VMD has also been expressly designed with the ability to animate molecular dynamics (MD) simulation trajectories, imported either from files or from a direct connection to a running MD simulation. VMD is the visualization component of MDScope, a set of tools for interactive problem solving in structural biology, which also includes the parallel MD program NAMD, and the MDCOMM software used to connect the visualization and simulation programs. VMD is written in C++, using an object-oriented design; the program, including source code and extensive documentation, is freely available via anonymous ftp and through the World Wide Web.
The computer program CASPER and its algorithms are described. The program is aimed at facilitating the determination of structures of oligosaccharides and regular polysaccharides, requiring as input either the one-dimensional 1H or 13C NMR spectrum or the 2D C,H-correlation NMR spectrum together with information on components and linkages. The databases, the method of simulating spectra, options of the program, and techniques for faster calculations are described as well as an example of a structural determination.
A WWW-interface to a program for structure elucidation of oligo- and polysaccharides using NMR data, CASPER, is presented. The interface and the underlying program have been extensively tested using published data and it was able to simulate 13C NMR spectra of >200 structures with an average error of about 0.3 ppm/resonance. When applied to the repeating units of Escherichia coli O-antigens the published structures were found among the five highest ranked structures in 75% of the cases. The average deviation between calculated and experimental 13C chemical shifts was 0.45 ppm. Oligosaccharide spectra were calculated with even better accuracy (0.23 ppm/resonance) and the correct structure was ranked 1st or 2nd in all the cases examined. Additional NMR experiments that may be required to distinguish between candidate structures are aided by the assignments provided by the program. This computational approach is also suitable for use in structural confirmation of chemically or enzymatically synthesized oligosaccharides. The program is found at http://www.casper.organ.su.se/casper.
The computer program CASPER, used in the structural analysis of polysaccharides composed of repeating units, has been extended. The extended version uses either unassigned 1H- or 13C-n.m.r. chemical shifts or the complete unassigned C,H-correlation spectrum, and can predict the structure of linear and branched oligo- and poly-saccharides. The number of possible structures, consistent with sugar and methylation analysis, can be decreased by the use of 1JC,H and 3JH,H values. The database, which contains 1H- or 13C-n.m.r. chemical shift data for monosaccharides and 1H- or 13C-glycosylation shifts for all types of glycosidic linkages obtained by combination of the monosaccharides, has been increased and now also contains correction values for sugar residues present in branch-point regions. The program has been tested on four polysaccharides of known structure but with different degrees of complexity. For three polysaccharides, the correct structure was suggested; for the fourth, two structures were consistent with the n.m.r. data, one of them being correct.
The CHARMM-GUI Membrane Builder (http://www.charmm-gui.org/input/membrane), an intuitive, straightforward, web-based graphical user interface, was expanded to automate the building process of heterogeneous lipid bilayers, with or without a protein and with support for up to 32 different lipid types. The efficacy of these new features was tested by building and simulating lipid bilayers that resemble yeast membranes, composed of cholesterol, dipalmitoylphosphatidylcholine, dioleoylphosphatidylcholine, palmitoyloleoylphosphatidylethanolamine, palmitoyloleoylphosphatidylamine, and palmitoyloleoylphosphatidylserine. Four membranes with varying concentrations of cholesterol and phospholipids were simulated, for a total of 170 ns at 303.15 K. Unsaturated phospholipid chain concentration had the largest influence on membrane properties, such as average lipid surface area, density profiles, deuterium order parameters, and cholesterol tilt angle. Simulations with a high concentration of unsaturated chains (73%, membrane(unsat)) resulted in a significant increase in lipid surface area and a decrease in deuterium order parameters, compared with membranes with a high concentration of saturated chains (60-63%, membrane(sat)). The average tilt angle of cholesterol with respect to bilayer normal was largest, and the distribution was significantly broader for membrane(unsat). Moreover, short-lived cholesterol orientations parallel to the membrane surface existed only for membrane(unsat). The membrane(sat) simulations were in a liquid-ordered state, and agree with similar experimental cholesterol-containing membranes.
Understanding how glycosylation affects protein structure, dynamics, and function is an emerging and challenging problem in biology. As a first step toward glycan modeling in the context of structural glycobiology, we have developed Glycan Reader and integrated it into the CHARMM-GUI, http://www.charmm-gui.org/input/glycan. Glycan Reader greatly simplifies the reading of PDB structure files containing glycans through (i) detection of carbohydrate molecules, (ii) automatic annotation of carbohydrates based on their three-dimensional structures, (iii) recognition of glycosidic linkages between carbohydrates as well as N-/O-glycosidic linkages to proteins, and (iv) generation of inputs for the biomolecular simulation program CHARMM with the proper glycosidic linkage setup. In addition, Glycan Reader is linked to other functional modules in CHARMM-GUI, allowing users to easily generate carbohydrate or glycoprotein molecular simulation systems in solution or membrane environments and visualize the electrostatic potential on glycoprotein surfaces. These tools are useful for studying the impact of glycosylation on protein structure and dynamics.
The GlycoViewer (http://www.systemsbiology.org.au/glycoviewer) is a web-based tool that can visualize, summarize and compare sets of glycan structures. Its input is a group of glycan structures; these can be entered as a list in IUPAC format or via a sugar structure builder. Its output is a detailed graphic, which summarizes all salient features of the glycans according to the shapes of the core structures, the nature and length of any chains, and the types of terminal epitopes. The tool can summarize up to hundreds of structures in a single figure. This allows unique, high-level views to be generated of glycans from one protein, from a cell, a tissue or a whole organism. Use of the tool is illustrated in the analysis of normal and disease-associated glycans from the human glycoproteome.
Motivation: Carbohydrates play crucial roles in various biochemical processes and are useful for developing drugs and vaccines. However, in case of carbohydrates, the primary structure elucidation is usually a sophisticated task. Therefore, they remain the least structurally characterized class of biomolecules, and it hampers the progress in glycochemistry and glycobiology. Creating a usable instrument designed to assist researchers in natural carbohydrate structure determination would advance glycochemistry in biomedical and pharmaceutical applications. Results: We present GRASS (Generation, Ranking and Assignment of Saccharide Structures), a novel method for semi-automated elucidation of carbohydrate and derivative structures which uses unassigned 13C NMR spectra and information obtained from chromatography, optical, chemical and other methods. This approach is based on new methods of carbohydrate NMR simulation recently reported as the most accurate. It combines a broad diversity of supported structural features, high accuracy and performance. Availability: GRASS is implemented in a free web tool available at http://csdb.glycoscience.ru/grass.html. Contact: kapaev_roman@mail.ru or netbox@toukach.ru. Supplementary information: Supplementary data are available at Bioinformatics online.
Glycan Optimized Dual Empirical Spectrum Simulation (GODESS) is a web service, which has been recently shown to be one of the most accurate tools for simulation of (1)H and (13)C 1D NMR spectra of natural carbohydrates and their derivatives. The new version of GODESS supports visualization of the simulated (1)H and (13)C chemical shifts in the form of most 2D spin correlation spectra commonly used in carbohydrate research, such as (1)H-(1)H TOCSY, COSY/COSY-DQF/COSY-RCT, and (1)H-(13)C edHSQC, HSQC-COSY, HSQC-TOCSY, and HMBC. Peaks in the simulated 2D spectra are color-coded and labeled according to the signal assignment and can be exported in JCAMP-DX format. Peak widths are estimated empirically from the structural features. GODESS is available free of charge via the Internet at the platform of the Carbohydrate Structure Database project ( http://csdb.glycoscience.ru ).
The Arabidopsis Information Portal (https://www.araport.org) is a new online resource for plant biology research. It houses the Arabidopsis thaliana genome sequence and associated annotation. It was conceived as a framework that allows the research community to develop and release 'modules' that integrate, analyze and visualize Arabidopsis data that may reside at remote sites. The current implementation provides an indexed database of core genomic information. These data are made available through feature-rich web applications that provide search, data mining, and genome browser functionality, and also by bulk download and web services. Araport uses software from the InterMine and JBrowse projects to expose curated data from TAIR, GO, BAR, EBI, UniProt, PubMed and EPIC CoGe. The site also hosts 'science apps,' developed as prototypes for community modules that use dynamic web pages to present data obtained on-demand from third-party servers via RESTful web services. Designed for sustainability, the Arabidopsis Information Portal strategy exploits existing scientific computing infrastructure, adopts a practical mixture of data integration technologies and encourages collaborative enhancement of the resource by its user community.
Micelles play an important role in both experimental and computational studies of the effect of lipid interactions on biological systems. The spherical geometry and the dynamical behavior of micelles makes generating micelle structures for use in molecular simulations challenging. An easy tool for generating simulation-ready micelle models, covering a broad range of lipids, is highly desirable. Here, we present a new Web server, Micelle Maker, which can provide equilibrated micelle models as a direct input for subsequent molecular dynamics simulations from a broad range of lipids (currently 25 lipid types, including 24 glycolipids). The Web server, which is available at http://www.micellemaker.net, uses error checking routines to prevent clashes during the initial placement of the lipids and uses AMBER's GLYCAM library for generating minimized or equilibrated micelle models, but the resulting structures can be used as starting points for simulations with any force field or simulation package. Extensive validation simulations with an overall simulation time of 12 mus using eight micelle models where assembly information is available show that all of the micelles remain very stable over the whole simulation time. Finally, we discuss the advantages of Micelle Maker relative to other approaches in the field.
CarbBuilder is a portable software tool for producing three-dimensional molecular models of carbohydrates from the simple text specification of a primary structure. CarbBuilder can generate a wide variety of carbohydrate structures, ranging from monosaccharides to large, branched polysaccharides. Version 2.0 of the software, described in this article, supports monosaccharides of both mammalian and bacterial origin and a range of substituents for derivatization of individual sugar residues. This improved version has a sophisticated building algorithm to explore the range of possible conformations for a specified carbohydrate molecule. Illustrative examples of models of complex polysaccharides produced by CarbBuilder demonstrate the capabilities of the software. CarbBuilder is freely available under the Artistic License 2.0 from https://people.cs.uct.ac.za/~mkuttel/Downloads.html. (c) 2016 Wiley Periodicals, Inc.
Main characteristics are described of the PRIRODA quantum-chemical program suite designed for the study of complex molecular systems by the density functional theory, at the MP2, MP3, and MP4 levels of multiparticle perturbation theory, and by the coupled-cluster single and double excitations method (CCSD) with the application of parallel computing. A number of examples of calculations are presented.
The Arabidopsis Information Resource (TAIR, http://arabidopsis.org) is a genome database for Arabidopsis thaliana, an important reference organism for many fundamental aspects of biology as well as basic and applied plant biology research. TAIR serves as a central access point for Arabidopsis data, annotates gene function and expression patterns using controlled vocabulary terms, and maintains and updates the A. thaliana genome assembly and annotation. TAIR also provides researchers with an extensive set of visualization and analysis tools. Recent developments include several new genome releases (TAIR8, TAIR9 and TAIR10) in which the A. thaliana assembly was updated, pseudogenes and transposon genes were re-annotated, and new data from proteomics and next generation transcriptome sequencing were incorporated into gene models and splice variants. Other highlights include progress on functional annotation of the genome and the release of several new tools including Textpresso for Arabidopsis which provides the capability to carry out full text searches on a large body of research literature.
MOTIVATION: Glycans play critical roles in many biological processes, and their structural diversity is key for specific protein-glycan recognition. Comparative structural studies of biological molecules provide useful insight into their biological relationships. However, most computational tools are designed for protein structure, and despite their importance, there is no currently available tool for comparing glycan structures in a sequence order- and size-independent manner. RESULTS: A novel method, GS-align, is developed for glycan structure alignment and similarity measurement. GS-align generates possible alignments between two glycan structures through iterative maximum clique search and fragment superposition. The optimal alignment is then determined by the maximum structural similarity score, GS-score, which is size-independent. Benchmark tests against the Protein Data Bank (PDB) N-linked glycan library and PDB homologous/non-homologous N-glycoprotein sets indicate that GS-align is a robust computational tool to align glycan structures and quantify their structural similarity. GS-align is also applied to template-based glycan structure prediction and monosaccharide substitution matrix generation to illustrate its utility. AVAILABILITY AND IMPLEMENTATION: http://www.glycanstructure.org/gsalign. CONTACT: wonpil@ku.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
BACKGROUND: Carbohydrates are a class of large and diverse biomolecules, ranging from a simple monosaccharide to large multi-branching glycan structures. The covalent linkage of a carbohydrate to the nitrogen atom of an asparagine, a process referred to as N-linked glycosylation, plays an important role in the physiology of many living organisms. Most software for glycan modeling on a personal desktop computer requires knowledge of molecular dynamics to interface with specialized programs such as CHARMM or AMBER. There are a number of popular web-based tools that are available for modeling glycans (e.g., GLYCAM-WEB (http:// https://dev.glycam.org/gp/ ) or Glycosciences.db ( http://www.glycosciences.de/ )). However, these web-based tools are generally limited to a few canonical glycan conformations and do not allow the user to incorporate glycan modeling into their protein structure modeling workflow. RESULTS: Here, we present Glycosylator, a Python framework for the identification, modeling and modification of glycans in protein structure that can be used directly in a Python script through its application programming interface (API) or through its graphical user interface (GUI). The GUI provides a straightforward two-dimensional (2D) rendering of a glycoprotein that allows for a quick visual inspection of the glycosylation state of all the sequons on a protein structure. Modeled glycans can be further refined by a genetic algorithm for removing clashes and sampling alternative conformations. Glycosylator can also identify specific three-dimensional (3D) glycans on a protein structure using a library of predefined templates. CONCLUSIONS: Glycosylator was used to generate models of glycosylated protein without steric clashes. Since the molecular topology is based on the CHARMM force field, new complex sugar moieties can be generated without modifying the internals of the code. Glycosylator provides more functionality for analyzing and modeling glycans than any other available software or webserver at present. Glycosylator will be a valuable tool for the glycoinformatics and biomolecular modeling communities.
Interactive Tree Of Life (http://itol.embl.de) is a web-based tool for the display, manipulation and annotation of phylogenetic trees. It is freely available and open to everyone. In addition to classical tree viewer functions, iTOL offers many novel ways of annotating trees with various additional data. Current version introduces numerous new features and greatly expands the number of supported data set types. Trees can be interactively manipulated and edited. A free personal account system is available, providing management and sharing of trees in user defined workspaces and projects. Export to various bitmap and vector graphics formats is supported. Batch access interface is available for programmatic access or inclusion of interactive trees into other web services.
The past decade has witnessed the modern advances of high-throughput technology and rapid growth of research capacity in producing large-scale biological data, both of which were concomitant with an exponential growth of biomedical literature. This wealth of scholarly knowledge is of significant importance for researchers in making scientific discoveries and healthcare professionals in managing health-related matters. However, the acquisition of such information is becoming increasingly difficult due to its large volume and rapid growth. In response, the National Center for Biotechnology Information (NCBI) is continuously making changes to its PubMed Web service for improvement. Meanwhile, different entities have devoted themselves to developing Web tools for helping users quickly and efficiently search and retrieve relevant publications. These practices, together with maturity in the field of text mining, have led to an increase in the number and quality of various Web tools that provide comparable literature search service to PubMed. In this study, we review 28 such tools, highlight their respective innovations, compare them to the PubMed system and one another, and discuss directions for future development. Furthermore, we have built a website dedicated to tracking existing systems and future advances in the field of biomedical literature search. Taken together, our work serves information seekers in choosing tools for their needs and service providers and developers in keeping current in the field. Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/search.
Glycans are essential to all scales of biology, with their intricate structures being crucial for their biological functions. The structural complexity of glycans is communicated through simplified and unified visual representations according to the Symbol Nomenclature for Glycans (SNFGs) guidelines adopted by the community. Here, we introduce GlycoDraw, a Python-native implementation for high-throughput generation of high-quality, SNFG-compliant glycan figures with flexible display options. GlycoDraw is released as part of our glycan analysis ecosystem, glycowork, facilitating integration into existing workflows by enabling fully automated annotation of glycan-related figures and thus assisting the analysis of e.g. differential abundance data or glycomics mass spectra.
The GLYCOSCIENCES.de web portal (www.glycosciences.de) combines various databases and applications related to glycobiology and glycomics. This chapter demonstrates the use of these resources to find a variety of information on the LewisX (LeX) epitope such as literature references, NMR data, or corresponding PDB entries. Several query options are presented that enable the users to find the information of interest from different point of views. The main focus of GLYCOSCIENCES.de is put on 3D structural data. Therefore, methods to create 3D structure models with the Sweet-II tool and to analyze conformational properties of LeX by usage of PDB data and conformational maps from GlycoMapsDB are also introduced in this chapter.
Various glycoinformatics resources developed their individual notations for encoding of glycan structure information. Therefore, translation of glycan structures is required for an efficient use of multiple resources and for data exchange. A major problem for translations is residue notation, because individual notations use different names to encode the same residues, and the number of different monosaccharides is too large to quickly build a translation table manually. MonosaccharideDB offers various means to perform translation of carbohydrate residue names. This chapter illustrates the usage of the MonosaccharideDB web interface for both manual and automated conversion and validation of glycan residues.
Mass spectrometric techniques are the key technology for rapid and reliable glycan analysis. However, the lack of robust, dependable, and freely available software for the (semi-) automatic annotation of mass spectra is still a severe bottleneck that hampers their rapid interpretation. In this article the "Glyco-Peakfinder" web-service is described allowing de novo determination of glycan compositions from their mass signals. Starting from a basic set of mandatory masses of glycan components, the calculation can be performed without any knowledge concerning the biological background of the sample or the fragmentation technique used. "Glyco-Peakfinder" assigns all types of fragment ions including monosaccharide cross-ring cleavage products and multiply charged ions. It provides full user control to handle modified glycans (persubstituted molecules, reducing-end modifications, glycoconjugates) and ion types. The formula applied to calculate the fragment masses and an outline of the implemented algorithm are discussed. A systematic evaluation of the dependence of all factors influencing the computation time revealed strikingly different impact of the individual calculation steps. To provide access to known carbohydrate structures a "composition search" in the open access database GLYCOSCIENCES.de can be performed. The service is available at the URL: www.eurocarbdb.org/applications/ms-tools.
BACKGROUND: High-throughput technologies became common tools to decipher genome-wide changes of gene expression (GE) patterns. Functional analysis of GE patterns is a daunting task as it requires often recourse to the public repositories of biological knowledge. On the other hand, in many cases researcher's inquiry can be served by a comprehensive glimpse. The KEGG PATHWAY database is a compilation of manually verified maps of biological interactions represented by the complete set of pathways related to signal transduction and other cellular processes. Rapid mapping of the differentially expressed genes to the KEGG pathways may provide an idea about the functional relevance of the gene lists corresponding to the high-throughput expression data. RESULTS: Here we present a web based graphic tool KEGG Pathway Painter (KPP). KPP paints pathways from the KEGG database using large sets of the candidate genes accompanied by "overexpressed" or "underexpressed" marks, for example, those generated by microarrays or miRNA profilings. CONCLUSION: KPP provides fast and comprehensive visualization of the global GE changes by consolidating a list of the color-coded candidate genes into the KEGG pathways. KPP is freely available and can be accessed at http://web.cos.gmu.edu/~gmanyam/kegg/.
Glycomics@ExPASy (https://www.expasy.org/glycomics) is the glycomics tab of ExPASy, the server of SIB Swiss Institute of Bioinformatics. It was created in 2016 to centralize web-based glycoinformatics resources developed within an international network of glycoscientists. The hosted collection currently includes mainly databases and tools created and maintained at SIB but also links to a range of reference resources popular in the glycomics community. The philosophy of our toolbox is that it should be {glycoscientist AND protein scientist}-friendly with the aim of (1) popularizing the use of bioinformatics in glycobiology and (2) emphasizing the relationship between glycobiology and protein-oriented bioinformatics resources. The scarcity of data bridging these two disciplines led us to design tools as interactive as possible based on database connectivity to facilitate data exploration and support hypothesis building. Glycomics@ExPASy was designed, and is developed, with a long-term vision in close collaboration with glycoscientists to meet as closely as possible the growing needs of the community for glycoinformatics.
CCP4mg is a molecular-graphics program that is designed to give rapid access to both straightforward and complex static and dynamic representations of macromolecular structures. It has recently been updated with a new interface that provides more sophisticated atom-selection options and a wizard to facilitate the generation of complex scenes. These scenes may contain a mixture of coordinate-derived and abstract graphical objects, including text objects, arbitrary vectors, geometric objects and imported images, which can enhance a picture and eliminate the need for subsequent editing. Scene descriptions can be saved to file and transferred to other molecules. Here, the substantially enhanced version 2 of the program, with a new underlying GUI toolkit, is described. A built-in rendering module produces publication-quality images.
MOTIVATION: Glycan structures are commonly represented using symbols or linear nomenclature such as that from the Consortium for Functional Glycomics (also known as modified IUPAC-condensed nomenclature). No current tool allows for writing the name in such format using a graphical user interface (GUI); thus, names are prone to errors or non-standardized representations. RESULTS: Here we present GlycoGlyph, a web application built using JavaScript, which is capable of drawing glycan structures using a GUI and providing the linear nomenclature as an output or using it as an input in a dynamic manner. GlycoGlyph also allows users to save the structures as an SVG vector graphic, and allows users to export the structure as condensed GlycoCT. AVAILABILITY AND IMPLEMENTATION: The application can be used at: https://glycotoolkit.com/Tools/GlycoGlyph/. The application is tested to work in modern web browsers such as Firefox or Chrome. CONTACT: aymehta@bidmc.harvard.edu or rcummin1@bidmc.harvard.edu.
Hierarchical Editing Language for Macromolecules (HELM version 2.0) is a molecular line notation similar to SMILEs but specifically for communicating and managing biopolymer structures. The HELM project, part of the Pistoia Alliance nonprofit organization, has been tasked to develop and promote HELM as a global exchange format and recently released version 2.0 of the specification. Here we will describe the specifics of the HELM v2.0 notation along with the large ecosystem of software to support HELM-based structure management. We will highlight a recent open-source software and database for HELM monomers and a new, simpler approach to deploying a large complicated molecular management system.
BACKGROUND: A frequent problem in computational modeling is the interconversion of chemical structures between different formats. While standard interchange formats exist (for example, Chemical Markup Language) and de facto standards have arisen (for example, SMILES format), the need to interconvert formats is a continuing problem due to the multitude of different application areas for chemistry data, differences in the data stored by different formats (0D versus 3D, for example), and competition between software along with a lack of vendor-neutral formats. RESULTS: We discuss, for the first time, Open Babel, an open-source chemical toolbox that speaks the many languages of chemical data. Open Babel version 2.3 interconverts over 110 formats. The need to represent such a wide variety of chemical and molecular data requires a library that implements a wide range of cheminformatics algorithms, from partial charge assignment and aromaticity detection, to bond order perception and canonicalization. We detail the implementation of Open Babel, describe key advances in the 2.3 release, and outline a variety of uses both in terms of software products and scientific research, including applications far beyond simple format interconversion. CONCLUSIONS: Open Babel presents a solution to the proliferation of multiple chemical file formats. In addition, it provides a variety of useful utilities from conformer searching and 2D depiction, to filtering, batch conversion, and substructure and similarity searching. For developers, it can be used as a programming library to handle chemical data in areas such as organic chemistry, drug design, materials science, and computational chemistry. It is freely available under an open-source license from http://openbabel.org.
New computer software, GlycoMiner, has been developed to automatically identify tandem (MS/MS) spectra obtained in liquid chromatography/mass spectrometry (LC/MS) runs which correspond to N-glycopeptides. The program complements conventional proteomics analysis, and can be used in a high-throughput environment. The program interprets the spectra and determines the structure of the corresponding glycopeptides. GlycoMiner runs under Windows, can process spectra obtained on various instruments, and can be downloaded from our website (w3.chemres.hu/ms/glycominer). The algorithm works similarly to a human expert; evaluates the low mass oxonium ions; deduces oligosaccharide losses from the protonated molecule; and identifies the mass of the peptide residue. The program has been tested on tryptic digests of two glycopeptides: AGP (which has five different N-glycosylation sites) and transferrin (with two N-glycosylation sites). Results have been evaluated both manually and by GlycoMiner. Out of 3132 MS/MS spectra 338 were found to correspond to glycopeptides; identification by GlycoMiner showed a 0.1% false positive and 0.1% false negative rate. From these it was possible to identify 196 glycan structures manually; GlycoMiner correctly identified all of these, with no false positives. The rest were low quality spectra, not suitable for structure assignment.
Analysis of Phylogenetics and Evolution (APE) is a package written in the R language for use in molecular evolution and phylogenetics. APE provides both utility functions for reading and writing data and manipulating phylogenetic trees, as well as several advanced methods for phylogenetic and evolutionary analysis (e.g. comparative and population genetic methods). APE takes advantage of the many R functions for statistics and graphics, and also provides a flexible framework for developing and implementing further statistical methods for the analysis of evolutionary processes. AVAILABILITY: The program is free and available from the official R package archive at http://cran.r-project.org/src/contrib/PACKAGES.html#ape. APE is licensed under the GNU General Public License.
Characterizing glycans and glycoconjugates in the context of three-dimensional structures is important in understanding their biological roles and developing efficient therapeutic agents. Computational modeling and molecular simulation have become an essential tool complementary to experimental methods. Here, we present a computational tool, Glycan Modeler for in silico N-/O-glycosylation of the target protein and generation of carbohydrate-only systems. In our previous study, we developed Glycan Reader, a web-based tool for detecting carbohydrate molecules from a PDB structure and generation of simulation system and input files. As integrated into Glycan Reader in CHARMM-GUI, Glycan Modeler (Glycan Reader & Modeler) enables to generate the structures of glycans and glycoconjugates for given glycan sequences and glycosylation sites using PDB glycan template structures from Glycan Fragment Database (http://glycanstructure.org/fragment-db). Our benchmark tests demonstrate the universal applicability of Glycan Reader & Modeler to various glycan sequences and target proteins. We also investigated the structural properties of modeled glycan structures by running 2-mus molecular dynamics simulations of HIV envelope protein. The simulations show that the modeled glycan structures built by Glycan Reader & Modeler have the similar structural features compared to the ones solved by X-ray crystallography. We also describe the representative examples of glycoconjugate modeling with video demos to illustrate the practical applications of Glycan Reader & Modeler. Glycan Reader & Modeler is freely available at http://charmm-gui.org/input/glycan.
A molecular visualization program tailored to deal with the range of 3D structures of complex carbohydrates and polysaccharides, either alone or in their interactions with other biomacromolecules, has been developed using advanced technologies elaborated by the video games industry. All the specific structural features displayed by the simplest to the most complex carbohydrate molecules have been considered and can be depicted. This concerns the monosaccharide identification and classification, conformations, location in single or multiple branched chains, depiction of secondary structural elements and the essential constituting elements in very complex structures. Particular attention was given to cope with the accepted nomenclature and pictorial representation used in glycoscience. This achievement provides a continuum between the most popular ways to depict the primary structures of complex carbohydrates to visualizing their 3D structures while giving the users many options to select the most appropriate modes of representations including new features such as those provided by the use of textures to depict some molecular properties. These developments are incorporated in a stand-alone viewer capable of displaying molecular structures, biomacromolecule surfaces and complex interactions of biomacromolecules, with powerful, artistic and illustrative rendering methods. They result in an open source software compatible with multiple platforms, i.e., Windows, MacOS and Linux operating systems, web pages, and producing publication-quality figures. The algorithms and visualization enhancements are demonstrated using a variety of carbohydrate molecules, from glycan determinants to glycoproteins and complex protein-carbohydrate interactions, as well as very complex mega-oligosaccharides and bacterial polysaccharides and multi-stranded polysaccharide architectures.
The design, implementation, and capabilities of an extensible visualization system, UCSF Chimera, are discussed. Chimera is segmented into a core that provides basic services and visualization, and extensions that provide most higher level functionality. This architecture ensures that the extension mechanism satisfies the demands of outside developers who wish to incorporate new features. Two unusual extensions are presented: Multiscale, which adds the ability to visualize large-scale molecular assemblies such as viral coats, and Collaboratory, which allows researchers to share a Chimera session interactively despite being at separate locales. Other extensions include Multalign Viewer, for showing multiple sequence alignments and associated structures; ViewDock, for screening docked ligand orientations; Movie, for replaying molecular dynamics trajectories; and Volume Viewer, for display and analysis of volumetric data. A discussion of the usage of Chimera in real-world situations is given, along with anticipated future directions. Chimera includes full user documentation, is free to academic and nonprofit users, and is available for Microsoft Windows, Linux, Apple Mac OS X, SGI IRIX, and HP Tru64 Unix from http://www.cgl.ucsf.edu/chimera/.
The Tinker software, currently released as version 8, is a modular molecular mechanics and dynamics package written primarily in a standard, easily portable dialect of Fortran 95 with OpenMP extensions. It supports a wide variety of force fields, including polarizable models such as the Atomic Multipole Optimized Energetics for Biomolecular Applications (AMOEBA) force field. The package runs on Linux, macOS, and Windows systems. In addition to canonical Tinker, there are branches, Tinker-HP and Tinker-OpenMM, designed for use on message passing interface (MPI) parallel distributed memory supercomputers and state-of-the-art graphical processing units (GPUs), respectively. The Tinker suite also includes a tightly integrated Java-based graphical user interface called Force Field Explorer (FFE), which provides molecular visualization capabilities as well as the ability to launch and control Tinker calculations.
Bioinformatics for glycobiology is still considered to be in its infancy. Nevertheless, there are various applications and databases available for glycoscientists by now. This article summarizes the problems that glycoinformatics is facing and gives an overview of the existing resources, including web portals, databases and tools. Software for structure input and display, for processing of analytical data, for prediction and analysis of glycosylation sites, and applications related to carbohydrate 3D structures are described. Special emphasis is put on GlycomeDB, a project that aims to integrate all freely available carbohydrate structure data already stored in databases, and the taxonomic annotation of these structures, into one resource. By this means it allows researchers to locate data in many databases without having to learn the different query types and carbohydrate notations used in the individual resources. KeywordsBioinformatics-Carbohydrate database-GlycomeDB-Glycan-Glycosylation sites-Automatic annotation-3D structure-Analytical software-Carbohydrate software tools.
MOTIVATION: The interactive visualization of very large macromolecular complexes on the web is becoming a challenging problem as experimental techniques advance at an unprecedented rate and deliver structures of increasing size. RESULTS: We have tackled this problem by developing highly memory-efficient and scalable extensions for the NGL WebGL-based molecular viewer and by using Macromolecular Transmission Format (MMTF), a binary and compressed MMTF. These enable NGL to download and render molecular complexes with millions of atoms interactively on desktop computers and smartphones alike, making it a tool of choice for web-based molecular visualization in research and education. AVAILABILITY AND IMPLEMENTATION: The source code is freely available under the MIT license at github.com/arose/ngl and distributed on NPM (npmjs.com/package/ngl). MMTF-JavaScript encoders and decoders are available at github.com/rcsb/mmtf-javascript.
The NGL Viewer (http://proteinformatics.charite.de/ngl) is a web application for the visualization of macromolecular structures. By fully adopting capabilities of modern web browsers, such as WebGL, for molecular graphics, the viewer can interactively display large molecular complexes and is also unaffected by the retirement of third-party plug-ins like Flash and Java Applets. Generally, the web application offers comprehensive molecular visualization through a graphical user interface so that life scientists can easily access and profit from available structural data. It supports common structural file-formats (e.g. PDB, mmCIF) and a variety of molecular representations (e.g. 'cartoon, spacefill, licorice'). Moreover, the viewer can be embedded in other web sites to provide specialized visualizations of entries in structural databases or results of structure-related calculations.
no abstract.
We present the LiteMol suite, a tool for visualizing large macromolecular structure data sets that is freely available at https://www.litemol.org.
Here we report the launch of a web-tool (the GLYCAM-Web GAG Builder, www.glycam.org/gag) for the rapid and straightforward prediction of 3D structural models for glycosaminoglycans (GAGs). The tool provides the user with coordinate files (PDB format) for use in visualization, as well as files for performing MD simulation with the AMBER software package. Counter ions and water may also be added as desired. The tool is designed with the non-expert in mind, and as such has implemented typical default values for structural parameters, which the user may change if desired. Multiple GAG types are supported, including Heparin/Heparan Sulfate, Chondroitin Sulfate, Dermatan Sulfate, Keratan Sulfate, and Hyaluronan; however, the user may alter the default sulfation patterns to create novel sequences. The common non-natural unsaturated uronic acid (DeltaUA) and its sulfated derivative are also supported.
The CASPER program which is used for determination of the structure of oligo- and polysaccharides has been extended. It can now handle a reduced number of experimental signals from an NMR spectrum in the comparison to the simulated spectra of structures that it generates, an improvement which is of practical importance since all signals in NMR spectra cannot always be identified. Furthermore, the program has been enhanced to simulate NMR spectra of multibranched oligo- and polysaccharides. The new developments were tested on four saccharides of known structure but of different complexity and were shown to predict the correct structures.
The LIPID MAPS Structure Database (LMSD) is a relational database encompassing structures and annotations of biologically relevant lipids. Structures of lipids in the database come from four sources: (i) LIPID MAPS Consortium's core laboratories and partners; (ii) lipids identified by LIPID MAPS experiments; (iii) computationally generated structures for appropriate lipid classes; (iv) biologically relevant lipids manually curated from LIPID BANK, LIPIDAT and other public sources. All the lipid structures in LMSD are drawn in a consistent fashion. In addition to a classification-based retrieval of lipids, users can search LMSD using either text-based or structure-based search options. The text-based search implementation supports data retrieval by any combination of these data fields: LIPID MAPS ID, systematic or common name, mass, formula, category, main class, and subclass data fields. The structure-based search, in conjunction with optional data fields, provides the capability to perform a substructure search or exact match for the structure drawn by the user. Search results, in addition to structure and annotations, also include relevant links to external databases. The LMSD is publicly available at www.lipidmaps.org/data/structure/.
Since the 2-D/3-D HPLC mapping technique was proposed for structural analyses of N-glycans, approximately 500 different structures have been elucidated. Based on the accumulated data, we developed a web application, GALAXY (Glycoanalysis by the three axes of MS and chromatography), to utilize the 2-D/3-D maps more effectively. This application will facilitate search of candidate structures satisfying 2-D/3-D HPLC and/or mass spectrometric data and enable us to predict coordinates of putative PA-glycans and to trace the effects of glycosidase treatments in a graphical manner.
The glycosylation of proteins and lipids is known to be closely related to the mechanisms of various diseases such as influenza, cancer, and muscular dystrophy. Therefore, it has become clear that the analysis of post-translational modifications of proteins, including glycosylation, is important to accurately understand the functions of each protein molecule and the interactions among them. In order to conduct large-scale analyses more efficiently, it is essential to promote the accumulation, sharing, and reuse of experimental and analytical data in accordance with the FAIR (Findability, Accessibility, Interoperability, and Re-usability) data principles. However, a FAIR data repository for storing and sharing glycoconjugate information, including glycopeptides and glycoproteins, in a standardized format did not exist. Therefore, we have developed GlyComb (https://glycomb.glycosmos.org) as a new standardized data repository for glycoconjugate data. Currently, GlyComb can assign a unique identifier to a set of glycosylation information associated with a specific peptide sequence or UniProt ID. By standardizing glycoconjugate data via GlyComb identifiers and coordinating with existing web resources such as GlyTouCan and GlycoPOST, a comprehensive system for data submission and data sharing among researchers can be established. Here we introduce how GlyComb is able to integrate the variety of glycoconjugate data already registered in existing data repositories to obtain a better understanding of the available glycopeptides and glycoproteins, and their glycosylation patterns. We also explain how this system can serve as a foundation for a better understanding of glycan function.
MOTIVATION: Glycans are biomolecules that take an important role in the biological processes of living organisms. They form diverse, complicated structures such as branched and cyclic forms. Web3 Unique Representation of Carbohydrate Structures (WURCS) was proposed as a new linear notation for uniquely representing glycans during the GlyTouCan project. WURCS defines rules for complex glycan structures that other text formats did not support, and so it is possible to represent a wide variety glycans. However, WURCS uses a complicated nomenclature, so it is not human-readable. Therefore, we aimed to support the interpretation of WURCS by converting WURCS to the most basic and widely used format IUPAC. RESULTS: In this study, we developed GlycanFormatConverter and succeeded in converting WURCS to the three kinds of IUPAC formats (IUPAC-Extended, IUPAC-Condensed and IUPAC-Short). Furthermore, we have implemented functionality to import IUPAC-Extended, KEGG Chemical Function (KCF) and LinearCode formats and to export WURCS. We have thoroughly tested our GlycanFormatConverter and were able to show that it was possible to convert all the glycans registered in the GlyTouCan repository, with exceptions owing only to the limitations of the original format. The source code for this conversion tool has been released as an open source tool. AVAILABILITY AND IMPLEMENTATION: https://github.com/glycoinfo/GlycanFormatConverter.git. .
The prediction of the quaternary structure of biomolecular macromolecules is of paramount importance for fundamental understanding of cellular processes and drug design. In the era of integrative structural biology, one way of increasing the accuracy of modeling methods used to predict the structure of biomolecular complexes is to include as much experimental or predictive information as possible in the process. This has been at the core of our information-driven docking approach HADDOCK. We present here the updated version 2.2 of the HADDOCK portal, which offers new features such as support for mixed molecule types, additional experimental restraints and improved protocols, all of this in a user-friendly interface. With well over 6000 registered users and 108,000 jobs served, an increasing fraction of which on grid resources, we hope that this timely upgrade will help the community to solve important biological questions and further advance the field. The HADDOCK2.2 Web server is freely accessible to non-profit users at http://haddock.science.uu.nl/services/HADDOCK2.2.
Structure validation has become a major issue in the structural biology community, and an essential step is checking the ligand structure. This paper introduces MotiveValidator, a web-based application for the validation of ligands and residues in PDB or PDBx/mmCIF format files provided by the user. Specifically, MotiveValidator is able to evaluate in a straightforward manner whether the ligand or residue being studied has a correct annotation (3-letter code), i.e. if it has the same topology and stereochemistry as the model ligand or residue with this annotation. If not, MotiveValidator explicitly describes the differences. MotiveValidator offers a user-friendly, interactive and platform-independent environment for validating structures obtained by any type of experiment. The results of the validation are presented in both tabular and graphical form, facilitating their interpretation. MotiveValidator can process thousands of ligands or residues in a single validation run that takes no more than a few minutes. MotiveValidator can be used for testing single structures, or the analysis of large sets of ligands or fragments prepared for binding site analysis, docking or virtual screening. MotiveValidator is freely available via the Internet at http://ncbr.muni.cz/MotiveValidator.
To address data management and data exchange problems in the nuclear magnetic resonance (NMR) community, the Collaborative Computing Project for the NMR community (CCPN) created a "Data Model" that describes all the different types of information needed in an NMR structural study, from molecular structure and NMR parameters to coordinates. This paper describes the development of a set of software applications that use the Data Model and its associated libraries, thus validating the approach. These applications are freely available and provide a pipeline for high-throughput analysis of NMR data. Three programs work directly with the Data Model: CcpNmr Analysis, an entirely new analysis and interactive display program, the CcpNmr FormatConverter, which allows transfer of data from programs commonly used in NMR to and from the Data Model, and the CLOUDS software for automated structure calculation and assignment (Carnegie Mellon University), which was rewritten to interact directly with the Data Model. The ARIA 2.0 software for structure calculation (Institut Pasteur) and the QUEEN program for validation of restraints (University of Nijmegen) were extended to provide conversion of their data to the Data Model. During these developments the Data Model has been thoroughly tested and used, demonstrating that applications can successfully exchange data via the Data Model. The software architecture developed by CCPN is now ready for new developments, such as integration with additional software applications and extensions of the Data Model into other areas of research.
High-throughput methods to identify and quantify glycans in a given sample are rare. We have optimised a robotic platform for analysing biopharmaceuticals at each stage of the manufacturing process. In addition, it can be applied to basic research. The plate format makes it convenient for large sample sets; it is relatively cheap, robust and quantitative. However, the large datasets churned out by this platform require significant time to interpret. Consequently, informatics tool are required to help with this annotation. This article briefly describes our robotic platform and concentrates on a set of software tools for the interpretation of quantitative glycoprofiling data. .
With over 150,000 entries, the worldwide protein data bank (PDB) is the primary repository for 3D macromolecular structure. Unfortunately, structural, annotational, and ambiguity errors exist for carbohydrates throughout the database due in part to the lack of carbohydrate‐specific tools for checking the quality of structures prior to deposition. Our group has partnered with the PDB Biocuration team to assist in the identification and remediation of carbohydrates in their database and we have developed a user‐friendly web interface called GlyFinder to accurately find, retrieve, and assess these glycan and glycoproteins. Using GlyFinder, we have found that 45,852 of the PDB entries contain carbohydrates (30.1% of the PDB). Nearly 6,000 glycoproteins have been identified, with an average of five N‐linked glycans per glycoprotein. Only 415 glycoproteins contained O‐linked glycans, with an average of three O‐linked glycans each. A surprisingly high number of glycoprotein PDBs (500, 7.45%) contain one or more N ‐linked glycans that are alpha‐linked to the asparagine, illustrating the unfortunate errors that sometimes exist in the data. Because of such errors, and because the glycans in crystal structures are often truncated, we have also developed tools (Glycoprotein Builder) that can build realistic models of the glycoprotein with intact glycans employing the crystal structure of the protein core. These models allow us to predict the impact of glycosylation on protein function, antigenicity, immunogenicity and stability. We illustrate these capabilities for several proteins, including human Erythropoietin, HIV gp120, and influenza A hemagglutinin.
CHARMM-GUI Membrane Builder, http://www.charmm-gui.org/input/membrane, is a web-based user interface designed to interactively build all-atom protein/membrane or membrane-only systems for molecular dynamics simulations through an automated optimized process. In this work, we describe the new features and major improvements in Membrane Builder that allow users to robustly build realistic biological membrane systems, including (1) addition of new lipid types, such as phosphoinositides, cardiolipin (CL), sphingolipids, bacterial lipids, and ergosterol, yielding more than 180 lipid types, (2) enhanced building procedure for lipid packing around protein, (3) reliable algorithm to detect lipid tail penetration to ring structures and protein surface, (4) distance-based algorithm for faster initial ion displacement, (5) CHARMM inputs for P21 image transformation, and (6) NAMD equilibration and production inputs. The robustness of these new features is illustrated by building and simulating a membrane model of the polar and septal regions of E. coli membrane, which contains five lipid types: CL lipids with two types of acyl chains and phosphatidylethanolamine lipids with three types of acyl chains. It is our hope that CHARMM-GUI Membrane Builder becomes a useful tool for simulation studies to better understand the structure and dynamics of proteins and lipids in realistic biological membrane environments.
This is the second in a set of two articles where we describe our newly developed scheme to predict conformations of complex oligosaccharides in solution. We apply our fast sugar conformation prediction tool to the case of two complex human milk oligosaccharides LNF-1 and LND-1. As described in detail in the first paper, our protocol initially delivers a set of "unique structures" corresponding to important minima on the potential-energy landscape of a complex sugar using an implicit solvent model. The nuclear Overhauser effect ranking of individual conformations provides a suitable way for comparison with available experiments. The structures obtained agree well with earlier computational predictions but are obtained at a significantly lower computational cost. Sugar conformations corresponding to stable energy minima not found by earlier molecular dynamics studies were also detected using our methodology. In order to evaluate the effects of explicit solvation and thermal fluctuations on several different predicted conformers, we also performed short-time molecular dynamics simulations in an explicit solvent.
In two recent back to back articles(Xia et al., J Chem Theory Comput 3:1620-1628 and 1629-1643, 2007a, b) we have started to address the problem of complex oligosaccharide conformation and folding. The scheme previously presented was based on exhaustive searches in configuration space in conjunction with Nuclear Overhauser Effect (NOE) calculations and the use of a complex rotameric library that takes branching into account. NOEs are extremely useful for structural determination but only provide information about short range interactions and ordering. Instead, the measurement of residual dipolar couplings (RDC), yields information about molecular ordering or folding that is long range in nature. In this article we show the results obtained by incorporation RDC calculations into our prediction scheme. Using this new approach we are able to accurately predict the structure of six human milk sugars: LNF-1, LND-1, LNF-2, LNF-3, LNnT and LNT. Our exhaustive search in dihedral configuration space combined with RDC and NOE calculations allows for highly accurate structural predictions that, because of the non-ergodic nature of these molecules on a time scale compatible with molecular dynamics simulations, are extremely hard to obtain otherwise (Almond et al., Biochemistry 43:5853-5863, 2004). Molecular dynamics simulations in explicit solvent using as initial configurations the structures predicted by our algorithm show that the histo-blood group epitopes in these sugars are relatively rigid and that the whole family of oligosaccharides derives its conformational variability almost exclusively from their common linkage (beta-D: -GlcNAc-(1-->3)-beta-D: -Gal) which can exist in two distinct conformational states. A population analysis based on the conformational variability of this flexible glycosidic link indicates that the relative population of the two distinct states varies for different human milk oligosaccharides.
This article reports on the implementation of J coupling calculations in our recently developed Fast Sugar Structure Prediction Software (FSPS). The FSPS combines a smart and exhaustive algorithm to search through conformational space with the calculation of different experimental nuclear magnetic resonance observables to establish the conformation of saccharides in solution. Using our algorithm in combination with NMR data, we investigate the solution structure of three simple disaccharides (methyl alpha-sophoroside, methyl alpha-laminarabioside, and methyl alpha-cellobioside) and one complex bacterial polysaccharide (Shigella flexneri 5a).
no abstract.
BACKGROUND: Correctness of structures and associated metadata within public and commercial chemical databases greatly impacts drug discovery research activities such as quantitative structure-property relationships modelling and compound novelty checking. MOL files, SMILES notations, IUPAC names, and InChI strings are ubiquitous file formats and systematic identifiers for chemical structures. While interchangeable for many cheminformatics purposes there have been no studies on the inconsistency of these structure identifiers due to various approaches for data integration, including the use of different software and different rules for structure standardisation. We have investigated the consistency of systematic identifiers of small molecules within and between some of the commonly used chemical resources, with and without structure standardisation. RESULTS: The consistency between systematic chemical identifiers and their corresponding MOL representation varies greatly between data sources (37.2%-98.5%). We observed the lowest overall consistency for MOL-IUPAC names. Disregarding stereochemistry increases the consistency (84.8% to 99.9%). A wide variation in consistency also exists between MOL representations of compounds linked via cross-references (25.8% to 93.7%). Removing stereochemistry improved the consistency (47.6% to 95.6%). CONCLUSIONS: We have shown that considerable inconsistency exists in structural representation and systematic chemical identifiers within and between databases. This can have a great influence especially when merging data and if systematic identifiers are used as a key index for structure integration or cross-querying several databases. Regenerating systematic identifiers starting from their MOL representation and applying well-defined and documented chemistry standardisation rules to all compounds prior to creating them can dramatically increase internal consistency.
New knowledge is produced at a continuously increasing speed, and the list of papers, databases and other knowledge sources that a researcher in the life sciences needs to cope with is actually turning into a problem rather than an asset. The adequate management of knowledge is therefore becoming fundamentally important for life scientists, especially if they work with approaches that thoroughly depend on knowledge integration, such as systems biology. Several initiatives to organize biological knowledge sources into a readily exploitable resourceome are presently being carried out. Ontologies and Semantic Web technologies revolutionize these efforts. Here, we review the benefits, trends, current possibilities, and the potential this holds for the biosciences.
BACKGROUND: A thorough understanding of the National Library of Medicine's Medical Subject Headings can increase the efficiency and precision of one's literature searching skills using the Medline database. AIMS: To describe how to use the Medical Subject Headings to conduct a search for literature, and how to write up a description of the search strategy. MATERIALS AND METHODS: The author interprets National Library of Medicine documentation to describe Medical Subject Headings, and shares strategies from daily literature searching. CONCLUSION: Knowing how to use Medical Subject Headings improves the efficiency and quality of one's literature searches.
no abstract.
The wide application of next-generation sequencing has presented a new hurdle to bioinformatics for managing the fast-growing sequence data. The management of biomacromolecules at the chemistry level imposes an even greater challenge in cheminformatics because of the lack of a good chemical representation of biopolymers. Here we introduce the self-contained sequence representation (SCSR). SCSR combines the best features of bioinformatics and cheminformatics notations. SCSR is the first general, extensible, and comprehensive representation of biopolymers in a compressed format that retains chemistry detail. The SCSR-based high-performance exact structure and substructure searching methods (NEMA key and SSS) offer new ways to search biopolymers that complement bioinformatics approaches. The widely used chemical structure file format (molfile) has been enhanced to support SCSR. SCSR offers a solid framework for future development of new methods and systems for managing and handling sequences at the chemistry level. SCSR lays the foundation for the integration of bioinformatics and cheminformatics.
The current rise in the use of open lab notebook techniques means that there are an increasing number of scientists who make chemical information freely and openly available to the entire community as a series of micropublications that are released shortly after the conclusion of each experiment. We propose that this trend be accompanied by a thorough examination of data sharing priorities. We argue that the most significant immediate benefactor of open data is in fact chemical algorithms, which are capable of absorbing vast quantities of data, and using it to present concise insights to working chemists, on a scale that could not be achieved by traditional publication methods. Making this goal practically achievable will require a paradigm shift in the way individual scientists translate their data into digital form, since most contemporary methods of data entry are designed for presentation to humans rather than consumption by machine learning algorithms. We discuss some of the complex issues involved in fixing current methods, as well as some of the immediate benefits that can be gained when open data is published correctly using unambiguous machine readable formats. Graphical AbstractLab notebook entries must target both visualisation by scientists and use by machine learning algorithms.
no abstract.
A series of file formats used for storing and transferring chemical structure information that have evolved over several years at Molecular Design Limited are described. These files are built using one or more connection table (Ctab) blocks. The Ctab block format is described in detail. The file formats described are the MOLfile for a single (multifragment) molecule, the RGfile for a generic query, the SDfile for multiple structures and data, the RXNfile for a single reaction, and the RDfile for multiple reactions and data. The relationships of these files are given as well as examples.
The technological advances of the past century, marked by the computer revolution and the advent of high-throughput screening technologies in drug discovery, opened the path to the computational analysis and visualization of bioactive molecules. For this purpose, it became necessary to represent molecules in a syntax that would be readable by computers and understandable by scientists of various fields. A large number of chemical representations have been developed over the years, their numerosity being due to the fast development of computers and the complexity of producing a representation that encompasses all structural and chemical characteristics. We present here some of the most popular electronic molecular and macromolecular representations used in drug discovery, many of which are based on graph representations. Furthermore, we describe applications of these representations in AI-driven drug discovery. Our aim is to provide a brief guide on structural representations that are essential to the practice of AI in drug discovery. This review serves as a guide for researchers who have little experience with the handling of chemical representations and plan to work on applications at the interface of these fields.
no abstract.
CurlySMILES is a chemical line notation which extends SMILES with annotations for storage, retrieval and modeling of interlinked, coordinated, assembled and adsorbed molecules in supramolecular structures and nanodevices. Annotations are enclosed in curly braces and anchored to an atomic node or at the end of the molecular graph depending on the annotation type. CurlySMILES includes predefined annotations for stereogenicity, electron delocalization charges, extra-molecular interactions and connectivity, surface attachment, solutions, and crystal structures and allows extensions for domain-specific annotations. CurlySMILES provides a shorthand format to encode molecules with repetitive substructural parts or motifs such as monomer units in macromolecules and amino acids in peptide chains. CurlySMILES further accommodates special formats for non-molecular materials that are commonly denoted by composition of atoms or substructures rather than complete atom connectivity.
no abstract.
A molecule editor, that is program for input and editing of molecules, is an indispensable part of every cheminformatics or molecular processing system. This review focuses on a special type of molecule editors, namely those that are used for molecule structure input on the web. Scientific computing is now moving more and more in the direction of web services and cloud computing, with servers scattered all around the Internet. Thus a web browser has become the universal scientific user interface, and a tool to edit molecules directly within the web browser is essential.The review covers a history of web-based structure input, starting with simple text entry boxes and early molecule editors based on clickable maps, before moving to the current situation dominated by Java applets. One typical example - the popular JME Molecule Editor - will be described in more detail. Modern Ajax server-side molecule editors are also presented. And finally, the possible future direction of web-based molecule editing, based on technologies like JavaScript and Flash, is discussed.
The World Wide Web has succeeded in large part because its software architecture has been designed to meet the needs of an Internet-scale distributed hypermedia system. The Web has been iteratively developed over the past ten years through a series of modifications to the standards that define its architecture. In order to identify those aspects of the Web that needed improvement and avoid undesirable modifications, a model for the modern Web architecture was needed to guide its design, definition, and deployment. Software architecture research investigates methods for determining how best to partition a system, how components identify and communicate with each other, how information is communicated, how elements of a system can evolve independently, and how all of the above can be described using formal and informal notations. My work is motivated by the desire to understand and evaluate the architectural design of network-based application software through principled use of architectural constraints, thereby obtaining the functional, performance, and social properties desired of an architecture. An architectural style is a named, coordinated set of architectural constraints. This dissertation defines a framework for understanding software architecture via architectural styles and demonstrates how styles can be used to guide the architectural design of network-based application software. A survey of architectural styles for network-based applications is used to classify styles according to the architectural properties they induce on an architecture for distributed hypermedia. I then introduce the Representational State Transfer (REST) architectural style and describe how REST has been used to guide the design and development of the architecture for the modern Web. REST emphasizes scalability of component interactions, generality of interfaces, independent deployment of components, and intermediary components to reduce interaction latency, enforce security, and encapsulate legacy systems. I describe the software engineering principles guiding REST and the interaction constraints chosen to retain those principles, contrasting them to the constraints of other architectural styles. Finally, I describe the lessons learned from applying REST to the design of the Hypertext Transfer Protocol and Uniform Resource Identifier standards, and from their subsequent deployment in Web client and server software.
We present an easy, human-readable line notation to describe even complex peptides.
The software for the IUPAC Chemical Identifier, InChI, is extraordinarily reliable. It has been tested on large databases around the world, and has proved itself to be an essential tool in the handling and integration of large chemical databases. InChI version 1.05 was released in January 2017 and version 1.06 in December 2020. In this paper, we report on the current state of the InChI Software, the details of the improvements in the v1.06 release, and the results of a test of the InChI run on PubChem, a database of more than a hundred million molecules. The upgrade introduces significant new features, including support for pseudo-element atoms and an improved description of polymers. We expect that few, if any, applications using the standard InChI will need to change as a result of the changes in version 1.06. Numerical instability was discovered for 0.002% of this database, and a small number of other molecules were discovered for which the algorithm did not run smoothly. On the basis of PubChem data, we can demonstrate that InChI version 1.05 was 99.996% accurate, and InChI version 1.06 represents a step closer to perfection. Finally, we look forward to future releases and extensions for the InChI Chemical identifier.
The Reaction InChI (RInChI) extends the idea of the InChI, which provides a unique descriptor of molecular structures, towards reactions. Prototype versions of the RInChI have been available since 2011. The first official release (RInChI-V1.00), funded by the InChI Trust, is now available for download ( http://www.inchi-trust.org/downloads/ ). This release defines the format and generates hashed representations (RInChIKeys) suitable for database and web operations. The RInChI provides a concise description of the key data in chemical processes, and facilitates the manipulation and analysis of reaction data.
no abstract.
BACKGROUND: SMILES and SMARTS are two well-defined structure matching languages that have gained wide use in cheminformatics. Jmol is a widely used open-source molecular visualization and analysis tool written in Java and implemented in both Java and JavaScript. Over the past 10 years, from 2007 to 2016, work on Jmol has included the development of dialects of SMILES and SMARTS that incorporate novel aspects that allow new and powerful applications. RESULTS: The specifications of "Jmol SMILES" and "Jmol SMARTS" are described. The dialects most closely resemble OpenSMILES and OpenSMARTS. Jmol SMILES is a superset of OpenSMILES, allowing a freer format, including whitespace and comments, the addition of "processing directives" that modify the meaning of certain aspects of SMILES processing such as aromaticity and stereochemistry, a more extensive treatment of stereochemistry, and several minor additions. Jmol SMARTS similarly adds these same modifications to OpenSMARTS, but also adds a number of additional "primitives" and elements of syntax tuned to matching 3D molecular structures and selecting their atoms. The result is an expansion of the capabilities of SMILES and SMARTS primarily for use in 3D molecular analysis, allowing a broader range of matching involving any combination of 3D molecular structures, SMILES strings, and SMARTS patterns. While developed specifically for Jmol, these dialects of SMILES and SMARTS are independent of the Jmol application itself. CONCLUSIONS: Jmol SMILES and Jmol SMARTS add value to standard SMILES and SMARTS. Together they have proven exceptionally capable in extracting valuable information from 3D structural models, as demonstrated in Jmol. Capabilities in Jmol enabled by Jmol SMILES and Jmol SMARTS include efficient MMFF94 atom typing, conformational identification, SMILES comparisons without canonicalization, identification of stereochemical relationships, quantitative comparison of 3D structures from different sources (including differences in Kekulization), conformational flexible fitting, and atom mapping used to synchronize interactive displays of 2D structures, 3D structures, and spectral correlations, where data are being drawn from multiple sources.
Since its public introduction in 2005 the IUPAC InChI chemical structure identifier standard has become the international, worldwide standard for defined chemical structures. This article will describe the extensive use and dissemination of the InChI and InChIKey structure representations by and for the world-wide chemistry community, the chemical information community, and major publishers and disseminators of chemical and related scientific offerings in manuscripts and databases.
Jmol is free, open source software for interactive molecular visualization. Since it is written in the Java programming language, it is compatible with all major operating systems and, in the applet form, with most modern web browsers. This article summarizes Jmol development and features that make it a valid and promising replacement for Rasmol and Chime in the development of educational materials, as well as in basic investigation of biomolecular structure. The description is set up by comparison with the well known abilities of Rasmol and Chime. Jmol is suitable for molecular model display and analysis in biochemistry, molecular biology, organic and inorganic chemistry, crystallography, and materials science.
Symbolic approaches to Artificial Intelligence (AI) represent things within a domain of knowledge through physical symbols, combine symbols into symbol expressions, and manipulate symbols and symbol expressions through inference processes. While a large part of Data Science relies on statistics and applies statistical approaches to AI, there is an increasing potential for successfully applying symbolic approaches as well. Symbolic representations and symbolic inference are close to human cognitive representations and therefore comprehensible and interpretable; they are widely used to represent data and metadata, and their specific semantic content must be taken into account for analysis of such information; and human communication largely relies on symbols, making symbolic representations a crucial part in the analysis of natural language. Here we discuss the role symbolic representations and inference can play in Data Science, highlight the research challenges from the perspective of the data scientist, and argue that symbolic methods should become a crucial component of the data scientists’ toolbox.
SYBYL line notation (SLN) is a powerful way to represent molecular structures, reactions, libraries of structures, molecular fragments, formulations, molecular queries, and reaction queries. Nearly any chemical structure imaginable, including macromolecules, pharmaceuticals, catalysts, and even combinatorial libraries can be represented as an SLN string. The language provides a rich syntax for database queries comparable to SMARTS. It provides full Markush, R-Group, reaction, and macro atom capabilities in a single unified notation. It includes the ability to specify 3D conformations and 2D depictions. All the information necessary to recreate the structure in a modeling or drawing package is present in a single, concise string of ASCII characters. This makes SLN ideal for structure communication over global computer networks between applications sitting at remote sites. Unlike SMILES and its derivatives, SLN accomplishes this within a single unified syntax. Structures, queries, compounds, reactions, and virtual libraries can all be represented in a single notation.
In this article we discuss our experience designing and implementing a statistical computing language. In developing this new language, we sought to combine what we felt were useful features from two existing computer languages. We feel that the new language provides advantages in the areas of portability, computational efficiency, memory management, and scoping.
This chapter covers a part of the spectrum of neural-network uses in analytical chemistry. Different architectures of neural networks are described briefly. The chapter focuses on the development of three-layer artificial neural network for modeling the anti-HIV activity of the HETP derivatives and activity parameters (pIC50) of heparanase inhibitors. The use of a genetic algorithm-kernel partial least squares algorithm combined with an artificial neural network (GA-KPLS-ANN) is described for predicting the activities of a series of aromatic sulfonamides. The retention behavior of terpenes and volatile organic compounds and predicting the response surface of different detection systems are presented as typical applications of ANNs in chromatographic area. The use of ANNs is explored in electrophoresis with emphasizes on its application on peptide mapping. Simulation of the electropherogram of glucagons and horse cytochrome C is described as peptide models. This chapter also focuses on discussing the role of ANNs in the simulation of mass and 13C-NMR spectra for noncyclic alkenes and alkanes and lignin and xanthones, respectively. .
In a cell or microorganism, the processes that generate mass, energy, information transfer and cell-fate specification are seamlessly integrated through a complex network of cellular constituents and reactions. However, despite the key role of these networks in sustaining cellular functions, their large-scale structure is essentially unknown. Here we present a systematic comparative mathematical analysis of the metabolic networks of 43 organisms representing all three domains of life. We show that, despite significant variation in their individual constituents and pathways, these metabolic networks have the same topological scaling properties and show striking similarities to the inherent organization of complex non-biological systems. This may indicate that metabolic organization is not only identical for all living organisms, but also complies with the design principles of robust and error-tolerant scale-free networks, and may represent a common blueprint for the large-scale organization of interactions among all cellular constituents.
Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort(1-4), the structures of around 100,000 unique proteins have been determined(5), but this represents a small fraction of the billions of known protein sequences(6,7). Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence-the structure prediction component of the 'protein folding problem'(8)-has been an important open research problem for more than 50 years(9). Despite recent progress(10-14), existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14)(15), demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm.
Swarm intelligence is a research branch that models the population of interacting agents or swarms that are able to self-organize. An ant colony, a flock of birds or an immune system is a typical example of a swarm system. Bees’ swarming around their hive is another example of swarm intelligence. Artificial Bee Colony (ABC) Algorithm is an optimization algorithm based on the intelligent behaviour of honey bee swarm. In this work, ABC algorithm is used for optimizing multivariable functions and the results produced by ABC, Genetic Algorithm (GA), Particle Swarm Algorithm (PSO) and Particle Swarm Inspired Evolutionary Algorithm (PS-EA) have been compared. The results showed that ABC outperforms the other algorithms.
The application of semantic technologies to the integration of biological data and the interoperability of bioinformatics analysis and visualization tools has been the common theme of a series of annual BioHackathons hosted in Japan for the past five years. Here we provide a review of the activities and outcomes from the BioHackathons held in 2011 in Kyoto and 2012 in Toyama. In order to efficiently implement semantic technologies in the life sciences, participants formed various sub-groups and worked on the following topics: Resource Description Framework (RDF) models for specific domains, text mining of the literature, ontology development, essential metadata for biological databases, platforms to enable efficient Semantic Web technology development and interoperability, and the development of applications for Semantic Web data. In this review, we briefly introduce the themes covered by these sub-groups. The observations made, conclusions drawn, and software development projects that emerged from these activities are discussed.
There are a lot of chemical data stored in large databases, repositories and other resources. These data are used by different researchers in different applications in various areas of chemistry. Since these data are stored in several standard chemical file formats, there is a need for the inter-conversion of chemical structures between different formats because all the formats are not supported by various software and tools used by the researchers. Therefore it becomes essential to convert one file format to another. This paper reviews some of the chemical file formats and also presents a few inter-conversion tools such as Open Babel [1], Mol converter [2] and CncTranslate [3].
Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, Smiles, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, Smiles has several shortcomings-most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100% robustness: SELF-referencing embedded string (Selfies). Selfies has since simplified and enabled numerous new applications in chemistry. In this perspective, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete future projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages, and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science.
Having a compact yet robust structurally based identifier or representation system is a key enabling factor for efficient sharing and dissemination of research results within the chemistry community, and such systems lay down the essential foundations for future informatics and data-driven research. While substantial advances have been made for small molecules, the polymer community has struggled in coming up with an efficient representation system. This is because, unlike other disciplines in chemistry, the basic premise that each distinct chemical species corresponds to a well-defined chemical structure does not hold for polymers. Polymers are intrinsically stochastic molecules that are often ensembles with a distribution of chemical structures. This difficulty limits the applicability of all deterministic representations developed for small molecules. In this work, a new representation system that is capable of handling the stochastic nature of polymers is proposed. The new system is based on the popular "simplified molecular-input line-entry system" (SMILES), and it aims to provide representations that can be used as indexing identifiers for entries in polymer databases. As a pilot test, the entries of the standard data set of the glass transition temperature of linear polymers (Bicerano, 2002) were converted into the new BigSMILES language. Furthermore, it is hoped that the proposed system will provide a more effective language for communication within the polymer community and increase cohesion between the researchers within the community.
Machine learning enables computers to address problems by learning from data. Deep learning is a type of machine learning that uses a hierarchical recombination of features to extract pertinent information and then learn the patterns represented in the data. Over the last eight years, its abilities have increasingly been applied to a wide variety of chemical challenges, from improving computational chemistry to drug and materials design and even synthesis planning. This review aims to explain the concepts of deep learning to chemists from any background and follows this with an overview of the diverse applications demonstrated in the literature. We hope that this will empower the broader chemical community to engage with this burgeoning field and foster the growing movement of deep learning accelerated chemistry.
Hierarchical Editing Language for Macromolecules (HELM version 2.0) is a molecular line notation similar to SMILEs but specifically for communicating and managing biopolymer structures. The HELM project, part of the Pistoia Alliance nonprofit organization, has been tasked to develop and promote HELM as a global exchange format and recently released version 2.0 of the specification. Here we will describe the specifics of the HELM v2.0 notation along with the large ecosystem of software to support HELM-based structure management. We will highlight a recent open-source software and database for HELM monomers and a new, simpler approach to deploying a large complicated molecular management system.
Contemporary peptide science exploits methods and tools of bioinformatics, and cheminformatics. These approaches use different languages to describe peptide structures-amino acid sequences and chemical codes (especially SMILES), respectively. The latter may be applied, e.g., in comparative studies involving structures and properties of peptides and peptidomimetics. Progress in peptide science "in silico" may be achieved via better communication between biologists and chemists, involving the translation of peptide representation from amino acid sequence into SMILES code. Recent recommendations concerning good practice in chemical information include careful verification of data and their annotation. This publication discusses the generation of SMILES representations of peptides using existing software. Construction of peptide structures containing unnatural and modified amino acids (with special attention paid on glycosylated peptides) is also included. Special attention is paid to the detection and correction of typical errors occurring in SMILES representations of peptides and their correction using molecular editors. Brief recommendations for training of staff working on peptide annotations, are discussed as well.
A retrospective view of the design and evolution of Chemical Markup Language (CML) is presented by its original authors.
BACKGROUND: There are two line notations of chemical structures that have established themselves in the field: the SMILES string and the InChI string. The InChI aims to provide a unique, or canonical, identifier for chemical structures, while SMILES strings are widely used for storage and interchange of chemical structures, but no standard exists to generate a canonical SMILES string. RESULTS: I describe how to use the InChI canonicalisation to derive a canonical SMILES string in a straightforward way, either incorporating the InChI normalisations (Inchified SMILES) or not (Universal SMILES). This is the first description of a method to generate canonical SMILES that takes stereochemistry into account. When tested on the 1.1 m compounds in the ChEMBL database, and a 1 m compound subset of the PubChem Substance database, no canonicalisation failures were found with Inchified SMILES. Using Universal SMILES, 99.79% of the ChEMBL database was canonicalised successfully and 99.77% of the PubChem subset. CONCLUSIONS: The InChI canonicalisation algorithm can successfully be used as the basis for a common standard for canonical SMILES. While challenges remain - such as the development of a standard aromatic model for SMILES - the ability to create the same SMILES using different toolkits will mean that for the first time it will be possible to easily compare the chemical models used by different toolkits.
BACKGROUND: A frequent problem in computational modeling is the interconversion of chemical structures between different formats. While standard interchange formats exist (for example, Chemical Markup Language) and de facto standards have arisen (for example, SMILES format), the need to interconvert formats is a continuing problem due to the multitude of different application areas for chemistry data, differences in the data stored by different formats (0D versus 3D, for example), and competition between software along with a lack of vendor-neutral formats. RESULTS: We discuss, for the first time, Open Babel, an open-source chemical toolbox that speaks the many languages of chemical data. Open Babel version 2.3 interconverts over 110 formats. The need to represent such a wide variety of chemical and molecular data requires a library that implements a wide range of cheminformatics algorithms, from partial charge assignment and aromaticity detection, to bond order perception and canonicalization. We detail the implementation of Open Babel, describe key advances in the 2.3 release, and outline a variety of uses both in terms of software products and scientific research, including applications far beyond simple format interconversion. CONCLUSIONS: Open Babel presents a solution to the proliferation of multiple chemical file formats. In addition, it provides a variety of useful utilities from conformer searching and 2D depiction, to filtering, batch conversion, and substructure and similarity searching. For developers, it can be used as a programming library to handle chemical data in areas such as organic chemistry, drug design, materials science, and computational chemistry. It is freely available under an open-source license from http://openbabel.org.
The Digital Object Identifier (DOI) is a system for identifying content objects in the digital environment. DOIs are names assigned to any entity for use on Internet digital networks. Scientific data sets may be identified by DOIs, and several efforts are now underway in this area. This paper outlines the underlying architecture of the DOI system, and two such efforts which are applying DOIs to content objects of scientific data.
We have developed a Java library for substructure matching that features easy-to-read syntax and extensibility. This molecular query language (MQL) is grounded on a context-free grammar, which allows for straightforward modification and extension. The formal description of MQL is provided in this paper. Molecule primitives are atoms, bonds, properties, branching, and rings. User-defined features can be added via a Java interface. In MQL, molecules are represented as graphs. Substructure matching was implemented using the Ullmann algorithm because of favorable run-time performance. The Ullmann algorithm carries out a fast subgraph isomorphism search by combining backtracking with effective forward checking. MQL software design was driven by the aim to facilitate the use of various cheminformatics toolkits. Two Java interfaces provide a bridge from our MQL package to an external toolkit: the first one provides the matching rules for every feature of a particular toolkit; the second one converts the found match from the internal format of MQL to the format of the external toolkit. We already implemented these interfaces for the Chemistry Development Toolkit.
The amount of data available on chemical structures and their properties has increased steadily over the past decades. In particular, articles published before the mid-1990 are available only in printed or scanned form. The extraction and storage of data from those articles in a publicly accessible database are desirable, but doing this manually is a slow and error-prone process. In order to extract chemical structure depictions and convert them into a computer-readable format, Optical Chemical Structure Recognition (OCSR) tools were developed where the best performing OCSR tools are mostly rule-based. The DECIMER (Deep lEarning for Chemical ImagE Recognition) project was launched to address the OCSR problem with the latest computational intelligence methods to provide an automated open-source software solution. Various current deep learning approaches were explored to seek a best-fitting solution to the problem. In a preliminary communication, we outlined the prospect of being able to predict SMILES encodings of chemical structure depictions with about 90% accuracy using a dataset of 50-100 million molecules. In this article, the new DECIMER model is presented, a transformer-based network, which can predict SMILES with above 96% accuracy from depictions of chemical structures without stereochemical information and above 89% accuracy for depictions with stereochemical information.
Chemical compounds can be identified through a graphical depiction, a suitable string representation, or a chemical name. A universally accepted naming scheme for chemistry was established by the International Union of Pure and Applied Chemistry (IUPAC) based on a set of rules. Due to the complexity of this ruleset a correct chemical name assignment remains challenging for human beings and there are only a few rule-based cheminformatics toolkits available that support this task in an automated manner. Here we present STOUT (SMILES-TO-IUPAC-name translator), a deep-learning neural machine translation approach to generate the IUPAC name for a given molecule from its SMILES string as well as the reverse translation, i.e. predicting the SMILES string from the IUPAC name. In both cases, the system is able to predict with an average BLEU score of about 90% and a Tanimoto similarity index of more than 0.9. Also incorrect predictions show a remarkable similarity between true and predicted compounds.
Chemical structure extraction from documents remains a hard problem because of both false positive identification of structures during segmentation and errors in the predicted structures. Current approaches rely on handcrafted rules and subroutines that perform reasonably well generally but still routinely encounter situations where recognition rates are not yet satisfactory and systematic improvement is challenging. Complications impacting the performance of current approaches include the diversity in visual styles used by various software to render structures, the frequent use of ad hoc annotations, and other challenges related to image quality, including resolution and noise. We present end-to-end deep learning solutions for both segmenting molecular structures from documents and predicting chemical structures from the segmented images. This deep-learning-based approach does not require any handcrafted features, is learned directly from data, and is robust against variations in image quality and style. Using the deep learning approach described herein, we show that it is possible to perform well on both segmentation and prediction of low-resolution images containing moderately sized molecules found in journal articles and patents.
Here, chemoinformatics is considered as a theoretical chemistry discipline complementary to quantum chemistry and force-field molecular modeling. These three fields are compared with respect to molecular representation, inference mechanisms, basic concepts and application areas. A chemical space, a fundamental concept of chemoinformatics, is considered with respect to complex relations between chemical objects (graphs or descriptor vectors). Statistical Learning Theory, one of the main mathematical approaches in structure-property modeling, is briefly reviewed. Links between chemoinformatics and its "sister" fields - machine learning, chemometrics and bioinformatics are discussed.
Substructural fragments are proposed as a simple and safe way to encode molecular structures in a matrix containing the occurrence of fragments of a given type. The knowledge retrieved from QSPR modelling can also be stored in that matrix in addition to the information about fragments. Complex supramolecular systems (using special bond types) and chemical reactions (represented as Condensed Graphs of Reactions, CGR) can be treated similarly. The efficiency of fragments as descriptors has been demonstrated in QSPR studies of aqueous solubility for a diverse set of organic compounds as well as in the analysis of thermodynamic parameters for hydrogen-bonding in some supramolecular complexes. It has also been shown that CGR may be an interesting opportunity to perform similarity searches for chemical reactions. The relationship between the density of information in descriptors/knowledge matrices and the robustness of QSPR models is discussed.
no abstract.
At the root of applications for substructure and similarity searching, reaction retrieval, synthesis planning, drug discovery, and physicochemical property prediction is the need for a machine-readable representation of a structure. Systematic nomenclature is unsuitable, and notations and fragment codes have been superseded, except in certain specific applications. Connection tables are widely used, but there is no formal standard. Recently the International Union of Pure and Applied Chemistry (IUPAC) International Chemical Identifier (InChI) has started to attract interest. This review also summarizes the representation of chemical reactions and three-dimensional structures. .
Research in chemistry increasingly requires interdisciplinary work prompted by, among other things, advances in computing, machine learning, and artificial intelligence. Everyone working with molecules, whether chemist or not, needs an understanding of the representation of molecules in a machine‐readable format, as this is central to computational chemistry. Four classes of representations are introduced: string, connection table, feature‐based, and computer‐learned representations. Three of the most significant representations are simplified molecular‐input line‐entry system (SMILES), International Chemical Identifier (InChI), and the MDL molfile, of which SMILES was the first to successfully be used in conjunction with a variational autoencoder (VAE) to yield a continuous representation of molecules. This is noteworthy because a continuous representation allows for efficient navigation of the immensely large chemical space of possible molecules. Since 2018, when the first model of this type was published, considerable effort has been put into developing novel and improved methodologies. Most, if not all, researchers in the community make their work easily accessible on GitHub, though discussion of computation time and domain of applicability is often overlooked. Herein, we present questions for consideration in future work which we believe will make chemical VAEs even more accessible. .
Jmol is a free, open source molecule viewer for chemistry and biochemistry. It is cross-platform, running on Windows, Mac OS X, and Linux/Unix systems. The software consists of three parts: the JmolApplet is a web browser applet that can be integrated into web pages; the Jmol application is a standalone Java application that runs on the desktop; and the JmolViewer is a development tool kit that can be integrated into other Java applications.
The gradual development of SYSTEMS of typed symbols for describing chemical structures and substrunures (radicals) is traced from its beginning in 1787. Such symbols are emphasized because *computers are symbol processors par excellence." Computers are the central tools of the Information Age, but future trends in the management of chemical structure information are likely to reflect deeply rooted human habits of the past, like the use of symbols in music and algebra.
When biological macromolecules are used as therapeutic agents, it is often necessary to introduce non-natural chemical modifications to improve their pharmaceutical properties. The final products are complex structures where entities such as proteins, peptides, oligonucleotides, and small molecule drugs may be covalently linked to each other, or may include chemically modified biological moieties. An accurate in silico representation of these complex structures is essential, as it forms the basis for their electronic registration, storage, analysis, and visualization. The size of these molecules (henceforth referred to as "biomolecules") often makes them too unwieldy and impractical to represent at the atomic level, while the presence of non-natural chemical modifications makes it impossible to represent them by sequence alone. Here we describe the Hierarchical Editing Language for Macromolecules ("HELM") and demonstrate its utility in the representation of structures such as antisense oligonucleotides, short interference RNAs, peptides, proteins, and antibody drug conjugates.
Background Complex diseases such as cancer are a consequence of numerous causes. State of the art personalised medicine approaches are mostly based on evaluating patients' individual genetic background. Despite the advances of genomics it fails to take individual dynamic influences into account that contribute to the individual and unique glycomic and glycoproteomic “configurations” of every living being. Scope of review Glycomic and glycoproteomic-based personalised medicine diagnostics are still in their infancies, however some initial success stories indicate that these fields are highly promising to mediate novel early diagnosis and disease stratification markers, subsequently resulting in improved patient well-being and reduced treatment costs. In this review we not only summarise current protein glycosylation based examples that substantially improve or possess great potential for personalised medicine, but also describe current limitations as well as future perspectives and challenges associated with establishing protein glycosylation aspects for this purpose. Major conclusions Many protein biomarkers currently in clinical use are glycoproteins, however, their glycosylation status is seldom evaluated in a clinical context. To date just few examples have already been successfully translated into clinical practice, making protein glycosylation a highly promising diagnostic target with humongous potential for personalised medicine. General significance There is an urgent need for markers that enable the establishment of an individualised and optimised patient treatment at the earliest disease stage possible. The glycosylation status of a patient and/or specific marker proteins can provide important clues that result in improved patient management. This article is part of a Special Issue entitled “Glycans in personalised medicine” Guest Editor: Professor Gordan Lauc.
Nowadays, due to the advance of experimental techniques in glycomics, large collections of glycan profiles are regularly published. The rapid growth of available glycan data accentuates the lack of innovative tools for visualizing and exploring large amount of information. Scientists resort to using general-purpose spreadsheet applications to create ad hoc data visualization. Thus, results end up being encoded in publication images and text, while valuable curated data is stored in files as supplementary information. To tackle this problem, we have built an interactive pipeline composed with three tools: Glynsight, EpitopeXtractor and Glydin'. Glycan profile data can be imported in Glynsight, which generates a custom interactive glycan profile. Several profiles can be compared and glycan composition is integrated with structural data stored in databases. Glycan structures of interest can then be sent to EpitopeXtractor to perform a glycoepitope extraction. EpitopeXtractor results can be superimposed on the Glydin' glycoepitope network. The network visualization allows fast detection of clusters of glycoepitopes and discovery of potential new targets. Each of these tools is standalone or can be used in conjunction with the others, depending on the data and the specific interest of the user. All the tools composing this pipeline are part of the Glycomics@ExPASy initiative and are available at https://www.expasy.org/glycomics.
Long avoided by chemists and biologists, sugar-based drugs are suddenly on medicine's menu and garnering impressive early reviews.
Protein glycosylation is critically important in vivo; current estimates are that more than half of the proteins in the SWISS-PROT database are glycoproteins. Glycosylation plays a substantial role wide a range of physiological and pathological processes including development, immunology, cancer, and infectious disease. Protein glycosylation is also vitally important in the development of therapeutic bioproducts. Currently, more than 165 recombinant protein pharmaceuticals are approved for human use, with another 500 in preclinical and clinical trials. Of these, approximately 70% are glycosylated proteins. Glycosylation affects the structure, activity, immunogenicity, protease sensitivity, stability, and biological clearance of glycoproteins. Hence, an understanding of the mechanisms by which proteins are glycosylated, and strategies for analyzing and controlling glycoforms has become increasingly important in the development of biopharmaceuticals. Advances in chromatography and mass spectrometry have permitted more detailed identification of glycans, while cellular and protein engineering strategies have allowed manipulation of the glycoforms. In this chapter, we review the biology of protein glycosylation, methods for identifying and characterizing glycans and glycoproteins, and the effects of host cell line, culture conditions, and cellular engineering on the glycoforms of recombinant glycoproteins, providing a comprehensive overview of glycosylation of recombinant protein therapeutics.
Recent technological advances in glycobiology and glycochemistry are paving the way for a new era in carbohydrate vaccine design. This is enabling greater efficiency in the identification, synthesis and evaluation of unique glycan epitopes found on a plethora of pathogens and malignant cells. Here, we review the progress being made in addressing challenges posed by targeting the surface carbohydrates of bacteria, protozoa, helminths, viruses, fungi and cancer cells for vaccine purposes.
The methodology underpinning the construction, refinement, validation and analysis of atomic models of glycoproteins and protein-carbohydrate complexes has received a long-overdue boost in the last five years. This is a very timely development, as the resolution revolution in electron cryo-microscopy is now routinely delivering structures of key glycomedical importance, with a three-dimensional precision where X-ray crystallographic methods have traditionally floundered. This review will focus on the new software developments that have been introduced in the past two years, and their impact on the field of structural glycobiology in terms of published structures.
Glycosylation is the most prevalent post-translational modification found on proteins, occurring in all domains of life. Ever since the discovery of asparagine-linked (N-linked) protein glycosylation pathways in bacteria, major efforts have been made to harness these systems for the creation of novel therapeutics, vaccines, and diagnostics. Recent advances such as the ability to produce designer glycans in bacteria, some containing unnatural sugars, and techniques for evolving glycosylation enzymes have spawned an entirely new discipline known as bacterial glycoengineering. In addition to their biotechnological and therapeutic potential, bacteria equipped with recombinant N-linked glycosylation pathways are improving our understanding of the N-glycosylation process. This review discusses the key role played by microorganisms in glycosciences, particularly in the context of N-linked glycosylation.
Despite being the most abundant type of biopolymers in Nature, the biological relevance of carbohydrates has systematically been underrated for decades, associating them far less sophisticated functions (structural or energy sourcing) than those unraveled for polynucleotides and proteins. The inherently large and complex diversity of carbohydrates and glycoconjugates, together with the lack of efficient technologies to either isolate them from natural sources or produce them synthetically in useful amounts, have burdened the appreciation of their utmost importance in the most fundamental biological processes. For these reasons, carbohydrate-mediated transmission of biological information was largely unexplored. However, over the decades, it became clear that the expression of complex carbohydrates is critical in the development of living systems. Nature uses this diverse repertoire of structures as codes in fundamental biological processes such as cellular differentiation, cellular signaling, fertilization or immune response, among many others. The urgency to elucidate the glycan code in terms of structure-function relationships has fuelled chemical biology approaches uncovering new frontiers in molecular biology, for which the term glycobiology had to be coined in the early 1980s'. Novel strategies for assembling oligosaccharides, glycoproteins, glycolipids and a range of glycoconjugates have flourished ever since providing access to glycomaterials for interrogating and interfering glycan function. This account focuses on the major breakthroughs made on the strategies during the last decades to synthetically reproduce the overwhelming glycodiversity, emphasizing on the dazzling array of concepts and techniques which development was required to cope with the task. In the first place, a succinct overview of the structural and functional diversity of biologically relevant saccharides and glycoconjugates will be given. Then, a selection of the most relevant strategies that composes the complex and diversity-oriented toolbox that modern carbohydrate synthesis consists on will be dissected. Finally, a selection of the most recent applications of this synthetic toolbox to chemical biology will be captured.
Like the other biopolymers, proteins, and nucleic acids, glycans come in a diversity of structures that underlie a vast array of biological functions. To understand these functions at a molecular level, we must first understand glycan structures at a chemical level. This chapter begins with an introduction to the building blocks-monosaccharides-that are assembled to generate more complex glycans. After a brief summary of nomenclature, we present the salient chemical features of the monosaccharides that define their structural diversity, with an emphasis on stereochemical properties. We then illustrate how monosaccharide diversity, combined with a multiplicity of ways in which they can be linked together, can create the wealth of glycan structures found in nature. An understanding of the structural features that distinguish glycans from other biopolymers will help the reader to appreciate the origin of their biological capabilities.
Many glycans show remarkably discontinuous distribution across evolutionary lineages. These differences play major roles when organisms belonging to different lineages interact as host-pathogen or host-symbiont. Certain lineage-specific glycans have become important signals for multicellular host organisms, which use them as molecular signatures of their pathogens and symbionts through recognition by a toolkit of innate defense molecules. In turn, pathogens have evolved to exploit host lineage-specific glycans and are constantly shaping the glycomes of their hosts. These interactions take place in the face of numerous critical endogenous functions played by glycans within host organisms. Whether due to simple evolutionary divergence or adaptive changes under natural selection resulting from endogenous functional requirements, once different lineages elaborate on differential glycomes these mutual differences provide opportunities for host exploitation and/or pathogen defense between lineages. Such phylogenetic molecular recognition mechanisms will augment and likely contribute to the maintenance of lineage-specific differences in glycan repertoires.
Elaboration of a capsule composed of one of a range of acidic polysaccharides is a common feature of many bacteria, particularly those capable of causing serious infections in humans. Biochemical and genetical analyses of capsule biogenesis in Escherichia coli are beginning to reveal new aspects of polysaccharide biosynthesis. Genes have been identified which are thought to encode products responsible for the translocation of these high molecular-weight polysaccharides across the cytoplasmic and outer membranes, and the organization of exported polysaccharide into a capsule. Their further analysis should provide new insights into membrane biology, particularly since the genes in question are absent from the often used laboratory strains of E. coli. Genetic analysis of capsule diversity is beginning to suggest possible mechanisms for the generation of the structural diversity of polysaccharides.
Adaptive immune responses have long been considered the "territory" of antigenic proteins, whereas carbohydrates are characterized as T-cell-independent antigens that are not typically recognized by the complete adaptive machinery. The current modus operandi when searching for dominant epitopes is the use of synthetic peptides designed from the primary structure of interesting target proteins; however, there is growing evidence that sugars can also play a critical role in immune recognition. Findings reported in this issue of the European Journal of Immunology begin to shed light on the differences in protein glycosylation that can occur in association with disorders like rheumatoid arthritis and the effect these changes have on collagen recognition by the immune system. Other recent studies have shown that immunodominant glycopeptide "remnant" epitopes as well as glycosylation changes on self-proteins can generate autoimmunity. Finally, some types of carbohydrates are now known to be processed and presented to T cells by class II MHC. Taken together, these advances illustrate a clear importance for carbohydrate recognition in foreign and self antigens by the adaptive immune system. With the common presence of carbohydrate molecules on eukaryotic, prokaryotic, and viral surfaces, the impact of carbohydrates on adaptive immunity is now indisputable.
Carbohydrates represent one of the building blocks of life, along with nucleic acids, proteins and lipids. Although glycans are involved in a wide range of processes from embryogenesis to protein trafficking and pathogen infection, we are still a long way from deciphering the glycocode. In this review, we aim to present a few of the challenges that researchers working in the area of glycobiology can encounter and what strategies can be utilised to overcome them. Our goal is to paint a comprehensive picture of the current saccharide landscape available in the Protein Data Bank (PDB). We also review recently updated repositories relevant to the topic proposed, the impact of software development on strategies to structurally solve carbohydrate moieties, and state-of-the-art molecular and cellular biology methods that can shed some light on the function and structure of glycans.
Glycosylation is the process of attachment of carbohydrates (glycans) to proteins and lipids to form the glycoproteins and glycolipids found in eukaryotic organisms. Glycosylation reactions are amongst the most common posttranslational modifications that have been reported on proteins, and they function to alter protein size, stability, charge and antigenicity. Glycosylation plays an important role in cancer, inherited diseases, pathogen–host interactions and immune recognition. Key Concepts The analysis of glycosylation is more complex than DNA and RNA or protein analysis as glycosylation is a non‐template‐driven event. N‐linked glycosylation commences in the endoplasmic reticulum and is completed in the Golgi apparatus. O‐linked glycosylation commences and is completed in the Golgi apparatus. N‐linked glycosylation occurs on asparagine residues in the consensus sequon Asn‐X‐Ser/Thr. O‐linked glycosylation occurs on serine and threonine residues. Many enzymes are involved in both N‐ and O‐linked glycosylation. Glycans play a role in sperm–egg interactions, the implantation process and embryonic development. Congenital disorders in glycosylation result from autosomally dominant mutations in enzymes involved in glycosylation pathways. Glycosylation of proteins influences their half‐life in the serum. N‐ and O‐linked glycosylation have been reported to be altered in cancer and implicated in all of the steps in the metastatic cascade.
Carbohydrates are the most abundant natural products. Besides their role in metabolism and as structural building blocks, they are fundamental constituents of every cell surface, where they are involved in vital cellular recognition processes. Carbohydrates are a relatively untapped source of new drugs and therefore offer exciting new therapeutic opportunities. Advances in the functional understanding of carbohydrate-protein interactions have enabled the development of a new class of small-molecule drugs, known as glycomimetics. These compounds mimic the bioactive function of carbohydrates and address the drawbacks of carbohydrate leads, namely their low activity and insufficient drug-like properties. Here, we examine examples of approved carbohydrate-derived drugs, discuss the potential of carbohydrate-binding proteins as new drug targets (focusing on the lectin families) and consider ways to overcome the challenges of developing this unique class of novel therapeutics.
no abstract.
This chapter provides an overview of glycosylation patterns across biological taxa and discusses glycan complexity and diversity from an evolutionary perspective. As much of the currently available information concerns vertebrates, this chapter emphasizes comparisons between vertebrate glycans and those of other taxa. Evolutionary processes that likely determine generation of glycan diversity are briefly considered, including intrinsic host glycan-binding protein functions and interactions of hosts with extrinsic pathogens or symbionts.
The oligosaccharide chains (glycans) attached to cell surface and extracellular proteins and lipids are known to mediate many important biological roles. However, for many glycans, there are still no evident functions that are of obvious benefit to the organism that synthesizes them. There is also no clear explanation for the extreme complexity and diversity of glycans that can be found on a given glycoconjugate or cell type. Based on the limited information available about the scope and distribution of this diversity among taxonomic groups, it is difficult to see clear trends or patterns consistent with different evolutionary lineages. It appears that closely related species may not necessarily share close similarities in their glycan diversity, and that more derived species may have simpler as well as more complex structures. Intraspecies diversity can also be quite extensive, often without obvious functional relevance. We suggest one general explanation for these observations, that glycan diversification in complex multicellular organisms is driven by evolutionary selection pressures of both endogenous and exogenous origin. We argue that exogenous selection pressures mediated by viral and microbial pathogens and parasites that recognize glycans have played a more prominent role, favoring intra- and interspecies diversity. This also makes it difficult to appreciate and elucidate the specific endogenous roles of the glycans within the organism that synthesizes them.
Based on important cell-biological and biochemical results concerning the structural difference between membrane glycoproteins of normal epithelial cells and epithelial tumour cells, tumour-associated glycopeptide antigens have been chemically synthesised and structurally confirmed. Glycopeptide structures of the tandem repeat sequence of mucin MUC1 of epithelial tumour cells constitute the most promising tumour-associated antigens. In order to generate a sufficient immunogenicity of these endogenous structures, usually tolerated by the immune system, these synthetic glycopeptide antigens were conjugated to immune stimulating components: in fully synthetic two-component vaccines either with T-cell peptide epitopes or with Toll-like receptor2 lipopeptide ligands or in three-component vaccines with both these stimulants. Alternatively, the synthetic glycopeptide antigens were coupled to immune stimulating carrier proteins. In particular, MUC1 glycopeptide conjugates with Tetanus toxoid proved to be efficient vaccines inducing very strong immune responses in mice. The antibodies elicited with the fully synthetic vaccines showed selective recognition of the tumour-associated glycopeptides as was shown by neutralisation and micro-array binding experiments. After booster immunisations, most of the immune responses showed the installation of an immunological memory. Immunisation with fully synthetic three-component vaccines induced immune reactions with therapeutic effects in terms of reduction of the tumour burden in mice or in killing of tumour cells in culture, while MUC1 glycopeptide-Tetanus toxoid vaccines elicited antibodies in mice which recognised tumour cells in human tumour tissues. The results achieved so far are considered to be promising for the development of an active immunisation against tumours.
Defining how a glycan-binding protein (GBP) specifically selects its cognate glycan from among the ensemble of glycans within the cellular glycome is an area of intense study. Powerful insight into recognition mechanisms can be gained from 3D structures of GBPs complexed to glycans; however, such structures remain difficult to obtain experimentally. Here an automated 3D structure generation technique, called computational carbohydrate grafting, is combined with the wealth of specificity information available from glycan array screening. Integration of the array data with modeling and crystallography allows generation of putative co-complex structures that can be objectively assessed and iteratively altered until a high level of agreement with experiment is achieved. Given an accurate model of the co-complexes, grafting is also able to discern which binding determinants are active when multiple potential determinants are present within a glycan. In some cases, induced fit in the protein or glycan was necessary to explain the observed specificity, while in other examples a revised definition of the minimal binding determinants was required. When applied to a collection of 10 GBP-glycan complexes, for which crystallographic and array data have been reported, grafting provided a structural rationalization for the binding specificity of >90% of 1223 arrayed glycans. A webtool that enables researchers to perform computational carbohydrate grafting is available at www.glycam.org/gr (accessed 03 March 2016).
Cell-surface glycans are a diverse class of macromolecules that participate in many key biological processes, including cell-cell communication, development, and disease progression. Thus, the ability to modulate the structures of glycans on cell surfaces provides a powerful means not only to understand fundamental processes but also to direct activity and elicit desired cellular responses. Here, we describe methods to sculpt glycans on cell surfaces and highlight recent successes in which artificially engineered glycans have been employed to control biological outcomes such as the immune response and stem cell fate.
Reliable and rapid access to defined biopolymers by automated DNA and peptide synthesis has fundamentally altered biological research and medical practice. Similarly, the procurement of defined glycans is key to establishing structure-activity relationships and thereby progress in the glycosciences. Here, we describe the rapid assembly of oligosaccharides using the commercially available Glyconeer 2.1 automated glycan synthesizer, monosaccharide building blocks, and a linker-functionalized polystyrene solid support. Purification and quality-control protocols for the oligosaccharide products have been standardized. Synthetic glycans prepared in this way are useful reagents as the basis for glycan arrays, diagnostics, and carbohydrate-based vaccines.
Carbohydrates are highly abundant biomolecules found extensively in nature. Besides playing important roles in energy storage and supply, they often serve as essential biosynthetic precursors or structural elements needed to sustain all forms of life. A number of unusual sugars that have certain hydroxyl groups replaced by a hydrogen, an amino group, or an alkyl side chain play crucial roles in determining the biological activity of the parent natural products in bacterial lipopolysaccharides or secondary metabolite antibiotics. Recent investigation of the biosynthesis of these monosaccharides has led to the identification of the gene clusters whose protein products facilitate the unusual sugar formation from the ubiquitous NDP-glucose precursors. This review summarizes the mechanistic studies of a few enzymes crucial to the biosynthesis of C-2, C-3, C-4, and C-6 deoxysugars, the characterization and mutagenesis of nucleotidyl transferases that can recognize and couple structural analogs of their natural substrates and the identification of glycosyltransferases with promiscuous substrate specificity. Information gleaned from these studies has allowed pathway engineering, resulting in the creation of new macrolides with unnatural deoxysugar moieties for biological activity screening. This represents a significant progress toward our goal of searching for more potent agents against infectious diseases and malignant tumors.
This chapter contains sections titled: - Introduction - Monosaccharide Nomenclature and Diversity - The Oligosaccharide Assembly Level - Major Empirical Structural Classifications.
Structural diversity of carbohydrates plays a crucial role in their large variety of roles in biological systems. This paper focuses on aspects of structure and biological functions of three classes of carbohydrates, N-linked oligosaccharides, blood group oligosaccharides and glycosaminoglycans. Conformations and dynamics in solution, as well as structure of protein-carbohydrate complexes are discussed. A short overview also describes theoretical and experimental methodologies that are used in current glycobiological research, particularly high-resolution NMR spectroscopy, X-ray crystallography and methods of computational chemistry.
Glycosylation is the most important posttranslational modification occurring mainly in the cytosol, the endoplasmic reticulum, the Golgi apparatus and the sarcolemmal membrane. A rapidly growing family of genetic diseases is due to defects in protein N- and O-glycosylation, glycosylphosphatidylinositol glycosylation, and lipid glycosylation (congenital disorders of glycosylation (CDG)). Most CDG are severe, multisystem diseases with important neurological involvement. Some 76 CDG have been identified. Screening methods are limited to serum transferrin isoelectrofocusing (for protein N-glycosylation disorders with sialic acid deficiency) and serum apolipoprotein C-III isoelectrofocusing (for core 1 mucin-type O-glycosylation disorders). Whole exome/genome sequencing is increasingly used in the diagnostic work-up of patients with an unidentified CDG. Only one CDG is efficiently treatable namely MPI-CDG.
Glycoconjugate vaccines, in which a cell surface carbohydrate from a micro-organism is covalently attached to an appropriate carrier protein are proving to be the most effective means to generate protective immune responses to prevent a wide range of diseases. The technology appears to be generic and applicable to a wide range of pathogens, as long as antibodies against surface carbohydrates help protect against infection. Three such vaccines, against Haemophilus influenzae type b, Neisseria meningitidis Group C and seven serotypes of Streptococcus pneumoniae, have already been licensed and many others are in development. This article discusses the rationale for the development and use of glycoconjugate vaccines, the mechanisms by which they elicit T cell-dependent immune responses and the implications of this for vaccine development, the role of physicochemical methods in the characterisation and quality control of these vaccines, and the novel products which are under development.
The effects of the backbone and side chain on the molecular environments in the chiral cavities of three commercially important polysaccharide-based chiral sorbents--cellulose tris(3,5-dimethylphenylcarbamate) (CDMPC), amylose tris(3,5-dimethylphenylcarbamate) (ADMPC), and amylose tris[(S)-alpha-methylbenzylcarbamate] (ASMBC)--are studied by attenuated total reflection infrared spectroscopy (ATR-IR), X-ray diffraction (XRD), 13C cross-polarization/magic-angle spinning (CP/MAS) and MAS solid-state NMR, and density functional theory (DFT) modeling. These sorbents are used widely in preparative-scale chiral separations. ATR-IR is used to determine how the H-bonding states of the C=O and NH groups of the polymer depend on the backbone and side chain. The changes in the polymer crystallinity are characterized with XRD. The changes in the polymer helicity and molecular mobility for polymer-coated silica beads (commercially called Chiralcel OD, Chirapak AD, and Chiralpak AS) are probed with 13C CP/MAS and MAS solid-state NMR. The IR wavenumbers and the NMR chemical shifts for the polymer backbone monomers and dimers and the side chains are predicted at the DFT/B3LYP/6-311+g(d,p) level of theory. It is concluded that the molecular environments of the C=O, NH, and phenyl groups show significant differences in intramolecular and intermolecular interactions and in the nanostructures of the chiral cavities of these biopolymers. These results have implications for understanding how the molecular environments of chiral cavities of these polymers affect their molecular recognition mechanisms.
Glycan structural analysis and glycan synthesis are important for the development of glycan-related therapeutics and vaccines. The utility of glycoenzymes, including glycosidases and glycosyltransferases, for glycan analysis and synthesis has been increasing because these enzymes have specificity for their substrates. Glycosidase catalyzes the hydrolysis of a glycosidic linkage and is generally used to determine the presence of specific glycosidic linkages. Glycosyltransferase catalyzes the transfer of monosaccharide from a glycosyl donor substrate onto an acceptor to form a new specific glycosidic linkage. Enzyme-catalyzed glycan synthesis is often superior to chemical synthesis in view of its regio- and stereospecificity and technical simplicity, although there are still some limitations. Moreover, transglycosylation activity of glycosidase and glycosynthase is also useful for the glycan synthesis. The reservoir of glycoenzymes as an array of useful tools for glycan manipulation will be further enriched by an increase in number of sequenced genome, which will be then accelerating their applications and also enhancing the feasibility in development of glycan-related therapeutics.
The cell wall is composed of a polysaccharide-based three-dimensional network. Considered for a long time as an inert exoskeleton, the cell wall is now seen as a dynamic structure that is continuously changing as a result of the modification of culture conditions and environmental stresses. Although the cell wall composition varies among fungal species, chemogenomic comparative analysis have led to a better understanding of the genes and mechanisms involved in the construction of the common central core composed of branched beta1,3 glucan-chitin. Because of its essential biological role, unique biochemistry and structural organization and the absence in mammalian cells of most of its constitutive components, the cell wall is an attractive target for the development of new antifungal agents. Genomic as well as drug studies have shown that the death of the fungus can result from inhibition of cell wall polysaccharide synthases. To date, only beta1,3 glucan synthase inhibitors have been launched clinically and many more targets remain to be explored.
Carbohydrates have been in the center of human interest since the very beginning of civilization, but we have only recently started to understand the importance of complex oligosaccharides (glycans) attached to protein or lipid backbones. This is perhaps not surprising, since the branched structures of sugars makes analysis of glycoconjugates significantly more challenging than the analysis of linear DNA and protein sequences. A typical glycan is a complex molecule containing between 10 and 15 monosaccharides linked in a rather complicated manner that many of us have not yet learned to interpret. Two or more such glycans are attached to the protein backbone of an average glycoprotein, and since there is no genetic blueprint for glycans, individual glycan structures can vary depending on the current level of expression and intracellular localization of biosynthetic enzymes (glycosyltransferases and glycosidases). Consequently, slightly different glycan structures can be attached to the same protein backbone, and after the glycosylation of a protein is completed, proteins with the same amino acid sequence can end up in one of several hundred possible glycoforms. Since naturally glycosylated proteins still cannot be produced in vitro, structural analysis has to be performed on small quantities of glycoproteins that can be isolated from nature, and considering the complexity of glycosylation, this can be a formidable task.
Glycans are expressed on the surface of nearly all host and bacterial cells. Not surprisingly, glycan-mediated molecular interactions play a vital role in bacterial pathogenesis and host responses against pathogens. Glycan-mediated host-pathogen interactions can benefit the pathogen, host, or both. Here, we discuss (i) bacterial glycans that play a critical role in bacterial colonization and/or immune evasion, (ii) host glycans that are utilized by bacteria for pathogenesis, and (iii) bacterial and host glycans involved in immune responses against pathogens. We further discuss (iv) opportunities and challenges for transforming these research findings into more effective antibacterial strategies, and (v) technological advances in glycoscience that have helped to accelerate progress in research. These studies collectively offer valuable insights into new perspectives on antibacterial strategies that may effectively tackle the drug-resistant pathogens that are rapidly spreading globally.
no abstract.
Carbohydrates are ubiquitous in nature and present across all kingdoms of life – bacteria, fungi, viruses, yeast, plants, animals and humans. They are essential to many biological processes. However, due to their complexity and heterogeneous nature they are often neglected, sometimes referred to as the ‘dark matter’ of biology. Nevertheless, due to their extensive biological impact on health and disease, glycans and the field of glycobiology have become increasingly popular in recent years, giving rise to glycan-based drug development and therapeutics. Forecasting of communicable diseases predicts that we will see an increase in pandemics of humans and livestock due to global loss of biodiversity from changes to land use, intensification of agriculture, climate change and disruption of ecosystems. As such, the development of point-of-care devices to detect pathogens is vital to prevent the transmission of infectious disease, as we have seen with the COVID-19 pandemic. So, can glycans be exploited to detect COVID-19 and other infectious diseases? And is this technology sensitive and accurate? Here, I discuss the structure and function of glycans, the current glycan-based therapeutics and how glycan binding can be exploited for detection of infectious disease, like COVID-19.
The four essential building blocks of cells are proteins, nucleic acids, lipids, and glycans. Also referred to as carbohydrates, glycans are composed of saccharides that are typically linked to lipids and proteins in the secretory pathway. Glycans are highly abundant and diverse biopolymers, yet their functions have remained relatively obscure. This is changing with the advent of genetic reagents and techniques that in the past decade have uncovered many essential roles of specific glycan linkages in living organisms. Glycans appear to modulate biological processes in the development and function of multiple physiologic systems, in part by regulating protein-protein and cell-cell interactions. Moreover, dysregulation of glycan synthesis represents the etiology for a growing number of human genetic diseases. The study of glycans, known as glycobiology, has entered an era of renaissance that coincides with the acquisition of complete genome sequences for multiple organisms and an increased focus upon how posttranslational modifications to protein contribute to the complexity of events mediating normal and disease physiology. Glycan production and modification comprise an estimated 1% of genes in the mammalian genome. Many of these genes encode enzymes termed glycosyltransferases and glycosidases that reside in the Golgi apparatus where they play the major role in constructing the glycan repertoire that is found at the cell surface and among extracellular compartments. We present a review of the recently established functions of glycan structures in the context of mammalian genetic studies focused upon the mouse and human species. Nothing tends so much to the advancement of knowledge as the application of a new instrument. The native intellectual powers of men in different times are not so much the causes of the different success of their labours, as the peculiar nature of the means and artificial resources in their possession. T. Hager: Force of Nature (1).
Glycosylation is one of the most common post-translational modifications in eukaryotic cells and plays important roles in many biological processes, such as the immune response and protein quality control systems. It has been notoriously difficult to study glycoproteins by X-ray crystallography since the glycan moieties usually have a heterogeneous chemical structure and conformation, and are often mobile. Nonetheless, recent technical advances in glycoprotein crystallography have accelerated the accumulation of 3D structural information. Statistical analysis of "snapshots" of glycoproteins can provide clues to understanding their structural and dynamic aspects. In this review, we provide an overview of crystallographic analyses of glycoproteins, in which electron density of the glycan moiety is clearly observed. These well-defined N-glycan structures are in most cases attributed to carbohydrate-protein and/or carbohydrate-carbohydrate interactions and may function as "molecular glue" to help stabilize inter- and intra-molecular interactions. However, the more mobile N-glycans on cell surface receptors, the electron density of which is usually missing on X-ray crystallography, seem to guide the partner ligand to its binding site and prevent irregular protein aggregation by covering oligomerization sites away from the ligand-binding site.
Glycosylation produces an abundant, diverse, and highly regulated repertoire of cellular glycans that are frequently attached to proteins and lipids. The past decade of research on glycan function has revealed that the enzymes responsible for glycosylation—the glycosyltransferases and glycosidases—are essential in the development and physiology of living organisms. Glycans participate in many key biological processes including cell adhesion, molecular trafficking and clearance, receptor activation, signal transduction, and endocytosis. This review discusses the increasingly sophisticated molecular mechanisms being discovered by which mammalian glycosylation governs physiology and contributes to disease.
Access to complex carbohydrates remains a limiting factor for the development of the glycosciences. Automated glycan assembly (AGA) has accelerated and simplified the synthetic process and, with the first commercially available instrument and building blocks, glycan synthesis can now be practiced by any chemist. All classes of glycans, including sulfated or sialylated carbohydrates and polysaccharides as long as 50mers are now accessible owing to optimized reaction conditions and new methodologies. These synthetic glycans have helped to understand many biological functions and to advance diagnostic and vaccine development. Establishing detailed structure-function relationships will eventually enable the production of unnatural materials with tuned properties.
Glycan structural information is a prerequisite for elucidation of carbohydrate func-tion in biological systems. To this end we employ a tripod approach for investigation ofcarbo hydrate 3D structure and dynamics based on organic synthesis; different experimentalspectroscopy techniques, NMR being of prime importance; and molecular simulations using,in particular, molecular dynamics (MD) simulations. The synthesis of oligosaccharides in theform of glucosyl fluorides is described, and their use as substrates for the Lam16A E115Sglucosyl synthase is exemplified as well as a conformational analysis of a cyclic β-(1 →3)-heptaglucan based on molecular simulations. The flexibility of the N-acetyl group ofaminosugars is by MD simulations indicated to function as a gatekeeper for transitions ofglycosidic torsion angles to other regions of conformational space. A novel approach to visu-alize glycoprotein (GP) structures is presented in which the protein is shown by, for exam-ple, ribbons, but instead of stick or space-filling models for the carbohydrate portion it isvisualized by the colored geometrical figures known as CFG representation in a 3D way,which we denote 3D-CFG, thereby effectively highlighting the sugar residues of the glycanpart of the GP and the position(s) on the protein.
S.C. Purcell, K. Godula.
Synthetic glycoscapes: addressing the structural and functional complexity of the glycocalyx.
Interface Focus, 2019. 9(2): ID 20180080.
DOI: 10.1098/rsfs.2018.0080
The glycocalyx is an information-dense network of biomacromolecules extensively modified through glycosylation that populates the cellular boundary. The glycocalyx regulates biological events ranging from cellular protection and adhesion to signalling and differentiation. Owing to the characteristically weak interactions between individual glycans and their protein binding partners, multivalency of glycan presentation is required for the high-avidity interactions needed to trigger cellular responses. As such, biological recognition at the glycocalyx interface is determined by both the structure of glycans that are present as well as their spatial distribution. While genetic and biochemical approaches have proven powerful in controlling glycan composition, modulating the three-dimensional complexity of the cell-surface 'glycoscape' at the sub-micrometre scale remains a considerable challenge in the field. This focused review highlights recent advances in glycocalyx engineering using synthetic nanoscale glycomaterials, which allows for controlled de novo assembly of complexity with precision not accessible with traditional molecular biology tools. We discuss several exciting new studies in the field that demonstrate the power of precision glycocalyx editing in living cells in revealing and controlling the complex mechanisms by which the glycocalyx regulates biological processes.
The function of deciphering the biological information encoded by the glycome, which is the entire repertoire of complex sugar structures expressed by cells and tissues, is assigned in part to endogenous glycan-binding proteins or lectins. Galectins, a family of animal lectins that bind N-acetyllactosamine-containing glycans, have many roles in diverse immune cell processes, including those relevant to pathogen recognition, shaping the course of adaptive immune responses and fine-tuning the inflammatory response. How do galectins translate glycan-encoded information into tolerogenic or inflammatory cell programmes? An improved understanding of the mechanisms underlying these functions will provide further opportunities for developing new therapies based on the immunoregulatory properties of this multifaceted protein family.
The glycome describes the complete repertoire of glycoconjugates composed of carbohydrate chains, or glycans, that are covalently linked to lipid or protein molecules. Glycoconjugates are formed through a process called glycosylation and can differ in their glycan sequences, the connections between them and their length. Glycoconjugate synthesis is a dynamic process that depends on the local milieu of enzymes, sugar precursors and organelle structures as well as the cell types involved and cellular signals. Studies of rare genetic disorders that affect glycosylation first highlighted the biological importance of the glycome, and technological advances have improved our understanding of its heterogeneity and complexity. Researchers can now routinely assess how the secreted and cell-surface glycomes reflect overall cellular status in health and disease. In fact, changes in glycosylation can modulate inflammatory responses, enable viral immune escape, promote cancer cell metastasis or regulate apoptosis; the composition of the glycome also affects kidney function in health and disease. New insights into the structure and function of the glycome can now be applied to therapy development and could improve our ability to fine-tune immunological responses and inflammation, optimize the performance of therapeutic antibodies and boost immune responses to cancer. These examples illustrate the potential of the emerging field of 'glycomedicine'.
Carbohydrates are one of the most abundant and widely distributed organic compounds found on the Earth. They are ubiquitous, being found as important constituents in plants, animals, and microorganisms. They are formed de novo in the process of photosynthesis. The process uses the energy of sunlight to combine carbon dioxide with water to form carbohydrate and molecular oxygen as abbreviated in the following reaction: 6 CO2+6 H2O→C6H12O6+6O2. Carbohydrates thus represent the conservation of the energy of the sun as chemical energy and serve as the major source of energy for nonphotosynthe- sizing organisms. Therefore, they must have been very early products in the evolution of life.
Carbohydrates are the most abundant biopolymers on earth and part of every living creature. Glycans are essential as materials for nutrition and for information transfer in biological processes. To date, in few cases a detailed correlation between glycan structure and glycan function has been established. A molecular understanding of glycan function will require pure glycans for biological, immunological, and structural studies. Given the immense structural complexity of glycans found in living organisms and the lack of amplification methods or expression systems, chemical synthesis is the only means to access usable quantities of pure glycan molecules. While the solid-phase synthesis of DNA and peptides has become routine for decades, access to glycans has been technically difficult, time-consuming and confined to a few expert laboratories. In this Account, the development of a comprehensive approach to the automated synthesis of all classes of mammalian glycans, including glycosaminoglycans and glycosylphosphatidyl inositol (GPI) anchors, as well as bacterial and plant carbohydrates is described. A conceptual advance concerning the logic of glycan assembly was required in order to enable automated execution of the synthetic process. Based on the central glycosidic bond forming reaction, a general concept for the protecting groups and leaving groups has been developed. Building blocks that can be procured on large scale, are stable for prolonged periods of time, but upon activation result in high yields and selectivities were identified. A coupling-capping and deprotection cycle was invented that can be executed by an automated synthesis instrument. Straightforward postsynthetic protocols for cleavage from the solid support as well as purification of conjugation-ready oligosaccharides have been established. Introduction of methods to install selectively a wide variety of glycosidic linkages has enabled the rapid assembly of linear and branched oligo- and polysaccharides as large as 30-mers. Fast, reliable access to defined glycans that are ready for conjugation has given rise to glycan arrays, glycan probes, and synthetic glycoconjugate vaccines. While an ever increasing variety of glycans are accessible by automated synthesis, further methodological advances in carbohydrate chemistry are needed to make all possible glycans found in nature. These tools begin to fundamentally impact the medical but also materials aspects of the glycosciences.
Sugars and chains of sugar units are the most abundant consdtuent of living matter. New carbohydrates are still being discovered, as are new roles for them in normal biological processes and disease.
Glycans, oligo- and polysaccharides secreted or attached to proteins and lipids, cover the surfaces of all cells and have a regulatory capacity and structural diversity beyond any other class of biological molecule. Glycans may have evolved these properties because they mediate cellular interactions and often face pressure to evolve new functions rapidly. We approach this idea two ways. First, we discuss evolutionary innovation. Glycan synthesis, regulation, and mode of chemical interaction influence the spectrum of new forms presented to evolution. Second, we describe the evolutionary conflicts that arise when alleles and individuals interact. Glycan regulation and diversity are integral to these biological negotiations. Glycans are tasked with such an amazing diversity of functions that no study of cellular interaction can begin without considering them. We propose that glycans predominate the cell surface because their physical and chemical properties allow the rapid innovation required of molecules on the frontlines of evolutionary conflict.
Simple and complex carbohydrates (glycans) have long been known to play major metabolic, structural and physical roles in biological systems. Targeted microbial binding to host glycans has also been studied for decades. But such biological roles can only explain some of the remarkable complexity and organismal diversity of glycans in nature. Reviewing the subject about two decades ago, one could find very few clear-cut instances of glycan-recognition-specific biological roles of glycans that were of intrinsic value to the organism expressing them. In striking contrast there is now a profusion of examples, such that this updated review cannot be comprehensive. Instead, a historical overview is presented, broad principles outlined and a few examples cited, representing diverse types of roles, mediated by various glycan classes, in different evolutionary lineages. What remains unchanged is the fact that while all theories regarding biological roles of glycans are supported by compelling evidence, exceptions to each can be found. In retrospect, this is not surprising. Complex and diverse glycans appear to be ubiquitous to all cells in nature, and essential to all life forms. Thus, >3 billion years of evolution consistently generated organisms that use these molecules for many key biological roles, even while sometimes coopting them for minor functions. In this respect, glycans are no different from other major macromolecular building blocks of life (nucleic acids, proteins and lipids), simply more rapidly evolving and complex. It is time for the diverse functional roles of glycans to be fully incorporated into the mainstream of biological sciences.
Many different theories have been advanced concerning the biological roles of the oligosaccharide units of individual classes of glycoconjugates. Analysis of the evidence indicates that while all of these theories are correct, exceptions to each can also be found. The biological roles of oligosaccharides appear to span the spectrum from those that are trivial, to those that are crucial for the development, growth, function or survival of an organism. Some general principles emerge. First, it is difficult to predict a priori the functions a given oligosaccharide on a given glycoconjugate might be mediating, or their relative importance to the organism. Second, the same oligosaccharide sequence may mediate different functions at different locations within the same organism, or at different times in its ontogeny or life cycle. Third, the more specific and crucial biological roles of oligosaccharides are often mediated by unusual oligosaccharide sequences, unusual presentations of common terminal sequences, or by further modifications of the sugars themselves. However, such oligosaccharide sequences are also more likely to be targets for recognition by pathogenic toxins and microorganisms. As such, they are subject to more intra- and inter-species variation because of ongoing host-pathogen interactions during evolution. In the final analysis, the only common features of the varied functions of oligosaccharides are that they either mediate 'specific recognition' events or that they provide 'modulation' of biological processes. In so doing, they generate much of the functional diversity required for the development and differentiation of complex organisms, and for their interactions with other organisms in the environment.
This chapter contains sections titled: - Introduction - Carbohydrate Research: Main Steps of Development and Discovery of Their Functionalities - Development of Nanoparticular Drug Carriers: Current States and Limitations - Integration of Carbohydrate in the Design of Nanoparticle Drug Carriers With New Functionalities - Enhancement of Carbohydrate Therapeutic Activity by Association With Nanoparticle Delivery Systems - Nanoparticle Drug Carriers Made of Carbohydrates for the Delivery of Fragile Molecules - Conclusion.
Glycomics or the study of structure-function relationships of complex glycans has reshaped post-genomics biology. Glycans mediate fundamental biological functions via their specific interactions with a variety of proteins. Recognizing the importance of glycomics, large-scale research initiatives such as the Consortium for Functional Glycomics (CFG) were established to address these challenges. Over the past decade, the Consortium for Functional Glycomics (CFG) has generated novel reagents and technologies for glycomics analyses, which in turn have led to generation of diverse datasets. These datasets have contributed to understanding glycan diversity and structure-function relationships at molecular (glycan-protein interactions), cellular (gene expression and glycan analysis), and whole organism (mouse phenotyping) levels. Among these analyses and datasets, screening of glycan-protein interactions on glycan array platforms has gained much prominence and has contributed to cross-disciplinary realization of the importance of glycomics in areas such as immunology, infectious diseases, cancer biomarkers, etc. This manuscript outlines methodologies for capturing data from glycan array experiments and online tools to access and visualize glycan array data implemented at the CFG.
The peptidoglycan (murein) sacculus is a unique and essential structural element in the cell wall of most bacteria. Made of glycan strands cross-linked by short peptides, the sacculus forms a closed, bag-shaped structure surrounding the cytoplasmic membrane. There is a high diversity in the composition and sequence of the peptides in the peptidoglycan from different species. Furthermore, in several species examined, the fine structure of the peptidoglycan significantly varies with the growth conditions. Limited number of biophysical data on the thickness, elasticity and porosity of peptidoglycan are available. The different models for the architecture of peptidoglycan are discussed with respect to structural and physical parameters.
A methylation-analysis procedure has been developed by which the glycosyl-linkage compositions of microgram quantities of complex carbohydrates, including those containing hexosyluronic acid residues, can be determined. The effectiveness of the procedure was demonstrated by correctly determining the glycosyl-linkage compositions of 1 μg of a disaccharide and 5 μg of an acidic polysaccharide whose structures were unknown to the analyst. The development of a new technique, namely, reversed-phase chromatography on Sep-Pak C18 cartridges, to recover and purify microgram quantities of per-O-methylated complex carbohydrates from methylation-reaction mixtures, was critical to the success of the microscale procedure. The use of gas-liquid chromatography-mass spectrometry with multiple, selected-ion monitoring was also essential for identification and semiquantitation of the partially O-methylated alditol acetates derived from 1 to 5 μg of a complex carbohydrate.
Experimental techniques to identify and quantify glycan structures in a given sample are continuously improving. However, as they advance data analysis and annotation seems to become more complex. To address this issue, much progress has been made in developing software for interpretation of quantitative glycan profiles. Here, we focus on these informatics tools for high/ultra performance liquid chromatography (H/UPLC), mass spectrometry (MS), tandem mass spectrometry (MS(n)) and combinations thereof. Software for biomarker discovery, pathway, genomic and disease analysis and a final note on some future prospects for glycoinformatics are also mentioned.
Carbohydrates in the form of capsular polysaccharides and/or lipopolysaccharides are the major components on the surface of bacteria. These molecules are important virulence factors in many bacteria isolated from infected persons. Immunity against these components confers protection against the disease. However, developing vaccines based on polysaccharides is difficult and several problems have to be solved. First of all, most of the bacterial polysaccharides are T-lymphocyte independent antigens. Anti-polysaccharide immune response is characterised by lack of T-lymphocyte memory, isotype restriction and delayed ontogeny. Children below 2 years of age and elderly respond poorly to polysaccharide antigens. Secondly, the wide structural heterogeneity among the polysaccharides within and between species is also a problem. Thirdly, some bacterial polysaccharides are poor immunogens in humans due to their structural similarities with glycolipids and glycoproteins present in man. The T-lymphocyte independent nature of a polysaccharide may be overcome by conjugating the native or depolymerised polysaccharide to a protein carrier. Such neoglycoconjugates have been proven to be efficient in inducing T-lymphocyte dependent immunity and to protect both infants as well as elderly from disease. Another approach to circumvent the T-lymphocyte independent property of polysaccharides is to select peptides mimicking the immunodominant structures. Several examples of such peptides have been described.
Consideration of host-parasite interactions encompasses a wide range of phenomena from adhesion to epithelial surfaces to interactions with cells of the immune system. This review focuses on the role of carbohydrates as recognition molecules in these complex interactions. The abundant glycoproteins and glycolipids of cell surfaces of both prokaryotic and eukaryotic cells have the ability to exist in a variety of spatial configurations through alpha- and beta-linkages and the formation of branched structures. This ability carries with it the opportunity of acting as informational molecules greater than that possible for proteins or nucleic acids. The blood group substances are probably the best characterized of these carbohydrate containing molecules. Whilst at present a detailed understanding of the importance of these molecules in host-parasite interaction is lacking, the material covered in this discussion emphasizes the way in which carbohydrate based recognition has been shown to be involved and may provide the basis for further understanding.
Carbohydrates, in more biologically oriented areas referred to as glycans, constitute one of the four groups of biomolecules. The glycans, often present as glycoproteins or glycolipids, form highly complex structures. In mammals ten monosaccharides are utilized in building glycoconjugates in the form of oligo- (up to about a dozen monomers) and polysaccharides. Subsequent modifications and additions create a large number of different compounds. In bacteria, more than a hundred monosaccharides have been reported to be constituents of lipopolysaccharides, capsular polysaccharides, and exopolysaccharides. Thus, the number of polysaccharide structures possible to create is huge. NMR spectroscopy plays an essential part in elucidating the primary structure, that is, monosaccharide identity and ring size, anomeric configuration, linkage position, and sequence, of the sugar residues. The structural studies may also employ computational approaches for NMR chemical shift predictions (CASPER program). Once the components and sequence of sugar residues have been unraveled, the three-dimensional arrangement of the sugar residues relative to each other (conformation), their flexibility (transitions between and populations of conformational states), together with the dynamics (timescales) should be addressed. To shed light on these aspects we have utilized a combination of experimental liquid state NMR techniques together with molecular dynamics simulations. For the latter a molecular mechanics force field such as our CHARMM-based PARM22/SU01 has been used. The experimental NMR parameters acquired are typically (1)H,(1)H cross-relaxation rates (related to NOEs), (3)JCH and (3)JCCtrans-glycosidic coupling constants and (1)H,(13)C- and (1)H,(1)H-residual dipolar couplings. At a glycosidic linkage two torsion angles varphi and psi are defined and for 6-substituted residues also the omega torsion angle is required. Major conformers can be identified for which highly populated states are present. Thus, in many cases a well-defined albeit not rigid structure can be identified. However, on longer timescales, oligosaccharides must be considered as highly flexible molecules since also anti-conformations have been shown to exist with H-C-O-C torsion angles of approximately 180 degrees , compared to syn-conformations in which the protons at the carbon atoms forming the glycosidic linkage are in close proximity. The accessible conformational space governs possible interactions with proteins and both minor changes and significant alterations occur for the oligosaccharides in these interaction processes. Transferred NOE NMR experiments give information on the conformation of the glycan ligand when bound to the proteins whereas saturation transfer difference NMR experiments report on the carbohydrate part in contact with the protein. It is anticipated that the subtle differences in conformational preferences for glycan structures facilitate a means to regulate biochemical processes in different environments. Further developments in the analysis of glycan structure and in particular its role in interactions with other molecules, will lead to clarifications of the importance of structure in biochemical regulation processes essential to health and disease.
Glycans are widely distributed in biological systems in free state as well as conjugated forms as parts of glycoproteins, glycolipids, and proteoglycans. Because glycans are not synthesized directly by the corresponding genes but a combination of the related enzymes and substrates, the structures of glycans are quite diverse and sensitive with the changes of physiological conditions. Due to the extremely complex heterogeneities of glycans, it has been a big challenging target to analyze comprehensive glycan profiles (i.e. glycome) and determine characteristic glycan(s) for clinical use. Recent advances in separation sciences such as capillary/microchip electrophoresis and mass spectrometry have made it possible to analyze glycans for various practical uses. New emerging technologies of microarray and bioinformatics have also been applied to glycome/glycomics studies. In this review, recent topics on glycan analysis in clinical use are described with their historical background. Some results obtained by our studies are also shown.
Carbohydrates, or glycans, are as integral to biology as nucleic acids and proteins. In immunology, glycans are well known to drive diverse functions ranging from glycosaminoglycan-mediated chemokine presentation and selectin-dependent leukocyte trafficking to the discrimination of self and non-self through the recognition of sialic acids by Siglec (sialic acid-binding Ig-like lectin) receptors. In recent years, a number of key immunological discoveries are driving a renewed and burgeoning appreciation for the importance of glycans. In this review, we highlight these findings which collectively help to define and refine our knowledge of the function and impact of glycans within the immune response.
Последнее обновление: 2024 май 1 Домой