Elsevier

Carbohydrate Research

Volume 339, Issue 5, 2 April 2004, Pages 1015-1020
Carbohydrate Research

Data mining the protein data bank: automatic detection and assignment of carbohydrate structures

https://doi.org/10.1016/j.carres.2003.09.038Get rights and content

Abstract

Knowledge of the 3D structure of glycans is a prerequisite for a complete understanding of the biological processes glycoproteins are involved in. However, due to a lack of standardised nomenclature, carbohydrate compounds are difficult to locate within the Protein Data Bank (PDB). Using an algorithm that detects carbohydrate structures only requiring element types and atom coordinates, we were able to detect 1663 entries containing a total of 5647 carbohydrate chains. The majority of chains are found to be N-glycosidically bound. Noncovalently bound ligands are also frequent, while O-glycans form a minority. About 30% of all carbohydrate containing PDB entries comprise one or several errors. The automatic assignment of carbohydrate structures in PDB entries will improve the cross-linking of glycobiology resources with genomic and proteomic data collections, which will be an important issue of the upcoming glycomics projects. By aiding in detection of erroneous annotations and structures, the algorithm might also help to increase database quality.

Introduction

Protein glycosylation is probably by far the most common and complex type of co- and posttranslational modifications encountered in proteins. Glycosylation differs from most other covalent protein modifications such as phosphorylation, acetylation and formylation with respect to the size and the complexity of the added group and the magnitude of the cellular machinery devoted to synthesis and modulation.[1], [2] Inspection of protein databases reveals that as many as 70% of all proteins have potential N-glycosylation sites (Asn-X-Ser/Thr, X not proline) and O-glycosylation is even more ubiquitous.3

The oligosaccharide moieties cover a range of diverse biological functions. First of all, they are involved in the process of folding and subsequent conformational maturation in the rough endoplasmic reticulum. Without added glycans, many plasma membrane proteins and secretory proteins are not able to fold properly.[1], [4] Simply because of their large size and hydrophilicity, glycans can alter the physico-chemical properties of a glycoprotein, making them more soluble, reducing backbone flexibility and therefore leading to increased protein stability, protecting them from proteolysis, etc.3 Carbohydrates are absolutely required for the correct maturation, function and intracellular sorting of many glycoproteins. The failure of glycoproteins to be correctly processed, trafficked and degraded, leads to diseases in human. Certain carbohydrates found on the surface of cells can affect the onset and progression of disease states.2

Another major function of protein-linked glycans is to provide additional recognition epitopes for protein receptors. These are implicated in a variety of cell–cell and cell–matrix recognition events. By participating in the process of cellular recognition, the oligosaccharide moieties play a pivotal role in inflammation, immune response and cancer.[4], [5], [6] These recognition events involve specific carbohydrate binding proteins––the lectins.4 They depend on the precise three-dimensional shape of the glycan,3 therefore knowledge of the 3D structure of the glycan is a prerequisite for a full understanding of the biological processes glycoproteins are involved in.

The progressing glycomics projects will dramatically accelerate the understanding of the roles of carbohydrates in cell communication and hopefully lead to novel therapeutic approaches for treatment of human disease. The MIT’s magazine of innovation (21 January 2003) has identified glycomics as one of the top 10 technologies that will change the future. The development of new and advanced bioinformatic tools, algorithms and data collections for glycobiology is an absolute requirement to manage and analyse successfully the large amount of data, which will be produced by the upcoming glycomics projects. An important issue will be the cross-linking of glycobiology resources with genomic and proteomic data collections. The existing glycorelated databases are not reciprocally cross-referenced as are, for instance, the diverse gene and protein databases. Although the oligosaccharide databases make reference to major protein databases, no significant effort is made to link protein entries to their glycan structures.7

The largest source of biomolecular 3D structures publicly available is the Protein Data Bank (PDB).8 In September 2003, the PDB consists of about 22,500 entries, most of which are proteins. Only 18 entries are pure carbohydrate structures, but many protein structures include carbohydrate compounds. The problem, however, is that in contrast to proteins or nucleic acids there is no standard nomenclature for carbohydrate residues within the pdb-files.9 In some cases, entire carbohydrate chains are combined in one single residue (e.g., the residues ASL and ASF in the PDB entry 1agm). Furthermore, for many monosaccharide residues as defined in the PDB Het Group Dictionary (http://pdb.rutgers.edu/het_dictionary.txt) there is no distinction between α- and β-form. Information about how the single carbohydrate residues are linked to each other may be given within the LINK records of a pdb-file but is missing in most cases. For these reasons, it is difficult for glycobiologists to find a carbohydrate structure of their interest within the PDB. So far, only very few attempts to analyse special types of this carbohydrate related data were made.[3], [4], [9] Here, we present an algorithm that detects and assigns carbohydrate compounds––covalently attached glycans as well as noncovalently bound ligands––just on the basis of element types and atom coordinates. This approach has the advantage that it is not limited to the annotation given in pdb file format. The algorithm was implemented in the software pdb2linucs. The program analyses the input file and lists included carbohydrates using the LINUCS notation. LINUCS is a linear, unique nomenclature for carbohydrate structures that is well suited for use in data processing or computer algorithms.10

Section snippets

Protein Data Bank

According to the PDB Holdings List of 9 September 2003, the PDB contains a total of 22,448 structures, 19,062 of which are solved by X-ray, the remaining 3386 are solved by NMR. The vast majority of the entries (20,262) are classified as Proteins, Peptides and Viruses, 1231 are nucleic acids, 937 represent protein/nucleic acid complexes and 18 entries are referred to as carbohydrate structures.

Algorithm

To detect carbohydrate compounds in a molecular 3D structure, a perception of the molecular topology

Cross-referencing with other data collections

The automatic annotation of carbohydrate structures and their conversion to a unique linear notation enables an efficient cross-referencing with other data collections containing glyco-related data. As a showcase, direct links to SWEET-DB11 were established. For each glycan chain detected in a selected PDB file, SWEET-DB is automatically searched and a direct link is established in case of success. The data contained in SWEET-DB can be retrieved by clicking on the corresponding button. In case

Outlook

The large frequency of wrong annotation and other errors in the carbohydrate structures in PDB files points up the need of a check software. In the area of protein structures, several quality control tools like WhatCheck14 or ProCheck15 are available. Frequent use of such programs leads to an increasing database quality.16 Similarly, the algorithm presented here could aid to improve the reliability of the annotation of carbohydrate compounds stored in PDB entries.

The automatic, IUPAC-conform

Acknowledgements

This work was supported by a grant from the German Research Council (DFG) within the digital library program.

References (17)

  • J Charlwood et al.

    Biomol. Eng.

    (2001)
  • I Marchal et al.

    Biochimie

    (2003)
  • A Bohne-Lang et al.

    Carbohydr. Res.

    (2001)
  • R Apweiler et al.

    Biochim. Biophys. Acta

    (1999)
  • A Helenius et al.

    Science

    (2001)
  • M.R Wormald et al.

    Chem. Rev.

    (2002)
  • A Imberty et al.

    Protein Eng.

    (1995)
  • P Rudd et al.

    Science

    (2001)
There are more references available in the full text version of this article.

Cited by (102)

  • How molecular modelling can better broaden the understanding of glycosylations

    2022, Current Opinion in Structural Biology
    Citation Excerpt :

    For the saccharides whose electron densities are resolved, there are major concerns about their reliability. As many as 30% of glycoprotein structures in PDB had errors associated with the carbohydrate electron densities [34], and large number of errors were associated with poor fit of pyranose residue assignments [35]. However, there have been numerous efforts to alleviate these problems.

  • The current structural glycome landscape and emerging technologies

    2020, Current Opinion in Structural Biology
    Citation Excerpt :

    The most important of these challenges are briefly listed below: Not all saccharide moieties are correctly annotated in the PDB [8,9]. Significant errors in the structural rendering, partially due to incomplete electron density maps [7•,9], especially missing or incorrect glycosidic bonds.

  • Structural glycobiology in the age of electron cryo-microscopy

    2020, Current Opinion in Structural Biology
  • Umbrella Visualization: A method of analysis dedicated to glycan flexibility with UnityMol

    2020, Methods
    Citation Excerpt :

    Before running MD simulation, it is useful to analyze and understand the structure of glycosylated proteins. Some tools can help identify conformational errors in structures [21,22] while others are dedicated to the study of the torsion angles and saccharidic linkage [23–25]. Carbohydrates have many hydroxyl groups that are very reactive and can interact with their surroundings.

View all citing articles on Scopus
View full text