Positive-unlabelled learning of glycosylation sites in the human proteome

dc.contributor.authorLi, F.
dc.contributor.authorZhang, Y.
dc.contributor.authorPurcell, A.W.
dc.contributor.authorWebb, G.I.
dc.contributor.authorChou, K.C.
dc.contributor.authorLithgow, T.
dc.contributor.authorLi, C.
dc.contributor.authorSong, J.
dc.date.issued2019
dc.description.abstractBackground: As an important type of post-translational modification (PTM), protein glycosylation plays a crucial role in protein stability and protein function. The abundance and ubiquity of protein glycosylation across three domains of life involving Eukarya, Bacteria and Archaea demonstrate its roles in regulating a variety of signalling and metabolic pathways. Mutations on and in the proximity of glycosylation sites are highly associated with human diseases. Accordingly, accurate prediction of glycosylation can complement laboratory-based methods and greatly benefit experimental efforts for characterization and understanding of functional roles of glycosylation. For this purpose, a number of supervised-learning approaches have been proposed to identify glycosylation sites, demonstrating a promising predictive performance. To train a conventional supervised-learning model, both reliable positive and negative samples are required. However, in practice, a large portion of negative samples (i.e. non-glycosylation sites) are mislabelled due to the limitation of current experimental technologies. Moreover, supervised algorithms often fail to take advantage of large volumes of unlabelled data, which can aid in model learning in conjunction with positive samples (i.e. experimentally verified glycosylation sites). Results: In this study, we propose a positive unlabelled (PU) learning-based method, PA2DE (V2.0), based on the AlphaMax algorithm for protein glycosylation site prediction. The predictive performance of this proposed method was evaluated by a range of glycosylation data collected over a ten-year period based on an interval of three years. Experiments using both benchmarking and independent tests show that our method outperformed the representative supervised-learning algorithms (including support vector machines and random forests) and oneclass learners, as well as currently available prediction methods in terms of F1 score, accuracy and AUC measures. In addition, we developed an online web server as an implementation of the optimized model (available at http://glycomine.erc.monash.edu/Lab/GlycoMine_PU/) to facilitate community-wide efforts for accurate prediction of protein glycosylation sites. Conclusion: The proposed PU learning approach achieved a competitive predictive performance compared with currently available methods. This PU learning schema may also be effectively employed and applied to address the prediction problems of other important types of protein PTM site and functional sites.
dc.description.statementofresponsibilityFuyi Li, Yang Zhang, Anthony W. Purcell, Geoffrey I. Webb, Kuo-Chen Chou, Trevor Lithgow, Chen Li and Jiangning Song
dc.identifier.citationBMC Bioinformatics, 2019; 20(1):1-17
dc.identifier.doi10.1186/s12859-019-2700-1
dc.identifier.issn1471-2105
dc.identifier.issn1471-2105
dc.identifier.orcidLi, F. [0000-0001-5216-3213]
dc.identifier.urihttps://hdl.handle.net/2440/139603
dc.language.isoen
dc.publisherBioMed Central
dc.relation.granthttp://purl.org/au-research/grants/arc/LP110200333
dc.relation.granthttp://purl.org/au-research/grants/arc/DP120104460
dc.relation.granthttp://purl.org/au-research/grants/nhmrc/4909809
dc.relation.granthttp://purl.org/au-research/grants/nhmrc/1127948
dc.relation.granthttp://purl.org/au-research/grants/nhmrc/1144652
dc.relation.granthttp://purl.org/au-research/grants/nhmrc/1137739
dc.relation.granthttp://purl.org/au-research/grants/nhmrc/1143366
dc.relation.granthttp://purl.org/au-research/grants/arc/FL130100038
dc.rights© The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
dc.source.urihttps://doi.org/10.1186/s12859-019-2700-1
dc.subjectProtein glycosylation prediction; Positive unlabelled-learning; Supervised-learning; AlphaMax; Sequence analysis; Sequence-derived features
dc.subject.meshHumans
dc.subject.meshProteome
dc.subject.meshStaining and Labeling
dc.subject.meshROC Curve
dc.subject.meshComputational Biology
dc.subject.meshProtein Processing, Post-Translational
dc.subject.meshGlycosylation
dc.subject.meshDatabases, Protein
dc.subject.meshSupport Vector Machine
dc.titlePositive-unlabelled learning of glycosylation sites in the human proteome
dc.typeJournal article
pubs.publication-statusPublished

Files

Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
hdl_139603.pdf
Size:
2.3 MB
Format:
Adobe Portable Document Format
Description:
Published version