Using sequence similarity to predict the function of biological sequences

Jones, Craig E.

Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/40403

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Brown, Alfred	en
dc.contributor.advisor	Baumann, Ute	en
dc.contributor.author	Jones, Craig E.	en
dc.date.issued	2007	en
dc.identifier.uri	http://hdl.handle.net/2440/40403	-
dc.description.abstract	In this thesis we examine issues surrounding the development of software that predicts the function of biological sequences using sequence similarity. There is a pressing need for high throughput software that can annotate protein or DNA sequences with functional information due to the exponential growth in sequence data. In Chapter 1 we briefly introduce the molecular biology and bioinformatics that is assumed knowledge, and the objectives for the research presented here. In Chapter 2 we discuss the development of a method of comparing competing designs for software annotators, using precision and recall metrics, and a benchmark method referred to as Best BLAST. From this we conclude that data-mining approaches may be useful in the development of annotation algorithms, and that any new annotator should demonstrate its effectiveness against other approaches before being adopted. As any new annotator that utilises sequence similarity to predict the function of a sequence will rely on the quality of existing annotations, we examine the error rate of existing sequence annotations in Chapter 3. We develop a new method that allows for the estimation of annotation error rates. This involves adding annotation errors at known rates to a sample of reference sequence annotations that was found to be similar to query sequences. The precision at each error rate treatment is determined, and linear regression then used to find the error rate at estimated values for the maximum precision possible given assumptions concerning the impact of semantic variation on precision. We found that the error rate of curated annotations based on sequence similarity (ISS) is far higher than those that use other forms of evidence (49% versus 13-18%, respectively). As such we conclude that software annotators should avoid basing predictions on ISS annotations where possible. In Chapter 4 we detail the development of GOSLING, Gene Ontology Similarity Listing using Information Graphs, a software annotator with a design based on the principles discovered in previous chapters. Chapter 5 concludes the thesis by discussing the major findings from the research presented.	en
dc.format.extent	511805 bytes	en
dc.format.extent	28403 bytes	en
dc.format.mimetype	application/pdf	en
dc.format.mimetype	application/pdf	en
dc.language.iso	en	en
dc.subject	bioinformatics, computer science	en
dc.subject.lcsh	Bioinformatics	en
dc.subject.lcsh	Computer science	en
dc.title	Using sequence similarity to predict the function of biological sequences	en
dc.type	Thesis	en
dc.contributor.school	School of Computer Science	en
dc.provenance	This electronic version is made publicly available by the University of Adelaide in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. This thesis may incorporate third party material which has been used by the author pursuant to Fair Dealing exceptions. If you are the owner of any included third party copyright material you wish to be removed from this electronic version, please complete the take down form located at: http://www.adelaide.edu.au/legals	en
dc.description.dissertation	Thesis (M.Sc.(M&CS)) -- School of Computer Science, 2007	en
Appears in Collections:	Research Theses

Files in This Item:

File	Size	Format
Jones2007_MSc.pdf	794.87 kB	Adobe PDF	View/Open

Show simple item record

Adelaide Research & Scholarship