Using sequence similarity to predict the function of biological sequences

Jones, Craig E.

Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/40403

Type:	Thesis
Title:	Using sequence similarity to predict the function of biological sequences
Author:	Jones, Craig E.
Issue Date:	2007
School/Discipline:	School of Computer Science
Abstract:	In this thesis we examine issues surrounding the development of software that predicts the function of biological sequences using sequence similarity. There is a pressing need for high throughput software that can annotate protein or DNA sequences with functional information due to the exponential growth in sequence data. In Chapter 1 we briefly introduce the molecular biology and bioinformatics that is assumed knowledge, and the objectives for the research presented here. In Chapter 2 we discuss the development of a method of comparing competing designs for software annotators, using precision and recall metrics, and a benchmark method referred to as Best BLAST. From this we conclude that data-mining approaches may be useful in the development of annotation algorithms, and that any new annotator should demonstrate its effectiveness against other approaches before being adopted. As any new annotator that utilises sequence similarity to predict the function of a sequence will rely on the quality of existing annotations, we examine the error rate of existing sequence annotations in Chapter 3. We develop a new method that allows for the estimation of annotation error rates. This involves adding annotation errors at known rates to a sample of reference sequence annotations that was found to be similar to query sequences. The precision at each error rate treatment is determined, and linear regression then used to find the error rate at estimated values for the maximum precision possible given assumptions concerning the impact of semantic variation on precision. We found that the error rate of curated annotations based on sequence similarity (ISS) is far higher than those that use other forms of evidence (49% versus 13-18%, respectively). As such we conclude that software annotators should avoid basing predictions on ISS annotations where possible. In Chapter 4 we detail the development of GOSLING, Gene Ontology Similarity Listing using Information Graphs, a software annotator with a design based on the principles discovered in previous chapters. Chapter 5 concludes the thesis by discussing the major findings from the research presented.
Advisor:	Brown, Alfred Baumann, Ute
Dissertation Note:	Thesis (M.Sc.(M&CS)) -- School of Computer Science, 2007
Subject:	Bioinformatics Computer science
Keywords:	bioinformatics, computer science
Provenance:	This electronic version is made publicly available by the University of Adelaide in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. This thesis may incorporate third party material which has been used by the author pursuant to Fair Dealing exceptions. If you are the owner of any included third party copyright material you wish to be removed from this electronic version, please complete the take down form located at: http://www.adelaide.edu.au/legals
Appears in Collections:	Research Theses

Files in This Item:

File	Size	Format
Jones2007_MSc.pdf	794.87 kB	Adobe PDF	View/Open

Show full item record

Adelaide Research & Scholarship