Please use this identifier to cite or link to this item:
Scopus Web of ScienceĀ® Altmetric
Type: Theses
Title: Consensus sequence estimation methods for DNA sequence data sets
Author: James, Sarah Ellen
Issue Date: 2017
School/Discipline: School of Mathematical Sciences
Abstract: A combination of ancient and modern DNA sequences and geographical information is used in phylogenetics, to study the evolutionary relationships between individuals or species. A common problem with using ancient DNA sequences in phylogenetic analyses is that ancient DNA sequence data sets can often contain damaged or missing bases. This limits the accuracy of the analysis and reduces the number of statistical methods available for use. Since DNA sequencing technologies have improved, more informative statistical techniques have been developed to estimate the DNA sequence for a single individual. In the 1990s, several DNA sequence estimation methods were developed using only the reads contained in the alignment. Current statistical methods use both the reads in the alignment and the qualities associated with each read, to estimate the consensus sequence. We then obtain the DNA sequence by removing any gaps in the consensus sequence. A limitation of these DNA sequence estimation methods is that these methods may return estimated DNA sequences with missing bases, due to low coverage or poor quality data. DNA sequences with missing bases are problematic when used in statistical analyses, and this can limit the accuracy of the analysis. Therefore, in this thesis we focus on developing a consensus sequence estimation method that uses information from the alignment as well as outside sources of information. In particular, our consensus sequence estimation method estimates bases for the sites in the DNA sequence that may have otherwise been assigned a missing base, due to low coverage or poor quality data. We develop two consensus sequence estimation methods based on the EM algorithm and Gibbs sampling respectively. Our consensus sequence estimation method based on the EM algorithm uses the alignment data as well as the associated quality data to estimate the consensus sequence. Our consensus sequence estimation method based on Gibbs sampling uses the alignment data, quality data, and cytosine deamination rates to also estimate sites that were damaged by ancient DNA damage. Since we often do not know the true DNA sequence for an individual, we use simulated DNA sequences and alignment data to assess the accuracy of our consensus sequence estimation methods. Using a DNA sequence distance metric, we compare our estimated DNA sequences to the true DNA sequence for each consensus sequence estimation method presented in this thesis. We also consider the entropy at each site along the consensus sequence to quantify the amount of uncertainty in the estimated consensus sequence. Based on these results, we make recommendations on which consensus sequence estimation methods are suitable for particular DNA sequence data sets. An advantage of the Gibbs sampling approach over the EM algorithm approach is that we have informative prior distributions for sites with low coverage or poor quality data. Hence, the estimated bases at these sites are more informative than those estimated using the EM algorithm approach. Our consensus sequence estimation methods generate estimates for all sites, including those that may have otherwise been allocated missing bases. Therefore, our consensus sequence estimation methods reduce the problem of using DNA sequences with missing bases in phylogenetic analyses.
Advisor: Bean, Nigel Geoffrey
Tuke, Simon Jonathan
Dissertation Note: Thesis (M.Phil.) -- University of Adelaide, School of Mathematical Sciences, 2017.
Keywords: consensus sequence estimation
ancient DNA
Provenance: This electronic version is made publicly available by the University of Adelaide in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. This thesis may incorporate third party material which has been used by the author pursuant to Fair Dealing exceptions. If you are the owner of any included third party copyright material you wish to be removed from this electronic version, please complete the take down form located at:
DOI: 10.25909/5b874d36c13f9
Appears in Collections:Research Theses

Files in This Item:
File Description SizeFormat 
01front.pdf51.98 kBAdobe PDFView/Open
02whole.pdf11.4 MBAdobe PDFView/Open
PermissionsLibrary staff access only234.31 kBAdobe PDFView/Open
RestrictedLibrary staff access only43.22 MBAdobe PDFView/Open

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.