Protein sequence comparison based on K-string dictionary

dc.contributor.authorYu, C.
dc.contributor.authorHe, R.
dc.contributor.authorYau, S.
dc.date.issued2013
dc.descriptionData source: Supplementary data, https://doi.org/10.1016/j.gene.2013.07.092
dc.description.abstractThe current K-string-based protein sequence comparisons require large amounts of computer memory because the dimension of the protein vector representation grows exponentially with K. In this paper, we propose a novel concept, the "K-string dictionary", to solve this high-dimensional problem. It allows us to use a much lower dimensional K-string-based frequency or probability vector to represent a protein, and thus significantly reduce the computer memory requirements for their implementation. Furthermore, based on this new concept, we use Singular Value Decomposition to analyze real protein datasets, and the improved protein vector representation allows us to obtain accurate gene trees.
dc.identifier.citationGene, 2013; 529(2):250-256
dc.identifier.doi10.1016/j.gene.2013.07.092
dc.identifier.issn0378-1119
dc.identifier.issn1879-0038
dc.identifier.orcidYu, C. [0000-0002-3248-8421]
dc.identifier.urihttps://hdl.handle.net/11541.2/131862
dc.language.isoen
dc.publisherElsevier
dc.relation.fundingUS NSF DMS-1120824
dc.relation.fundingChina NSF 31271408
dc.relation.fundingTsinghua University
dc.rightsCopyright 2013 Elsevier
dc.source.urihttps://doi.org/10.1016/j.gene.2013.07.092
dc.subjectProteins
dc.subjectData Interpretation, Statistical
dc.subjectSequence Alignment
dc.subjectSequence Analysis, Protein
dc.subjectPhylogeny
dc.subjectDatabases, Protein
dc.titleProtein sequence comparison based on K-string dictionary
dc.typeJournal article
pubs.publication-statusPublished
ror.mmsid9916188090201831

Files

Collections