Protein sequence comparison based on K-string dictionary

Yu, C.; He, R.; Yau, S.

doi:10.1016/j.gene.2013.07.092

Protein sequence comparison based on K-string dictionary

dc.contributor.author	Yu, C.
dc.contributor.author	He, R.
dc.contributor.author	Yau, S.
dc.date.issued	2013
dc.description	Data source: Supplementary data, https://doi.org/10.1016/j.gene.2013.07.092
dc.description.abstract	The current K-string-based protein sequence comparisons require large amounts of computer memory because the dimension of the protein vector representation grows exponentially with K. In this paper, we propose a novel concept, the "K-string dictionary", to solve this high-dimensional problem. It allows us to use a much lower dimensional K-string-based frequency or probability vector to represent a protein, and thus significantly reduce the computer memory requirements for their implementation. Furthermore, based on this new concept, we use Singular Value Decomposition to analyze real protein datasets, and the improved protein vector representation allows us to obtain accurate gene trees.
dc.identifier.citation	Gene, 2013; 529(2):250-256
dc.identifier.doi	10.1016/j.gene.2013.07.092
dc.identifier.issn	0378-1119
dc.identifier.issn	1879-0038
dc.identifier.orcid	Yu, C. [0000-0002-3248-8421]
dc.identifier.uri	https://hdl.handle.net/11541.2/131862
dc.language.iso	en
dc.publisher	Elsevier
dc.relation.funding	US NSF DMS-1120824
dc.relation.funding	China NSF 31271408
dc.relation.funding	Tsinghua University
dc.rights	Copyright 2013 Elsevier
dc.source.uri	https://doi.org/10.1016/j.gene.2013.07.092
dc.subject	Proteins
dc.subject	Data Interpretation, Statistical
dc.subject	Sequence Alignment
dc.subject	Sequence Analysis, Protein
dc.subject	Phylogeny
dc.subject	Databases, Protein
dc.title	Protein sequence comparison based on K-string dictionary
dc.type	Journal article
pubs.publication-status	Published
ror.mmsid	9916188090201831

Protein sequence comparison based on K-string dictionary

Files

Collections