Reducing Alignment Time Complexity of Ultra-Large Sets of Sequences

Research output: Contribution to journalArticle

Abstract

The alignment of three or more protein or nucleotide sequences is known as Multiple Sequence Alignment problem. The complexity of this problem increases exponentially with the number of sequences; therefore, many of the current approaches published in the literature suffer a computational overhead when thousands of sequences are required to be aligned. We introduce a new approach for dealing with ultra-large sets of sequences. A two-level clustering method is considered. The first level clusters the input sequences by using their biological composition, that is, the number of positive, negative, polar, special, and hydrophobic amino acids. In the second level, each cluster is divided into different clusters according to their similarity. Then, each cluster is aligned by using any method/aligner. After aligning the centroid sequences of each second-level cluster, we extrapolate the new gaps to each cluster of sequences to obtain the final alignment. We present a study on biological data with up to ∼100,000 sequences, showing that the proposed approach is able to obtain accurate alignments in a reduced amount of time; for example, in >10,000 sequences datasets, it is able to reduce up to ∼45 times the required runtime of the well-known Kalign.

Original languageEnglish
Pages (from-to)1144-1152
Number of pages9
JournalJournal of Computational Biology
Volume24
Issue number11
DOIs
Publication statusPublished - 1 Nov 2017

Keywords

  • Algorithms
  • Amino acids
  • Clustering
  • Large sequences
  • Multiple alignment

Fingerprint Dive into the research topics of 'Reducing Alignment Time Complexity of Ultra-Large Sets of Sequences'. Together they form a unique fingerprint.

  • Cite this