Information in various applications is often expressed as character sequences over a finite alphabet (e.g., DNA or protein sequences). In Big Data era, the lengths and sizes of these sequences are growing explosively, leading to grand challenges for the classical NP-hard problem, namely searching for the Multiple Longest Common Subsequences (MLCS) from multiple sequences. In this paper, we first unveil the fact that the state-of-the-art MLCS algorithms are unable to be applied to long and large-scale sequences alignments. To overcome their defects and tackle the longer and large-scale or even big sequences alignments, based on the proposed novel problem-solving model and various strategies, e.g., parallel topological sorting, optimal calculating, reuse of intermediate results, subsection calculation and serialization, etc., we present a novel parallel MLCS algorithm. Exhaustive experiments on the datasets of both synthetic and real-world biological sequences demonstrate that both the time and space of the proposed algorithm are only linear in the number of dominants from aligned sequences, and the proposed algorithm significantly outperforms the state-of-the-art MLCS algorithms, being applicable to longer and large-scale sequences alignments.
|Publication status||Published - 2016|
|Event||ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - California, United States|
Duration: 13 Aug 2016 → 17 Aug 2016
|Conference||ACM SIGKDD International Conference on Knowledge Discovery and Data Mining|
|Period||13/08/16 → 17/08/16|
Bibliographical noteThe full text is currently unavailable on the repository.
- Multiple Longest Common Subsequences (MLCS)
- Non-redundant Common Subsequence Graph (NCSG)
- Topological Sorting
- Subsection Calculation and Serialization