A Real Linear and Parallel Multiple Longest Common Subsequences (MLCS) Algorithm

Y. Li, H. Li, T. Duan, Sheng Wang, Z. Wang, Y. Cheng

Research output: Contribution to conferencePaper

7 Citations (Scopus)

Abstract

Information in various applications is often expressed as character sequences over a finite alphabet (e.g., DNA or protein sequences). In Big Data era, the lengths and sizes of these sequences are growing explosively, leading to grand challenges for the classical NP-hard problem, namely searching for the Multiple Longest Common Subsequences (MLCS) from multiple sequences. In this paper, we first unveil the fact that the state-of-the-art MLCS algorithms are unable to be applied to long and large-scale sequences alignments. To overcome their defects and tackle the longer and large-scale or even big sequences alignments, based on the proposed novel problem-solving model and various strategies, e.g., parallel topological sorting, optimal calculating, reuse of intermediate results, subsection calculation and serialization, etc., we present a novel parallel MLCS algorithm. Exhaustive experiments on the datasets of both synthetic and real-world biological sequences demonstrate that both the time and space of the proposed algorithm are only linear in the number of dominants from aligned sequences, and the proposed algorithm significantly outperforms the state-of-the-art MLCS algorithms, being applicable to longer and large-scale sequences alignments.
Original languageEnglish
Pages1725-1734
DOIs
Publication statusPublished - 2016
EventACM SIGKDD International Conference on Knowledge Discovery and Data Mining - California, United States
Duration: 13 Aug 201617 Aug 2016

Conference

ConferenceACM SIGKDD International Conference on Knowledge Discovery and Data Mining
CountryUnited States
CityCalifornia
Period13/08/1617/08/16

Fingerprint

Sorting
Computational complexity
DNA
Proteins
Defects
Experiments
Big data

Bibliographical note

The full text is currently unavailable on the repository.

Keywords

  • Multiple Longest Common Subsequences (MLCS)
  • Non-redundant Common Subsequence Graph (NCSG)
  • Topological Sorting
  • Subsection Calculation and Serialization

Cite this

Li, Y., Li, H., Duan, T., Wang, S., Wang, Z., & Cheng, Y. (2016). A Real Linear and Parallel Multiple Longest Common Subsequences (MLCS) Algorithm. 1725-1734. Paper presented at ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, California, United States. https://doi.org/10.1145/2939672.2939842

A Real Linear and Parallel Multiple Longest Common Subsequences (MLCS) Algorithm. / Li, Y.; Li, H.; Duan, T.; Wang, Sheng; Wang, Z.; Cheng, Y.

2016. 1725-1734 Paper presented at ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, California, United States.

Research output: Contribution to conferencePaper

Li, Y, Li, H, Duan, T, Wang, S, Wang, Z & Cheng, Y 2016, 'A Real Linear and Parallel Multiple Longest Common Subsequences (MLCS) Algorithm' Paper presented at ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, California, United States, 13/08/16 - 17/08/16, pp. 1725-1734. https://doi.org/10.1145/2939672.2939842
Li Y, Li H, Duan T, Wang S, Wang Z, Cheng Y. A Real Linear and Parallel Multiple Longest Common Subsequences (MLCS) Algorithm. 2016. Paper presented at ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, California, United States. https://doi.org/10.1145/2939672.2939842
Li, Y. ; Li, H. ; Duan, T. ; Wang, Sheng ; Wang, Z. ; Cheng, Y. / A Real Linear and Parallel Multiple Longest Common Subsequences (MLCS) Algorithm. Paper presented at ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, California, United States.
@conference{829b290f1e96498e92abc72dd769b65e,
title = "A Real Linear and Parallel Multiple Longest Common Subsequences (MLCS) Algorithm",
abstract = "Information in various applications is often expressed as character sequences over a finite alphabet (e.g., DNA or protein sequences). In Big Data era, the lengths and sizes of these sequences are growing explosively, leading to grand challenges for the classical NP-hard problem, namely searching for the Multiple Longest Common Subsequences (MLCS) from multiple sequences. In this paper, we first unveil the fact that the state-of-the-art MLCS algorithms are unable to be applied to long and large-scale sequences alignments. To overcome their defects and tackle the longer and large-scale or even big sequences alignments, based on the proposed novel problem-solving model and various strategies, e.g., parallel topological sorting, optimal calculating, reuse of intermediate results, subsection calculation and serialization, etc., we present a novel parallel MLCS algorithm. Exhaustive experiments on the datasets of both synthetic and real-world biological sequences demonstrate that both the time and space of the proposed algorithm are only linear in the number of dominants from aligned sequences, and the proposed algorithm significantly outperforms the state-of-the-art MLCS algorithms, being applicable to longer and large-scale sequences alignments.",
keywords = "Multiple Longest Common Subsequences (MLCS), Non-redundant Common Subsequence Graph (NCSG), Topological Sorting, Subsection Calculation and Serialization",
author = "Y. Li and H. Li and T. Duan and Sheng Wang and Z. Wang and Y. Cheng",
note = "The full text is currently unavailable on the repository.; ACM SIGKDD International Conference on Knowledge Discovery and Data Mining ; Conference date: 13-08-2016 Through 17-08-2016",
year = "2016",
doi = "10.1145/2939672.2939842",
language = "English",
pages = "1725--1734",

}

TY - CONF

T1 - A Real Linear and Parallel Multiple Longest Common Subsequences (MLCS) Algorithm

AU - Li, Y.

AU - Li, H.

AU - Duan, T.

AU - Wang, Sheng

AU - Wang, Z.

AU - Cheng, Y.

N1 - The full text is currently unavailable on the repository.

PY - 2016

Y1 - 2016

N2 - Information in various applications is often expressed as character sequences over a finite alphabet (e.g., DNA or protein sequences). In Big Data era, the lengths and sizes of these sequences are growing explosively, leading to grand challenges for the classical NP-hard problem, namely searching for the Multiple Longest Common Subsequences (MLCS) from multiple sequences. In this paper, we first unveil the fact that the state-of-the-art MLCS algorithms are unable to be applied to long and large-scale sequences alignments. To overcome their defects and tackle the longer and large-scale or even big sequences alignments, based on the proposed novel problem-solving model and various strategies, e.g., parallel topological sorting, optimal calculating, reuse of intermediate results, subsection calculation and serialization, etc., we present a novel parallel MLCS algorithm. Exhaustive experiments on the datasets of both synthetic and real-world biological sequences demonstrate that both the time and space of the proposed algorithm are only linear in the number of dominants from aligned sequences, and the proposed algorithm significantly outperforms the state-of-the-art MLCS algorithms, being applicable to longer and large-scale sequences alignments.

AB - Information in various applications is often expressed as character sequences over a finite alphabet (e.g., DNA or protein sequences). In Big Data era, the lengths and sizes of these sequences are growing explosively, leading to grand challenges for the classical NP-hard problem, namely searching for the Multiple Longest Common Subsequences (MLCS) from multiple sequences. In this paper, we first unveil the fact that the state-of-the-art MLCS algorithms are unable to be applied to long and large-scale sequences alignments. To overcome their defects and tackle the longer and large-scale or even big sequences alignments, based on the proposed novel problem-solving model and various strategies, e.g., parallel topological sorting, optimal calculating, reuse of intermediate results, subsection calculation and serialization, etc., we present a novel parallel MLCS algorithm. Exhaustive experiments on the datasets of both synthetic and real-world biological sequences demonstrate that both the time and space of the proposed algorithm are only linear in the number of dominants from aligned sequences, and the proposed algorithm significantly outperforms the state-of-the-art MLCS algorithms, being applicable to longer and large-scale sequences alignments.

KW - Multiple Longest Common Subsequences (MLCS)

KW - Non-redundant Common Subsequence Graph (NCSG)

KW - Topological Sorting

KW - Subsection Calculation and Serialization

U2 - 10.1145/2939672.2939842

DO - 10.1145/2939672.2939842

M3 - Paper

SP - 1725

EP - 1734

ER -