Efficient and Accurate Discovery of Patterns in Sequence Datasets

A. Pandian; J. Venkatasubramanian; S.E. Chandiran

Efficient and Accurate Discovery of Patterns in Sequence Datasets

A. Pandian, J. Venkatasubramanian, S.E. Chandiran

Abstract

-Existing sequence mining algorithms mostly focus on
mining for subsequences. However, a large class of applications, such as biological DNA and protein motif mining, require efficient mining of “approximate” patterns that are contiguous. The few existing algorithms that can be applied to find such contiguous approximate pattern mining have drawbacks like poor scalability, lack of guarantees in finding the pattern, and difficulty in adapting to other applications. In this paper, we present a new algorithm called FLAME (FLexible and Accurate Motif DEtector). FLAME is a flexible suffix tree based algorithm that can be used to find frequent patterns with a variety of definitions of motif (pattern) models. It is also accurate, as it always find the pattern if it exists. Using both real and synthetic datasets, we
demonstrate that FLAME is fast, scalable, and outperforms existing algorithms on a variety of performance metrics. Using FLAME, it is now possible to mine datasets that would have been prohibitively difficult with existing tools.

Keywords

Keywords---FLAME, Data Mining, Distributed Algorithms, Dataset, Decision Trees, Classification

Full Text:

PDF

References

M. O. Dayhoff, R. M. Schwartz, and B. Orcutt, “A Model for

Evolutionary Changes in Proteins,” Atlas of Protein Sequence and

Structure, vol. 5, pp. 345–352, 1978.

S. Henikoff and J. Henikoff, “Amino Acid Substitution Matrices from

Protein Blocks,” National Academy of Sciences, USA, vol. 89, no. 22, pp.

915–9, 1992.

R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association

Rules,” in VLDB, 1994, pp. 487–499.

“Mining Sequential Patterns,” in ICDE, 1995, pp. 3–14.

[5] M. J. Zaki, “SPADE: An Efficient Algorithm for Mining Frequent

Sequences,” Machine Learning, vol. 42, no. 1/2, pp. 31–60, 2001.

J. Wang and J. Han, “BIDE: Efficient Mining of Frequent Closed

Sequences,” in ICDE, 2004, pp. 79–90.

X. Yan, J. Han, and R. Afshar, “CloSpan: Mining Closed Sequential

Patterns in Large Datasets,” in SDM, 2003.

J. Yang, W. Wang, P. S. Yu, and J. Han, “Mining Long Sequential

Patterns in a Noisy Environment,” in SIGMOD, 2002, pp. 406–417.

M. J. Zaki, “Sequence Mining in Categorical Domains: Incorporating

Constrains,” in CIKM, 2000, pp. 442–429.

J. Pei, J. Han, and W. Wang, “Mining Sequential Patterns With

Constraints in Large Databases,” in CIKM, 2002, pp. 18–25.

G. Pavesi, P. Mereghetti, G. Mauri, and G. Pesole, “Weeder Web:

Discovery of Transcription Factor Binding Sites in a Set of Sequences

From Co-Regulated Genes,” Nucleic Acids Research, vol. 32(Web Server

issue), pp. W199–W203, 2004.

E. Eskin and P. A. Pevzner, “Finding Composite Regulatory Patterns in

DNA Sequences,” in ISMB, 2002, pp. S354–63.

J. Buhler and M. Tompa, “Finding Motifs Using Random Projections,”

Journal Computational Biology, vol. 9, no. 2, pp. 225–242, 2002.

G. Das, K.-I. Lin, H. Mannila, G. Renganathan, and P. Smyth, “Rule

Discovery From Time Series,” in KDD, 1998, pp. 16–22.

S. Hoppner, “Discovery of Temporal Patterns – Learning Rules about the

Qualitative Behaviour of Time Series,” in 5th European Conference on

Principles and Practice of Knowledge Discovery in Databases, 2001, pp.

–203.

P. Patel, E. Keogh, J. Lin, and S. Lonardi, “Mining Motifs in Massive

Time Series Databases,” in ICDM, 2002, pp. 370–377.

H. Wu, B. Salzberg, G. C. Sharp, S. B. Jiang, H. Shirato, and D. Kaeli,

“Subsequence Matching on Structured Time Series Data,” in SIGMOD,

, pp. 682–693.

B. Y.-C. Chiu, E. J. Keogh, and S. Lonardi, “Probabilistic Discovery of

Time Series Motifs,” in KDD, 2003, pp. 493–498.

W. Wang and J. Yang, Mining Sequential Patterns from Large Data Sets.

Springer-Verlag, 2005, vol. 28.

M. Das and H. K. Dai, “A Survey of DNA Motif Finding Algorithms,”

BMC Bioinformatics, vol. 8, 2007.

G. K. Sandve and F. Drabløs, “A Survey of Motif Discovery Methods in

an Integrated Framework,” Biology Direct, vol. 1, 2006.

Y. Zhang and M. J. Zaki, “ExMOTIF: Efficient Structured Motif

Extraction,” Algorithms for Molecular Biology, vol. 1, no. 21, November

A. M. Carvalho et al., “An Efficient Algorithm for the Identification of

Structured Motifs in DNA Promoter Sequences,” IEEE/ACM

Transactions on computational biology and bioinformatics, vol. 3, 2006.

J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.

Hsu, “PrefixSpan: Mining Sequential Patterns by Prefix-Projected

Growth,” in ICDE, 2001, pp. 215–224.

A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert, “Approaches to the

Automatic Discovery of Patterns in Biosequences,” Journal of

Computational Biology, vol. 5, pp. 279–305, 1998.

L. Marsan and M.-F. Sagot, “Algorithms for Extracting Structured Motifs

Using a Suffix Tree with Application to Promoter and Regulatory Site

Consensus Identification,” Journal of Computational Biology, vol. 7, no.

/4, pp. 345–360, 2000.

F. Zhu, X. Yan, J. Han, and P. S. Yu, “Efficient discovery of frequent

approximate sequential patterns,” in ICDM, 2007.

T. L. Bailey and C. Elkan, “Unsupervised Learning of Multiple Motifs in

Biopolymers using EM,” Machine Learning, vol. 21, no. 1-2, pp. 51–80,

W. Thompson, E. C. Rouchka, and C. E. Lawrence, “Gibbs Recursive

Sampler: Finding Transcription Factor Binding Sites,” Nucleic Acids

Research, vol. 31, no. 13, pp. 3580–3585, 2003.

M. Tompa et al., “Assessing Computational Tools for the Discovery of

Transcription Factor Binding Sites,” Nature Biotechnology, vol. 23, pp.

–144, 2005.

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution 3.0 License.

Username
Password
Remember me