An efficient algorithm for pattern discovery in large text databases

Dan Li, Kefei Wang, Jitender S. Deogun, Ruben O. Donis

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper, we present novel text mining algorithms that are useful for pattern discovery in large gene sequence databases. Our approach allows us to work with a small subset of all possible patterns thus enhancing space and time complexity. We call this algorithm Generating All Frequent Patterns, GAFP. Representative subword association rules are introduced to express associations between subword patterns and user-specified target conditions. A rule is of the form P ⇒ C, where P is a subword association pattern in the form of (α1, α2, ⋯, αk,d), and C is a target condition. Pattern (α1, α2, ⋯, αk, d) is called a k-subword association pattern where αi are subwords from input text sequences, and d is the distance constraint which specifies the maximum distance between two subwords adjacent in the pattern. GAFP presents an efficient approach for computing frequent patterns that optimize the rule confidence.

Original languageEnglish (US)
Title of host publicationProceedings of the International Conference on Information and Knowledge Engineering 2003
EditorsN. Goharian, N. Goharian
Pages96-102
Number of pages7
StatePublished - 2003
EventProceedings of the International Conference on Information and Knowledge Engineering 2003 - Las Vegas, NV, United States
Duration: Jun 23 2003Jun 26 2003

Publication series

NameProceedings of the International Conference on Information and Knowledge Engineering
Volume1

Conference

ConferenceProceedings of the International Conference on Information and Knowledge Engineering 2003
Country/TerritoryUnited States
CityLas Vegas, NV
Period6/23/036/26/03

ASJC Scopus subject areas

  • General Engineering

Fingerprint

Dive into the research topics of 'An efficient algorithm for pattern discovery in large text databases'. Together they form a unique fingerprint.

Cite this