Sei sulla pagina 1di 1

Association Rule Mining for the QSAR Problem

CHARM (Zaki & Hsiao, 1999) and Closet (Pei, Han as well as the model dimension. We do not focus on
& Mao, 2000). The closed-itemset approaches rely on feature selection methods in this chapter.
the application of Formal Concept Analysis to associa- The classification approach uses both the topologi-
tion rule problem that was first mentioned in (Zaki & cal representation that sees a chemical compound as
Ogihara, 1998). For more details on lattice theory see an undirected graph, having atoms in the vertices and
(Ganter & Wille, 1999). Another approach leading to bonds in the edges and the geometric representation
a small number of results is finding representative as- that sees a chemical compound as an undirected graph
sociation rules (Kryszkiewicz, 1998). with 3D coordinates attached to the vertices.
The difference between Apriori-based and closed The predictive association-based model approach
itemset-based approaches consists in the treatment of is applied for organic compounds only and uses typi-
sub-unitary confidence and unitary confidence associa- cal sub-structure presence/count descriptors (a typical
tion rules, namely Apriori makes no distinction between substructure can be, for example, - CH3 or –CH2-). It
them, while FCA-based approaches report sub-unitary also includes a pre-clustered target item, the activity
association rules (also named partial implication rules) to be predicted.
structured in a concept lattice and, eventually, the
pseudo-intents, a base on the unitary association rules Resulting Data Model
(also named global implications, exhibiting a logi-
cal implication behavior). The advantage of a closed The frequent sub-structure mining attempts to build, just
itemset approach is the smaller size of the resulting like frequent itemsets, frequent connected sub-graphs,
concept lattice versus the number of frequent itemsets, by adding vertices step-by step, in an Apriori fashion.
i.e. search space reduction. The main difference from frequent itemset mining is
that graph isomorphism has to be checked, in order
to correctly compute itemset support in the database.
MAIN THRUST OF THE CHAPTER The purpose of frequent sub-structure mining is the
classification of chemical compounds, using a Support
While there are many application domains for the as- Vector Machine-based classification algorithm on the
sociation rule mining methods, they have only started chemical compound structure expressed in terms of
to be used in relation to the QSAR problem. There are the resulted frequent sub-structures.
two main approaches: one that attempts classifying The predictive association rule-based model consid-
chemical compounds, using frequent sub-structure min- ers as mining result only the global implications with
ing (Deshpande, Kuramochi, Wale, & Karypis, 2005), predictive capability, namely the ones comprising the
a modified version of association rule mining, and one target item in the rule’s conclusion. The prediction is
that attempts predicting activity using an association achieved by applying to a new compound all the rules
rule-based model (Dumitriu, Segal, Craciun, Cocu, & in the model. Some rules may not apply (rule’s premises
Georgescu, 2006). are not satisfied by the compound’s structure) and some
rules may predict activity clusters. Each cluster can be
Mined Data predicted by a number of rules. After subjecting the
compound to the predictive model, it can yield:
For the QSAR problem, the items are called chemi- - a “none” result, meaning that the compound’s
cal compound descriptors. There are various types of activity can not be predicted with the model,
descriptors that can be used to represent the chemical - a cluster id result, meaning the predicted activity
structure of compounds: chemical element presence cluster,
in a compound, chemical element mass, normalized - several cluster ids; whenever this situation occurs
chemical element mass, topological structure of the it can be dealt with in various manners: a vote can be
molecule, geometrical structure of the molecule etc. held and the majority cluster id can be declared a win-
Generally, a feature selection algorithm is applied ner, or the rule set (the model) can be refined since it
before mining, in order to reduce the search space, is too general.



Potrebbero piacerti anche