Sei sulla pagina 1di 7

This paper proposes:

Mining Frequent Patterns without


Candidate Generation
„ A novel frequent pattern tree
structure: F P-tree
„ An efficient FP-tree-based
a paper by Jiawei Han, Jian Pei and Yiwen Yin
School of Computing Science
S imon Fraser University
Presented by Maria Cutumisu
mining method: FP-growth
Department of Computing Science
University of Alberta

T his approach is very efficient


due to: F P-tree: Design and Construction
„ Compression of a large „ To ensure that the tree structure
database into a smaller data is compact, only frequent
structure length-1 items will have nodes
„ Pattern fragment growth mining in the tree
method „ More frequently occurring nodes
„ Partitioning-based divide-and- will have better chances of
conquer search method sharing nodes than the others
Example: a transaction database The corresponding FP-tree
T ransaction ID Items Bought (Ordered) T ransactions
F requent Items
sharing an identical
100 f, a, c, d, i, m, p f, c, a, m, p
itemset can be
200 a, b, c, f, l, m, o f, c, a, b, m merged into one
300 b, f, h, j, o f, b with the number of
occurrences
400 b, c, k, s, p c, b, p
registered as count.
500 a, f, c, e, l, p, f, c, a, m, p
m, n

An FP-tree is a tree structure


which consists of: F P-tree construction algorithm
„ One root labeled as "null" „ Input: a transaction database DB
„ A set of item prefix sub-trees with and a minimum support threshold ε
each node formed by three fields: „ Output: Its frequent pattern tree,
item-name, count, node-link F P-tree
„ A frequent-item header table with „ Method: The FP-tree is constructed
two fields for each entry: item- in the following steps:
name, head of node-link
2. Create a root of an FP-tree, T,
1. Scan DB once: and label it as "null"
„ Collect the set of frequent items „ For each transaction T rans in DB do
F and their supports the following:
„ select and sort the frequent items
„ Sort F in support descending in T rans according to the order of
order as L, the list of frequent L
items „ let the sorted frequent item list in
T rans be [p|P], where p is the
first element and P is the
remaining list. Call
insert_tree([p|P], T)

Note: insert_tree([p|P], T) is
performed as follows: Analysis
„ IF T has a child N such that „ Two scans of the DB are necessary:
N.item_name=p.item_name, then the first collects the set of frequent
increment N's count by 1 items and the second constructs the
„ E L S E create a new node N, and let its F P-tree.
count by 1, its parent link be linked to T,
and its node-link be linked to the nodes „ T he cost of inserting a transaction
with the same item_name via the node- T rans into the FP-tree is
link structure O(|Trans|), where | T rans| is the
„ IF P is nonempty, call insert_tree(P,N) number of frequent items in T rans.
recursively
F P-growth: the FP-tree-based
mining method
„ F P-tree contains the complete „ Starts from a frequent length-1
information for frequent pattern mining.
pattern
„ T he size of the FP-tree is bounded by the
size of the database, but due to frequent „ Examines only its conditional
items sharing, the size of the tree is pattern base
usually much smaller than its original
database. „ Constructs its FP-tree
„ High compaction is achieved by placing „ Performs mining recursively on
more frequently items closer to the root the tree
(being thus more likely to be shared).

F P-growth algorithm P rocedure FP-growth ( T ree, α)


„ Input: F P-tree constructed using „ IF T ree contains a single path P

DB and a minimum support „ T H EN for each combination β of the nodes in


the path P DO generate pattern β ∪ α with
threshold ε support = minimum support of nodes in β
„ Output: The complete set of „ E L S E for each ai in the header of T ree DO

frequent patterns „ generate pattern β = ai ∪ α with ai.support;


„ construct β 's conditional pattern base and
„ Method: Call F P-growth (FP- F P-tree T reeβ
tree, null) „ IF T reeβ <> void THEN Call F P-
growth(T reeβ, β)
Analysis of the FP-growth Search technique: partitioning-
algorithm based divide-and-conquer
„ Finds the complete set of frequent „ U sed instead of the Apriori-like
itemsets
bottom-up generation of
„ Efficient because:
„ it works on a reduced set of pattern bases
frequent itemsets combinations
„ it performs mining operations less costly than „ Reduces the size of the
generation and test:
conditional pattern base
„ prefix count adjustment
„ counting
generated at the subsequent
„ pattern fragment concatenation level of search and of its
corresponding FP-tree

Performance comparison with


other algorithms
„ T ransforms the problem of „ T reeProjection is the supporting
finding long frequent patterns to algorithm of another novel tree
looking for shorter ones and structure: lexicographic tree
then concatenating the suffix. „ Comparative analysis of the FP-
growth with Apriori and
„ Employs the least frequent
T reeProjection algorithms show
items as suffix, which offers a that FP-growth outperforms both
good selectivity. of them
Improvements: how to design a
disk-resident F P-tree Performance improvements
„ Cluster F P-tree nodes by path and by „ Materialization of an FP-tree
item prefix sub-tree
„ B+-tree for F P-tree not fitting into main „ Incremental updates of an F P-
memory tree
„ Group access mode mining to reduce the
I/O cost „ F P-tree mining with item
„ Release space of the conditional pattern constraints
base or conditional FP-tree after usage
„ F P-tree mining of other frequent
„ Remove the node-links of the FP-tree
patterns

Advantages of the FP-growth


mining method: Drawbacks:
„ Efficient and scalable for both long and „ T he tree does not achieve maximal
short frequent patterns; the running compactness all the time.
memory requirements of FP-growth „ For the databases with mostly short
increase linearly when the support transactions, the reduction ratio of
threshold goes down the tree in respect to the database
„ An order of magnitude faster than the is not very high.
Apriori algorithm
„ T he F P-tree does not always fit into
„ Faster than recently reported new the main memory.
frequent pattern mining methods
Conclusions
„ F P-growth method has satisfactory
performance when tested in large
industrial databases
„ It is open to a lot of research issues
„ Due to compression, sometimes large
databases (order of gigabytes) containing
many long patterns may generate F P-
trees which fit in main memory

Potrebbero piacerti anche