Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
20093173016
Table of Contents
Introduction. Literature
Review. algorithm.
Problem definition. (window initialization phase, window sliding phase, mining frequent itemsets phase. )
MRFI-SW
Introduction
A data stream is a massive sequence of data elements
continuously generating at a rapid rate. Different from the traditional static datasets, data streams are continuous, unbounded and have a data distribution that changes with time.
Many applications generate large amount of data streams in real
time, such as sensor data generated from sensors networks, online transaction flows in retail chains, Web record and clickstreams in Web applications, etc.
Data streams can be classified into offline data streams [1] and
Cont..
[1] The target applications domains of offline data
stream are a bulk addition of new transactions, such as a data warehouse system.
[2] Online data streams are characterized by real-
time updated data. The streaming data of online data stream come one by one in time, such as a continuously generated transaction as in a network monitoring system.
Literature Review
Researchers have proposed many algorithms of mining frequent item
landmark window model. the time-fading model. the sliding window model.
Sampling and Lossy Counting . This algorithm can mine frequent items over offline data stream under landmark window model.
Cont..
SWFI-stream is an algorithm for mining frequent item sets in online data
streams under transaction-sensitive sliding window model proposed an incremental mining algorithm to mine frequent item sets in offline data streams with a time-sensitive sliding window.
Problem definition
Let
={i1,i2,,im} be a set of literals, called items. A transaction T={id, x1x2..xn}. A transaction data stream DS={T1, T2,TN} is a continuous sequence of transactions . A data stream can be also denoted as DS={W1, W2,Wm}, where each basic window is a transaction-sensitive sliding window. w is the size of the transaction-sensitive sliding window. s is a user-defined minimum support threshold in the rang of [0,1]. The support of a transaction X over SW is the number of transactions in SW containing X as a subset. If the support of X is higher than s*w, X is called a frequent item set (FI).
MRFI-SW algorithm
The proposed MRFI-SW algorithm consists of three
phases :
window
initialization phase. window sliding phase. and mining frequent itemsets phase.
transaction arriving. The phase lasts until the transactionsensitive sliding window is full. When the sliding window is full, the w items are transformed into bit-order representations. Each entry is the form of (bit, order), denoted as R(x). If item X is in the i-th transaction in current sliding window, the ith entry of R(X)_bit is set to be 1 and the order of items in a transaction can get from R(X)_order, otherwise the R(X) is set to be 0 (R(X)_bit=R(X)_order=0).
Cont..
T1, T2, and T3. The bit-order representations of items in SW1 are shown in Table 1.
Cont..
becomes full. In this phase, a new arriving transaction is inserted into the sliding window, and the oldest transaction in current sliding window is removed. Because the bit-order sequence representation is a structure of sequence, we use left-shift operation on the sequence. To improve the memory usage, a pruning entry operation is executed after the window sliding. a pruning entry operation is executed after the window sliding. The operation is pruning the entry of item when its bit-order sequence is 0. If item X dose not appear in any transaction over current sliding window, where sup(X)SW=0, the entry R(X) is pruned.
Cont..
For instance, in Table 1, when the forth transaction T4 arrives, the first
transaction T1 must be removed from the current SW. The bit-order sequence entries of items in SW1 are executed left-shift. R(a) is modified from <(1, 1), 0, (1, 1)> to <0, (1, 1), 0>
Similarly R(c)=<(1, 2), (1, 3), 0> R(d)=<0, 0, 0> R(b)=<(1, 1), (1, 2), (1, 1)> R(e)=<(1, 3), (1, 4), (1, 2)>
are updated and the frequent itemsets are requested. We proposed a method to generate k-frequent items (itemsets with k items) from the known k-1-frequent items. The method works basing on Apriori property (If a pattern is frequent, all of its sub-patterns will also be frequent). We use SUM operation on the bit of each entry to compute the support of items, and find the frequent 1-itemsets in current SW . Then the proposed algorithm uses AND operation on the bit of each entry to find 2-itemsets. The support of 2-itemsets is computed, the itemsets whose supports are less than the user defined threshold are pruned. The process is terminated until no new k+1-itemsets are generated.
Cont..
For instance, consider the DS in Table 1. Let the minimum support
threshold s be 0.6. Hence, an item set X is frequent if sup(X)0.6*3=1.8. We discuss the step of mining frequent item sets in SW2. First, MRFISW algorithm finds out frequent 1-itemsets, through computing the support of items where R(a)=<0, (1, 1), 0>, i.e., sup(a)=1 R(c)=<(1, 2), (1, 3), 0>, i.e., sup(c)=2 R(b)=<(1, 1), (1, 2), (1, 1)>, i.e., sup(b)=3 R(e)=<(1, 3), (1, 4), (1, 2)>, i.e., sup(e)=3
Cont..
Cont..
itemsets.
1Find frequent 1-itemsets FI1 2For (k=2; FIk-1null; k++) 3 Do AND operation on R(FIk-1).bit to find Candidate FIk 4For each FI do 5 Do bitwise SUM operation on R( Candidate FIk) 6 If SUM(R( Candidate FIk).bit ) s*w 7 If k=2 8 Scan R(Candidate FIk).order 9 Output FIk 10 End if 11 End if 12End for
Experiment
Our algorithm was written in C and compiled using Microsoft
Visual C++ 6.0. We generate online data streams using IBM synthetic data generator.
sliding window
Conclusion
Mining online data stream is an interesting and challenging research
field. The characteristics of data stream make many traditional mining algorithms unable to be applied. In this paper proposed an efficient algorithm of three phases for
mining recent frequent item sets over online data stream with transaction-sensitive sliding window. Experiment shows that using the proposed algorithm not only attains highly accurate mining result, but also runs significant faster and consume less memory than SWFI-algorithm for mining recent frequent item sets over online data streams.
Questions??