Mining Data Streams Presentation

ONLINE DATA STREAM MINING OF RECENT FREQUENT ITEMSETS BASED ON SLIDING WINDOW MODEL IEEE 2008 Conference
Presented by: Baha Nawafleh
20093173016
Table of Contents
Introduction. Literature

Review. algorithm.
Problem definition. (window initialization phase, window sliding phase, mining frequent itemsets phase. )
MRFI-SW
Experiment Conclusion Questions??
Introduction
A data stream is a massive sequence of data elements
continuously generating at a rapid rate. Different from the traditional static datasets, data streams are continuous, unbounded and have a data distribution that changes with time.
Many applications generate large amount of data streams in real
time, such as sensor data generated from sensors networks, online transaction flows in retail chains, Web record and clickstreams in Web applications, etc.
Data streams can be classified into offline data streams [1] and
online data streams [2].
Cont..
[1] The target applications domains of offline data
stream are a bulk addition of new transactions, such as a data warehouse system.
[2] Online data streams are characterized by real-
time updated data. The streaming data of online data stream come one by one in time, such as a continuously generated transaction as in a network monitoring system.
Literature Review
Researchers have proposed many algorithms of mining frequent item
sets in data streams.

The researches of mining frequent itemsets in data streams can be
divided into three categories:

landmark window model. the time-fading model. the sliding window model.
Manku and Motwani developed two single-pass algorithms, Sticky
Sampling and Lossy Counting . This algorithm can mine frequent items over offline data stream under landmark window model.
Cont..
SWFI-stream is an algorithm for mining frequent item sets in online data
streams under transaction-sensitive sliding window model proposed an incremental mining algorithm to mine frequent item sets in offline data streams with a time-sensitive sliding window.
The purpose of this paper:

MRFI-SW is Mining Recent Frequent Item sets over online data stream
with Sliding window.
Problem definition
Let
={i1,i2,,im} be a set of literals, called items. A transaction T={id, x1x2..xn}. A transaction data stream DS={T1, T2,TN} is a continuous sequence of transactions . A data stream can be also denoted as DS={W1, W2,Wm}, where each basic window is a transaction-sensitive sliding window. w is the size of the transaction-sensitive sliding window. s is a user-defined minimum support threshold in the rang of [0,1]. The support of a transaction X over SW is the number of transactions in SW containing X as a subset. If the support of X is higher than s*w, X is called a frequent item set (FI).
MRFI-SW algorithm
The proposed MRFI-SW algorithm consists of three
phases :
window
initialization phase. window sliding phase. and mining frequent itemsets phase.
window initialization phase.

The window initialization phase is activated by the first
transaction arriving. The phase lasts until the transactionsensitive sliding window is full. When the sliding window is full, the w items are transformed into bit-order representations. Each entry is the form of (bit, order), denoted as R(x). If item X is in the i-th transaction in current sliding window, the ith entry of R(X)_bit is set to be 1 and the order of items in a transaction can get from R(X)_order, otherwise the R(X) is set to be 0 (R(X)_bit=R(X)_order=0).
Cont..
For example, there are three transactions in SW1,
T1, T2, and T3. The bit-order representations of items in SW1 are shown in Table 1.
Cont..

Table 1. Bit-order of items in window initialization phase
window sliding phase

The window sliding phase is activated when the sliding window
becomes full. In this phase, a new arriving transaction is inserted into the sliding window, and the oldest transaction in current sliding window is removed. Because the bit-order sequence representation is a structure of sequence, we use left-shift operation on the sequence. To improve the memory usage, a pruning entry operation is executed after the window sliding. a pruning entry operation is executed after the window sliding. The operation is pruning the entry of item when its bit-order sequence is 0. If item X dose not appear in any transaction over current sliding window, where sup(X)SW=0, the entry R(X) is pruned.
Cont..
For instance, in Table 1, when the forth transaction T4 arrives, the first
transaction T1 must be removed from the current SW. The bit-order sequence entries of items in SW1 are executed left-shift. R(a) is modified from <(1, 1), 0, (1, 1)> to <0, (1, 1), 0>

Similarly R(c)=<(1, 2), (1, 3), 0> R(d)=<0, 0, 0> R(b)=<(1, 1), (1, 2), (1, 1)> R(e)=<(1, 3), (1, 4), (1, 2)>
Noted that item d is dropped, because R(d)=<0, 0, 0>, sup(d)SW2=0.
Algorithm 1: Output: updated bit-order sequence

1Initialize sliding window and bit-order sequence; 2While each new coming transaction Ti in SW do 3 If (SW is full) 4 Transform all of items in SW to bit-order sequence; 5 Else 6 Do left_shift operation on bit-order sequence of all items 7 For each item X arrives in SW 8 Transform X to bit sequence representation 9 End for 10 End if 11For each R(X) in SW 12 If SUM( R(X).bit)=0 13 Drop X from SW 14 End if 15End for

Mining frequent itemsets phase

The mining frequent itemsets phase is activated when the bit-order sequences
are updated and the frequent itemsets are requested. We proposed a method to generate k-frequent items (itemsets with k items) from the known k-1-frequent items. The method works basing on Apriori property (If a pattern is frequent, all of its sub-patterns will also be frequent). We use SUM operation on the bit of each entry to compute the support of items, and find the frequent 1-itemsets in current SW . Then the proposed algorithm uses AND operation on the bit of each entry to find 2-itemsets. The support of 2-itemsets is computed, the itemsets whose supports are less than the user defined threshold are pruned. The process is terminated until no new k+1-itemsets are generated.
Cont..
For instance, consider the DS in Table 1. Let the minimum support
threshold s be 0.6. Hence, an item set X is frequent if sup(X)0.6*3=1.8. We discuss the step of mining frequent item sets in SW2. First, MRFISW algorithm finds out frequent 1-itemsets, through computing the support of items where R(a)=<0, (1, 1), 0>, i.e., sup(a)=1 R(c)=<(1, 2), (1, 3), 0>, i.e., sup(c)=2 R(b)=<(1, 1), (1, 2), (1, 1)>, i.e., sup(b)=3 R(e)=<(1, 3), (1, 4), (1, 2)>, i.e., sup(e)=3
So item a is not frequent because its support is 1.
Cont..
Cont..
Algorithm 1: Output: a set of frequent
itemsets.
1Find frequent 1-itemsets FI1 2For (k=2; FIk-1null; k++) 3 Do AND operation on R(FIk-1).bit to find Candidate FIk 4For each FI do 5 Do bitwise SUM operation on R( Candidate FIk) 6 If SUM(R( Candidate FIk).bit ) s*w 7 If k=2 8 Scan R(Candidate FIk).order 9 Output FIk 10 End if 11 End if 12End for

Experiment
Our algorithm was written in C and compiled using Microsoft
Visual C++ 6.0. We generate online data streams using IBM synthetic data generator.
Figure 1. Memory usages in window initialization
Figure 2. Memory usages in window
sliding window
Figure 3. Memory usages in mining frequent item sets
Figure 4. The processing time of algorithm
Conclusion
Mining online data stream is an interesting and challenging research
field. The characteristics of data stream make many traditional mining algorithms unable to be applied. In this paper proposed an efficient algorithm of three phases for
mining recent frequent item sets over online data stream with transaction-sensitive sliding window. Experiment shows that using the proposed algorithm not only attains highly accurate mining result, but also runs significant faster and consume less memory than SWFI-algorithm for mining recent frequent item sets over online data streams.
Questions??

Mining Data Streams Presentation

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Mining Data Streams Presentation

Caricato da

Copyright:

Formati disponibili

ONLINE DATA STREAM MINING OF RECENT FREQUENT ITEMSETS BASED ON SLIDING WINDOW MODEL IEEE 2008 Conference

Presented by: Baha Nawafleh

Experiment Conclusion Questions??

online data streams [2].

sets in data streams.

divided into three categories:

Manku and Motwani developed two single-pass algorithms, Sticky

The purpose of this paper:

with Sliding window.

window initialization phase.

For example, there are three transactions in SW1,

Table 1. Bit-order of items in window initialization phase

window sliding phase

Noted that item d is dropped, because R(d)=<0, 0, 0>, sup(d)SW2=0.

Algorithm 1: Output: updated bit-order sequence

Mining frequent itemsets phase

So item a is not frequent because its support is 1.

Algorithm 1: Output: a set of frequent

Figure 1. Memory usages in window initialization

Figure 2. Memory usages in window

Figure 3. Memory usages in mining frequent item sets

Figure 4. The processing time of algorithm

Potrebbero piacerti anche

Mining Data Streams Presentation

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Mining Data Streams Presentation

Caricato da

Copyright:

Formati disponibili

ONLINE DATA STREAM MINING OF RECENT FREQUENT ITEMSETS BASED ON SLIDING WINDOW MODEL IEEE 2008 Conference

Presented by: Baha Nawafleh

 Experiment  Conclusion  Questions??

online data streams [2].

sets in data streams.

divided into three categories:

 Manku and Motwani developed two single-pass algorithms, Sticky

The purpose of this paper:

with Sliding window.

window initialization phase.

 For example, there are three transactions in SW1,

Table 1. Bit-order of items in window initialization phase

window sliding phase

 Noted that item d is dropped, because R(d)=<0, 0, 0>, sup(d)SW2=0.

Algorithm 1: Output: updated bit-order sequence

Mining frequent itemsets phase

 So item a is not frequent because its support is 1.

Algorithm 1: Output: a set of frequent

Figure 1. Memory usages in window initialization

Figure 2. Memory usages in window

Figure 3. Memory usages in mining frequent item sets

Figure 4. The processing time of algorithm

Potrebbero piacerti anche

Experiment Conclusion Questions??

Manku and Motwani developed two single-pass algorithms, Sticky

For example, there are three transactions in SW1,

Noted that item d is dropped, because R(d)=<0, 0, 0>, sup(d)SW2=0.

So item a is not frequent because its support is 1.