Sei sulla pagina 1di 31

Mining Stream, Time-Series,

and Sequence Data

Md. Yasser Arafat


MS Student, Dept of CSE, DU
7/11/15

Topics Covered
Methodologies for Stream Data Processing
Association
Tilted Time Frame
Critical Layers
Lossy Counting Algorithm
Hoeffding Tree Algorithm
VFDT (Very Fast Decision Tree learner)
Categories of Time-Series Movements
Estimation of Trend Curve
Similarity Search in Time-Series Analysis
7/11/15

Methodologies for Stream Data


Processing

Random Sampling
Sliding Windows
Histograms
Multiresolution Methods
Sketches
Randomized Algorithms

Tilted Time Frame


Natural tilted time frame
Time frame structured in multiple granularities
based on the natural or usual time scale
Example: Minimal: quarter, then 4 quarters
1 hour, 24 hours day,
1 2 m o n th s

31 days

2 4 h o u rs

4 q trs
tim e

Tilted Time Frame


Logarithmic tilted time frame
Time frame is structured in multiple
granularities according to a logarithmic
scale
Example: Minimal: 1 minute, then 1, 2,
4, 8, 16, 32,
64t 32t 16t

8t

4t

2t

T im e

Tilted Time Frame


Progressive logarithmic tilted time
frame
Snap-shots are stored at differing levels of
granularity depending on the recency
Example: Suppose there are 5 frames and each
takes maximal 3 snapshots. Given a snapshot
number N, if N mod 2d = 0, insert into the
frame number d. If there are more than 3
snapshots, kick out the oldest one.

Critical Layers

Lossy Counting Algorithm


User provides two input parameters:
Min support threshold,
Error bound,

Incoming stream is conceptually divided into


buckets of widthw, d = 1/
Approximate frequency count, f
Maximum possible error,
If a given item already exists, we simply
increase its frequency count, f. Otherwise, we
insert it into the list with a frequency count of
1.
If the new item is from the bth bucket, we set
to be b-1.

Lossy Counting Algorithm


An item entry is deleted if, for that
entry,
f + <= b.
We know that b <= N/w, that is, b
<= N. So an item can be
underestimated is N.
Now the actual frequency is N.
So f has the value of at least N-N

Lossy Counting - Example


Step 1: Divide the stream into windows
Window 1

Window 2

Window 3

Lossy Counting - Example


Frequency
Counts

Empty

First Window

Lossy Counting - Example


Frequency
Counts

Next Window

Lossy Counting Example (Error


Analysis)

How much do we undercount?


If
and
then

current size of stream


window-size

=N
= 1/

frequency error #windows = N


Rule of thumb:
Set = 10% of support s
Example:
Given support frequency s = 1%,
set error frequency
= 0.1%

Output:

Elements with counter values exceeding sN N

Approximation guarantees

Frequencies underestimated by at most N


No false negatives
False positives have true frequency at least sN N

How many counters do we need?


Worst case: 1/ log ( N) counters

[See paper for proof]

Classification of Dynamic Data Streams


Hoeffding Tree Algorithm
Sufficient to consider only a small subset
of the training examples that pass through
that node to find the best split
For example, use the first few examples to
choose the split at the root

15

Hoeffding Bound
Independent of the probability distribution
generating the observations
A real-valued random variable r whose
range is R
n independent observations of r with mean
r
Hoeffding bound states that P(r r - ) =
1 - , where r is the
is a
R 2 true
ln(1 / mean,
)

small number, and 2n

16

Hoeffding Bound (cont.)


Let G(Xi) be the heuristic measure
used to choose the split, where Xi is a
discrete attribute
Let Xa, Xb be the attribute with the
highest and second-highest observed
G() after seeing n examples
respectively
Let G = G(Xa) G(Xb) 0
17

Hoeffding Bound (cont.)


Given a desired , if G > , the
Hoeffding bound states that P(G
G - > 0) = 1 -
G > 0 G(Xa) - G(Xb) > 0 G(Xa) >
G(Xb)
Xa is the best attribute to split with
probability 1-
18

Decision-Tree Induction with


Data Streams
Packets > 10
yes

Data Stream

no

Protocol = http

Packets > 10
yes

Data Stream

no

Bytes >
60K
yes
Protocol = ftp
7/11/15

Protocol = http

Ack. From Gehrkes SIGMOD tutorial slides

Data Mining: Concepts and


Techniques

19

Hoeffding Tree: Strengths and


Weaknesses
Strengths

Scales better than traditional methods


Sublinear with sampling
Very small memory utilization

Incremental
Make class predictions in parallel
New examples are added as they come
Weakness

Could spend a lot of time with ties


Memory used with tree expansion
Number of candidate attributes
7/11/15

Data Mining: Concepts and


Techniques

20

VFDT (Very Fast Decision Tree


learner)
A learning system based on
hoeffding tree algorithm
Improvements
Breaking near ties
Computation of G()
Memory utilization
Dropping poor attributes
Initialization method

Categories of Time-Series
Movements
Trend or Long-term or movements
General direction in which a time series is moving over a long
interval of time

Cyclic movements or cycle variations


Long term oscillations about a trend line or curve
e.g., business cycles, may or may not be periodic

Seasonal movements or seasonal variations


almost identical patterns that a time series appears to follow during
corresponding months of successive years.

Irregular or random movements


Time series analysis
decomposition of a time series into these four basic movements
Additive Modal: TS = T + C + S + I
Multiplicative Modal: TS = T x C x S x I

Estimation of Trend Curve


Freehand method
Fit the curve by looking at the graph
Costly and barely reliable for largescaled data mining

Least-square method
Find the curve minimizing the sum of
the squares of the deviation of points on
the curve from the corresponding data
points

Moving Average
Moving average of order n

Smoothes the data


Eliminates cyclic, seasonal and irregular

movements
Loses the data at the beginning or end

of a series

Similarity Search in Time-Series


Analysis
Normal database query finds exact match
Similarity search finds data sequences that
differ only slightly from the given query
sequence
Two categories of similarity queries
Whole matching: find a sequence that is
similar to the query sequence
Subsequence matching: find all pairs of
similar sequences

Typical Applications

Financial market
Market basket data analysis
Scientific databases
Medical diagnosis

Data
Transformation
Many techniques for signal analysis
require the data to be in the
frequency domain
Reduction techniques
discrete Fourier transform (DFT)
discrete wavelet transform (DWT)

The distance between two signals in


the time domain is the same as their
Euclidean distance in the frequency

Subsequence
Matching
Break each sequence into a set
of pieces of window with length
w
Extract the features of the
subsequence inside the
window
Map each sequence to a trail
in the feature space
Divide the trail of each
sequence into subtrails and
represent each of them with
minimum bounding rectangle
Use a multi-piece assembly
algorithm to search for longer

Analysis of Similar Time Series

Steps for Performing a Similarity Search


Atomic matching
Find all pairs of gap-free windows of a small
length that are similar
Window stitching
Stitch similar windows to form pairs of large
similar subsequences allowing gaps between
atomic matches
Subsequence Ordering
Linearly order the subsequence matches to
determine whether enough similar pieces exist

Reference
Chapter 6, Data Mining Concepts and
Techniques, Third Edition. By Jiawei Han,
Micheline Kamber and Jian Pei.

7/11/15

30

Thank You

7/11/15

31

Potrebbero piacerti anche