Stream, Time, Sequence Mining

Mining Stream, Time-Series,
and Sequence Data
Md. Yasser Arafat

MS Student, Dept of CSE, DU
7/11/15
Topics Covered
Methodologies for Stream Data Processing
Association
Tilted Time Frame
Critical Layers
Lossy Counting Algorithm
Hoeffding Tree Algorithm
VFDT (Very Fast Decision Tree learner)
Categories of Time-Series Movements
Estimation of Trend Curve
Similarity Search in Time-Series Analysis
7/11/15
Methodologies for Stream Data

Processing
Random Sampling
Sliding Windows
Histograms
Multiresolution Methods
Sketches
Randomized Algorithms
Tilted Time Frame

Natural tilted time frame
Time frame structured in multiple granularities
based on the natural or usual time scale
Example: Minimal: quarter, then 4 quarters
1 hour, 24 hours day,
1 2 m o n th s
31 days
2 4 h o u rs
4 q trs
tim e
Tilted Time Frame

Logarithmic tilted time frame
Time frame is structured in multiple
granularities according to a logarithmic
scale
Example: Minimal: 1 minute, then 1, 2,
4, 8, 16, 32,
64t 32t 16t
8t
4t
2t
T im e
Tilted Time Frame

Progressive logarithmic tilted time
frame
Snap-shots are stored at differing levels of
granularity depending on the recency
Example: Suppose there are 5 frames and each
takes maximal 3 snapshots. Given a snapshot
number N, if N mod 2d = 0, insert into the
frame number d. If there are more than 3
snapshots, kick out the oldest one.
Critical Layers

User provides two input parameters:
Min support threshold,
Error bound,
Incoming stream is conceptually divided into

buckets of widthw, d = 1/
Approximate frequency count, f
Maximum possible error,
If a given item already exists, we simply
increase its frequency count, f. Otherwise, we
insert it into the list with a frequency count of
1.
If the new item is from the bth bucket, we set
to be b-1.

An item entry is deleted if, for that
entry,
f + <= b.
We know that b <= N/w, that is, b
<= N. So an item can be
underestimated is N.
Now the actual frequency is N.
So f has the value of at least N-N
Lossy Counting - Example

Step 1: Divide the stream into windows
Window 1
Window 2
Window 3

Frequency
Counts
Empty
First Window

Frequency
Counts
Next Window
Lossy Counting Example (Error

Analysis)
How much do we undercount?

If
and
then
current size of stream

window-size
=N
= 1/
frequency error #windows = N

Rule of thumb:
Set = 10% of support s
Example:
Given support frequency s = 1%,
set error frequency
= 0.1%
Output:
Elements with counter values exceeding sN N
Approximation guarantees
Frequencies underestimated by at most N

No false negatives
False positives have true frequency at least sN N
How many counters do we need?

Worst case: 1/ log ( N) counters
[See paper for proof]
Classification of Dynamic Data Streams

Hoeffding Tree Algorithm
Sufficient to consider only a small subset
of the training examples that pass through
that node to find the best split
For example, use the first few examples to
choose the split at the root
15
Hoeffding Bound
Independent of the probability distribution
generating the observations
A real-valued random variable r whose
range is R
n independent observations of r with mean
r
Hoeffding bound states that P(r r - ) =
1 - , where r is the
is a
R 2 true
ln(1 / mean,
)
small number, and 2n
16
Hoeffding Bound (cont.)

Let G(Xi) be the heuristic measure
used to choose the split, where Xi is a
discrete attribute
Let Xa, Xb be the attribute with the
highest and second-highest observed
G() after seeing n examples
respectively
Let G = G(Xa) G(Xb) 0
17
Hoeffding Bound (cont.)

Given a desired , if G > , the
Hoeffding bound states that P(G
G - > 0) = 1 -
G > 0 G(Xa) - G(Xb) > 0 G(Xa) >
G(Xb)
Xa is the best attribute to split with
probability 1-
18
Decision-Tree Induction with

Data Streams
Packets > 10
yes
Data Stream
no
Protocol = http
Packets > 10
yes
Data Stream
no
Bytes >
60K
yes
Protocol = ftp
7/11/15
Protocol = http
Ack. From Gehrkes SIGMOD tutorial slides
Data Mining: Concepts and

Techniques
19
Hoeffding Tree: Strengths and

Weaknesses
Strengths
Scales better than traditional methods

Sublinear with sampling
Very small memory utilization
Incremental
Make class predictions in parallel
New examples are added as they come
Weakness
Could spend a lot of time with ties

Memory used with tree expansion
Number of candidate attributes
7/11/15
Data Mining: Concepts and

Techniques
20
VFDT (Very Fast Decision Tree

learner)
A learning system based on
hoeffding tree algorithm
Improvements
Breaking near ties
Computation of G()
Memory utilization
Dropping poor attributes
Initialization method
Categories of Time-Series
Movements
Trend or Long-term or movements
General direction in which a time series is moving over a long
interval of time
Cyclic movements or cycle variations

Long term oscillations about a trend line or curve
e.g., business cycles, may or may not be periodic
Seasonal movements or seasonal variations

almost identical patterns that a time series appears to follow during
corresponding months of successive years.
Irregular or random movements

Time series analysis
decomposition of a time series into these four basic movements
Additive Modal: TS = T + C + S + I
Multiplicative Modal: TS = T x C x S x I
Estimation of Trend Curve

Freehand method
Fit the curve by looking at the graph
Costly and barely reliable for largescaled data mining
Least-square method
Find the curve minimizing the sum of
the squares of the deviation of points on
the curve from the corresponding data
points
Moving Average
Moving average of order n
Smoothes the data

Eliminates cyclic, seasonal and irregular
movements
Loses the data at the beginning or end
of a series
Similarity Search in Time-Series

Analysis
Normal database query finds exact match
Similarity search finds data sequences that
differ only slightly from the given query
sequence
Two categories of similarity queries
Whole matching: find a sequence that is
similar to the query sequence
Subsequence matching: find all pairs of
similar sequences
Typical Applications
Financial market
Market basket data analysis
Scientific databases
Medical diagnosis
Data
Transformation
Many techniques for signal analysis
require the data to be in the
frequency domain
Reduction techniques
discrete Fourier transform (DFT)
discrete wavelet transform (DWT)
The distance between two signals in

the time domain is the same as their
Euclidean distance in the frequency
Subsequence
Matching
Break each sequence into a set
of pieces of window with length
w
Extract the features of the
subsequence inside the
window
Map each sequence to a trail
in the feature space
Divide the trail of each
sequence into subtrails and
represent each of them with
minimum bounding rectangle
Use a multi-piece assembly
algorithm to search for longer
Analysis of Similar Time Series
Steps for Performing a Similarity Search

Atomic matching
Find all pairs of gap-free windows of a small
length that are similar
Window stitching
Stitch similar windows to form pairs of large
similar subsequences allowing gaps between
atomic matches
Subsequence Ordering
Linearly order the subsequence matches to
determine whether enough similar pieces exist
Reference
Chapter 6, Data Mining Concepts and
Techniques, Third Edition. By Jiawei Han,
Micheline Kamber and Jian Pei.
7/11/15
30
Thank You
7/11/15
31

Stream, Time, Sequence Mining

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Stream, Time, Sequence Mining

Caricato da

Copyright:

Formati disponibili

Mining Stream, Time-Series,

and Sequence Data

Md. Yasser Arafat

Methodologies for Stream Data

Tilted Time Frame

Tilted Time Frame

Tilted Time Frame

Lossy Counting Algorithm

Incoming stream is conceptually divided into

Lossy Counting Algorithm

Lossy Counting - Example

Lossy Counting - Example

Lossy Counting - Example

Lossy Counting Example (Error

How much do we undercount?

current size of stream

frequency error #windows = N

Elements with counter values exceeding sN N

Frequencies underestimated by at most N

How many counters do we need?

[See paper for proof]

Classification of Dynamic Data Streams

small number, and 2n

Hoeffding Bound (cont.)

Hoeffding Bound (cont.)

Decision-Tree Induction with

Ack. From Gehrkes SIGMOD tutorial slides

Data Mining: Concepts and

Hoeffding Tree: Strengths and

Scales better than traditional methods

Could spend a lot of time with ties

Data Mining: Concepts and

VFDT (Very Fast Decision Tree

Cyclic movements or cycle variations

Seasonal movements or seasonal variations

Irregular or random movements

Estimation of Trend Curve

Smoothes the data

Similarity Search in Time-Series

The distance between two signals in

Analysis of Similar Time Series

Steps for Performing a Similarity Search

Potrebbero piacerti anche