Sei sulla pagina 1di 9

2YHUYLHZ

‡ :KDWLVDGDWDZDUHKRXVH"

,QWURGXFWLRQWR'DWD:DUHKRXVHV ‡ $PXOWLGLPHQVLRQDOGDWDPRGHO

DQG2/$37HFKQRORJLHV ‡ 'DWDZDUHKRXVHDUFKLWHFWXUH

‡ 'DWDZDUHKRXVHLPSOHPHQWDWLRQ

‡ )XUWKHUGHYHORSPHQWRIGDWDFXEHWHFKQRORJ\

‡ )URPGDWDZDUHKRXVLQJWRGDWDPLQLQJ

Data Mining Lecture 2 2

:KDWLV'DWD:DUHKRXVH" 'DWD:DUHKRXVH³6XEMHFW2ULHQWHG

‡ 'HILQHGLQPDQ\GLIIHUHQWZD\VEXWQRWULJRURXVO\
² $GHFLVLRQVXSSRUWGDWDEDVHWKDWLVPDLQWDLQHGVHSDUDWHO\
‡ 2UJDQL]HGDURXQGPDMRUVXEMHFWVVXFKDVFXVWRPHU
IURPWKHRUJDQL]DWLRQ·VRSHUDWLRQDOGDWDEDVH SURGXFWVDOHV
² 6XSSRUWLQIRUPDWLRQSURFHVVLQJE\SURYLGLQJDVROLGSODWIRUP
RIFRQVROLGDWHGKLVWRULFDOGDWDIRUDQDO\VLV ‡ )RFXVLQJRQWKHPRGHOLQJDQGDQDO\VLVRIGDWDIRU
‡ ´$GDWDZDUHKRXVHLVDVXEMHFWRULHQWHGLQWHJUDWHG GHFLVLRQPDNHUVQRWRQGDLO\RSHUDWLRQVRUWUDQVDFWLRQ
WLPHYDULDQWDQGQRQYRODWLOHFROOHFWLRQRIGDWDLQ
VXSSRUWRIPDQDJHPHQW·VGHFLVLRQPDNLQJSURFHVVµ SURFHVVLQJ
³:+,QPRQ ‡ 3URYLGHDVLPSOHDQGFRQFLVHYLHZDURXQGSDUWLFXODU
‡ 'DWDZDUHKRXVLQJ
VXEMHFWLVVXHVE\H[FOXGLQJGDWDWKDWDUHQRWXVHIXOLQ
² 7KHSURFHVVRIFRQVWUXFWLQJDQGXVLQJGDWDZDUHKRXVHV
WKHGHFLVLRQVXSSRUWSURFHVV

Data Mining Lecture 2 3 Data Mining Lecture 2 4

'DWD:DUHKRXVH³,QWHJUDWHG 'DWD:DUHKRXVH³7LPH9DULDQW

‡ &RQVWUXFWHGE\LQWHJUDWLQJPXOWLSOHKHWHURJHQHRXV
‡ 7KHWLPHKRUL]RQIRUWKHGDWDZDUHKRXVHLV
GDWDVRXUFHV
² UHODWLRQDOGDWDEDVHVIODWILOHVRQOLQHWUDQVDFWLRQUHFRUGV VLJQLILFDQWO\ORQJHUWKDQWKDWRIRSHUDWLRQDOV\VWHPV
‡ 'DWDFOHDQLQJDQGGDWDLQWHJUDWLRQWHFKQLTXHVDUH ² 2SHUDWLRQDOGDWDEDVHFXUUHQWYDOXHGDWD
DSSOLHG ² 'DWDZDUHKRXVHGDWDSURYLGHLQIRUPDWLRQIURPDKLVWRULFDO
² (QVXUHFRQVLVWHQF\LQQDPLQJFRQYHQWLRQVHQFRGLQJ SHUVSHFWLYH HJSDVW\HDUV
VWUXFWXUHVDWWULEXWHPHDVXUHVHWFDPRQJGLIIHUHQWGDWD
VRXUFHV ‡ (YHU\NH\VWUXFWXUHLQWKHGDWDZDUHKRXVH
‡ (J+RWHOSULFHFXUUHQF\WD[EUHDNIDVWFRYHUHGHWF ² &RQWDLQVDQHOHPHQWRIWLPHH[SOLFLWO\RULPSOLFLWO\
² :KHQGDWDLVPRYHGWRWKHZDUHKRXVHLWLVFRQYHUWHG ² %XWWKHNH\RIRSHUDWLRQDOGDWDPD\RUPD\QRWFRQWDLQ´WLPH
HOHPHQWµ

Data Mining Lecture 2 5 Data Mining Lecture 2 6

1
'DWD:DUHKRXVH³1RQ9RODWLOH 'DWD:DUHKRXVHYV+HWHURJHQHRXV'%06

‡ $SK\VLFDOO\VHSDUDWHVWRUHRIGDWDWUDQVIRUPHG ‡ 7UDGLWLRQDOKHWHURJHQHRXV'%LQWHJUDWLRQ
² %XLOGZUDSSHUVPHGLDWRUVRQWRSRIKHWHURJHQHRXVGDWDEDVHV
IURPWKHRSHUDWLRQDOHQYLURQPHQW ² 4XHU\GULYHQDSSURDFK
‡ 2SHUDWLRQDOXSGDWHRIGDWDGRHVQRWRFFXULQWKH ‡ :KHQDTXHU\LVSRVHGWRDFOLHQWVLWHDPHWDGLFWLRQDU\LVXVHG
WRWUDQVODWHWKHTXHU\LQWRTXHULHVDSSURSULDWHIRULQGLYLGXDO
GDWDZDUHKRXVHHQYLURQPHQW KHWHURJHQHRXVVLWHVLQYROYHGDQGWKHUHVXOWVDUHLQWHJUDWHGLQWR
² 'RHVQRWUHTXLUHWUDQVDFWLRQSURFHVVLQJUHFRYHU\DQG DJOREDODQVZHUVHW
‡ &RPSOH[LQIRUPDWLRQILOWHULQJFRPSHWHIRUUHVRXUFHV
FRQFXUUHQF\FRQWUROPHFKDQLVPV
² 5HTXLUHVRQO\WZRRSHUDWLRQVLQGDWDDFFHVVLQJ
‡ 'DWDZDUHKRXVHXSGDWHGULYHQKLJKSHUIRUPDQFH
² ,QIRUPDWLRQIURPKHWHURJHQHRXVVRXUFHVLVLQWHJUDWHGLQDGYDQFHDQG
‡ LQLWLDOORDGLQJRIGDWDDQGDFFHVVRIGDWD VWRUHGLQZDUHKRXVHVIRUGLUHFWTXHU\DQGDQDO\VLV

Data Mining Lecture 2 7 Data Mining Lecture 2 8

'DWD:DUHKRXVHYV2SHUDWLRQDO'%06 2/73YV2/$3
OLTP OLAP
‡ 2/73 RQOLQHWUDQVDFWLRQSURFHVVLQJ
users clerk, IT professional knowledge worker
² 0DMRUWDVNRIWUDGLWLRQDOUHODWLRQDO'%06
function day to day operations decision support
² 'D\WRGD\RSHUDWLRQVSXUFKDVLQJLQYHQWRU\EDQNLQJPDQXIDFWXULQJ
DB design application-oriented subject-oriented
SD\UROOUHJLVWUDWLRQDFFRXQWLQJHWF
data current, up-to-date historical,
‡ 2/$3 RQOLQHDQDO\WLFDOSURFHVVLQJ detailed, flat relational summarized, multidimensional
² 0DMRUWDVNRIGDWDZDUHKRXVHV\VWHP isolated integrated, consolidated
usage repetitive ad-hoc
² 'DWDDQDO\VLVDQGGHFLVLRQPDNLQJ
access read/write lots of scans
‡ 'LVWLQFWIHDWXUHV 2/73YV2/$3  index/hash on prim. key
² 8VHUDQGV\VWHPRULHQWDWLRQFXVWRPHUYVPDUNHW unit of work short, simple transaction complex query
# records accessed tens millions
² 'DWDFRQWHQWVFXUUHQWGHWDLOHGYVKLVWRULFDOFRQVROLGDWHG
#users thousands hundreds
² 'DWDEDVHGHVLJQ(5DSSOLFDWLRQYVVWDUVXEMHFW
DB size 100MB-GB 100GB-TB
² 9LHZFXUUHQWORFDOYVHYROXWLRQDU\LQWHJUDWHG
metric transaction throughput query throughput, response
² $FFHVVSDWWHUQVXSGDWHYVUHDGRQO\EXWFRPSOH[TXHULHV
Data Mining Lecture 2 9 Data Mining Lecture 2 10

:K\6HSDUDWH'DWD:DUHKRXVH" 2YHUYLHZ

‡ +LJKSHUIRUPDQFHIRUERWKV\VWHPV
² '%06³WXQHGIRU2/73DFFHVVPHWKRGVLQGH[LQJ ‡ :KDWLVDGDWDZDUHKRXVH"
FRQFXUUHQF\FRQWUROUHFRYHU\
² :DUHKRXVH³WXQHGIRU2/$3FRPSOH[2/$3TXHULHV ‡ $PXOWLGLPHQVLRQDOGDWDPRGHO
PXOWLGLPHQVLRQDOYLHZFRQVROLGDWLRQ
‡ 'LIIHUHQWIXQFWLRQVDQGGLIIHUHQWGDWD ‡ 'DWDZDUHKRXVHDUFKLWHFWXUH
² PLVVLQJGDWD'HFLVLRQVXSSRUWUHTXLUHVKLVWRULFDOGDWDZKLFK
RSHUDWLRQDO'%VGRQRWW\SLFDOO\PDLQWDLQ ‡ 'DWDZDUHKRXVHLPSOHPHQWDWLRQ
² GDWDFRQVROLGDWLRQ'6UHTXLUHVFRQVROLGDWLRQ DJJUHJDWLRQ
VXPPDUL]DWLRQ RIGDWDIURPKHWHURJHQHRXVVRXUFHV ‡ )XUWKHUGHYHORSPHQWRIGDWDFXEHWHFKQRORJ\
² GDWDTXDOLW\GLIIHUHQWVRXUFHVW\SLFDOO\XVHLQFRQVLVWHQWGDWD
UHSUHVHQWDWLRQVFRGHVDQGIRUPDWVZKLFKKDYHWREH ‡ )URPGDWDZDUHKRXVLQJWRGDWDPLQLQJ
UHFRQFLOHG

Data Mining Lecture 2 11 Data Mining Lecture 2 12

2
)URP7DEOHVDQG6SUHDGVKHHWVWR'DWD&XEHV &XEH$/DWWLFHRI&XERLGV

‡ $GDWDZDUHKRXVHLVEDVHGRQDPXOWLGLPHQVLRQDOGDWDPRGHO all
0-D(apex) cuboid
ZKLFKYLHZVGDWDLQWKHIRUPRIDGDWDFXEH
‡ $GDWDFXEHVXFKDVVDOHVDOORZVGDWDWREHPRGHOHGDQG time item location supplier
1-D cuboids
YLHZHGLQPXOWLSOHGLPHQVLRQV
² 'LPHQVLRQWDEOHVVXFKDVLWHP LWHPBQDPHEUDQGW\SH RU time,location item,location location,supplier
WLPH GD\ZHHNPRQWKTXDUWHU\HDU
2-D cuboids
² )DFWWDEOHFRQWDLQVPHDVXUHV VXFKDVGROODUVBVROG DQGNH\VWR time,item time,supplier item,supplier
HDFKRIWKHUHODWHGGLPHQVLRQWDEOHV time,location,supplier
‡ ,QGDWDZDUHKRXVLQJOLWHUDWXUHDQQ'EDVHFXEHLVFDOOHGD 3-D cuboids
EDVHFXERLG7KHWRSPRVW'FXERLGZKLFKKROGVWKHKLJKHVW time,item,location
time,item,supplier item,location,supplier
OHYHORIVXPPDUL]DWLRQLVFDOOHGWKHDSH[FXERLG7KHODWWLFHRI
FXERLGVIRUPVDGDWDFXEH time, item, location, supplier
4-D(base) cuboid
Data Mining Lecture 2 13 Data Mining Lecture 2 14

&RQFHSWXDO0RGHOLQJRI'DWD:DUHKRXVHV ([DPSOHRI6WDU6FKHPD
time
0RGHOLQJGDWDZDUHKRXVHVGLPHQVLRQV PHDVXUHV time_key item
² 6WDUVFKHPD$IDFWWDEOHLQWKHPLGGOHFRQQHFWHGWRDVHWRI day item_key
day_of_the_week Sales Fact Table item_name
GLPHQVLRQWDEOHV month brand
² 6QRZIODNHVFKHPD$UHILQHPHQWRIVWDUVFKHPDZKHUHVRPH quarter time_key type
year supplier_type
GLPHQVLRQDOKLHUDUFK\LVQRUPDOL]HGLQWRDVHWRIVPDOOHU item_key
GLPHQVLRQWDEOHVIRUPLQJDVKDSHVLPLODUWRVQRZIODNH branch_key
branch location
² )DFWFRQVWHOODWLRQV0XOWLSOHIDFWWDEOHVVKDUHGLPHQVLRQ location_key
branch_key location_key
WDEOHVYLHZHGDVDFROOHFWLRQRIVWDUVWKHUHIRUHFDOOHGJDOD[\ street
branch_name units_sold
VFKHPDRUIDFWFRQVWHOODWLRQ branch_type city
dollars_sold province_or_state
country
avg_sales
Measures
Data Mining Lecture 2 15 Data Mining Lecture 2 16

([DPSOHRI6QRZIODNH6FKHPD ([DPSOHRI)DFW&RQVWHOODWLRQ

time time
time_key item time_key item Shipping Fact Table
day item_key supplier day item_key
Sales Fact Table day_of_the_week Sales Fact Table item_name time_key
day_of_the_week item_name supplier_key
month brand
month brand supplier_type
time_key quarter time_key type
quarter type year item_key
supplier_type
year item_key supplier_key shipper_key
item_key from_location
branch_key branch_key
branch location branch location_key location to_location
location_key
location_key branch_key location_key dollars_cost
branch_key units_sold
units_sold street branch_name street
branch_name branch_type units_shipped
city_key city dollars_sold city
branch_type dollars_sold province_or_state
city_key avg_sales shipper
country
avg_sales city
Measures
province_or_state shipper_key
Measures country shipper_name
location_key
Data Mining Lecture 2 17 Data Mining Lecture 2 18
shipper_type

3
$'DWD0LQLQJ4XHU\/DQJXDJH'04/
/DQJXDJH3ULPLWLYHV 'HILQLQJD6WDU6FKHPDLQ'04/

‡ &XEH'HILQLWLRQ )DFW7DEOH GHILQHFXEHVDOHVBVWDU>WLPHLWHPEUDQFKORFDWLRQ@


GHILQHFXEHFXEHBQDPH!>GLPHQVLRQBOLVW!@ GROODUVBVROG VXP VDOHVBLQBGROODUV DYJBVDOHV
PHDVXUHBOLVW! DYJ VDOHVBLQBGROODUV XQLWVBVROG FRXQW
‡ 'LPHQVLRQ'HILQLWLRQ 'LPHQVLRQ7DEOH GHILQHGLPHQVLRQWLPHDV WLPHBNH\GD\GD\BRIBZHHN
GHILQHGLPHQVLRQGLPHQVLRQBQDPH!DV PRQWKTXDUWHU\HDU
DWWULEXWHBRUBVXEGLPHQVLRQBOLVW!
GHILQHGLPHQVLRQLWHPDV LWHPBNH\LWHPBQDPHEUDQG
‡ 6SHFLDO&DVH 6KDUHG'LPHQVLRQ7DEOHV W\SHVXSSOLHUBW\SH
² )LUVWWLPHDV´FXEHGHILQLWLRQµ
GHILQHGLPHQVLRQEUDQFKDV EUDQFKBNH\EUDQFKBQDPH
² GHILQHGLPHQVLRQGLPHQVLRQBQDPH!DV
GLPHQVLRQBQDPHBILUVWBWLPH!LQFXEH EUDQFKBW\SH
FXEHBQDPHBILUVWBWLPH! GHILQHGLPHQVLRQORFDWLRQDV ORFDWLRQBNH\VWUHHWFLW\
SURYLQFHBRUBVWDWHFRXQWU\

Data Mining Lecture 2 19 Data Mining Lecture 2 20

'HILQLQJD6QRZIODNH6FKHPDLQ'04/ 'HILQLQJD)DFW&RQVWHOODWLRQLQ'04/

GHILQHFXEHVDOHV>WLPHLWHPEUDQFKORFDWLRQ@
GHILQHFXEHVDOHVBVQRZIODNH>WLPHLWHPEUDQFKORFDWLRQ@
GROODUVBVROG VXP VDOHVBLQBGROODUV  DYJBVDOHV DYJ VDOHVBLQBGROODUV 
GROODUVBVROG VXP VDOHVBLQBGROODUV DYJBVDOHV XQLWVBVROG FRXQW
DYJ VDOHVBLQBGROODUV XQLWVBVROG FRXQW GHILQHGLPHQVLRQWLPHDV WLPHBNH\GD\GD\BRIBZHHNPRQWKTXDUWHU
\HDU
GHILQHGLPHQVLRQWLPHDV WLPHBNH\GD\GD\BRIBZHHN GHILQHGLPHQVLRQLWHPDV LWHPBNH\LWHPBQDPHEUDQGW\SH
VXSSOLHUBW\SH
PRQWKTXDUWHU\HDU
GHILQHGLPHQVLRQEUDQFKDV EUDQFKBNH\EUDQFKBQDPHEUDQFKBW\SH
GHILQHGLPHQVLRQLWHPDV LWHPBNH\LWHPBQDPHEUDQG GHILQHGLPHQVLRQORFDWLRQDV ORFDWLRQBNH\VWUHHWFLW\SURYLQFHBRUBVWDWH
FRXQWU\
W\SHVXSSOLHU VXSSOLHUBNH\VXSSOLHUBW\SH GHILQHFXEHVKLSSLQJ>WLPHLWHPVKLSSHUIURPBORFDWLRQWRBORFDWLRQ@
GHILQHGLPHQVLRQEUDQFKDV EUDQFKBNH\EUDQFKBQDPH GROODUBFRVW VXP FRVWBLQBGROODUV XQLWBVKLSSHG FRXQW
GHILQHGLPHQVLRQWLPHDVWLPHLQFXEHVDOHV
EUDQFKBW\SH GHILQHGLPHQVLRQLWHPDVLWHPLQFXEHVDOHV
GHILQHGLPHQVLRQORFDWLRQDV ORFDWLRQBNH\VWUHHW GHILQHGLPHQVLRQVKLSSHUDV VKLSSHUBNH\VKLSSHUBQDPHORFDWLRQDV
ORFDWLRQLQFXEHVDOHVVKLSSHUBW\SH
FLW\ FLW\BNH\SURYLQFHBRUBVWDWHFRXQWU\ GHILQHGLPHQVLRQIURPBORFDWLRQDVORFDWLRQLQFXEHVDOHV
GHILQHGLPHQVLRQWRBORFDWLRQDVORFDWLRQLQFXEHVDOHV
Data Mining Lecture 2 21 Data Mining Lecture 2 22

0HDVXUHV7KUHH&DWHJRULHV $&RQFHSW+LHUDUFK\'LPHQVLRQ ORFDWLRQ

‡ GLVWULEXWLYHLIWKHUHVXOWGHULYHGE\DSSO\LQJWKH all all


IXQFWLRQWRQDJJUHJDWHYDOXHVLVWKHVDPHDVWKDW
GHULYHGE\DSSO\LQJWKHIXQFWLRQRQDOOWKHGDWDZLWKRXW
region Europe ... North_America
SDUWLWLRQLQJ
‡ (JFRXQW VXP PLQ PD[ 
‡ DOJHEUDLFLILWFDQEHFRPSXWHGE\DQDOJHEUDLFIXQFWLRQ country Germany ... Spain Canada ... Mexico
ZLWK0DUJXPHQWV ZKHUH0LVDERXQGHGLQWHJHU HDFK
RIZKLFKLVREWDLQHGE\DSSO\LQJDGLVWULEXWLYHDJJUHJDWH
IXQFWLRQ city Frankfurt ... Vancouver ... Toronto
‡ (JDYJ PLQB1 VWDQGDUGBGHYLDWLRQ 
‡ KROLVWLFLIWKHUHLVQRFRQVWDQWERXQGRQWKHVWRUDJHVL]H L. Chan ... M. Wind
office
QHHGHGWRGHVFULEHDVXEDJJUHJDWH
‡ (JPHGLDQ PRGH UDQN 
Data Mining Lecture 2 23 Data Mining Lecture 2 24

4
0XOWLGLPHQVLRQDO'DWD $6DPSOH'DWD&XEH

Total annual sales


‡ 6DOHVYROXPHDVDIXQFWLRQRISURGXFW Date of TV in U.S.A.
1Qtr 2Qtr sum

t
3Qtr
PRQWKDQGUHJLRQ 4Qtr

uc
TV

od
Dimensions: Product, Location, Time PC U.S.A

Pr
Hierarchical summarization paths VCR
on

Country
sum
gi

Canada
Re

Industry Region Year

Category Country Quarter Mexico


Product

Product City Month Week sum

Office Day

Month
Data Mining Lecture 2 25 Data Mining Lecture 2 26

&XERLGV&RUUHVSRQGLQJWRWKH&XEH %URZVLQJD'DWD&XEH

all
0-D(apex) cuboid
product date country
1-D cuboids

product,date product,country date, country


2-D cuboids

3-D(base) cuboid
product, date, country ‡ 9LVXDOL]DWLRQ
‡ 2/$3FDSDELOLWLHV
‡ ,QWHUDFWLYHPDQLSXODWLRQ
Data Mining Lecture 2 27 Data Mining Lecture 2 28

7\SLFDO2/$32SHUDWLRQV 2YHUYLHZ

‡ 5ROOXS GULOOXS VXPPDUL]HGDWD


² E\FOLPELQJXSKLHUDUFK\RUE\GLPHQVLRQUHGXFWLRQ ‡ :KDWLVDGDWDZDUHKRXVH"
‡ 'ULOOGRZQ UROOGRZQ UHYHUVHRIUROOXS
² IURPKLJKHUOHYHOVXPPDU\WRORZHUOHYHOVXPPDU\RUGHWDLOHG
‡ $PXOWLGLPHQVLRQDOGDWDPRGHO
GDWDRULQWURGXFLQJQHZGLPHQVLRQV
‡ 6OLFHDQGGLFH ‡ 'DWDZDUHKRXVHDUFKLWHFWXUH
² SURMHFWDQGVHOHFW
‡ 3LYRW URWDWH  ‡ 'DWDZDUHKRXVHLPSOHPHQWDWLRQ
² UHRULHQWWKHFXEHYLVXDOL]DWLRQ'WRVHULHVRI'SODQHV
‡ 2WKHURSHUDWLRQV
‡ )XUWKHUGHYHORSPHQWRIGDWDFXEHWHFKQRORJ\
² GULOODFURVVLQYROYLQJ DFURVV PRUHWKDQRQHIDFWWDEOH
² GULOOWKURXJKWKURXJKWKHERWWRPOHYHORIWKHFXEHWRLWVEDFNHQG
‡ )URPGDWDZDUHKRXVLQJWRGDWDPLQLQJ
UHODWLRQDOWDEOHV XVLQJ64/

Data Mining Lecture 2 29 Data Mining Lecture 2 30

5
'HVLJQRID'DWD:DUHKRXVH$%XVLQHVV
$QDO\VLV)UDPHZRUN 'DWD:DUHKRXVH'HVLJQ3URFHVV

)RXUYLHZVUHJDUGLQJWKHGHVLJQRIDGDWDZDUHKRXVH ‡ 7RSGRZQERWWRPXSDSSURDFKHVRUDFRPELQDWLRQRIERWK
² 7RSGRZQ6WDUWVZLWKRYHUDOOGHVLJQDQGSODQQLQJ PDWXUH
² 7RSGRZQYLHZ
² %RWWRPXS6WDUWVZLWKH[SHULPHQWVDQGSURWRW\SHV UDSLG
‡ DOORZVVHOHFWLRQRIWKHUHOHYDQWLQIRUPDWLRQQHFHVVDU\IRUWKH
GDWDZDUHKRXVH
‡ )URPVRIWZDUHHQJLQHHULQJSRLQWRIYLHZ
² :DWHUIDOOVWUXFWXUHGDQGV\VWHPDWLFDQDO\VLVDWHDFKVWHSEHIRUH
² 'DWDVRXUFHYLHZ SURFHHGLQJWRWKHQH[W
‡ H[SRVHVWKHLQIRUPDWLRQEHLQJFDSWXUHGVWRUHGDQGPDQDJHGE\ ² 6SLUDOUDSLGJHQHUDWLRQRILQFUHDVLQJO\IXQFWLRQDOV\VWHPVVKRUW
RSHUDWLRQDOV\VWHPV WXUQDURXQGWLPHTXLFNWXUQDURXQG
² 'DWDZDUHKRXVHYLHZ ‡ 7\SLFDOGDWDZDUHKRXVHGHVLJQSURFHVV
‡ FRQVLVWVRIIDFWWDEOHVDQGGLPHQVLRQWDEOHV ² &KRRVHDEXVLQHVVSURFHVVWRPRGHOHJRUGHUVLQYRLFHVHWF
² %XVLQHVVTXHU\YLHZ ² &KRRVHWKHJUDLQ DWRPLFOHYHORIGDWD RIWKHEXVLQHVVSURFHVV
‡ VHHVWKHSHUVSHFWLYHVRIGDWDLQWKHZDUHKRXVHIURPWKHYLHZRI ² &KRRVHWKHGLPHQVLRQVWKDWZLOODSSO\WRHDFKIDFWWDEOHUHFRUG
HQGXVHU ² &KRRVHWKHPHDVXUHWKDWZLOOSRSXODWHHDFKIDFWWDEOHUHFRUG

Data Mining Lecture 2 31 Data Mining Lecture 2 32

0XOWL7LHUHG$UFKLWHFWXUH 7KUHH'DWD:DUHKRXVH0RGHOV

Monitor ‡ (QWHUSULVHZDUHKRXVH
other Metadata & OLAP Server ² FROOHFWVDOORIWKHLQIRUPDWLRQDERXWVXEMHFWVVSDQQLQJWKH
sources Integrator HQWLUHRUJDQL]DWLRQ
Analysis ‡ 'DWD0DUW
Operational Extract Query ² DVXEVHWRIFRUSRUDWHZLGHGDWDWKDWLVRIYDOXHWRDVSHFLILF
Transform Data Serve JURXSVRIXVHUV,WVVFRSHLVFRQILQHGWRVSHFLILFVHOHFWHG
DBs Reports
Load JURXSVVXFKDVPDUNHWLQJGDWDPDUW
Refresh
Warehouse Data mining
‡ ,QGHSHQGHQWYVGHSHQGHQW GLUHFWO\IURPZDUHKRXVH GDWDPDUW
‡ 9LUWXDOZDUHKRXVH
² $VHWRIYLHZVRYHURSHUDWLRQDOGDWDEDVHV
Data Marts ² 2QO\VRPHRIWKHSRVVLEOHVXPPDU\YLHZVPD\EHPDWHULDOL]HG

Data Sources Data Storage OLAP Engine Front-End Tools


Data Mining Lecture 2 33 Data Mining Lecture 2 34

2/$36HUYHU$UFKLWHFWXUHV 2YHUYLHZ

‡ 5HODWLRQDO2/$3 52/$3
² 8VHUHODWLRQDORUH[WHQGHGUHODWLRQDO'%06WRVWRUHDQGPDQDJH ‡ :KDWLVDGDWDZDUHKRXVH"
ZDUHKRXVHGDWDDQG2/$3PLGGOHZDUHWRVXSSRUWPLVVLQJSLHFHV
² ,QFOXGHRSWLPL]DWLRQRI'%06EDFNHQGLPSOHPHQWDWLRQRI ‡ $PXOWLGLPHQVLRQDOGDWDPRGHO
DJJUHJDWLRQQDYLJDWLRQORJLFDQGDGGLWLRQDOWRROVDQGVHUYLFHV
² JUHDWHUVFDODELOLW\
‡ 'DWDZDUHKRXVHDUFKLWHFWXUH
‡ 0XOWLGLPHQVLRQDO2/$3 02/$3
² $UUD\EDVHGPXOWLGLPHQVLRQDOVWRUDJHHQJLQH VSDUVHPDWUL[
WHFKQLTXHV ‡ 'DWDZDUHKRXVHLPSOHPHQWDWLRQ
² IDVWLQGH[LQJWRSUHFRPSXWHGVXPPDUL]HGGDWD
‡ +\EULG2/$3 +2/$3 ‡ )XUWKHUGHYHORSPHQWRIGDWDFXEHWHFKQRORJ\
² 8VHUIOH[LELOLW\HJORZOHYHOUHODWLRQDOKLJKOHYHODUUD\
‡ 6SHFLDOL]HG64/VHUYHUV ‡ )URPGDWDZDUHKRXVLQJWRGDWDPLQLQJ
² VSHFLDOL]HGVXSSRUWIRU64/TXHULHVRYHUVWDUVQRZIODNHVFKHPDV

Data Mining Lecture 2 35 Data Mining Lecture 2 36

6
(IILFLHQW'DWD&XEH&RPSXWDWLRQ &XEH2SHUDWLRQ

‡ 'DWDFXEHFDQEHYLHZHGDVDODWWLFHRIFXERLGV ‡ &XEHGHILQLWLRQDQGFRPSXWDWLRQLQ'04/
² 7KHERWWRPPRVWFXERLGLVWKHEDVHFXERLG GHILQHFXEHVDOHV>LWHPFLW\\HDU@VXP VDOHVBLQBGROODUV
² 7KHWRSPRVWFXERLG DSH[ FRQWDLQVRQO\RQHFHOO FRPSXWHFXEHVDOHV

² +RZPDQ\FXERLGVLQDQQGLPHQVLRQDOFXEHZLWK/OHYHOV" ‡ 7UDQVIRUP LW LQWR D 64/OLNH ODQJXDJH ZLWK D QHZ RSHUDWRU


n
FXEHE\LQWURGXFHGE\*UD\HWDO·
T = ∏ ( Li + 1) 6(/(&7LWHPFLW\\HDU680 DPRXQW
i =1 ()
‡ 0DWHULDOL]DWLRQRIGDWDFXEH )5206$/(6
² 0DWHULDOL]HHYHU\ FXERLG  IXOOPDWHULDOL]DWLRQ QRQH QR &8%(%<LWHPFLW\\HDU (city) (item) (year)
PDWHULDOL]DWLRQ RUVRPH SDUWLDOPDWHULDOL]DWLRQ ‡ 1HHGFRPSXWHWKHIROORZLQJ*URXS%\V
² 6HOHFWLRQRIZKLFKFXERLGVWRPDWHULDOL]H GDWHSURGXFWFXVWRPHU 
‡ %DVHGRQVL]HVKDULQJDFFHVVIUHTXHQF\HWF GDWHSURGXFW  GDWHFXVWRPHU  SURGXFWFXVWRPHU 
GDWH  SURGXFW  FXVWRPHU (city, item) (city, year) (item, year)

Data Mining Lecture 2 37 Data Mining Lecture 2 (city, item, year) 38

&XEH&RPSXWDWLRQ52/$3%DVHG0HWKRG ,QGH[LQJ2/$3'DWD%LWPDS,QGH[

‡ (IILFLHQWFXEHFRPSXWDWLRQPHWKRGV ‡ ,QGH[RQDSDUWLFXODUFROXPQ
² 52/$3EDVHGFXELQJDOJRULWKPV $JDUZDOHWDO· ‡ (DFKYDOXHLQWKHFROXPQKDVDELWYHFWRUELWRSLVIDVW
² $UUD\EDVHGFXELQJDOJRULWKP =KDRHWDO· ‡ 7KHOHQJWKRIWKHELWYHFWRURIUHFRUGVLQWKHEDVHWDEOH
² %RWWRPXSFRPSXWDWLRQPHWKRG %H\HU 5DPDUNULVKQDQ· ‡ 7KHLWKELWLVVHWLIWKHLWKURZRIWKHEDVHWDEOHKDVWKH
YDOXHIRUWKHLQGH[HGFROXPQ
‡ 52/$3EDVHGFXELQJDOJRULWKPV ‡ QRWVXLWDEOHIRUKLJKFDUGLQDOLW\GRPDLQV
² 6RUWLQJKDVKLQJDQGJURXSLQJRSHUDWLRQVDUHDSSOLHGWRWKH
GLPHQVLRQDWWULEXWHVLQRUGHUWRUHRUGHUDQGFOXVWHUUHODWHG
WXSOHV Base table Index on Region Index on Type
² *URXSLQJLVSHUIRUPHGRQVRPHVXEDJJUHJDWHVDVD´SDUWLDO C u st R eg io n T yp e R ecID Asia Eu ro p e Am erica R ecID R etail D ealer
JURXSLQJVWHSµ C1 A s ia R e tail 1 1 0 0 1 1 0
² $JJUHJDWHVPD\EHFRPSXWHGIURPSUHYLRXVO\FRPSXWHG C2 E urop e D e a ler 2 0 1 0 2 0 1
DJJUHJDWHVUDWKHUWKDQIURPWKHEDVHIDFWWDEOH C3 A s ia D e a ler 3 1 0 0 3 0 1
C4 A m e ric a R e tail 4 0 1 0 4 1 0
C5 E urop e D e a ler 5 0 0 1 5 0 1
Data Mining Lecture 2 39 Data Mining Lecture 2 40

,QGH[LQJ2/$3'DWD-RLQ,QGLFHV (IILFLHQW3URFHVVLQJ2/$34XHULHV

‡ -RLQLQGH[-, 5LG6LG ZKHUH


5 5LG« 6 6LG« ‡ 'HWHUPLQHZKLFKRSHUDWLRQVVKRXOGEHSHUIRUPHGRQWKH
‡ 7UDGLWLRQDOLQGLFHVPDSWKHYDOXHVWRDOLVW
RIUHFRUGLGV
DYDLODEOHFXERLGV
² ,WPDWHULDOL]HVUHODWLRQDOMRLQLQ-,ILOHDQG ² WUDQVIRUPGULOOUROOHWFLQWRFRUUHVSRQGLQJ64/DQGRU2/$3
VSHHGVXSUHODWLRQDOMRLQ³DUDWKHUFRVWO\
RSHUDWLRQ RSHUDWLRQVHJGLFH VHOHFWLRQSURMHFWLRQ
‡ ,QGDWDZDUHKRXVHVMRLQLQGH[UHODWHVWKH
YDOXHVRIWKHGLPHQVLRQVRIDVWDUWVFKHPD ‡ 'HWHUPLQHWRZKLFKPDWHULDOL]HGFXERLG V WKHUHOHYDQW
WRURZVLQWKHIDFWWDEOH
² (JIDFWWDEOH6DOHVDQGWZRGLPHQVLRQV
RSHUDWLRQVVKRXOGEHDSSOLHG
FLW\DQGSURGXFW
‡ $MRLQLQGH[RQFLW\PDLQWDLQVIRUHDFK ‡ ([SORULQJLQGH[LQJVWUXFWXUHVDQGFRPSUHVVHGYVGHQVH
GLVWLQFWFLW\DOLVWRI5,'VRIWKH
WXSOHVUHFRUGLQJWKH6DOHVLQWKHFLW\ DUUD\VWUXFWXUHVLQ02/$3
² -RLQLQGLFHVFDQVSDQPXOWLSOHGLPHQVLRQV

Data Mining Lecture 2 41 Data Mining Lecture 2 42

7
0HWDGDWD5HSRVLWRU\ 'DWD:DUHKRXVH%DFN(QG7RROVDQG8WLOLWLHV

‡ 0HWDGDWDLVWKHGDWDGHILQLQJZDUHKRXVHREMHFWV,WKDVWKH ‡ 'DWDH[WUDFWLRQ
IROORZLQJNLQGV ² JHWGDWDIURPPXOWLSOHKHWHURJHQHRXVDQGH[WHUQDOVRXUFHV
² 'HVFULSWLRQRIWKHVWUXFWXUHRIWKHZDUHKRXVH ‡ 'DWDFOHDQLQJ
‡ VFKHPDYLHZGLPHQVLRQVKLHUDUFKLHVGHULYHGGDWD GHIQGDWDPDUW ² GHWHFWHUURUVLQWKHGDWDDQGUHFWLI\WKHPZKHQSRVVLEOH
ORFDWLRQVDQGFRQWHQWV ‡ 'DWDWUDQVIRUPDWLRQ
² 2SHUDWLRQDOPHWDGDWD ² FRQYHUWGDWDIURPOHJDF\RUKRVWIRUPDWWRZDUHKRXVHIRUPDW
‡ GDWDOLQHDJH KLVWRU\RIPLJUDWHGGDWDDQGWUDQVIRUPDWLRQSDWK FXUUHQF\ ‡ /RDG
RIGDWD DFWLYHDUFKLYHGRUSXUJHG PRQLWRULQJLQIRUPDWLRQ ZDUHKRXVH ² VRUWVXPPDUL]HFRQVROLGDWHFRPSXWHYLHZVFKHFNLQWHJULW\
XVDJHVWDWLVWLFVHUURUUHSRUWVDXGLWWUDLOV
DQGEXLOGLQGLFLHVDQGSDUWLWLRQV
² 7KHDOJRULWKPVXVHGIRUVXPPDUL]DWLRQ ‡ 5HIUHVK
² 7KHPDSSLQJIURPRSHUDWLRQDOHQYLURQPHQWWRWKHGDWDZDUHKRXVH
² SURSDJDWHWKHXSGDWHVIURPWKHGDWDVRXUFHVWRWKH
² 'DWDUHODWHGWRV\VWHPSHUIRUPDQFH ZDUHKRXVH
‡ ZDUHKRXVHVFKHPDYLHZDQGGHULYHGGDWDGHILQLWLRQV
² %XVLQHVVGDWD
‡ EXVLQHVVWHUPVDQGGHILQLWLRQVRZQHUVKLSRIGDWDFKDUJLQJSROLFLHV

Data Mining Lecture 2 43 Data Mining Lecture 2 44

2YHUYLHZ 'LVFRYHU\'ULYHQ([SORUDWLRQRI'DWD&XEHV

‡ +\SRWKHVLVGULYHQH[SORUDWLRQE\XVHUKXJHVHDUFKVSDFH
‡ :KDWLVDGDWDZDUHKRXVH"
‡ 'LVFRYHU\GULYHQ 6DUDZDJLHWDO·
‡ $PXOWLGLPHQVLRQDOGDWDPRGHO ² SUHFRPSXWHPHDVXUHVLQGLFDWLQJH[FHSWLRQVJXLGHXVHULQWKHGDWD
DQDO\VLVDWDOOOHYHOVRIDJJUHJDWLRQ
‡ 'DWDZDUHKRXVHDUFKLWHFWXUH ² ([FHSWLRQVLJQLILFDQWO\GLIIHUHQWIURPWKHYDOXHDQWLFLSDWHGEDVHG
RQDVWDWLVWLFDOPRGHO
‡ 'DWDZDUHKRXVHLPSOHPHQWDWLRQ ² 9LVXDOFXHVVXFKDVEDFNJURXQGFRORUDUHXVHGWRUHIOHFWWKHGHJUHH
RIH[FHSWLRQRIHDFKFHOO
‡ )XUWKHUGHYHORSPHQWRIGDWDFXEHWHFKQRORJ\
² &RPSXWDWLRQRIH[FHSWLRQLQGLFDWRU PRGHOLQJILWWLQJDQGFRPSXWLQJ
6HOI([S,Q([SDQG3DWK([SYDOXHV FDQEHRYHUODSSHGZLWKFXEH
‡ )URPGDWDZDUHKRXVLQJWRGDWDPLQLQJ
FRQVWUXFWLRQ

Data Mining Lecture 2 45 Data Mining Lecture 2 46

&RPSOH[$JJUHJDWLRQDW0XOWLSOH*UDQXODULWLHV
([DPSOHV'LVFRYHU\'ULYHQ'DWD&XEHV 0XOWL)HDWXUH&XEHV

‡ 0XOWLIHDWXUHFXEHV 5RVVHWDO &RPSXWHFRPSOH[TXHULHV


LQYROYLQJPXOWLSOHGHSHQGHQWDJJUHJDWHVDWPXOWLSOHJUDQXODULWLHV
‡ ([*URXSLQJE\DOOVXEVHWVRI^LWHPUHJLRQPRQWK`ILQGWKH
PD[LPXPSULFHLQIRUHDFKJURXSDQGWKHWRWDOVDOHVDPRQJ
DOOPD[LPXPSULFHWXSOHV
VHOHFWLWHPUHJLRQPRQWKPD[ SULFH VXP 5VDOHV
IURPSXUFKDVHV
ZKHUH\HDU 
FXEHE\LWHPUHJLRQPRQWK5
VXFKWKDW5SULFH PD[ SULFH
‡ &RQWLQXLQJWKHODVWH[DPSOHDPRQJWKHPD[SULFHWXSOHVILQGWKH
PLQDQGPD[VKHOIOLYHDQGILQGWKHIUDFWLRQRIWKHWRWDOVDOHV
GXHWRWXSOHWKDWKDYHPLQVKHOIOLIHZLWKLQWKHVHWRIDOOPD[
SULFHWXSOHV
Data Mining Lecture 2 47 Data Mining Lecture 2 48

8
2YHUYLHZ 'DWD:DUHKRXVH8VDJH

‡ 7KUHHNLQGVRIGDWDZDUHKRXVHDSSOLFDWLRQV
‡ :KDWLVDGDWDZDUHKRXVH" ² ,QIRUPDWLRQSURFHVVLQJ
‡ VXSSRUWVTXHU\LQJEDVLFVWDWLVWLFDODQDO\VLVDQGUHSRUWLQJ
‡ $PXOWLGLPHQVLRQDOGDWDPRGHO XVLQJFURVVWDEVWDEOHVFKDUWVDQGJUDSKV
² $QDO\WLFDOSURFHVVLQJ
‡ 'DWDZDUHKRXVHDUFKLWHFWXUH ‡ PXOWLGLPHQVLRQDODQDO\VLVRIGDWDZDUHKRXVHGDWD
‡ VXSSRUWVEDVLF2/$3RSHUDWLRQVVOLFHGLFHGULOOLQJSLYRWLQJ
‡ 'DWDZDUHKRXVHLPSOHPHQWDWLRQ ² 'DWDPLQLQJ
‡ NQRZOHGJHGLVFRYHU\IURPKLGGHQSDWWHUQV
‡ )XUWKHUGHYHORSPHQWRIGDWDFXEHWHFKQRORJ\ ‡ VXSSRUWVDVVRFLDWLRQVFRQVWUXFWLQJDQDO\WLFDOPRGHOV
SHUIRUPLQJFODVVLILFDWLRQDQGSUHGLFWLRQDQGSUHVHQWLQJWKH
‡ )URPGDWDZDUHKRXVLQJWRGDWDPLQLQJ PLQLQJUHVXOWVXVLQJYLVXDOL]DWLRQWRROV
‡ 'LIIHUHQFHVDPRQJWKHWKUHHWDVNV

Data Mining Lecture 2 49 Data Mining Lecture 2 50

)URP2Q/LQH$QDO\WLFDO3URFHVVLQJ 2/$3
WR2Q/LQH$QDO\WLFDO0LQLQJ 2/$0 $Q2/$0$UFKLWHFWXUH

‡ :K\RQOLQHDQDO\WLFDOPLQLQJ" Mining Mining Layer4


² +LJKTXDOLW\RIGDWDLQGDWDZDUHKRXVHV query result User Interface
User GUI API
‡ ':FRQWDLQVLQWHJUDWHGFRQVLVWHQWFOHDQHGGDWD
OLAM OLAP Layer3
² $YDLODEOHLQIRUPDWLRQSURFHVVLQJVWUXFWXUHVXUURXQGLQJGDWD
ZDUHKRXVHV Engine Engine OLAP/OLAM
‡ 2'%&2/('%:HEDFFHVVLQJVHUYLFHIDFLOLWLHVUHSRUWLQJDQG Data Cube API
2/$3WRROV
² 2/$3EDVHGH[SORUDWRU\GDWDDQDO\VLV Layer2
‡ PLQLQJZLWKGULOOLQJGLFLQJSLYRWLQJHWF MDDB
² 2QOLQHVHOHFWLRQRIGDWDPLQLQJIXQFWLRQV
MDDB
Meta
‡ LQWHJUDWLRQDQGVZDSSLQJRIPXOWLSOHPLQLQJIXQFWLRQV Database API Data
Filtering&Integration Filtering
DOJRULWKPVDQGWDVNV
Layer1
‡ $UFKLWHFWXUHRI2/$0 Databases
Data cleaning Data
Data integration Warehouse Data
Repository
Data Mining Lecture 2 51 Data Mining Lecture 2 52

6XPPDU\

‡ 'DWDZDUHKRXVH
² $VXEMHFWRULHQWHGLQWHJUDWHGWLPHYDULDQWDQGQRQYRODWLOHFROOHFWLRQRIGDWDLQ
VXSSRUWRIPDQDJHPHQW·VGHFLVLRQPDNLQJSURFHVV
‡ $PXOWLGLPHQVLRQDOPRGHORIDGDWDZDUHKRXVH
² 6WDUVFKHPDVQRZIODNHVFKHPDIDFWFRQVWHOODWLRQV
² $GDWDFXEHFRQVLVWVRIGLPHQVLRQV PHDVXUHV
‡ 2/$3RSHUDWLRQVGULOOLQJUROOLQJVOLFLQJGLFLQJDQGSLYRWLQJ
‡ 2/$3VHUYHUV52/$302/$3+2/$3
‡ (IILFLHQWFRPSXWDWLRQRIGDWDFXEHV
² 3DUWLDOYVIXOOYVQRPDWHULDOL]DWLRQ
² 0XOWLZD\DUUD\DJJUHJDWLRQ
² %LWPDSLQGH[DQGMRLQLQGH[LPSOHPHQWDWLRQV
‡ )XUWKHUGHYHORSPHQWRIGDWDFXEHWHFKQRORJ\
² 'LVFRYHU\GULYHDQGPXOWLIHDWXUHFXEHV
² )URP2/$3WR2/$0 RQOLQHDQDO\WLFDOPLQLQJ

Data Mining Lecture 2 53

Potrebbero piacerti anche