Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
:KDWLVDGDWDZDUHKRXVH"
,QWURGXFWLRQWR'DWD:DUHKRXVHV $PXOWLGLPHQVLRQDOGDWDPRGHO
DQG2/$37HFKQRORJLHV 'DWDZDUHKRXVHDUFKLWHFWXUH
'DWDZDUHKRXVHLPSOHPHQWDWLRQ
)XUWKHUGHYHORSPHQWRIGDWDFXEHWHFKQRORJ\
)URPGDWDZDUHKRXVLQJWRGDWDPLQLQJ
:KDWLV'DWD:DUHKRXVH" 'DWD:DUHKRXVH³6XEMHFW2ULHQWHG
'HILQHGLQPDQ\GLIIHUHQWZD\VEXWQRWULJRURXVO\
² $GHFLVLRQVXSSRUWGDWDEDVHWKDWLVPDLQWDLQHGVHSDUDWHO\
2UJDQL]HGDURXQGPDMRUVXEMHFWVVXFKDVFXVWRPHU
IURPWKHRUJDQL]DWLRQ·VRSHUDWLRQDOGDWDEDVH SURGXFWVDOHV
² 6XSSRUWLQIRUPDWLRQSURFHVVLQJE\SURYLGLQJDVROLGSODWIRUP
RIFRQVROLGDWHGKLVWRULFDOGDWDIRUDQDO\VLV )RFXVLQJRQWKHPRGHOLQJDQGDQDO\VLVRIGDWDIRU
´$GDWDZDUHKRXVHLVDVXEMHFWRULHQWHGLQWHJUDWHG GHFLVLRQPDNHUVQRWRQGDLO\RSHUDWLRQVRUWUDQVDFWLRQ
WLPHYDULDQWDQGQRQYRODWLOHFROOHFWLRQRIGDWDLQ
VXSSRUWRIPDQDJHPHQW·VGHFLVLRQPDNLQJSURFHVVµ SURFHVVLQJ
³:+,QPRQ 3URYLGHDVLPSOHDQGFRQFLVHYLHZDURXQGSDUWLFXODU
'DWDZDUHKRXVLQJ
VXEMHFWLVVXHVE\H[FOXGLQJGDWDWKDWDUHQRWXVHIXOLQ
² 7KHSURFHVVRIFRQVWUXFWLQJDQGXVLQJGDWDZDUHKRXVHV
WKHGHFLVLRQVXSSRUWSURFHVV
'DWD:DUHKRXVH³,QWHJUDWHG 'DWD:DUHKRXVH³7LPH9DULDQW
&RQVWUXFWHGE\LQWHJUDWLQJPXOWLSOHKHWHURJHQHRXV
7KHWLPHKRUL]RQIRUWKHGDWDZDUHKRXVHLV
GDWDVRXUFHV
² UHODWLRQDOGDWDEDVHVIODWILOHVRQOLQHWUDQVDFWLRQUHFRUGV VLJQLILFDQWO\ORQJHUWKDQWKDWRIRSHUDWLRQDOV\VWHPV
'DWDFOHDQLQJDQGGDWDLQWHJUDWLRQWHFKQLTXHVDUH ² 2SHUDWLRQDOGDWDEDVHFXUUHQWYDOXHGDWD
DSSOLHG ² 'DWDZDUHKRXVHGDWDSURYLGHLQIRUPDWLRQIURPDKLVWRULFDO
² (QVXUHFRQVLVWHQF\LQQDPLQJFRQYHQWLRQVHQFRGLQJ SHUVSHFWLYHHJSDVW\HDUV
VWUXFWXUHVDWWULEXWHPHDVXUHVHWFDPRQJGLIIHUHQWGDWD
VRXUFHV (YHU\NH\VWUXFWXUHLQWKHGDWDZDUHKRXVH
(J+RWHOSULFHFXUUHQF\WD[EUHDNIDVWFRYHUHGHWF ² &RQWDLQVDQHOHPHQWRIWLPHH[SOLFLWO\RULPSOLFLWO\
² :KHQGDWDLVPRYHGWRWKHZDUHKRXVHLWLVFRQYHUWHG ² %XWWKHNH\RIRSHUDWLRQDOGDWDPD\RUPD\QRWFRQWDLQ´WLPH
HOHPHQWµ
1
'DWD:DUHKRXVH³1RQ9RODWLOH 'DWD:DUHKRXVHYV+HWHURJHQHRXV'%06
$SK\VLFDOO\VHSDUDWHVWRUHRIGDWDWUDQVIRUPHG 7UDGLWLRQDOKHWHURJHQHRXV'%LQWHJUDWLRQ
² %XLOGZUDSSHUVPHGLDWRUVRQWRSRIKHWHURJHQHRXVGDWDEDVHV
IURPWKHRSHUDWLRQDOHQYLURQPHQW ² 4XHU\GULYHQDSSURDFK
2SHUDWLRQDOXSGDWHRIGDWDGRHVQRWRFFXULQWKH :KHQDTXHU\LVSRVHGWRDFOLHQWVLWHDPHWDGLFWLRQDU\LVXVHG
WRWUDQVODWHWKHTXHU\LQWRTXHULHVDSSURSULDWHIRULQGLYLGXDO
GDWDZDUHKRXVHHQYLURQPHQW KHWHURJHQHRXVVLWHVLQYROYHGDQGWKHUHVXOWVDUHLQWHJUDWHGLQWR
² 'RHVQRWUHTXLUHWUDQVDFWLRQSURFHVVLQJUHFRYHU\DQG DJOREDODQVZHUVHW
&RPSOH[LQIRUPDWLRQILOWHULQJFRPSHWHIRUUHVRXUFHV
FRQFXUUHQF\FRQWUROPHFKDQLVPV
² 5HTXLUHVRQO\WZRRSHUDWLRQVLQGDWDDFFHVVLQJ
'DWDZDUHKRXVHXSGDWHGULYHQKLJKSHUIRUPDQFH
² ,QIRUPDWLRQIURPKHWHURJHQHRXVVRXUFHVLVLQWHJUDWHGLQDGYDQFHDQG
LQLWLDOORDGLQJRIGDWDDQGDFFHVVRIGDWD VWRUHGLQZDUHKRXVHVIRUGLUHFWTXHU\DQGDQDO\VLV
'DWD:DUHKRXVHYV2SHUDWLRQDO'%06 2/73YV2/$3
OLTP OLAP
2/73RQOLQHWUDQVDFWLRQSURFHVVLQJ
users clerk, IT professional knowledge worker
² 0DMRUWDVNRIWUDGLWLRQDOUHODWLRQDO'%06
function day to day operations decision support
² 'D\WRGD\RSHUDWLRQVSXUFKDVLQJLQYHQWRU\EDQNLQJPDQXIDFWXULQJ
DB design application-oriented subject-oriented
SD\UROOUHJLVWUDWLRQDFFRXQWLQJHWF
data current, up-to-date historical,
2/$3RQOLQHDQDO\WLFDOSURFHVVLQJ detailed, flat relational summarized, multidimensional
² 0DMRUWDVNRIGDWDZDUHKRXVHV\VWHP isolated integrated, consolidated
usage repetitive ad-hoc
² 'DWDDQDO\VLVDQGGHFLVLRQPDNLQJ
access read/write lots of scans
'LVWLQFWIHDWXUHV2/73YV2/$3 index/hash on prim. key
² 8VHUDQGV\VWHPRULHQWDWLRQFXVWRPHUYVPDUNHW unit of work short, simple transaction complex query
# records accessed tens millions
² 'DWDFRQWHQWVFXUUHQWGHWDLOHGYVKLVWRULFDOFRQVROLGDWHG
#users thousands hundreds
² 'DWDEDVHGHVLJQ(5DSSOLFDWLRQYVVWDUVXEMHFW
DB size 100MB-GB 100GB-TB
² 9LHZFXUUHQWORFDOYVHYROXWLRQDU\LQWHJUDWHG
metric transaction throughput query throughput, response
² $FFHVVSDWWHUQVXSGDWHYVUHDGRQO\EXWFRPSOH[TXHULHV
Data Mining Lecture 2 9 Data Mining Lecture 2 10
:K\6HSDUDWH'DWD:DUHKRXVH" 2YHUYLHZ
+LJKSHUIRUPDQFHIRUERWKV\VWHPV
² '%06³WXQHGIRU2/73DFFHVVPHWKRGVLQGH[LQJ :KDWLVDGDWDZDUHKRXVH"
FRQFXUUHQF\FRQWUROUHFRYHU\
² :DUHKRXVH³WXQHGIRU2/$3FRPSOH[2/$3TXHULHV $PXOWLGLPHQVLRQDOGDWDPRGHO
PXOWLGLPHQVLRQDOYLHZFRQVROLGDWLRQ
'LIIHUHQWIXQFWLRQVDQGGLIIHUHQWGDWD 'DWDZDUHKRXVHDUFKLWHFWXUH
² PLVVLQJGDWD'HFLVLRQVXSSRUWUHTXLUHVKLVWRULFDOGDWDZKLFK
RSHUDWLRQDO'%VGRQRWW\SLFDOO\PDLQWDLQ 'DWDZDUHKRXVHLPSOHPHQWDWLRQ
² GDWDFRQVROLGDWLRQ'6UHTXLUHVFRQVROLGDWLRQDJJUHJDWLRQ
VXPPDUL]DWLRQRIGDWDIURPKHWHURJHQHRXVVRXUFHV )XUWKHUGHYHORSPHQWRIGDWDFXEHWHFKQRORJ\
² GDWDTXDOLW\GLIIHUHQWVRXUFHVW\SLFDOO\XVHLQFRQVLVWHQWGDWD
UHSUHVHQWDWLRQVFRGHVDQGIRUPDWVZKLFKKDYHWREH )URPGDWDZDUHKRXVLQJWRGDWDPLQLQJ
UHFRQFLOHG
2
)URP7DEOHVDQG6SUHDGVKHHWVWR'DWD&XEHV &XEH$/DWWLFHRI&XERLGV
$GDWDZDUHKRXVHLVEDVHGRQDPXOWLGLPHQVLRQDOGDWDPRGHO all
0-D(apex) cuboid
ZKLFKYLHZVGDWDLQWKHIRUPRIDGDWDFXEH
$GDWDFXEHVXFKDVVDOHVDOORZVGDWDWREHPRGHOHGDQG time item location supplier
1-D cuboids
YLHZHGLQPXOWLSOHGLPHQVLRQV
² 'LPHQVLRQWDEOHVVXFKDVLWHPLWHPBQDPHEUDQGW\SHRU time,location item,location location,supplier
WLPHGD\ZHHNPRQWKTXDUWHU\HDU
2-D cuboids
² )DFWWDEOHFRQWDLQVPHDVXUHVVXFKDVGROODUVBVROGDQGNH\VWR time,item time,supplier item,supplier
HDFKRIWKHUHODWHGGLPHQVLRQWDEOHV time,location,supplier
,QGDWDZDUHKRXVLQJOLWHUDWXUHDQQ'EDVHFXEHLVFDOOHGD 3-D cuboids
EDVHFXERLG7KHWRSPRVW'FXERLGZKLFKKROGVWKHKLJKHVW time,item,location
time,item,supplier item,location,supplier
OHYHORIVXPPDUL]DWLRQLVFDOOHGWKHDSH[FXERLG7KHODWWLFHRI
FXERLGVIRUPVDGDWDFXEH time, item, location, supplier
4-D(base) cuboid
Data Mining Lecture 2 13 Data Mining Lecture 2 14
&RQFHSWXDO0RGHOLQJRI'DWD:DUHKRXVHV ([DPSOHRI6WDU6FKHPD
time
0RGHOLQJGDWDZDUHKRXVHVGLPHQVLRQV PHDVXUHV time_key item
² 6WDUVFKHPD$IDFWWDEOHLQWKHPLGGOHFRQQHFWHGWRDVHWRI day item_key
day_of_the_week Sales Fact Table item_name
GLPHQVLRQWDEOHV month brand
² 6QRZIODNHVFKHPD$UHILQHPHQWRIVWDUVFKHPDZKHUHVRPH quarter time_key type
year supplier_type
GLPHQVLRQDOKLHUDUFK\LVQRUPDOL]HGLQWRDVHWRIVPDOOHU item_key
GLPHQVLRQWDEOHVIRUPLQJDVKDSHVLPLODUWRVQRZIODNH branch_key
branch location
² )DFWFRQVWHOODWLRQV0XOWLSOHIDFWWDEOHVVKDUHGLPHQVLRQ location_key
branch_key location_key
WDEOHVYLHZHGDVDFROOHFWLRQRIVWDUVWKHUHIRUHFDOOHGJDOD[\ street
branch_name units_sold
VFKHPDRUIDFWFRQVWHOODWLRQ branch_type city
dollars_sold province_or_state
country
avg_sales
Measures
Data Mining Lecture 2 15 Data Mining Lecture 2 16
([DPSOHRI6QRZIODNH6FKHPD ([DPSOHRI)DFW&RQVWHOODWLRQ
time time
time_key item time_key item Shipping Fact Table
day item_key supplier day item_key
Sales Fact Table day_of_the_week Sales Fact Table item_name time_key
day_of_the_week item_name supplier_key
month brand
month brand supplier_type
time_key quarter time_key type
quarter type year item_key
supplier_type
year item_key supplier_key shipper_key
item_key from_location
branch_key branch_key
branch location branch location_key location to_location
location_key
location_key branch_key location_key dollars_cost
branch_key units_sold
units_sold street branch_name street
branch_name branch_type units_shipped
city_key city dollars_sold city
branch_type dollars_sold province_or_state
city_key avg_sales shipper
country
avg_sales city
Measures
province_or_state shipper_key
Measures country shipper_name
location_key
Data Mining Lecture 2 17 Data Mining Lecture 2 18
shipper_type
3
$'DWD0LQLQJ4XHU\/DQJXDJH'04/
/DQJXDJH3ULPLWLYHV 'HILQLQJD6WDU6FKHPDLQ'04/
'HILQLQJD6QRZIODNH6FKHPDLQ'04/ 'HILQLQJD)DFW&RQVWHOODWLRQLQ'04/
GHILQHFXEHVDOHV>WLPHLWHPEUDQFKORFDWLRQ@
GHILQHFXEHVDOHVBVQRZIODNH>WLPHLWHPEUDQFKORFDWLRQ@
GROODUVBVROG VXPVDOHVBLQBGROODUV DYJBVDOHV DYJVDOHVBLQBGROODUV
GROODUVBVROG VXPVDOHVBLQBGROODUVDYJBVDOHV XQLWVBVROG FRXQW
DYJVDOHVBLQBGROODUVXQLWVBVROG FRXQW
GHILQHGLPHQVLRQWLPHDVWLPHBNH\GD\GD\BRIBZHHNPRQWKTXDUWHU
\HDU
GHILQHGLPHQVLRQWLPHDVWLPHBNH\GD\GD\BRIBZHHN GHILQHGLPHQVLRQLWHPDVLWHPBNH\LWHPBQDPHEUDQGW\SH
VXSSOLHUBW\SH
PRQWKTXDUWHU\HDU
GHILQHGLPHQVLRQEUDQFKDVEUDQFKBNH\EUDQFKBQDPHEUDQFKBW\SH
GHILQHGLPHQVLRQLWHPDVLWHPBNH\LWHPBQDPHEUDQG GHILQHGLPHQVLRQORFDWLRQDVORFDWLRQBNH\VWUHHWFLW\SURYLQFHBRUBVWDWH
FRXQWU\
W\SHVXSSOLHUVXSSOLHUBNH\VXSSOLHUBW\SH GHILQHFXEHVKLSSLQJ>WLPHLWHPVKLSSHUIURPBORFDWLRQWRBORFDWLRQ@
GHILQHGLPHQVLRQEUDQFKDVEUDQFKBNH\EUDQFKBQDPH GROODUBFRVW VXPFRVWBLQBGROODUVXQLWBVKLSSHG FRXQW
GHILQHGLPHQVLRQWLPHDVWLPHLQFXEHVDOHV
EUDQFKBW\SH GHILQHGLPHQVLRQLWHPDVLWHPLQFXEHVDOHV
GHILQHGLPHQVLRQORFDWLRQDVORFDWLRQBNH\VWUHHW GHILQHGLPHQVLRQVKLSSHUDVVKLSSHUBNH\VKLSSHUBQDPHORFDWLRQDV
ORFDWLRQLQFXEHVDOHVVKLSSHUBW\SH
FLW\FLW\BNH\SURYLQFHBRUBVWDWHFRXQWU\ GHILQHGLPHQVLRQIURPBORFDWLRQDVORFDWLRQLQFXEHVDOHV
GHILQHGLPHQVLRQWRBORFDWLRQDVORFDWLRQLQFXEHVDOHV
Data Mining Lecture 2 21 Data Mining Lecture 2 22
4
0XOWLGLPHQVLRQDO'DWD $6DPSOH'DWD&XEH
t
3Qtr
PRQWKDQGUHJLRQ 4Qtr
uc
TV
od
Dimensions: Product, Location, Time PC U.S.A
Pr
Hierarchical summarization paths VCR
on
Country
sum
gi
Canada
Re
Office Day
Month
Data Mining Lecture 2 25 Data Mining Lecture 2 26
&XERLGV&RUUHVSRQGLQJWRWKH&XEH %URZVLQJD'DWD&XEH
all
0-D(apex) cuboid
product date country
1-D cuboids
3-D(base) cuboid
product, date, country 9LVXDOL]DWLRQ
2/$3FDSDELOLWLHV
,QWHUDFWLYHPDQLSXODWLRQ
Data Mining Lecture 2 27 Data Mining Lecture 2 28
7\SLFDO2/$32SHUDWLRQV 2YHUYLHZ
5
'HVLJQRID'DWD:DUHKRXVH$%XVLQHVV
$QDO\VLV)UDPHZRUN 'DWD:DUHKRXVH'HVLJQ3URFHVV
)RXUYLHZVUHJDUGLQJWKHGHVLJQRIDGDWDZDUHKRXVH 7RSGRZQERWWRPXSDSSURDFKHVRUDFRPELQDWLRQRIERWK
² 7RSGRZQ6WDUWVZLWKRYHUDOOGHVLJQDQGSODQQLQJPDWXUH
² 7RSGRZQYLHZ
² %RWWRPXS6WDUWVZLWKH[SHULPHQWVDQGSURWRW\SHVUDSLG
DOORZVVHOHFWLRQRIWKHUHOHYDQWLQIRUPDWLRQQHFHVVDU\IRUWKH
GDWDZDUHKRXVH
)URPVRIWZDUHHQJLQHHULQJSRLQWRIYLHZ
² :DWHUIDOOVWUXFWXUHGDQGV\VWHPDWLFDQDO\VLVDWHDFKVWHSEHIRUH
² 'DWDVRXUFHYLHZ SURFHHGLQJWRWKHQH[W
H[SRVHVWKHLQIRUPDWLRQEHLQJFDSWXUHGVWRUHGDQGPDQDJHGE\ ² 6SLUDOUDSLGJHQHUDWLRQRILQFUHDVLQJO\IXQFWLRQDOV\VWHPVVKRUW
RSHUDWLRQDOV\VWHPV WXUQDURXQGWLPHTXLFNWXUQDURXQG
² 'DWDZDUHKRXVHYLHZ 7\SLFDOGDWDZDUHKRXVHGHVLJQSURFHVV
FRQVLVWVRIIDFWWDEOHVDQGGLPHQVLRQWDEOHV ² &KRRVHDEXVLQHVVSURFHVVWRPRGHOHJRUGHUVLQYRLFHVHWF
² %XVLQHVVTXHU\YLHZ ² &KRRVHWKHJUDLQDWRPLFOHYHORIGDWDRIWKHEXVLQHVVSURFHVV
VHHVWKHSHUVSHFWLYHVRIGDWDLQWKHZDUHKRXVHIURPWKHYLHZRI ² &KRRVHWKHGLPHQVLRQVWKDWZLOODSSO\WRHDFKIDFWWDEOHUHFRUG
HQGXVHU ² &KRRVHWKHPHDVXUHWKDWZLOOSRSXODWHHDFKIDFWWDEOHUHFRUG
0XOWL7LHUHG$UFKLWHFWXUH 7KUHH'DWD:DUHKRXVH0RGHOV
Monitor (QWHUSULVHZDUHKRXVH
other Metadata & OLAP Server ² FROOHFWVDOORIWKHLQIRUPDWLRQDERXWVXEMHFWVVSDQQLQJWKH
sources Integrator HQWLUHRUJDQL]DWLRQ
Analysis 'DWD0DUW
Operational Extract Query ² DVXEVHWRIFRUSRUDWHZLGHGDWDWKDWLVRIYDOXHWRDVSHFLILF
Transform Data Serve JURXSVRIXVHUV,WVVFRSHLVFRQILQHGWRVSHFLILFVHOHFWHG
DBs Reports
Load JURXSVVXFKDVPDUNHWLQJGDWDPDUW
Refresh
Warehouse Data mining
,QGHSHQGHQWYVGHSHQGHQWGLUHFWO\IURPZDUHKRXVHGDWDPDUW
9LUWXDOZDUHKRXVH
² $VHWRIYLHZVRYHURSHUDWLRQDOGDWDEDVHV
Data Marts ² 2QO\VRPHRIWKHSRVVLEOHVXPPDU\YLHZVPD\EHPDWHULDOL]HG
2/$36HUYHU$UFKLWHFWXUHV 2YHUYLHZ
5HODWLRQDO2/$352/$3
² 8VHUHODWLRQDORUH[WHQGHGUHODWLRQDO'%06WRVWRUHDQGPDQDJH :KDWLVDGDWDZDUHKRXVH"
ZDUHKRXVHGDWDDQG2/$3PLGGOHZDUHWRVXSSRUWPLVVLQJSLHFHV
² ,QFOXGHRSWLPL]DWLRQRI'%06EDFNHQGLPSOHPHQWDWLRQRI $PXOWLGLPHQVLRQDOGDWDPRGHO
DJJUHJDWLRQQDYLJDWLRQORJLFDQGDGGLWLRQDOWRROVDQGVHUYLFHV
² JUHDWHUVFDODELOLW\
'DWDZDUHKRXVHDUFKLWHFWXUH
0XOWLGLPHQVLRQDO2/$302/$3
² $UUD\EDVHGPXOWLGLPHQVLRQDOVWRUDJHHQJLQHVSDUVHPDWUL[
WHFKQLTXHV 'DWDZDUHKRXVHLPSOHPHQWDWLRQ
² IDVWLQGH[LQJWRSUHFRPSXWHGVXPPDUL]HGGDWD
+\EULG2/$3+2/$3 )XUWKHUGHYHORSPHQWRIGDWDFXEHWHFKQRORJ\
² 8VHUIOH[LELOLW\HJORZOHYHOUHODWLRQDOKLJKOHYHODUUD\
6SHFLDOL]HG64/VHUYHUV )URPGDWDZDUHKRXVLQJWRGDWDPLQLQJ
² VSHFLDOL]HGVXSSRUWIRU64/TXHULHVRYHUVWDUVQRZIODNHVFKHPDV
6
(IILFLHQW'DWD&XEH&RPSXWDWLRQ &XEH2SHUDWLRQ
'DWDFXEHFDQEHYLHZHGDVDODWWLFHRIFXERLGV &XEHGHILQLWLRQDQGFRPSXWDWLRQLQ'04/
² 7KHERWWRPPRVWFXERLGLVWKHEDVHFXERLG GHILQHFXEHVDOHV>LWHPFLW\\HDU@VXPVDOHVBLQBGROODUV
² 7KHWRSPRVWFXERLGDSH[FRQWDLQVRQO\RQHFHOO FRPSXWHFXEHVDOHV
&XEH&RPSXWDWLRQ52/$3%DVHG0HWKRG ,QGH[LQJ2/$3'DWD%LWPDS,QGH[
(IILFLHQWFXEHFRPSXWDWLRQPHWKRGV ,QGH[RQDSDUWLFXODUFROXPQ
² 52/$3EDVHGFXELQJDOJRULWKPV$JDUZDOHWDO· (DFKYDOXHLQWKHFROXPQKDVDELWYHFWRUELWRSLVIDVW
² $UUD\EDVHGFXELQJDOJRULWKP=KDRHWDO· 7KHOHQJWKRIWKHELWYHFWRURIUHFRUGVLQWKHEDVHWDEOH
² %RWWRPXSFRPSXWDWLRQPHWKRG%H\HU 5DPDUNULVKQDQ· 7KHLWKELWLVVHWLIWKHLWKURZRIWKHEDVHWDEOHKDVWKH
YDOXHIRUWKHLQGH[HGFROXPQ
52/$3EDVHGFXELQJDOJRULWKPV QRWVXLWDEOHIRUKLJKFDUGLQDOLW\GRPDLQV
² 6RUWLQJKDVKLQJDQGJURXSLQJRSHUDWLRQVDUHDSSOLHGWRWKH
GLPHQVLRQDWWULEXWHVLQRUGHUWRUHRUGHUDQGFOXVWHUUHODWHG
WXSOHV Base table Index on Region Index on Type
² *URXSLQJLVSHUIRUPHGRQVRPHVXEDJJUHJDWHVDVD´SDUWLDO C u st R eg io n T yp e R ecID Asia Eu ro p e Am erica R ecID R etail D ealer
JURXSLQJVWHSµ C1 A s ia R e tail 1 1 0 0 1 1 0
² $JJUHJDWHVPD\EHFRPSXWHGIURPSUHYLRXVO\FRPSXWHG C2 E urop e D e a ler 2 0 1 0 2 0 1
DJJUHJDWHVUDWKHUWKDQIURPWKHEDVHIDFWWDEOH C3 A s ia D e a ler 3 1 0 0 3 0 1
C4 A m e ric a R e tail 4 0 1 0 4 1 0
C5 E urop e D e a ler 5 0 0 1 5 0 1
Data Mining Lecture 2 39 Data Mining Lecture 2 40
,QGH[LQJ2/$3'DWD-RLQ,QGLFHV (IILFLHQW3URFHVVLQJ2/$34XHULHV
7
0HWDGDWD5HSRVLWRU\ 'DWD:DUHKRXVH%DFN(QG7RROVDQG8WLOLWLHV
0HWDGDWDLVWKHGDWDGHILQLQJZDUHKRXVHREMHFWV,WKDVWKH 'DWDH[WUDFWLRQ
IROORZLQJNLQGV ² JHWGDWDIURPPXOWLSOHKHWHURJHQHRXVDQGH[WHUQDOVRXUFHV
² 'HVFULSWLRQRIWKHVWUXFWXUHRIWKHZDUHKRXVH 'DWDFOHDQLQJ
VFKHPDYLHZGLPHQVLRQVKLHUDUFKLHVGHULYHGGDWD GHIQGDWDPDUW ² GHWHFWHUURUVLQWKHGDWDDQGUHFWLI\WKHPZKHQSRVVLEOH
ORFDWLRQVDQGFRQWHQWV 'DWDWUDQVIRUPDWLRQ
² 2SHUDWLRQDOPHWDGDWD ² FRQYHUWGDWDIURPOHJDF\RUKRVWIRUPDWWRZDUHKRXVHIRUPDW
GDWDOLQHDJHKLVWRU\RIPLJUDWHGGDWDDQGWUDQVIRUPDWLRQSDWKFXUUHQF\ /RDG
RIGDWDDFWLYHDUFKLYHGRUSXUJHGPRQLWRULQJLQIRUPDWLRQZDUHKRXVH ² VRUWVXPPDUL]HFRQVROLGDWHFRPSXWHYLHZVFKHFNLQWHJULW\
XVDJHVWDWLVWLFVHUURUUHSRUWVDXGLWWUDLOV
DQGEXLOGLQGLFLHVDQGSDUWLWLRQV
² 7KHDOJRULWKPVXVHGIRUVXPPDUL]DWLRQ 5HIUHVK
² 7KHPDSSLQJIURPRSHUDWLRQDOHQYLURQPHQWWRWKHGDWDZDUHKRXVH
² SURSDJDWHWKHXSGDWHVIURPWKHGDWDVRXUFHVWRWKH
² 'DWDUHODWHGWRV\VWHPSHUIRUPDQFH ZDUHKRXVH
ZDUHKRXVHVFKHPDYLHZDQGGHULYHGGDWDGHILQLWLRQV
² %XVLQHVVGDWD
EXVLQHVVWHUPVDQGGHILQLWLRQVRZQHUVKLSRIGDWDFKDUJLQJSROLFLHV
2YHUYLHZ 'LVFRYHU\'ULYHQ([SORUDWLRQRI'DWD&XEHV
+\SRWKHVLVGULYHQH[SORUDWLRQE\XVHUKXJHVHDUFKVSDFH
:KDWLVDGDWDZDUHKRXVH"
'LVFRYHU\GULYHQ6DUDZDJLHWDO·
$PXOWLGLPHQVLRQDOGDWDPRGHO ² SUHFRPSXWHPHDVXUHVLQGLFDWLQJH[FHSWLRQVJXLGHXVHULQWKHGDWD
DQDO\VLVDWDOOOHYHOVRIDJJUHJDWLRQ
'DWDZDUHKRXVHDUFKLWHFWXUH ² ([FHSWLRQVLJQLILFDQWO\GLIIHUHQWIURPWKHYDOXHDQWLFLSDWHGEDVHG
RQDVWDWLVWLFDOPRGHO
'DWDZDUHKRXVHLPSOHPHQWDWLRQ ² 9LVXDOFXHVVXFKDVEDFNJURXQGFRORUDUHXVHGWRUHIOHFWWKHGHJUHH
RIH[FHSWLRQRIHDFKFHOO
)XUWKHUGHYHORSPHQWRIGDWDFXEHWHFKQRORJ\
² &RPSXWDWLRQRIH[FHSWLRQLQGLFDWRUPRGHOLQJILWWLQJDQGFRPSXWLQJ
6HOI([S,Q([SDQG3DWK([SYDOXHVFDQEHRYHUODSSHGZLWKFXEH
)URPGDWDZDUHKRXVLQJWRGDWDPLQLQJ
FRQVWUXFWLRQ
&RPSOH[$JJUHJDWLRQDW0XOWLSOH*UDQXODULWLHV
([DPSOHV'LVFRYHU\'ULYHQ'DWD&XEHV 0XOWL)HDWXUH&XEHV
8
2YHUYLHZ 'DWD:DUHKRXVH8VDJH
7KUHHNLQGVRIGDWDZDUHKRXVHDSSOLFDWLRQV
:KDWLVDGDWDZDUHKRXVH" ² ,QIRUPDWLRQSURFHVVLQJ
VXSSRUWVTXHU\LQJEDVLFVWDWLVWLFDODQDO\VLVDQGUHSRUWLQJ
$PXOWLGLPHQVLRQDOGDWDPRGHO XVLQJFURVVWDEVWDEOHVFKDUWVDQGJUDSKV
² $QDO\WLFDOSURFHVVLQJ
'DWDZDUHKRXVHDUFKLWHFWXUH PXOWLGLPHQVLRQDODQDO\VLVRIGDWDZDUHKRXVHGDWD
VXSSRUWVEDVLF2/$3RSHUDWLRQVVOLFHGLFHGULOOLQJSLYRWLQJ
'DWDZDUHKRXVHLPSOHPHQWDWLRQ ² 'DWDPLQLQJ
NQRZOHGJHGLVFRYHU\IURPKLGGHQSDWWHUQV
)XUWKHUGHYHORSPHQWRIGDWDFXEHWHFKQRORJ\ VXSSRUWVDVVRFLDWLRQVFRQVWUXFWLQJDQDO\WLFDOPRGHOV
SHUIRUPLQJFODVVLILFDWLRQDQGSUHGLFWLRQDQGSUHVHQWLQJWKH
)URPGDWDZDUHKRXVLQJWRGDWDPLQLQJ PLQLQJUHVXOWVXVLQJYLVXDOL]DWLRQWRROV
'LIIHUHQFHVDPRQJWKHWKUHHWDVNV
)URP2Q/LQH$QDO\WLFDO3URFHVVLQJ2/$3
WR2Q/LQH$QDO\WLFDO0LQLQJ2/$0 $Q2/$0$UFKLWHFWXUH
6XPPDU\
'DWDZDUHKRXVH
² $VXEMHFWRULHQWHGLQWHJUDWHGWLPHYDULDQWDQGQRQYRODWLOHFROOHFWLRQRIGDWDLQ
VXSSRUWRIPDQDJHPHQW·VGHFLVLRQPDNLQJSURFHVV
$PXOWLGLPHQVLRQDOPRGHORIDGDWDZDUHKRXVH
² 6WDUVFKHPDVQRZIODNHVFKHPDIDFWFRQVWHOODWLRQV
² $GDWDFXEHFRQVLVWVRIGLPHQVLRQV PHDVXUHV
2/$3RSHUDWLRQVGULOOLQJUROOLQJVOLFLQJGLFLQJDQGSLYRWLQJ
2/$3VHUYHUV52/$302/$3+2/$3
(IILFLHQWFRPSXWDWLRQRIGDWDFXEHV
² 3DUWLDOYVIXOOYVQRPDWHULDOL]DWLRQ
² 0XOWLZD\DUUD\DJJUHJDWLRQ
² %LWPDSLQGH[DQGMRLQLQGH[LPSOHPHQWDWLRQV
)XUWKHUGHYHORSPHQWRIGDWDFXEHWHFKQRORJ\
² 'LVFRYHU\GULYHDQGPXOWLIHDWXUHFXEHV
² )URP2/$3WR2/$0RQOLQHDQDO\WLFDOPLQLQJ