Sei sulla pagina 1di 31

Overview of Data Mining

Meeting of WP Data Mining


April 28, 2008 Bowo Prasetyo
http://www.scribd.com/prazjp http://www.slideshare.net/bowoprasetyo
1

Contents
What Is Data Mining? Does It Differ To Statistics? Why Uses Data Mining? What Can Data Mining Do? Methods of Data Mining Contoh Kasus - Toserba Mining Environmental Data Conclusion
2

What Is Data Mining?


The exploration and analysis of large quantities of data in order to discover meaningful patterns and rules1) .

1) Berry and Linoff, Data Mining Techniques for Marketing, Sales and 3 Customer Support (Book), 1997

Does It Differ To Statistics?


Data mining is a blend of statistics, artificial intelligence, and Artificial database Intelligence research16) .
Statistics

Data Mining

Database

16) D. Pregibon, Data Mining: Statistical Computing and Graphics, p. 7-8, 4 1997

Statistics, AI, Database


Statistics
Distribution, mean, median, standard deviation

Artificial Intelligence (AI)


Neural network, fuzzy theory, genetic algorithm, particle swarm optimization

Database
Relational, object-oriented, spatial, temporal
5

Why Uses Data Mining?


Data explosion
Automated data collection Log data of large organizations2):
44% 1 terabyte per month 11% 10 terabytes per month

Worlds digital data on PCs, digital cameras, servers, sensors, etc.3):


in 2006 161 billion gigabytes In 2010 988 billion gigabytes (predicted)

Large amounts of data, but small amounts of knowledge Data mining to discover the knowledge
2) ESG Research, New ESG Research Finds Large Organizations Experiencing Explosive Growth in Log Data Collection, Analysis, and Storage, 2007 (http://www.enterprisestrategygroup.com/_documents/NewsEvent/NewsEvent439.pdf) 3) EMC IDC Research, The Expanding Digital Universe: A Forecast of Worldwide Information Growth Through 2010, 2006 (http://www.emc.com/about/destination/digital_universe/)

What Can Data Mining Do?


Examples

On Business and Network Security


Builds customer profiles based on his/her transactional histories4) Analyzes corporate credit ratings using public financial statements, such as financial ratios5) Detects credit card fraud by analyzing customer transaction database6) Detects network intrusion based on system program behavior such as sendmail and tcpdump7)
4) G. Adomavicius and A. Tuzhilin, Using data mining methods to build customer profiles, in Computer magazine p. 74-82, 2001 5) Z. Huang, H. Chen, C. Hsu, W. Chen, S. Wu, Credit rating analysis with support vector machines and neural networks: a market comparative study, in Journal of Decision Support Systems p. 543-558, 2004 6) T. Fawcett and F. Provost, Adaptive Fraud Detection, in Journal of Data Mining and Knowledge Discovery p. 291-316, 2004 7) W. Lee and S. J. Stolfo, Data Mining Approaches for Intrusion Detection, in Proceedings of

On The Web
Discovers useful patterns from log files, contents, and links of websites8) Ranks the web pages on the internet using link structure analysis9) Personalizes a website based on log files, contents, and profile data10) Supports on-line recommendation to customers by analyzing e-commerce transaction records11)
8) R. Cooley, B. Mobasher, J. Srivastava, Web Mining: Information and Pattern Discovery on the World Wide Web, in Proceedings of 9th International Conference on Tools with Artificial Intelligence (ICTAI) p. 0558, 1997 9) Larry Page, Sergey Brin, R. Motwani, T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web, 1998 (http://citeseer.ist.psu.edu/page98pagerank.html) 10) M. Eirinaki and M. Vazirgiannis, Web mining for web personalization, in ACM Transactions on Internet Technology (TOIT) p. 1- 27, 2003. 11) S. W. Changchien and T. Lu, Mining association rules procedure to support on-line recommendation by customers and products fragmentation, in Journal of Expert Systems with Applications v. 20-4 p. 325-335, 2001

On Environment
Discovers rules in geo-spatial database12) Analyzes weather impacts on airspace system13) Discovers interesting patterns on Earth Science variables (soil moisture, temperature, precipitation) along with ecosystem data (Net Primary Production)14) Finds Ocean Climate Indices based on pressure and temperature data15)
12) J. Han, K. Koperski, N. Stefanovic, GeoMiner: a system prototype for spatial data mining, in Proceedings of ACM SIGMOD international conference on Management of data p. 553 556, 1997 13) Z. Nazeri and J. Zhang, Mining aviation data to understand impacts of severe weather on airspace system performance, in Proceedings of International Conference on Coding and Computing p. 518- 523, 2002. 14) V. Kumar, M. Steinbach, P. Tan, S. Klooster, C. Potter, A. Torregrosa, Mining Scientific Data: Discovery of Patterns in the Global Climate System, in Proceedings of the Joint Statistical Meetings p. 5--9, 2001 15) M. Steinbach, P. Tan, V. Kumar, S. Klooster, C. Potter, Data Mining for the Discovery of Ocean Climate Indices, in Proceedings of the 5th Workshop on Scientific Data Mining p. 7-16,

10

Methods in Data Mining

Basic Methods

11

Classification, Clustering, Association Rules


Data mining consists of several basic methods:
Classification
Places items into groups based on a training set of previously labeled items (supervised)

Clustering
Places items into groups based on some defined distance measure (unsupervised)

Association Rules
Discovers items that co-occur frequently within a data set and also their rules, such as implication or correlation

12

Classification
Naive Bayesian classifier

Spam/Non-spam classification

Spam if

17) http://en.wikipedia.org/wiki/Naive_Bayes_classifier

13

Clustering
K-means algorithm18)
1. Partitions items into k clusters 2. Calculates mean of each cluster as centroid 3. Associates each items to the closest centroid using defined distance 4. Back to 2 until convergence

18) J. A. Hartigan and M. A. Wong, A k-means clustering algorithm, in Applied Statistics, 14 28 (1) p. 100-108, 1979

Association Rules
If a customer buys bread and butter, then she will likely buy milk too with 90% confidence Algorithm19):
Finds frequent itemsets whose support >= minsup Finds interesting rules from frequent itemsets above whose confidence >= minconf
19) R. Agrawal, R. Srikant, Fast Algorithms for Mining Association Rules, in Proc. 20th 15 Int. Conf. Very Large Data Bases, VLDB, 1994

Association Rules
Apriori algorithm to find frequent itemsets L in database D19):
Find frequent set Lk1 Join step
Ck is generated by joining Lk1with itself

Prune step
Any (k1)-itemset that is not frequent cannot be a subset of a frequent kitemset, hence should be removed

(Ck: Candidate itemset of size k) (Lk: frequent itemset of size k whose support >= minsup) 16

Association Rules
Apriori algorithm to find rules R from frequent itemsets L19):
For each l L generate S = nonempty subsets of l For each s S generate rule s (l-s) if confidence >= minconf

17

Visualization Of Mining Results


Problem of mining results
Too much results to display Difficult to find important rules Difficult to understand the rules

Needs good visualization tools


Chart for statistical results Graph (node & edge) for association rules Globe map for geo-spatial results Animation for temporal results Utilizes colors, styles, thickness etc.

18

Contoh Kasus

Aturan Asosiasi di Toserba

19

Item dan Transaksi


transaksi

Pembelian Pak Joko bulan Januari:


1. beras, minyak goreng, daging sapi 2. gula pasir, minyak goreng, telur ayam 3. beras, gula pasir, minyak goreng, telur ayam 4. gula pasir, telur ayam
item 20

Frequent Item (Item Sering)


Sering: pembelian >= 2
minimum support

support

daging sapi = 1 kali bukan sering


21

n-Length Item (n-Item)


n > 1
2-length item

3-length item

22

Aturan Asosiasi
Kustomer yang membeli beras akan membeli juga minyak goreng.

beras => minyak goreng


antecedent consequent

jika beras maka minyak goreng"


support(minyak goreng & beras) = 2/2 = 1 support(beras) confidence 23

Aturan Asosiasi Lengkap

24

Mining Environmental Data


Examples

25

Explosion in Environmental Data


Temperature, humidity, pressure, precipitation, sound, light, shock Weather & rainfall trends, river height & flows, air & water quality, pollution levels, salinity, emissions, FPAR, NPP Earth science, oceanography, meteorology, ecology Sensors, hand-held/wireless devices, remote sensing (satellites), other automated logging devices
26

Geo-spatial Database
Discovers rules in geo-spatial Given Western Canada, database12)
describe the weather patterns Given temperature, precipitation, etc., describe the regions Show the differences in weather patterns between British Columbia and GeoMiner Alberta If a Canadian town is large and is adjacent to large water body, then it is close to the U.S. border, with the 27 possibility of 78%

Earth Science
Interesting patterns on Earth Science14)

Regions that are covered by Shrubland regions the highly correlated FPAR: Fractional Intercepted Photosynthetically pattern, FPAR-Hi NPP- Active Radiation NPP : Net Primary Production 28 Hi

Earth Science
Interesting patterns on Earth Science14)

Two clusters for NPP (land) and two clusters for SST (ocean). The clusters approximate the northern and southern hemispheres, for land and ocean. 29 SST: sea surface temperature

Earth Science
Interesting patterns on Earth Clusters of ocean near the Science14)
Philipines (SST) and lands of Eastern Brazil, Southern Africa, and a bit of Australia (NPP) is highly correlated (0.47).
In particular, this sea region is highly correlated (0.66), with SOI, which is a climate index related to El Nio, and it is known that parts of Southern Africa and Australia experience droughts related to El Nino. 30

Conclusion
Todays data repository is huge and collected in enormous speed Traditional statistical methods are no longer sufficient to analyze data. Data mining is very important to discover knowledge hidden in data Helps decision making in broad range of fields: business, network security, web, environment etc. Good visualization tool is needed to understand mining results easily
31

Potrebbero piacerti anche