Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Contents
What Is Data Mining? Does It Differ To Statistics? Why Uses Data Mining? What Can Data Mining Do? Methods of Data Mining Contoh Kasus - Toserba Mining Environmental Data Conclusion
2
1) Berry and Linoff, Data Mining Techniques for Marketing, Sales and 3 Customer Support (Book), 1997
Data Mining
Database
16) D. Pregibon, Data Mining: Statistical Computing and Graphics, p. 7-8, 4 1997
Database
Relational, object-oriented, spatial, temporal
5
Large amounts of data, but small amounts of knowledge Data mining to discover the knowledge
2) ESG Research, New ESG Research Finds Large Organizations Experiencing Explosive Growth in Log Data Collection, Analysis, and Storage, 2007 (http://www.enterprisestrategygroup.com/_documents/NewsEvent/NewsEvent439.pdf) 3) EMC IDC Research, The Expanding Digital Universe: A Forecast of Worldwide Information Growth Through 2010, 2006 (http://www.emc.com/about/destination/digital_universe/)
On The Web
Discovers useful patterns from log files, contents, and links of websites8) Ranks the web pages on the internet using link structure analysis9) Personalizes a website based on log files, contents, and profile data10) Supports on-line recommendation to customers by analyzing e-commerce transaction records11)
8) R. Cooley, B. Mobasher, J. Srivastava, Web Mining: Information and Pattern Discovery on the World Wide Web, in Proceedings of 9th International Conference on Tools with Artificial Intelligence (ICTAI) p. 0558, 1997 9) Larry Page, Sergey Brin, R. Motwani, T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web, 1998 (http://citeseer.ist.psu.edu/page98pagerank.html) 10) M. Eirinaki and M. Vazirgiannis, Web mining for web personalization, in ACM Transactions on Internet Technology (TOIT) p. 1- 27, 2003. 11) S. W. Changchien and T. Lu, Mining association rules procedure to support on-line recommendation by customers and products fragmentation, in Journal of Expert Systems with Applications v. 20-4 p. 325-335, 2001
On Environment
Discovers rules in geo-spatial database12) Analyzes weather impacts on airspace system13) Discovers interesting patterns on Earth Science variables (soil moisture, temperature, precipitation) along with ecosystem data (Net Primary Production)14) Finds Ocean Climate Indices based on pressure and temperature data15)
12) J. Han, K. Koperski, N. Stefanovic, GeoMiner: a system prototype for spatial data mining, in Proceedings of ACM SIGMOD international conference on Management of data p. 553 556, 1997 13) Z. Nazeri and J. Zhang, Mining aviation data to understand impacts of severe weather on airspace system performance, in Proceedings of International Conference on Coding and Computing p. 518- 523, 2002. 14) V. Kumar, M. Steinbach, P. Tan, S. Klooster, C. Potter, A. Torregrosa, Mining Scientific Data: Discovery of Patterns in the Global Climate System, in Proceedings of the Joint Statistical Meetings p. 5--9, 2001 15) M. Steinbach, P. Tan, V. Kumar, S. Klooster, C. Potter, Data Mining for the Discovery of Ocean Climate Indices, in Proceedings of the 5th Workshop on Scientific Data Mining p. 7-16,
10
Basic Methods
11
Clustering
Places items into groups based on some defined distance measure (unsupervised)
Association Rules
Discovers items that co-occur frequently within a data set and also their rules, such as implication or correlation
12
Classification
Naive Bayesian classifier
Spam/Non-spam classification
Spam if
17) http://en.wikipedia.org/wiki/Naive_Bayes_classifier
13
Clustering
K-means algorithm18)
1. Partitions items into k clusters 2. Calculates mean of each cluster as centroid 3. Associates each items to the closest centroid using defined distance 4. Back to 2 until convergence
18) J. A. Hartigan and M. A. Wong, A k-means clustering algorithm, in Applied Statistics, 14 28 (1) p. 100-108, 1979
Association Rules
If a customer buys bread and butter, then she will likely buy milk too with 90% confidence Algorithm19):
Finds frequent itemsets whose support >= minsup Finds interesting rules from frequent itemsets above whose confidence >= minconf
19) R. Agrawal, R. Srikant, Fast Algorithms for Mining Association Rules, in Proc. 20th 15 Int. Conf. Very Large Data Bases, VLDB, 1994
Association Rules
Apriori algorithm to find frequent itemsets L in database D19):
Find frequent set Lk1 Join step
Ck is generated by joining Lk1with itself
Prune step
Any (k1)-itemset that is not frequent cannot be a subset of a frequent kitemset, hence should be removed
(Ck: Candidate itemset of size k) (Lk: frequent itemset of size k whose support >= minsup) 16
Association Rules
Apriori algorithm to find rules R from frequent itemsets L19):
For each l L generate S = nonempty subsets of l For each s S generate rule s (l-s) if confidence >= minconf
17
18
Contoh Kasus
19
support
3-length item
22
Aturan Asosiasi
Kustomer yang membeli beras akan membeli juga minyak goreng.
24
25
Geo-spatial Database
Discovers rules in geo-spatial Given Western Canada, database12)
describe the weather patterns Given temperature, precipitation, etc., describe the regions Show the differences in weather patterns between British Columbia and GeoMiner Alberta If a Canadian town is large and is adjacent to large water body, then it is close to the U.S. border, with the 27 possibility of 78%
Earth Science
Interesting patterns on Earth Science14)
Regions that are covered by Shrubland regions the highly correlated FPAR: Fractional Intercepted Photosynthetically pattern, FPAR-Hi NPP- Active Radiation NPP : Net Primary Production 28 Hi
Earth Science
Interesting patterns on Earth Science14)
Two clusters for NPP (land) and two clusters for SST (ocean). The clusters approximate the northern and southern hemispheres, for land and ocean. 29 SST: sea surface temperature
Earth Science
Interesting patterns on Earth Clusters of ocean near the Science14)
Philipines (SST) and lands of Eastern Brazil, Southern Africa, and a bit of Australia (NPP) is highly correlated (0.47).
In particular, this sea region is highly correlated (0.66), with SOI, which is a climate index related to El Nio, and it is known that parts of Southern Africa and Australia experience droughts related to El Nino. 30
Conclusion
Todays data repository is huge and collected in enormous speed Traditional statistical methods are no longer sufficient to analyze data. Data mining is very important to discover knowledge hidden in data Helps decision making in broad range of fields: business, network security, web, environment etc. Good visualization tool is needed to understand mining results easily
31