Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Introduction
Chapter 1 Introduction
The convergence of computing and communication has produced a society that feeds on information. Yet most of the information is in its raw form: data. If data is characterized as recorded facts, then information is the set of patterns, or expectations, that underlie the data. There is a huge amount of information locked up in databasesinformation that is potentially important but has not yet been discovered or articulated. Our mission is to bring it forth. Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. The idea is to build computer programs that sift through databases automatically, seeking regularities or patterns. Strong patterns, if found, will likely generalize to make accurate predictions on future data.
Forecasting demand is both a science and an art Needless to say, forecasting is an important element for decision-making support. The common theme through decision-making is selection and decision and the forecasting is indispensable for the optimal realization of this theme. This is explained obviously by the fact that the forecasting is positioned as a core method of DSS (Decision Support System) which has been developed until now.
In modern days, high activity of industry and high dependence to electric power in daily life, require still more increase and higher stability of electric power. It is needless to say that high accurate prediction of electric power demand has a decisive role for these requirements. This is the first chapter of the report which will give you a brief overview of different terms that we repeatedly used in data mining
-1-
A predictive analisys for real time data through data mining techniques
Introduction
1.1 OVERVIEW
Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.
1.2 Data
Data are any facts, numbers, or text that can be processed by a computer. Today, organizations are accumulating vast and growing amounts of data in different formats and different databases. This includes:
Operational or transactional data such as, sales, cost, inventory, payroll, and accounting. nonoperational data, such as industry sales, forecast data, and macro economic data
meta data - data about the data itself, such as logical database design or data dictionary definitions
1.3 Information
The patterns, associations, or relationships among all this data can provide information. For example, analysis of retail point of sale transaction data can yield information on which products are selling and when.
-2-
A predictive analisys for real time data through data mining techniques
Introduction
1.4 Knowledge
Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behaviour. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts. There are two main kinds of models in data mining: predictive and descriptive. Predictive models can be used to forecast explicit values, based on patterns determined from known results. For example, from a database of customers who have already responded to a particular offer, a model can be built that predicts which prospects are likeliest to respond to the same offer. Descriptive models describe patterns in existing data, and are generally used to create meaningful subgroups such as demographic clusters. Descriptive models, that is, the unsupervised learning function do not predict a target value, but focus more on the intrinsic structure, relations, interconnectedness, etc. of the data.
-3-
A predictive analisys for real time data through data mining techniques
Introduction
Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials.
Clusters: Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities.
Associations: Data can be mined to identify associations. The beer-diaper example is an example of associative mining.
-4-
A predictive analisys for real time data through data mining techniques
Introduction
Sequential patterns: Data is mined to anticipate behaviour patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes.
Extract, transform, and load transaction data onto the data warehouse system.
Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure.
Genetic algorithms: Optimization techniques that use process such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution.
Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset
Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the k-nearest neighbor technique. -5-
A predictive analisys for real time data through data mining techniques
Introduction
Rule induction: The extraction of useful if-then rules from data based on statistical significance.
the business objectives and desired outcomes for the project and translate
2. Exploration:
Analyze source data to determine the most appropriate data and model building
4. Model Building:
Create, test, and validate models, and evaluate whether they will meet project
5. Deployment:
Apply model results to business decisions or processes. This ranges from sharing
insights with business users to embedding models into applications to automate decisions and business processes. -6-
A predictive analisys for real time data through data mining techniques
Introduction
1.10 .Barriers to Usage. A host of barriers can prevent organizations from venturing into the domain of predictive analytics or impede their growth. This analytics bottleneck arises from:
1. Complexity.
Developing sophisticated models has traditionally been a slow, iterative, and labour
intensive process.
2. Data. Most
corporate data is full of errors and inconsistencies but most predictive a models require
Complex analytical queries and scoring processes can clog networks and bog
Qualified business analysts who can create sophisticated models are hard to find,
accessing or moving data and models among multiple machines, operating platforms, and applications, which requires interoperable software. 6. Pricing. The price of most predictive analytic software and the hardware to run it on is beyond the reach of most midsize organizations or departments in large organizations.
-7-
A predictive analisys for real time data through data mining techniques
Introduction
Clinical Decision Support systems link health observations with health knowledge to influence health choices by clinicians for improved health care.iii) .Cross-sell
For an organization that offers multiple products, an analysis of existing customer behaviour can lead to efficient cross sell of products. Predictive analytics can help analyze customers spending, usage and other behaviour, and help cross-sell the right product at the right time.
iv) Customer retention By a frequent examination of a customers past service usage, service performance, spending and other behaviour patterns, predictive models can determine the likelihood of a customer wanting to terminate service sometime in the near future.
v) Direct marketing
Apart from identifying prospects, predictive analytics can also help to identify the most effective combination of product versions, marketing material, communication channels and timing that should
-8-
A predictive analisys for real time data through data mining techniques
Introduction
be used to target a given consumer. The goal of predictive analytics is typically to lower the cost per order or cost per action.
Fraud is a big problem for many businesses and can be of various types. Inaccurate credit applications, fraudulent transactions, identity thefts and false insurance claims are some examples of this problem. These problems plague firms all across the spectrum and some examples of likely victims are credit card issuers, insurance companies, retail merchants, manufacturers, business to business suppliers and even services providers. This is an area where a predictive model is often used to help weed out the bads and reduce a business's exposure to fraud.
-9-
A predictive analisys for real time data through data mining techniques
Objectives
Chapter 2 Objectives
1. Upgrade an e-science infrastructure to support collaborative, data mining enabled experimental research.
3. Design and implement mechanisms for meta-mining the knowledge discovery process.
- 10 -
A predictive analisys for real time data through data mining techniques
Company profile
3.2 Introduction
Under veena industries Ltd. Agarwal Group has been newly started software Solutions Company on 12 th July 2012. Which headquartered in Chakan Pune, they delivers the finest software solutions that best meet the requirements of the global clientele. Their softwares successfully supports the clients in achieving their business goals in the most cost-effective way. They can do this, as they invest quality time in understanding, analyzing, and sometimes even discovering the client requirements at an early stage. Their ability to choose the suitable in-house resources for the projects, their expertise to scale their selves in terms of resources and domain knowledge to match the project needs, and above all, their knowledge of industry, translates a customer requirement into a successfully delivered software solution. - 11 -
A predictive analisys for real time data through data mining techniques
Company profile
It solves most critical business problems using analytic they maximize predictive power while ensuring actionable use by addressing operational legal and data issues. Deep Expertise in working into data from multiple sources ( e.g. demographic, application, master file, credit bureau ) and levels of data aggregation (e.g. transaction, account , customer, portfolio, enterprise) .
Their Mission is to provide customers with innovative solution and progressive technology that helps to improve customers working methods and profitability.
- 12 -
A predictive analisys for real time data through data mining techniques
Company History
2012
2011
2010
2009
Mecc Alte plant was set up at Sanaswadi, Pune and is operational since 25th January 2009
2007
Flair Technologies Limited, UK was incorporated as a 100% subsidiary of Veena Industries Ltd.
Flair Technologies is the sales, marketing & distribution arm operating on the business principle of JIT deliveries to our global customers.
2006
Veena Industries Private Limited formed a Joint Venture with Mecc Alte, UK for manufacture of Alternator assemblies.
2002
Branch I and Branch II were ISO 9001:2000 certified by DNV in June 2002
1980
Veena Industries commissioned its first plant at MIDC, Bhosari, Pune, India to manufacture trailers for Mahindra Owel and base-frames & under-carriages for Atlas Copco India Ltd
- 13 -
A predictive analisys for real time data through data mining techniques
Company History
Branch 1 - Bhosari, Pune Branch 2 - Chakan, Pune Branch 3 - Waki, Pune Branch 4 - Samba, Jammu Branch 5 - (EOU)-Chakan, Pune Branch 6 - Pithampur, Indore Mecc Alte India Limited. Sanaswadi, Pune (JV)
3.7 Organisational Chart of the company Board of directors Shailendra N. Agarwal, Managing Director Brijendra N. Agarwal, Chairman Atin Agarwal, Director Avinash Agarwal, Director
3.8 Management team & Organization of the company Mr. Raman Tandon - GM Operations Division Mr. Milind Vyas Head Procurement and IT Mr. S.Srinivas - GM Business Development
- 14 -
A predictive analisys for real time data through data mining techniques
Company History
Various Products
1. Proclaim 2. Makros 3. Universal Desk 4. Acuity Veena Industries Ltd. Provides powerful custom analytics and analytic consulting that are considered the Gold Standard in the industries they serve.
- 15 -
A predictive analisys for real time data through data mining techniques
Chapter 4
3. Predict what each individual accessing a Web site is most likely interested in seeing.
- 16 -
A predictive analisys for real time data through data mining techniques
4.4 IMPORTANT ASPECTS OF THE PROJECT As a student of MBA the project provided me an excellent opportunity to implement all that I have learnt in my classroom sessions in the practical outfield. it helped me to know more about the data mining data mailing, softwares, & websites and also about the software development. The head quarters of the company was established months before in pune. Beginning of this year the management realized the importance of software solutions and need of software companies, the company recruited new team and interns with this respect. I was recruited as an intern for working on data mining process.
- 17 -
A predictive analisys for real time data through data mining techniques
Research Methodology
5.1 INTRODUCTION
In data mining, association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases. Piatetsky-Shapiro describes analyzing and presenting strong rules discovered in databases using different measures of interestingness.
Association rule mining finds interesting associations and/or correlation relationships among large set of data items. Association rules show attribute value conditions that occur frequently together in a given dataset. A typical and widely-used example of association rule mining is Market Basket Analysis.
For example, data are collected using bar-code scanners in supermarkets. Such market basket databases consist of a large number of transaction records. Each record lists all items bought by a customer on a single purchase transaction. Managers would be interested to know if certain groups of items are consistently purchased together. They could use this data for adjusting store layouts (placing items optimally with respect to each other), for cross-selling, for promotions, for catalog design and to identify customer segments based on buying patterns. Association rules provide information of this type in the form of "if-then" statements. These rules are computed from the data and, unlike the if-then rules of logic, association rules are probabilistic in nature.
- 18 -
A predictive analisys for real time data through data mining techniques
Research Methodology
The first number is called the support for the rule. The support is simply the number of transactions that include all items in the antecedent and consequent parts of the rule. The support is sometimes expressed as a percentage of the total number of records in the database.
The other number is known as the confidence of the rule. Confidence is the ratio of the number of transactions that include all items in the consequent as well as the antecedent (namely, the support) to the number of transactions that include all items in the antecedent.
Lift is one more parameter of interest in the association analysis. Lift is nothing but the ratio of Confidence to Expected Confidence. Expected Confidence in this case means, using the above example, "confidence, if buying A and B does not enhance the probability of buying C." It is the number of transactions that include the consequent divided by the total number of transactions.
Association rules are usually required to satisfy a user-specified minimum support and a user-specified minimum confidence at the same time. Association rule generation is usually split up into two separate steps:
1. First, minimum support is applied to find all frequent item sets in a database. 2. Second, these frequent item sets and the minimum confidence constraint are used to form rules.
While the second step is straight forward, the first step needs more attention. - 19 -
A predictive analisys for real time data through data mining techniques
Research Methodology
Finding all frequent item sets in a database is difficult since it involves searching all possible item sets (item combinations). The set of possible item sets is the power set over I and has size 2n 1 (excluding the empty set which is not a valid item set). Although the size of the power set grows exponentially in the number of items n in I, efficient search is possible using the downward-closure property of support(also called anti-monotonicity) which guarantees that for a frequent item set also all its subsets are frequent and thus for an infrequent item set, all its supersets must be infrequent. Exploiting this property, efficient algorithms (e.g., Apriori and clat) can find all frequent item sets.
1. Frequent Item set Generation, whose objective is to find all the item- sets that satisfy the minimum support threshold. These item set are called frequent item set. 2. Rule Generation, whose objective is to extract all the high-confidence rules from the frequent item sets found in the previous step. These rules are called strong rules.
The computational requirements for frequent item set generation are generally more expensive than those of rule generation.
- 20 -
A predictive analisys for real time data through data mining techniques
Research Methodology
If an item sets is frequent, then all of its subsets must also be frequent. To illustrate the idea behind the Apriori principle, suppose {c, d, e} is a frequent item sets. Clearly, any transaction that contains {c, d, e} must also contain its subsets, {c, d},{c, e}, {d, e}, {c}, {d}, and {e}. As a result, if {c, d, e} is frequent, then all subsets of {c, d, e} must also be frequent
- 21 -
A predictive analisys for real time data through data mining techniques
Research Methodology
The Apriori heuristic achieves good performance gained by (possibly significantly) reducing The size of candidate sets. However, in situations with a large number of frequent Patterns, long patterns, or quite low minimum support thresholds, an Apriori-like algorithm May suffer from the following two nontrivial costs:
It is costly to handle a huge number of candidate sets. For example, if there are 104 Frequent 1-item sets, the Apriori algorithm will need to generate more than 107 length-2 Candidates and accumulate and test their occurrence frequencies. Moreover, to discover a frequent pattern of size 100, such as {a1, . . . , a100}, it must generate 2100 2 1030 Candidates in total. This is the inherent cost of candidate generation, no matter what implementation technique is applied. It is tedious to repeatedly scan the database and check a large set of candidates by pattern matching, which is especially true for mining long patterns.
5.4.1 FP-GROWTH.
FP-growth is an algorithm for generating frequent item sets for association rules from Jiawei Hans research group at Simon Fraser University. It generates all frequent item sets satisfying a given minimum support by growing a frequent pattern tree structure that stores compressed information about the frequent patterns. In this way, FP-growth can avoid repeated database scans and also avoid the generation of a large number of candidate item sets
- 22 -
A predictive analisys for real time data through data mining techniques
Research Methodology
There are several advantages of FP-growth over other approaches: i) It constructs a highly compact FP-tree, which is usually substantially smaller than the original database and thus saves the costly database scans in the subsequent mining processes. ii) It applies a pattern growth method which avoids costly candidate generation and test by successively concatenating frequent 1-item set found in the (conditional) FP-trees. This ensures that it never generates any combinations of new candidate sets which are not in the database because the item set in any transaction is always encoded in the corresponding path of the FP-trees
- 23 -
A predictive analisys for real time data through data mining techniques
Research Methodology
5.4.2 OBJECTIVE
Develop a recommender system with the help of supermarket transactional data.
A predictive analisys for real time data through data mining techniques
Research Methodology
Fig 4.2 Supermarket Data view in Table Format As we can see the data is highly sparse andt represents binary value. For the mining process, besides the input data, the minimum support threshold value is needed. It is one of the key issues, to which value the support threshold should be set. The right answer can be given only with the user interactions and many iterations until the appropriate values have been found. For this reason, namely, that the interaction of the users is needed in this phase of the mining process, it is advisable executing the frequent pattern discovery algorithm iteratively on a relatively small part of the whole dataset only. Choosing the right size of the sample data, the response time of the application remains small, while the sample data represents the whole data accurately. Setting the minimum support threshold parameter is not a trivial task, and it requires a lot of practice and attention on the part of the user. The frequent web access patterns are written in a text file along with the
- 25 -
A predictive analisys for real time data through data mining techniques
Research Methodology
sessions in which they are accessed and the day in which they are accessed and in the correct order sequence. Here are some of the rules generated by apriori given the values for minimum support is 10% and minimum confidence level of 90%, the above figure shows the output of the weka Association analysis on supermarket dataset.
Output
Output of the weka association analysis on supermarket dataset is given below.
Figure.4.3 Weka output for apriori algorithm on supermarket dataset As we can see in fig. 4.3 the algorithm has generated 10 best rules found according to given constraints, here are the rules.
- 26 -
A predictive analisys for real time data through data mining techniques
Research Methodology
1. biscuits=t frozen foods=t pet foods=t milk-cream=t vegetables=t 516 ==> bread and cake=t 475 conf:(0.92) 2. baking needs=t biscuits=t milk-cream=t margarine=t fruit=t vegetables=t 505 ==> bread and cake=t 464 conf:(0.92)
3. biscuits=t frozen foods=t milk-cream=t margarine=t vegetables=t 585 ==> bread and cake=t 537 conf:(0.92) 4. biscuits=t canned vegetables=t frozen foods=t fruit=t vegetables=t 536 ==> bread and cake=t 492 conf:(0.92) 5. baking needs=t frozen foods=t milk-cream=t margarine=t fruit=t vegetables=t 517 ==> bread and cake=t 474 conf:(0.92)
6. biscuits=t frozen foods=t pet foods=t milk-cream=t fruit=t 511 ==> bread and cake=t 468 conf:(0.92) 7. biscuits=t frozen foods=t tissues-paper prd=t milk-cream=t vegetables=t 575 ==> bread and cake=t 526 conf:(0.91) conf:(0.91)
8. biscuits=t frozen foods=t beef=t fruit=t vegetables=t 536 ==> bread and cake=t 490
9. baking needs=t biscuits=t frozen foods=t cheese=t fruit=t 538 ==> bread and cake=t 491 conf:(0.91) 10. biscuits=t frozen foods=t milk-cream=t margarine=t fruit=t 592 ==> bread and cake=t 540 conf:(0.91).
- 27 -
A predictive analisys for real time data through data mining techniques
Research Methodology
A major advantage of FPGrowth compared to Apriori is that it uses only 2 data scans and is therefore often applicable even on large data sets. The given data set is only allowed to contain binominal attributes, i.e. nominal attributes with only two different values
- 28 -
A predictive analisys for real time data through data mining techniques
Research Methodology
5.5.1 Process
The process for frequent item set generation in RapidMiner is given below. This process takes input using Read ARFF operator. This input shuld be in binominal form, if the input is not in binominal form we have to convert into binominal using Nomial2Binominal operator.
- 29 -
A predictive analisys for real time data through data mining techniques
Research Methodology
- 30 -
A predictive analisys for real time data through data mining techniques
6.1 INTRODUCTION
Classification trees are used to predict membership of cases or objects in the classes of a categorical dependent variable from their measurements on one or more predictor variables. The goal of classification trees is to predict or explain responses on a categorical dependent variable, and as such, the available techniques have much in common with the techniques used in the more traditional methods
The flexibility of classification trees makes them a very attractive analysis option. Classification trees readily lend themselves to being displayed graphically, helping to make them easier to interpret. Classification trees can be and sometimes are quite complex. However, graphical procedures can be developed to help simplify interpretation even for complex trees.
6.2 OBJECTIVE
Develop a classification system using census bureau dataset taken from www.census.gov Prediction task is to determine whether a person makes over or below 50K a year.
- 31 -
A predictive analisys for real time data through data mining techniques
A predictive analisys for real time data through data mining techniques
C4.5 Algorithm.
C4.5 builds decision trees from a set of training data using the concept of information entropy. At each node of the tree, C4.5 chooses one attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. Its criterion is the normalized information gain (difference in entropy) that results from choosing an attribute for splitting the data. The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurs on the smaller sub lists.
- 33 -
A predictive analisys for real time data through data mining techniques
3. Let a_best be the attribute with the highest normalized information gain 4. Create a decision node that splits on a_best 5. Recur on the sub lists obtained by splitting on a_best, and add those nodes as children of node
6.6 Process
Output : - The output contains two things to analyze, these are as follows.
1. Decision tree 2. Confusion matrix - 34 -
A predictive analisys for real time data through data mining techniques
From the above tree Fig 5.4, we obtained through Data Miner; we can derive rules that can be applied for a given problem. The rules can be written in text form, and the number of the rules is equal to number of the leafs in the tree. As we can see there are nine leafs in output tree we can derive nine rules as follows
1.if capital-gain > 7,073.500 then >50K (13 / 673) 2.if capital-gain 7,073.500 and marital-status = Divorced then <=50K (1958 / 168) 3.if capital-gain 7,073.500 and marital-status = Married-AF-spouse then <=50K (8 / 4) 4.if capital-gain 7,073.500 and marital-status = Married-civ-spouse then <=50K (4077 / 2743) 5.if capital-gain 7,073.500 and marital-status = Married-spouse-absent then <=50K - 35 -
A predictive analisys for real time data through data mining techniques
(193 / 10) 6.if capital-gain 7,073.500 and marital-status = Never-married then <=50K (5007 / 182) 7.if capital-gain 7,073.500 and marital-status = Separated and education-num > 13.500 then >50K (9 / 12) 8.if capital-gain 7,073.500 and marital-status = Separated and education-num 13.500 then <=50K (452 / 18) 9.if capital-gain 7,073.500 and marital-status = Widowed then <=50K (448 / 25)
Fig 5.5 Confusion Matrix for Adult Dataset As we can see the above confusion matrix Fig 5.5, 12143 instances are predicted as class <=50K and they belongs to same class i.e. <=50K. 3150 instances are predicted as class <=50K but actually they belongs to class >50K In the second line 22 instances which belongs to class <=50K are wrongly predicted as of class >50K and 685 instances are correctly classified as of class >50K - 36 -
A predictive analisys for real time data through data mining techniques
This confusion matrix output we can visualize using a scatter plot by actual class plotted against predicted class.
- 37 -
A predictive analisys for real time data through data mining techniques
The rules can be applied to form any new marketing campaign, promotional schemes, finding valuable customers which will help business to categories their customers and give better type of the service needed for each class of customers.
- 38 -
A predictive analisys for real time data through data mining techniques
- 39 -
A predictive analisys for real time data through data mining techniques
Bibliography
BIBLIOGRAPHY
J. Han and M. Kamber, Data Mining Concepts and Techniques Second edition, Morgan Kaufmann, San Francisco. [1] http://www.sas.com/feature/analytics/102892_0107.pdf [2] International Journal of Management and Decision Making [3]http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm [4] http://www.britannica.com/EBchecked/topic/1671124/predictive-modeling [5] http://www.dynamicintegration.net/predictive_analytics.aspx [6] http://en.wikipedia.org/wiki/Predictive_analytics [7] http://www.cs.waikato.ac.nz/~ihw/papers/04-EF-etal-DataminingWEKA.pdf
- 40 -