Sei sulla pagina 1di 40

A predictive analisys for real time data through data mining techniques

Introduction

Chapter 1 Introduction
The convergence of computing and communication has produced a society that feeds on information. Yet most of the information is in its raw form: data. If data is characterized as recorded facts, then information is the set of patterns, or expectations, that underlie the data. There is a huge amount of information locked up in databasesinformation that is potentially important but has not yet been discovered or articulated. Our mission is to bring it forth. Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. The idea is to build computer programs that sift through databases automatically, seeking regularities or patterns. Strong patterns, if found, will likely generalize to make accurate predictions on future data.

Forecasting demand is both a science and an art Needless to say, forecasting is an important element for decision-making support. The common theme through decision-making is selection and decision and the forecasting is indispensable for the optimal realization of this theme. This is explained obviously by the fact that the forecasting is positioned as a core method of DSS (Decision Support System) which has been developed until now.

In modern days, high activity of industry and high dependence to electric power in daily life, require still more increase and higher stability of electric power. It is needless to say that high accurate prediction of electric power demand has a decisive role for these requirements. This is the first chapter of the report which will give you a brief overview of different terms that we repeatedly used in data mining

-1-

A predictive analisys for real time data through data mining techniques

Introduction

1.1 OVERVIEW
Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.

1.2 Data
Data are any facts, numbers, or text that can be processed by a computer. Today, organizations are accumulating vast and growing amounts of data in different formats and different databases. This includes:

Operational or transactional data such as, sales, cost, inventory, payroll, and accounting. nonoperational data, such as industry sales, forecast data, and macro economic data

meta data - data about the data itself, such as logical database design or data dictionary definitions

1.3 Information
The patterns, associations, or relationships among all this data can provide information. For example, analysis of retail point of sale transaction data can yield information on which products are selling and when.

-2-

A predictive analisys for real time data through data mining techniques

Introduction

1.4 Knowledge
Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behaviour. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts. There are two main kinds of models in data mining: predictive and descriptive. Predictive models can be used to forecast explicit values, based on patterns determined from known results. For example, from a database of customers who have already responded to a particular offer, a model can be built that predicts which prospects are likeliest to respond to the same offer. Descriptive models describe patterns in existing data, and are generally used to create meaningful subgroups such as demographic clusters. Descriptive models, that is, the unsupervised learning function do not predict a target value, but focus more on the intrinsic structure, relations, interconnectedness, etc. of the data.

1.5 Predictive modeling


Predictive modeling is used when the goal is to estimate the value of a particular target attribute and there exist sample training data for which values of that attribute are known. An example is classification, which takes a set of data already divided into predefined groups and searches for patterns in the data that differentiate those groups. These discovered patterns then can be used to classify other data where the right group designation for the target attribute is unknown (though other attributes may be known). For instance, a manufacturer could develop a predictive model that distinguishes parts that fail under extreme heat, extreme cold, or other conditions based on their manufacturing environment, and this model may then be used to determine appropriate applications for each part.

-3-

A predictive analisys for real time data through data mining techniques

Introduction

1.6 Descriptive modeling


Descriptive modelling, or clustering, also divides data into groups. With clustering, however, the proper groups are not known in advance; the patterns discovered by analyzing the data are used to determine the groups. For example, an advertiser could analyze a general population in order to classify potential customers into different clusters and then develop separate advertising campaigns targeted to each group. Fraud detection also makes use of clustering to identify groups of individuals with similar purchasing patterns.

1.7 How does data mining work?


While large-scale information technology has been evolving separate transaction and analytical systems, data mining provides the link between the two. Data mining software analyzes relationships and patterns in stored transaction data based on open-ended user queries. Several types of analytical software are available: statistical, machine learning, and neural networks. Generally, any of four types of relationships are sought

Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials.

Clusters: Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities.

Associations: Data can be mined to identify associations. The beer-diaper example is an example of associative mining.

-4-

A predictive analisys for real time data through data mining techniques

Introduction

Sequential patterns: Data is mined to anticipate behaviour patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes.

Data mining consists of five major elements:

Extract, transform, and load transaction data onto the data warehouse system.

Store and manage the data in a multidimensional database system.

Provide data access to business analysts and information technology professionals.

Analyze the data by application software.

Present the data in a useful format, such as a graph or table.

1.8 Different levels of analysis are available

Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure.

Genetic algorithms: Optimization techniques that use process such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution.

Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset

Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the k-nearest neighbor technique. -5-

A predictive analisys for real time data through data mining techniques

Introduction

Rule induction: The extraction of useful if-then rules from data based on statistical significance.

1.9 Predictive models.


Predictive models analyze past performance to assess how likely a customer is to exhibit a specific behavior in the future in order to improve marketing effectiveness. This category also encompasses models that seek out subtle data patterns to answer questions about customer performance, such as fraud detection models.

1.9.1 Predictive Models Methodology.


1.
Project Definition: Define

the business objectives and desired outcomes for the project and translate

them into predictive analytic objectives and tasks.

2. Exploration:

Analyze source data to determine the most appropriate data and model building

approach, and scope the effort.

3. Data Preparation: Select,

extract, and transform data upon which to create models.

4. Model Building:

Create, test, and validate models, and evaluate whether they will meet project

metrics and goals.

5. Deployment:

Apply model results to business decisions or processes. This ranges from sharing

insights with business users to embedding models into applications to automate decisions and business processes. -6-

A predictive analisys for real time data through data mining techniques

Introduction

1.10 .Barriers to Usage. A host of barriers can prevent organizations from venturing into the domain of predictive analytics or impede their growth. This analytics bottleneck arises from:
1. Complexity.

Developing sophisticated models has traditionally been a slow, iterative, and labour

intensive process.
2. Data. Most

corporate data is full of errors and inconsistencies but most predictive a models require

clean, scrubbed, expertly formatted data to work.


3. Processing Expense.

Complex analytical queries and scoring processes can clog networks and bog

down database performance, especially when performed on the desktop.


4. Expertise.

Qualified business analysts who can create sophisticated models are hard to find,

expensive to pay, and difficult to retain.


5. Interoperability.

The process of creating and deploying predictive models traditionally involves

accessing or moving data and models among multiple machines, operating platforms, and applications, which requires interoperable software. 6. Pricing. The price of most predictive analytic software and the hardware to run it on is beyond the reach of most midsize organizations or departments in large organizations.

-7-

A predictive analisys for real time data through data mining techniques

Introduction

1.11 Applications of predictive analysis.


Although predictive analytics can be put to use in many applications, we outline a few examples where predictive analytics has shown positive impact in recent years.

i) Analytical customer relationship management (CRM)

ii) Clinical decision support systems

Clinical Decision Support systems link health observations with health knowledge to influence health choices by clinicians for improved health care.iii) .Cross-sell

For an organization that offers multiple products, an analysis of existing customer behaviour can lead to efficient cross sell of products. Predictive analytics can help analyze customers spending, usage and other behaviour, and help cross-sell the right product at the right time.

iv) Customer retention By a frequent examination of a customers past service usage, service performance, spending and other behaviour patterns, predictive models can determine the likelihood of a customer wanting to terminate service sometime in the near future.

v) Direct marketing

Apart from identifying prospects, predictive analytics can also help to identify the most effective combination of product versions, marketing material, communication channels and timing that should

-8-

A predictive analisys for real time data through data mining techniques

Introduction

be used to target a given consumer. The goal of predictive analytics is typically to lower the cost per order or cost per action.

vi) Fraud detection

Fraud is a big problem for many businesses and can be of various types. Inaccurate credit applications, fraudulent transactions, identity thefts and false insurance claims are some examples of this problem. These problems plague firms all across the spectrum and some examples of likely victims are credit card issuers, insurance companies, retail merchants, manufacturers, business to business suppliers and even services providers. This is an area where a predictive model is often used to help weed out the bads and reduce a business's exposure to fraud.

-9-

A predictive analisys for real time data through data mining techniques

Objectives

Chapter 2 Objectives

1. Upgrade an e-science infrastructure to support collaborative, data mining enabled experimental research.

2. Develop a knowledge-driven data mining assistant to support researchers in data-

intensive, knowledge-rich domains.

3. Design and implement mechanisms for meta-mining the knowledge discovery process.

4. Demonstrate e-LICO on a systems biology approach to disease studies

- 10 -

A predictive analisys for real time data through data mining techniques

Company profile

Chapter 3 Company Profile


3.1 About Veena Industries Ltd.
Veena Industries Ltd. the flagship company of Pune based Agarwal Group, was established in1980 by two technocrat entrepreneurs Mr. Shailendra N. Agarwal and Mr. Brijendra N. Agarwal. Veena Industries pioneered the manufacturing of Silent Gensets and Engineered fabricated components like canopies for gensets & industrial equipment, under-carriages, structural elements for Off Highway vehicles for over 30 years now. Veena Industries Limited has been a closely family held public limited company. Traditionally VIL has been an Owner driven, professionally managed Organisation. Veena Industries commissioned its first plant at MIDC, Bhosari, and Pune, India to manufacture trailers for Mahindra Owel and base-frames & under-carriages for Atlas Copco India Ltd.

3.2 Introduction
Under veena industries Ltd. Agarwal Group has been newly started software Solutions Company on 12 th July 2012. Which headquartered in Chakan Pune, they delivers the finest software solutions that best meet the requirements of the global clientele. Their softwares successfully supports the clients in achieving their business goals in the most cost-effective way. They can do this, as they invest quality time in understanding, analyzing, and sometimes even discovering the client requirements at an early stage. Their ability to choose the suitable in-house resources for the projects, their expertise to scale their selves in terms of resources and domain knowledge to match the project needs, and above all, their knowledge of industry, translates a customer requirement into a successfully delivered software solution. - 11 -

A predictive analisys for real time data through data mining techniques

Company profile

Veena Industries Ltd. Offers Following Services:


1. Offshore software development 2. Platform/data migration services 3. Digitization and conversion of documents 4. Product development

3.3 Why Veena Industries Ltd. Consulting?

It solves most critical business problems using analytic they maximize predictive power while ensuring actionable use by addressing operational legal and data issues. Deep Expertise in working into data from multiple sources ( e.g. demographic, application, master file, credit bureau ) and levels of data aggregation (e.g. transaction, account , customer, portfolio, enterprise) .

3.4 MISSION & VISION

Their Mission is to provide customers with innovative solution and progressive technology that helps to improve customers working methods and profitability.

- 12 -

A predictive analisys for real time data through data mining techniques

Company History

3.5 COMPANY HISTORY

2012

Started Solutions Company on 12th July 2012.

2011

Integrated Heavy Fabrication plant set at Chakan.

2010

Consolidation of manufacturing facilities from Silvasa and Santoshnagar to Chakan

2009

Mecc Alte plant was set up at Sanaswadi, Pune and is operational since 25th January 2009

2007

Flair Technologies Limited, UK was incorporated as a 100% subsidiary of Veena Industries Ltd.

Flair Technologies is the sales, marketing & distribution arm operating on the business principle of JIT deliveries to our global customers.

Veena Industries Private Limited became a public limited company

2006

Veena Industries Private Limited formed a Joint Venture with Mecc Alte, UK for manufacture of Alternator assemblies.

2002

Branch I and Branch II were ISO 9001:2000 certified by DNV in June 2002

1980

Veena Industries commissioned its first plant at MIDC, Bhosari, Pune, India to manufacture trailers for Mahindra Owel and base-frames & under-carriages for Atlas Copco India Ltd

- 13 -

A predictive analisys for real time data through data mining techniques

Company History

3.6 Branches of the veena industries ltd.


Branch 1 - Bhosari, Pune Branch 2 - Chakan, Pune Branch 3 - Waki, Pune Branch 4 - Samba, Jammu Branch 5 - (EOU)-Chakan, Pune Branch 6 - Pithampur, Indore Mecc Alte India Limited. Sanaswadi, Pune (JV)

3.7 Organisational Chart of the company Board of directors Shailendra N. Agarwal, Managing Director Brijendra N. Agarwal, Chairman Atin Agarwal, Director Avinash Agarwal, Director

3.8 Management team & Organization of the company Mr. Raman Tandon - GM Operations Division Mr. Milind Vyas Head Procurement and IT Mr. S.Srinivas - GM Business Development

- 14 -

A predictive analisys for real time data through data mining techniques

Company History

Various Products
1. Proclaim 2. Makros 3. Universal Desk 4. Acuity Veena Industries Ltd. Provides powerful custom analytics and analytic consulting that are considered the Gold Standard in the industries they serve.

What Veena Industries Ltd. aspires to be!


1) To have the leading edge in the use of Software technology.

- 15 -

A predictive analisys for real time data through data mining techniques

project work undertaken

Chapter 4

PROJECT WORK UNDERTAKEN


4.1 WHY THIS PROJECT?
The study that I conducted provided me an excellent opportunity to implement all that I have learnt in my classroom session in the practical outfield. I am doing my project on this topic because it will help me to know more about the Data mining. This project will help organization in determining to what extent its IT sectors activities were successful in their software solutions. Data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. And it is very helpful to find out the data which is hidden or stored in the system. This project will also help the company to understand what customers look for in the software service portal. My project will help the company know more about their strength, prospects, etc.

4.2 SCOPE OF THE PROJECT


1. the project include techniques of data mining which is most important to collect any data Which is useful to 2. Predict cross-sell opportunities and make recommendations.

3. Predict what each individual accessing a Web site is most likely interested in seeing.

- 16 -

A predictive analisys for real time data through data mining techniques

project work undertaken

4.3 HOW IT IS USEFUL FOR THE COMPANY?


The data mining is useful to identify the need of software solutions. Identify your best prospects and then retain them as customers. Learn parameters influencing trends in sales and margins.

4.4 IMPORTANT ASPECTS OF THE PROJECT As a student of MBA the project provided me an excellent opportunity to implement all that I have learnt in my classroom sessions in the practical outfield. it helped me to know more about the data mining data mailing, softwares, & websites and also about the software development. The head quarters of the company was established months before in pune. Beginning of this year the management realized the importance of software solutions and need of software companies, the company recruited new team and interns with this respect. I was recruited as an intern for working on data mining process.

4.5 PLACE OF THE PROJECT


Gat No 309 Plot No C/3, Nanekarwadi, Chakan, Pune 410501.

- 17 -

A predictive analisys for real time data through data mining techniques

Research Methodology

Chapter 5. RESEARCH METHODOLOGY


This chapter presents a methodology known as association analysis which is useful for discovering interesting relationship hidden in the large datasets. The uncovered relationships can be presented in the form of association rules or sets of frequent item sets

5.1 INTRODUCTION
In data mining, association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases. Piatetsky-Shapiro describes analyzing and presenting strong rules discovered in databases using different measures of interestingness.

Association rule mining finds interesting associations and/or correlation relationships among large set of data items. Association rules show attribute value conditions that occur frequently together in a given dataset. A typical and widely-used example of association rule mining is Market Basket Analysis.

For example, data are collected using bar-code scanners in supermarkets. Such market basket databases consist of a large number of transaction records. Each record lists all items bought by a customer on a single purchase transaction. Managers would be interested to know if certain groups of items are consistently purchased together. They could use this data for adjusting store layouts (placing items optimally with respect to each other), for cross-selling, for promotions, for catalog design and to identify customer segments based on buying patterns. Association rules provide information of this type in the form of "if-then" statements. These rules are computed from the data and, unlike the if-then rules of logic, association rules are probabilistic in nature.

- 18 -

A predictive analisys for real time data through data mining techniques

Research Methodology

5.2 Statistical terms


In addition to the antecedent (the "if" part) and the consequent (the "then" part), an association rule has two numbers that express the degree of uncertainty about the rule. In association analysis the antecedent and consequent are sets of items (called item sets) that are disjoint (do not have any items in common).

The first number is called the support for the rule. The support is simply the number of transactions that include all items in the antecedent and consequent parts of the rule. The support is sometimes expressed as a percentage of the total number of records in the database.

The other number is known as the confidence of the rule. Confidence is the ratio of the number of transactions that include all items in the consequent as well as the antecedent (namely, the support) to the number of transactions that include all items in the antecedent.

Lift is one more parameter of interest in the association analysis. Lift is nothing but the ratio of Confidence to Expected Confidence. Expected Confidence in this case means, using the above example, "confidence, if buying A and B does not enhance the probability of buying C." It is the number of transactions that include the consequent divided by the total number of transactions.

Association rules are usually required to satisfy a user-specified minimum support and a user-specified minimum confidence at the same time. Association rule generation is usually split up into two separate steps:

1. First, minimum support is applied to find all frequent item sets in a database. 2. Second, these frequent item sets and the minimum confidence constraint are used to form rules.

While the second step is straight forward, the first step needs more attention. - 19 -

A predictive analisys for real time data through data mining techniques

Research Methodology

Finding all frequent item sets in a database is difficult since it involves searching all possible item sets (item combinations). The set of possible item sets is the power set over I and has size 2n 1 (excluding the empty set which is not a valid item set). Although the size of the power set grows exponentially in the number of items n in I, efficient search is possible using the downward-closure property of support(also called anti-monotonicity) which guarantees that for a frequent item set also all its subsets are frequent and thus for an infrequent item set, all its supersets must be infrequent. Exploiting this property, efficient algorithms (e.g., Apriori and clat) can find all frequent item sets.

5.3 ASSOCIATION RULE


An association rule is an implication expression of the form X Y , where X and Y are disjoint item sets, i.e., X Y = . Support ( A C) = Support(A C) Confidence (A C) = Support (A C) / support (A) A common strategy adopted by many association rule mining algorithms is to decompose the problem into two major subtasks:

1. Frequent Item set Generation, whose objective is to find all the item- sets that satisfy the minimum support threshold. These item set are called frequent item set. 2. Rule Generation, whose objective is to extract all the high-confidence rules from the frequent item sets found in the previous step. These rules are called strong rules.

The computational requirements for frequent item set generation are generally more expensive than those of rule generation.

- 20 -

A predictive analisys for real time data through data mining techniques

Research Methodology

5.4 APRIORI PRINCIPLE


Apriori is Christian Borgelts implementation of the well-known Apriori association rule algorithm Apriori takes transactional data in the form of one row for each pair of transaction and item identifiers. It first generates frequent item sets and then creates association rules from these item sets. It can generate both association rules and frequent item sets.

If an item sets is frequent, then all of its subsets must also be frequent. To illustrate the idea behind the Apriori principle, suppose {c, d, e} is a frequent item sets. Clearly, any transaction that contains {c, d, e} must also contain its subsets, {c, d},{c, e}, {d, e}, {c}, {d}, and {e}. As a result, if {c, d, e} is frequent, then all subsets of {c, d, e} must also be frequent

Apriori pseudo code:-

- 21 -

A predictive analisys for real time data through data mining techniques

Research Methodology

The Apriori heuristic achieves good performance gained by (possibly significantly) reducing The size of candidate sets. However, in situations with a large number of frequent Patterns, long patterns, or quite low minimum support thresholds, an Apriori-like algorithm May suffer from the following two nontrivial costs:

It is costly to handle a huge number of candidate sets. For example, if there are 104 Frequent 1-item sets, the Apriori algorithm will need to generate more than 107 length-2 Candidates and accumulate and test their occurrence frequencies. Moreover, to discover a frequent pattern of size 100, such as {a1, . . . , a100}, it must generate 2100 2 1030 Candidates in total. This is the inherent cost of candidate generation, no matter what implementation technique is applied. It is tedious to repeatedly scan the database and check a large set of candidates by pattern matching, which is especially true for mining long patterns.

5.4.1 FP-GROWTH.
FP-growth is an algorithm for generating frequent item sets for association rules from Jiawei Hans research group at Simon Fraser University. It generates all frequent item sets satisfying a given minimum support by growing a frequent pattern tree structure that stores compressed information about the frequent patterns. In this way, FP-growth can avoid repeated database scans and also avoid the generation of a large number of candidate item sets

- 22 -

A predictive analisys for real time data through data mining techniques

Research Methodology

There are several advantages of FP-growth over other approaches: i) It constructs a highly compact FP-tree, which is usually substantially smaller than the original database and thus saves the costly database scans in the subsequent mining processes. ii) It applies a pattern growth method which avoids costly candidate generation and test by successively concatenating frequent 1-item set found in the (conditional) FP-trees. This ensures that it never generates any combinations of new candidate sets which are not in the database because the item set in any transaction is always encoded in the corresponding path of the FP-trees

- 23 -

A predictive analisys for real time data through data mining techniques

Research Methodology

5.4.2 OBJECTIVE
Develop a recommender system with the help of supermarket transactional data.

5.4.3 Data Set Information


Data is created in a supermarket by recording daily transactions of the sold items. It taken from a supermarket located in New Zealand about 10 years ago.

5.4.4 Attribute Information


The dataset contains 217 attributes and 4627 instances. The attributes are nothing but the names of the items in store. For each transactions record of 217 attributes is generated in which the sold items is recorded as True. The names are not given to all attributes.

5.4.5 Data Pre-processing


The dataset is in arff file format which is suited for most of the mining tools so there is very little need for pre-processing the dataset. To convert it into CSV file format we have to remove the @attributes and @data tags present in the arff format.

. Fig.4.1 ARFF file of Supermarket Dataset - 24 -

A predictive analisys for real time data through data mining techniques

Research Methodology

Fig 4.2 Supermarket Data view in Table Format As we can see the data is highly sparse andt represents binary value. For the mining process, besides the input data, the minimum support threshold value is needed. It is one of the key issues, to which value the support threshold should be set. The right answer can be given only with the user interactions and many iterations until the appropriate values have been found. For this reason, namely, that the interaction of the users is needed in this phase of the mining process, it is advisable executing the frequent pattern discovery algorithm iteratively on a relatively small part of the whole dataset only. Choosing the right size of the sample data, the response time of the application remains small, while the sample data represents the whole data accurately. Setting the minimum support threshold parameter is not a trivial task, and it requires a lot of practice and attention on the part of the user. The frequent web access patterns are written in a text file along with the

- 25 -

A predictive analisys for real time data through data mining techniques

Research Methodology

sessions in which they are accessed and the day in which they are accessed and in the correct order sequence. Here are some of the rules generated by apriori given the values for minimum support is 10% and minimum confidence level of 90%, the above figure shows the output of the weka Association analysis on supermarket dataset.

Output
Output of the weka association analysis on supermarket dataset is given below.

Figure.4.3 Weka output for apriori algorithm on supermarket dataset As we can see in fig. 4.3 the algorithm has generated 10 best rules found according to given constraints, here are the rules.

- 26 -

A predictive analisys for real time data through data mining techniques

Research Methodology

1. biscuits=t frozen foods=t pet foods=t milk-cream=t vegetables=t 516 ==> bread and cake=t 475 conf:(0.92) 2. baking needs=t biscuits=t milk-cream=t margarine=t fruit=t vegetables=t 505 ==> bread and cake=t 464 conf:(0.92)

3. biscuits=t frozen foods=t milk-cream=t margarine=t vegetables=t 585 ==> bread and cake=t 537 conf:(0.92) 4. biscuits=t canned vegetables=t frozen foods=t fruit=t vegetables=t 536 ==> bread and cake=t 492 conf:(0.92) 5. baking needs=t frozen foods=t milk-cream=t margarine=t fruit=t vegetables=t 517 ==> bread and cake=t 474 conf:(0.92)

6. biscuits=t frozen foods=t pet foods=t milk-cream=t fruit=t 511 ==> bread and cake=t 468 conf:(0.92) 7. biscuits=t frozen foods=t tissues-paper prd=t milk-cream=t vegetables=t 575 ==> bread and cake=t 526 conf:(0.91) conf:(0.91)

8. biscuits=t frozen foods=t beef=t fruit=t vegetables=t 536 ==> bread and cake=t 490

9. baking needs=t biscuits=t frozen foods=t cheese=t fruit=t 538 ==> bread and cake=t 491 conf:(0.91) 10. biscuits=t frozen foods=t milk-cream=t margarine=t fruit=t 592 ==> bread and cake=t 540 conf:(0.91).

- 27 -

A predictive analisys for real time data through data mining techniques

Research Methodology

5.4.6 Output Interpretation


Interpreting the rules are also important as it may show a new opportunity for a store owner to increase the sell. Describing the 1st rule given here, it shows that 92%of the customers who brought biscuits, frozen foods, pets foods, milk-cream, vegetables have also brought Bread and cake. So using this analysis the store owner can use this information to boost sales by placing these items together on shelf or he can place it to the opposite corners of the shelf so that customers have to go through aisle passing the other products on the shelf. He can go for other options like discounting a low sells product with above group thereby increasing the turnover and decreasing the shelf life of the product.

5.5 FREQUENT ITEM SET GENERATION USING FP-GROWTH


This FP growth operator in Rapid Miner calculates all frequent items sets from a data set by building a FPTree data structure on the transaction data base. This is a very compressed copy of the data which in many cases fits into main memory even for large data bases. From this FPTree all frequent item set are derived.

A major advantage of FPGrowth compared to Apriori is that it uses only 2 data scans and is therefore often applicable even on large data sets. The given data set is only allowed to contain binominal attributes, i.e. nominal attributes with only two different values

- 28 -

A predictive analisys for real time data through data mining techniques

Research Methodology

5.5.1 Process
The process for frequent item set generation in RapidMiner is given below. This process takes input using Read ARFF operator. This input shuld be in binominal form, if the input is not in binominal form we have to convert into binominal using Nomial2Binominal operator.

Fig.4.4 Frequent item set generation process in Rapid Miner

4.5.2 Output: - output of the FPGrowth is given as follows

Fig.4.5 Frequent item set given by FPGrowth.

- 29 -

A predictive analisys for real time data through data mining techniques

Research Methodology

4.5.3 Output interpretation.


The most frequent item set are given in the output with their support level. In the output item1, item2, item3 in the first line are bread and cake, fruit and vegetables and it shows that the support for this item set in all transactions is 38.7%. Using this the store owner can arrange these items together on shelf or at different places in store, or he can give discounts on these item set so the overall sells can be increased. The items with minimum support level and high shelf life can be clubbed together with a high support level item giving a attractive discount will result in increase in the sells and decrease in the shelf life time. Using this output many rules can be generated and applied in real scenario thereby boosting the overall profit of the business.

- 30 -

A predictive analisys for real time data through data mining techniques

Analysis and interpretation of data

Chapter 6 ANALYSIS AND INTERPRETATION OF DATA


This section of the report will give us information about classification trees and how they are used in generating rules for prediction for real world scenario. Classification tree analysis is one of the main techniques used in Data Mining.

6.1 INTRODUCTION
Classification trees are used to predict membership of cases or objects in the classes of a categorical dependent variable from their measurements on one or more predictor variables. The goal of classification trees is to predict or explain responses on a categorical dependent variable, and as such, the available techniques have much in common with the techniques used in the more traditional methods

The flexibility of classification trees makes them a very attractive analysis option. Classification trees readily lend themselves to being displayed graphically, helping to make them easier to interpret. Classification trees can be and sometimes are quite complex. However, graphical procedures can be developed to help simplify interpretation even for complex trees.

6.2 OBJECTIVE
Develop a classification system using census bureau dataset taken from www.census.gov Prediction task is to determine whether a person makes over or below 50K a year.

- 31 -

A predictive analisys for real time data through data mining techniques

Analysis and interpretation of data

6.3 Data Set Information


This data was extracted from the census bureau database found at www.census.gov. It consist of different information about people.

Fig. 5.1 Adult Dataset information

6.4 Attribute Information


The dataset contains 15 attributes showing characteristics and information about a person. It contains 16000 instances, each instance represents a person with values for 15 attributes. Persons are identified by numbers. The original Adult data set has 15 features, among which six are continuous and nine are categorical.

Fig 5.2 Adult Dataset in table format. - 32 -

A predictive analisys for real time data through data mining techniques

Analysis and interpretation of data

6.4 Data Pre-processing.


The data set obtained was in simple text format where the instance values are separated by commas. So to do classification on this data we have to convert it into required form. For this first we have to clean the dataset by removing unnecessary information from data then second step is to convert the data into the format needed by the mining algorithms.

6.5 Data Analysis


Decision tree operator in Rapid Miner learns decision trees from both nominal and numerical data. Decision trees are powerful classification methods which often can also easily be understood. In order to classify an example, the tree is traversed bottom-down. Every node in a decision tree is labeled with an attribute. The example's value for this attribute determines which of the out coming edges is taken. For nominal attributes, we have one outgoing edge per possible attribute value, and for numerical attributes the outgoing edges are labeled with disjoint ranges. This decision tree learner works similar to Quinlan's C4.5 or CART.

C4.5 Algorithm.
C4.5 builds decision trees from a set of training data using the concept of information entropy. At each node of the tree, C4.5 chooses one attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. Its criterion is the normalized information gain (difference in entropy) that results from choosing an attribute for splitting the data. The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurs on the smaller sub lists.

- 33 -

A predictive analisys for real time data through data mining techniques

Analysis and interpretation of data

In pseudo-code, the general algorithm for building decision trees is:

1. Check for base cases 2. For each attribute a

2.1Find the normalized information gain from splitting on a

3. Let a_best be the attribute with the highest normalized information gain 4. Create a decision node that splits on a_best 5. Recur on the sub lists obtained by splitting on a_best, and add those nodes as children of node

6.6 Process

Fig.5.3 Classification process using decision tree in Rapid Miner

Output : - The output contains two things to analyze, these are as follows.
1. Decision tree 2. Confusion matrix - 34 -

A predictive analisys for real time data through data mining techniques

Analysis and interpretation of data

6.7 Decision tree

Fig. 5.4 Decision tree for adult dataset

From the above tree Fig 5.4, we obtained through Data Miner; we can derive rules that can be applied for a given problem. The rules can be written in text form, and the number of the rules is equal to number of the leafs in the tree. As we can see there are nine leafs in output tree we can derive nine rules as follows

1.if capital-gain > 7,073.500 then >50K (13 / 673) 2.if capital-gain 7,073.500 and marital-status = Divorced then <=50K (1958 / 168) 3.if capital-gain 7,073.500 and marital-status = Married-AF-spouse then <=50K (8 / 4) 4.if capital-gain 7,073.500 and marital-status = Married-civ-spouse then <=50K (4077 / 2743) 5.if capital-gain 7,073.500 and marital-status = Married-spouse-absent then <=50K - 35 -

A predictive analisys for real time data through data mining techniques

Analysis and interpretation of data

(193 / 10) 6.if capital-gain 7,073.500 and marital-status = Never-married then <=50K (5007 / 182) 7.if capital-gain 7,073.500 and marital-status = Separated and education-num > 13.500 then >50K (9 / 12) 8.if capital-gain 7,073.500 and marital-status = Separated and education-num 13.500 then <=50K (452 / 18) 9.if capital-gain 7,073.500 and marital-status = Widowed then <=50K (448 / 25)

6.8 Confusion matrix


Confusion Matrix gives the number/proportion of examples from one class classified in to another (or same) class. This way, one can observe which type of examples were misclassified in a certain way. One benefit of a confusion matrix is that it is easy to see if the system is confusing two classes (i.e. commonly mislabeling one as another).

Fig 5.5 Confusion Matrix for Adult Dataset As we can see the above confusion matrix Fig 5.5, 12143 instances are predicted as class <=50K and they belongs to same class i.e. <=50K. 3150 instances are predicted as class <=50K but actually they belongs to class >50K In the second line 22 instances which belongs to class <=50K are wrongly predicted as of class >50K and 685 instances are correctly classified as of class >50K - 36 -

A predictive analisys for real time data through data mining techniques

Analysis and interpretation of data

6.9Accuracy of the model


Using above model for classification, 12828 instances among total 16000 are predicted correctly therefore overall accuracy of the classification is 80.18%. the classification accuracy is 99.82% for the class <=50K and 17.86% for the class >50K

This confusion matrix output we can visualize using a scatter plot by actual class plotted against predicted class.

Fig. 5.6 scatter plot for classification Result

- 37 -

A predictive analisys for real time data through data mining techniques

Analysis and interpretation of data

6.9.1 Output interpretation


By carrying out the classification on adult dataset using decision tree algorithm we can conclude following things. Simple to understand and interpret - People are able to understand decision tree models after a brief explanation.

The rules can be applied to form any new marketing campaign, promotional schemes, finding valuable customers which will help business to categories their customers and give better type of the service needed for each class of customers.

- 38 -

A predictive analisys for real time data through data mining techniques

Findings & suggestions

Chapter 7 FINDINGS & SUGGESTIONS


During this whole project I carried out different data mining techniques such as classification, association and time series analysis. Classification technique used for prediction is Simple to understand and interpret - People are able to understand decision tree models after a brief explanation. The rules generated by classification can be applied to form any new marketing campaign, promotional schemes, finding valuable customers which will help business to categories their customers and give better type of the service needed for each class of customers. Association analysis are very good for finding the hidden relationship in transactional data which can be used in different sectors such as in market basket analysis in a retail store. It gives new insights of data which can be used for well being of business. By performing time series analysis of predicting load on the electric feeder for short term using techniques such as simple moving average and weighted moving averages, we can say that the above two techniques have given promising results, despite the inconsistencies in the data. With nearly same accuracy both methods are predicting the short term load effectively. In future, we can apply more advanced method for forecasting which will predict load more accurately. In summary we believe that the proposed techniques are adaptable in short term forecasting of feeder load and will give good results if used in managing the generation of electricity and load management of the feeders. In future, we can apply more advanced method for forecasting which will predict load more accurately with the onset of inflation and rapidly rising energy prices, emergence of alternative fuels and technologies (in energy supply and end-use), changes in lifestyles, institutional changes etc, it has become imperative to use modeling techniques which capture the effect of factors such as prices, income, population, technology and other economic, demographic, policy and technological variables.

- 39 -

A predictive analisys for real time data through data mining techniques

Bibliography

BIBLIOGRAPHY
J. Han and M. Kamber, Data Mining Concepts and Techniques Second edition, Morgan Kaufmann, San Francisco. [1] http://www.sas.com/feature/analytics/102892_0107.pdf [2] International Journal of Management and Decision Making [3]http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm [4] http://www.britannica.com/EBchecked/topic/1671124/predictive-modeling [5] http://www.dynamicintegration.net/predictive_analytics.aspx [6] http://en.wikipedia.org/wiki/Predictive_analytics [7] http://www.cs.waikato.ac.nz/~ihw/papers/04-EF-etal-DataminingWEKA.pdf

- 40 -

Potrebbero piacerti anche