Wal-Mart Sales Forecasting

WAL-MART SALES
FORECASTING
94-832: Business Intelligence & Data Mining SAS
TEAM 7
MITHUN MATHEW
MEAGAN MUSGRAVE
AKASH PATEL
RENU THOMAS
IVY YANG
Report
Team 7
Table of Contents
1
Introduction .......................................................................................................................................... 3
Business Questions ............................................................................................................................... 4
2.1
Question One ................................................................................................................................ 4
2.2
Question Two ................................................................................................................................ 4
Description and Preparation of Data .................................................................................................... 5

3.1
Data Source ................................................................................................................................... 5
3.2
Data Sets Utilized .......................................................................................................................... 5
3.3
Data Preparation: Merging, Cleaning, and Transforming the Data ............................................. 5
Exploratory Analysis .............................................................................................................................. 7

4.1
Top 10 Stores by Sales .................................................................................................................. 7
4.2
Top 5 Departments across the stores ........................................................................................... 8
4.3
Sales vs CPI & Sales vs Fuel price .................................................................................................. 9
Unsupervised Learning: Clustering .................................................................................................... 10

5.1
Initial Results ............................................................................................................................... 10
5.2
Insight from Cluster A ................................................................................................................. 11
5.3
Insight from Cluster B ................................................................................................................. 12
5.4
Insight from Cluster C ................................................................................................................. 12
5.5
Overall Insight ............................................................................................................................. 13
Supervised Learning: Regression ........................................................................................................ 14

6.1
Linear Regression with Full Data ................................................................................................. 14
6.2
Linear Regression with Imputed and Transformed Full Data ..................................................... 15
6.3
Linear Regression with Filtered Data .......................................................................................... 16
6.4
Linear Regression with Normalized Data .................................................................................... 17
Supervised Learning: Decision Tree .................................................................................................... 19
7.1
Two-way Split .............................................................................................................................. 19
7.2
Three-way Split ........................................................................................................................... 20
7.3
Two-way Split without DEPT and STORE .................................................................................... 21
7.4
Decision Tree on Sampled Data .................................................................................................. 22
Time Series Analysis ............................................................................................................................ 23

8.1
Data Exploration ......................................................................................................................... 23
8.2
Hierarchical Clustering [6]........................................................................................................... 26
8.3
Sales Forecasting [7] ................................................................................................................... 27

1|Page
Report
9
10
Team 7
Business Implications .......................................................................................................................... 30

References ...................................................................................................................................... 31
Appendix A .................................................................................................................................................. 32
Appendix B .................................................................................................................................................. 33
2|Page
Report
Team 7
1 Introduction
This project has been done for the fulfilment of the project requirement of the course 94-832: Business
Intelligence & Data Mining SAS. The data which formed part of our core analysis was the Walmart data
set obtained from Kaggle.
The data contained weekly sales of various departments within different stores over different period of
time. Most of the work put into the project evolves around staging the data for cleaning the data and
modelling around different parameters and methodologies.
Using different methodologies, clustering, regression and decision tree, different models were
generated and their errors were noted. Variables of importance were identified and clustering insights
were drawn.
Time serie analysis was done for hierarchical clustering on sales trends, and portrayed how each cluster
was different from each other. To predict the sales for the end of the year holiday season of 2012, time
series forecasting was used.
3|Page
Report
Team 7
2 Business Questions
2.1 Question One
Retailers face many challenges when trying to forecast sales due to several reasons: the scale of the
problem, the erratic sales at the each individual store, season changes, constant introduction of new
items, and repeated promotional activity [1]. In an attempt to eradicate these issues, retailers have
turned to large-scale demand-forecasting that is able to accommodate large amounts of transaction
data. By collecting these data, retailers can then mine it and project future customer behavior. The
ability to forecast at such on such a large scale allows retailers the opportunity to optimize their revenue
system, thus enabling better choices on promotions and pricing. For our project we take on this
challenge and attempt to correctly forecast sales at Walmart. Given the reputation Walmart has about
its competitive pricing structure, the ability to accurately project sales is key in its ability to function.
However, research out of the University of Michigan recently affirmed that clustering prior to
forecasting sales greatly increases the accuracy of forecasts [2]. By clustering stores based on sales, and
attributes such as average temperature, fuel prices, etc., stores can eliminate the need to control for
seasonal indices and classes (summer shoes versus winter shirts etc.). After applying hierarchical
clustering to the data we hope to determine which stores are similar, in terms of both sales and store
attributes, so that we can ascern which characteristics are key drivers and sales, thus allowing us to
generate more accurate forecasts.
2.2 Question Two

Recent news reports have underscored the importance of getting an accurate forecast. In January of
2014, Walmart had several chains cut their forecasts due to the holiday season and profit-eating
discounts [3]. Moving forward to almost the end of 2014, Walmart again acknowledged that that it
needs to do a better job at forecasting in order to ensure that it is keeping appropriate levels of
inventory [4]. Given these recent developments, it is clear that forecasting plays an integral role in an
retailers success. We will address Walmarts challenge by leveraging sales data from 45 Walmart stores
that are from different regions within the United States. By taking these data we will be able to make
predictions on department-wide sales at each of the 45 stores. In addition to attempting to accurately
predicting department-wide sales, we will also attempt to understand the impact of markdowns (price
reductions) on holiday weeks. However, it is important to note that while we have data for each of the
45 stores regarding department-wide sales, we will be modeling the effect of markdowns without
possessing complete historical data. Overall, we hope to understand which attributes significantly
impact sales at the store level via regression, time series analysis, and decision tree models. These
results can then foster an accurate prediction of 2012 sales data, thus allowing us to determine when is
the best time to hire new employees.
4|Page
Report
Team 7
3 Description and Preparation of Data

3.1 Data Source
The Walmart Store Sales data is published as Walmart recruiting competition on Kaggle [5]. It covers
historical sales data for 45 Walmart stores in different regions of United States from 2010-02-05 to
2012-11-01. There three files contained in the data set: stores.csv, features.csv and train.csv.
3.2 Data Sets Utilized

stores.csv
This file describes three important features of 45 stores. Each store (1-45) is defined with a store type
(A-C) and a store size (numeric).
features.csv
This file describes additional information about each store for the given weeks. Each record contains 5
types of promotion markdowns at the given week. It also involves the average temperature, fuel
price,CPI and unemployment rate for its corresponding geographic region in this week. As well, each
record indicates whether the week is a special holiday week.
train.csv
This is the main historical sales data for training. Each records represents weekly sales for a certain
department in the given store at given week. It also maintains the isHoliday field specifying whether
the week is a holiday week.
Based on preliminary analysis, we decided to use all the tables provided. Although we use the official
train data as our dataset, our business goals are not restricted to sales prediction in this project. Then
the next step focuses on data cleaning, merging and pre-processing.
3.3 Data Preparation: Merging, Cleaning, and Transforming the Data

To put together all the three .csv files (train.csv, features.csv and stores.csv), the PK FK relations were
identified. Before denormalizing the data, all the NA values in the table features.csv was changed to
NULL. The TRUE and FALSE values for the ISHOLIDAY attribute were changed to binary values 1 and 0
repsectively.
The following statements generated the denormalized Walmart_Train dataset, which was used for the
remainder of the project.
Combining Stores and Features table as Stores_Features:
CREATE TABLE Store_Features
AS
SELECT *
FROM
Stores JOIN Features USING(Store);
Combining Stores_Features and Train table as Walmart_Train:

CREATE TABLE Walmart_Train
AS
SELECT *
5|Page
Report
FROM
Team 7
Train JOIN Store_Features USING(Store, Week, IsHoliday);
For analytical purposes and visualization, the variables TEMPERATURE, FUEL_PRICE and WEEKLY_SALES
were categorized into the following classes: (Refer appendix A for SQL queries)
Condition
TEMPERATURE < 32
TEMPERATURE >= 32 AND TEMPERATURE < 64
TEMPERATURE > 95
TEMP_CLASS
Freezing
Cold
Comfortable
Hot
Extremely Hot
3-1: TEMP_CLASS
Condition
FUEL_PRICE < 2.75
FUEL_PRICE >= 2.75 AND FUEL_PRICE < 3.12
FUEL_PRICE > 3.12
FUEL_CLASS
Low
Medium
High
3-2: FUEL_CLASS
Condition
WEEKLY_SALES <= 0
WEEKLY_SALES > 0 AND WEEKLY_SALES <= 10000
WEEKLY_SALES > 100000
SALES_CLASS
Negative
Low
Medium
High
Very High
3-3: SALES_CLASS
To visualize the data from a better perspective, further categorical attributes were added, including the
HOLIDAY (Super Bowl, Labor Day, Thanksgiving, Christmas). The two weeks before each holiday was
set as (Before Super Bowl, Before Labor Day, Before Thanksgiving, Before Christmas).
Furthermore, unemployment and CPI were categorized into Low, Medium and High. Store size was
categorized to Small, Medium and Large. (Refer appendix B for SQL Queries)
6|Page
Report
Team 7
4 Exploratory Analysis
4.1 Top 10 Stores by Sales
4-1: Top 10 Stores
The above chart shows the top 10 stores in terms of sales revenue and their percentage contribution to
the total sales generated between them. Store 20 was the highest contributor with a total of 301
Million. The stores are mix of 7 large sized and 3 medium sized stores. Together, these 10 stores
accounted for 39% of the revenue generated by the given 45 stores.
7|Page
Report
4.2
Team 7
Top 5 Departments across the stores
4-2: Top 5 Departments
The above figure shows the top 5 departments across the 3 store types namely A,B & C. Interestingly,
Department number 72 showed a significant hike in sales across store type A and B. Store type A
fetched the most sales whereas Store type C fetched the least sales.
4-3: Top 10 Stores

4-4: Top 5 Departments
8|Page
Report
Team 7
The above figure shows the pre-holiday sales registered by the 3 store types. The sales were the highest
before christmas followed by pre thanksgiving, pre labor day and pre super bowl sales. Store type A
registered the highest sales followed by store type B and store type C.
4.3 Sales vs CPI & Sales vs Fuel price
4-5: Sales vs CPI & Sales vs Fuel Price
No strong relationships were clear from visualizing the weekly sales data with respect to the CPI and the
fuel price during that week.
9|Page
Report
Team 7
5 Unsupervised Learning: Clustering
5-1: Clustering Nodes
5.1 Initial Results

The clustering model utilizes all of the attributes within the data sans weekly sales and all of the
markdown variables and uses the store ID as the segment cluster variable role. We set the cluster
variable role to segment and indicate that the model should standardize the data. Utilizing the centroid
clustering method yields three unique clusters.
5-2: Clustering
Each of these clusters represents a group of stores that share similar values of each distinct attribute
that has been clustered around the store ID. Based on the initial results table, we can see that each
cluster has different averages across each attribute.
10 | P a g e
Report
Team 7
Comparing these averages via the input means plot allows us to draw conclusions about each individual
segment (see sections 5.2-5.4)
5-3: Clusters
5.2 Insight from Cluster A

Cluster A represents the largest amount of stores within this data set. Based on the means input plot
(above), this cluster of stores has experienced lower than average fuel prices and unemployment rates.
This is further complemented by a higher than average consumer price index rating. We can also
observe that stores in this cluster are typically larger than the other stores. Overall, we might be able to
infer that Cluster A is filled with stores in richer, suburban regions, thus explaining the high CPI and low
unemployment rate and gas prices. However, because we do not have geographic information within
this dataset we are unable to make further conclusions. In terms of what variables are important within
this cluster, the chart below provides a visual of the importance of each attribute:
5-4: Cluster A - Variable Importance
11 | P a g e
Report
Team 7
Per the Variable Importance graph, CPI, unemployment rate, and store size are the top three important
variables when considering this cluster of stores.
5.3 Insight from Cluster B

Cluster B, per the input means plot, has higher a than average unemployment rates and temperature,
but a lower than average fuel price, store size, and consumer price index. Again, because we do not
have geographic data pertaining to each of the stores we are unable to make any further assumptions
about the location of each of these stores within Cluster B. The variable importance graph (below)
shows similar results as the graph from Cluster A.
5-5: Cluster B - Variable Importance
Again, the consumer price index, unemployment rate, and store size are all important variables within
this cluster of stores. It appears that the same variables are important across Clusters A and B, but the
averages of each of the attributes differs slightly relative to the overall attribute averages.
5.4 Insight from Cluster C

Cluster C is completely different from Clusters A and B in that this is the only segment that addresses the
importance of holidays. Overall, Cluster C has lower than average fuel prices and and temperature, but
all other attributes are on par with the overall attribute average. Looking at the variable importance
graph below confirms that this clusters important variables are in stark contrast to Cluster and A and B.
12 | P a g e
Report
Team 7
5-6: Cluster C - Variable Importance
This cluster of stores are grouped together because holidays have a large impact, with the variable
Holiday? dwarfing all other attribute values.
5.5 Overall Insight

Looking at all of the clusters relative to the overall population averages reveals that clustering prior to
forecasting can help eliminate errors that are often caused by seasonal changes or population disparity.
The impact of store size remains constant throughout each different cluster, but moving to attributes
beyond that reveal that the correlation between an attribute and weekly sales differs across each of the
three unique clusters.
5-7: Correlation with Weekly Sales
To sum, an initial clustering analysis reveals that different groups of stores have different relationships
with weekly sales depending on which cluster it belongs to. Holidays only appear to have an impact
within Cluster C, while the other attributes of interest are more relevant to Clusters B and C. We now
move onto our second method of unsupervised learning in an effort to test of the relationships seen
above are statistically significant.
13 | P a g e
Report
Team 7
6 Supervised Learning: Regression

6.1 Linear Regression with Full Data
6-1: Regression
In this model, we maintains all the variables (CPI, DEPT, FUEL_PRICE, ISHOLIDAY, MARKDOWN1-5,
STORE_SIZE, STORE_TYPE, TEMPERATURE, UNEMPLOYMENT), also we have WEEK as Time ID, STORE as
ID and WEEKLY_SALES as target. We firstly use Data Partition node to split the data into 70% as training
set and 30% as validation set. And then we set the selection model as stepwise, forward and backward
separately, with validation error as the selection criterion.
Effect
DEPT
STORE_SIZE
DF Sum of Squares F Value

80
2.46E+13
1456.02
1
2.15E+12
10162.5
Table 6-1: Regression Error
Pr > F
<.0001
<.0001
The result of stepwise and forward models are pretty similar. But the backward model gives a worse
result hence we take the stepwise result here, which usually gives the best solution. In this model, we
get the average square training error of 2.0121E8 and validation error of 2.0451E8. Although the plot
seems good especially at the beginging, the overall error statistic does not perform well. As we can see
from the Type 3 Analysis of Effects above, this result is caused by getting only two important variables in
this model at the end, which are DEPT and STORE_SIZE. This linear regression model contains all the
values of DEPT, which means the norminal values of department will affect the regression result deeply.
The average price of products in different departments may varies a lot. However, it does not make
sense to predict the sales only by looking at their departments. Also, STORE_SIZE contains large
numbers compared with other variables, it will cover the other variables effects and affect the accuracy
of model.
14 | P a g e
Report
Team 7
6-2: Linear Regression with Full Data
6.2 Linear Regression with Imputed and Transformed Full Data
6-3: Linear Regression with Inpute and Transform
To improve the results, we imputed the missing values of MARKDOWN1 5, and take the log of each
interval variable to remove their skewed. Then we got a better model whose average square training
error is 1.9549E8, and average squer validation error is 1.9831E8. Also, this model seems make more
sence than the before one. More attributes are involved in this model.
6-4: Linear Regression with Inputed and Transformed Full Data
From the screenshot of the model below, we can see the DEPT still has huge influence.
15 | P a g e
Report
Team 7
6.3 Linear Regression with Filtered Data
6-7: Linear Regression with Filtered Data
To reduce the negative affect of DEPT, we filter out the department variable. We make the similar
settings for all other variables and get the new result. However, the result seems even worse. We get
the average square training error of 4.842E8 and validation error of 4.881E8. It means, the department
in this dataset is really important. And if we want to make our model more accuracte, we need to keep
the department in our regression model.
16 | P a g e
Report
Team 7
6-8: Linear Regression with Filtered Data
6.4 Linear Regression with Normalized Data

To normalize the interval data, we simply modify the data in Microsoft Excel. For each variable, we
create a normalized variable using the original value devided by the largest value in this feature. Finally,
we get the normalized STORE_SIZE, TEMPERATURE, FUEL_PRICE, CPI, UNEMPLOYMENT and
MARKDOWN 1-5. As discussed above, we add the DEPT again to our model. Then we get the new model
with these interval variables. However, the average square training error and validation error are still
not good, which are 4.83E8 and 4.87E8.
6-9: Linear Regression with Normalized Data
17 | P a g e
Report
Team 7
6-10: Linear Regression with Normalized Data
Hence in the linear regression models, the first model (using full data) gives the best performance.
18 | P a g e
Report
Team 7
7 Supervised Learning: Decision Tree
7-1: Decision Tree
7.1 Two-way Split

Variable
Importance
DEPT
STORE
MARKDOWN3
STORE_SIZE
STORE_TYPE
UNEMPLOYMENT
TEMPERATURE
MARKDOWN5
MARKDOWN1
CPI
ISHOLIDAY
MARKDOWN4
MARKDOWN2
FUEL_PRICE
Number
Rules
of
Splitting
72.0
41.0
14.0
7.0
6.0
4.0
4.0
3.0
2.0
1.0
1.0
1.0
0.0
0.0
Importance
Validation Importance
1.0
0.4964515440958596
0.04084201073998688
0.34207810450886517
0.09548333258131239
0.03701878109843059
0.023291162585169126
0.012643930440875074
0.0037723260628764943
0.034734597871631134
0.004704913825175917
0.0017796209442395006
0.0
0.0
1.0
0.4964821958816832
0.04144053541125881
0.3480075010929593
0.096222212274534
0.031383248828117584
0.021652254529342163
0.01080177216385246
0.0011206103465749833
0.03245507329474942
0.00679092144395149
0.0
0.0
0.0
Ratio of Validation to Training Importance

1.0
1.0000617417473832
1.014654632826046
1.0173334583708806
1.0077383106899038
0.8477655907867824
0.9296339094352227
0.8543049342420205
0.297060839359281
0.9343730828464989
1.443367869484321
0.0
NaN
NaN
Table 7-1: Two Way Split
A two split decision tree was generated on the train dataset. The weekly sales classes which were
generated earlier were used as target classes. The model was heavily dependent on department (DEPT)
and store (STORE). Majority of the splitting rules were based on these two attributes. The two-way split
decision tree generated an average square error of 0.04222.
The WEEKLY_SALES is less dependent on the attribute ISHOLIDAY as opposed to the STORE_SIZE,
STORE_TYPE, UNEMPLOYMENT and TEMPERATURE. Looking at the data from a broader perspective, the
location of the store played a major factor in the weekly sales. A store located in a densely populated
urban area would have more sales as opposed to one in a rural area, regardless of the week being a
holiday or not. The holiday sales in a store located far off from the city might still be less compared to
the average sales in a store located in the city on a day which is not a public holiday. Stores in the cities
would be larger and would have larger amount of sales. To explore this scenario another approach was
pursued. (Refer section 4)
19 | P a g e
Report
Team 7
7-2: Two-way Split Decision Tree
7.2 Three-way Split

Variable
Importance
DEPT
STORE
STORE_SIZE
CPI
TEMPERATURE
STORE_TYPE
UNEMPLOYMENT
MARKDOWN3
FUEL_PRICE
MARKDOWN4
MARKDOWN5
MARKDOWN2
ISHOLIDAY
MARKDOWN1
Number of Splitting
Rules
150.0
73.0
39.0
69.0
49.0
11.0
43.0
42.0
29.0
4.0
4.0
4.0
3.0
4.0
Importance
1.0
0.7197389364808112
0.16985483324111295
0.13354907518567805
0.12228743722593886
0.10896237311981295
0.0870216497060246
0.0838513980255293
0.055805194471475056
0.02260829885174643
0.01614486406343703
0.014878670018523063
0.014085153665462117
0.012841667939644022
1.0
0.7172217848271354
0.17565253608886602
0.11875136590034423
0.1227097368514723
0.1091215150571583
0.08140314276386681
0.07108265278770555
0.045466506369912014
0.01494100486542834
0.013490146227052534
0.013747824483138145
0.008854538152353212
0.016799803214149107
Ratio of Validation to Training

Importance
1.0
0.9965026879524074
1.034133281562398
0.8891964675550165
1.003453336132584
1.0014605219470611
0.9354355271229843
0.8477217370432371
0.8147360976074067
0.660863736957997
0.8355688951016575
0.9239955228540533
0.6286433476452034
1.3082259479927658
Table 7-2: Three way split
In terms of variable importance, DEPT and STORE were the most important variables. However, the
three-way split provided more flexibility to the model in terms or decision making and hence the errors
in classifying them into the weekly sales classes, were less as expected. The average square error was
found to be 0.02765.
20 | P a g e
Report
Team 7
7-3: Three-way Split Decision Tree
7.3 Two-way Split without DEPT and STORE

To explore how much effect the attributes, DEPT and STORE had on the decision tree model, these
attributes were rejected and the model was generated with the same parameters as before. The model
generated portrayed a two fold increase in the average squared error (0.11243). Surprisingly, without
information on which DEPT and which STORE, the sales belongs to, the model classified all other classes
other than LOW WEEKLY_SALES incorrectly in almost all the cases. This can be seen from the graph plots
shown below.
7-4: Two-way Split without Dept. and Store
21 | P a g e
Report
Team 7
7.4 Decision Tree on Sampled Data
7-5: Decision Tree on Sampled Data
To observe how the weekly sales are dependent on the other features in the dataset, information on the
department and store ID was rejected. The data was filtered such that the classes NEGATIVE and VERY
HIGH WEEKLY_SALES were filtered out. The data was further sampled such that all the remaining
classes, LOW, MEDIUM and HIGH WEEKLY_SALES had the same number of observations.
The decision trees modeled on this data returned results as expected: The STORE_SIZE was one of the
major factors that determined the weekly sales and hence ended up as the most important variable for
splitting nodes.
Variable
Importance
STORE_SIZE
CPI
UNEMPLOYMENT
STORE_TYPE
MARKDOWN3
FUEL_PRICE
TEMPERATURE
MARKDOWN2
MARKDOWN1
MARKDOWN5
ISHOLIDAY
MARKDOWN4
Number of Splitting
Rules
16.0
4.0
5.0
3.0
2.0
1.0
1.0
0.0
0.0
0.0
0.0
0.0
Importance
1.0
0.2651285102856137
0.24441028209455445
0.15295892375537554
0.052709576258346755
0.022610228455857168
0.02168392343651502
0.0
0.0
0.0
0.0
0.0
1.0
0.24717493036369145
0.20442631071106554
0.15072179548628872
0.0315604300793954
0.012600386988009991
0.010275651333624921
0.0
0.0
0.0
0.0
0.0
Ratio of Validation to Training

Importance
1.0
0.9322834805559709
0.8364063449342921
0.9853743200189836
0.5987608385373367
0.5572870266485904
0.4738833986252257
NaN
NaN
NaN
NaN
NaN
Table 7-3: Sampled Date Decision Tree
However, the average square error for both trees (two-way split and three-way split) turned out to be
0.211, hence producing nodes with lower levels of purities for the tree.
The following table summarizes the decision tree models that were generated for the WALMART_TRAIN
dataset.
Average Squared Error
Two-way Split
Three-way Split
Two-way Split without DEPT and STORE
Two-way & Three-way Split on Sampled
Data
0.042
0.028
0.112
0.211
Table 7-4: Avg. Error
22 | P a g e
Report
Team 7
8 Time Series Analysis

Due to the nature of the data the results generated by the standard algorithms used in the previous
sections, provided little insight. To generate better results, the time series analysis tools of SAS were
used.
8.1 Data Exploration

To analyze the data from a time series perspective, the time dimension was set up in conjunction with
the cross sectional dimensions, store and department; using SAS TS Data Preparation Node.
8-1: Dimension Cube
The setting up of this structure allowed flexibility in visualizing data on aggregation over different
dimensions.
The following plot shows the weekly sales for 100 of the store department combination. It is quite
evident from the plot that the sales was recorded high during the holiday seasons: Christmas in
December and before summer in May. Other notable peaks in sales was during Thanksgiving in
November, Superbowl in February and Labor Day in September.
23 | P a g e
Report
Team 7
8-2: Weekly Sales for Store-Department
For further analysis, the mean weekly sales for each store as well as each department was plotted. Some
of the departments had very high average weekly sales compared to the others. These departments
although not mentioned by Walmart for privacy purposes, might be the departments which sell
products required by people on a day to day basis like groceries; or high grossing departments like
electronics, etc.
8-3: Mean Weekly Sales by Store
24 | P a g e
Report
Team 7
8-4: Mean Weekly Sales by Department
25 | P a g e
Report
Team 7
8.2 Hierarchical Clustering [6]
8-5: Hierarchical Clustering
Based on the values of the different input variables such as CPI, UNEMPLOYMENT, ISHOLIDAY,
TEMPERATURE and different MARKDOWN values, the time series inputs were used for clustering. The
clustering mechanism used mean squared error between the total weekly sales of the stores as the
similarity measure.
The following dendogram shows the distance between the different clusters that were generated.
8-6: Clustering Dendogram
Based on the minimum distance between clusters, at a value of 0.1 distance, three main clusters were
generated. Stores 7, 16, 17, 38 and 44 were clustered together as cluster A. Stores 28, 30, 33, 36, 37, 42
and 43 were clustered together as cluster B. And the rest of the stores belonged to cluster C. The
features of these clusters became more evident during the forecasting process.
26 | P a g e
Report
Team 7
The following graph shows how the different stores were clustered in terms of their trends on weekly
sales based on the trends of other attributes.
8-7: Clustering Graph
The features of these clusters became more evident when the trends in sales for the stores were
analyzed. Stores from the same cluster showed similar trends in weekly sales.
8.3 Sales Forecasting [7]

Using SAS Enterprise Miners Time Series Exponential Smoothing Tool, the sales for the stores was
forecasted for the next six weeks, until December 2012. This sales forecasting methodology is
independent of any of the earlier mentioned input variables. The forecasting takes into consideration
seasonal effects and trends in sales over the period of February 2010 to October 2012.
8-8: Sales Forecasting
For each store, different models were used to forecast the sales. The model with the least standard
error was automatically selected as the best model for forecasting sales for that store. The additive
winters model and seasonal models proved to be the best fit for most stores. The following table
illustrates which model was used for each store, and the paaremeter estimate and the associated
standard error.
27 | P a g e
Report
Team 7
Time Series ID
1.0
1.0
1.0
2.0
2.0
2.0
3.0
3.0
3.0
4.0
4.0
4.0
5.0
5.0
5.0
6.0
6.0
7.0
7.0
7.0
8.0
8.0
8.0
9.0
9.0
9.0
Store
1.0
1.0
1.0
2.0
2.0
2.0
3.0
3.0
3.0
4.0
4.0
4.0
5.0
5.0
5.0
6.0
6.0
7.0
7.0
7.0
8.0
8.0
8.0
9.0
9.0
9.0
Model
ADDWINTERS
ADDWINTERS
ADDWINTERS
WINTERS
WINTERS
WINTERS
ADDWINTERS
ADDWINTERS
ADDWINTERS
ADDWINTERS
ADDWINTERS
ADDWINTERS
ADDWINTERS
ADDWINTERS
ADDWINTERS
SEASONAL
SEASONAL
ADDWINTERS
ADDWINTERS
ADDWINTERS
ADDWINTERS
ADDWINTERS
ADDWINTERS
ADDWINTERS
ADDWINTERS
ADDWINTERS
Parameter
LEVEL
SEASON
TREND
LEVEL
SEASON
TREND
TREND
SEASON
LEVEL
LEVEL
TREND
SEASON
TREND
SEASON
LEVEL
SEASON
LEVEL
LEVEL
SEASON
TREND
SEASON
LEVEL
TREND
LEVEL
TREND
SEASON
Parameter Estimate
0.0034631198964860067
0.6055475095192919
0.001
0.16545582491041058
0.921247729943568
0.001
0.001
0.6250302553028629
0.18422291502403845
0.08795186376589817
0.001
0.7130415339734698
0.001
0.5774493510072217
0.1367460310044868
0.7157152817395214
0.12046874034834318
0.1653571721391764
0.7697969237235239
0.001
0.6850143638004
0.0662748336317954
0.001
0.17980061535722747
0.001
0.814285370219351
Standard Error
0.00441693090198638
0.040597985445483306
0.002625959422274718
0.02507185069755535
0.05853998565921424
0.009608906293211008
0.004528332723662851
0.04869015651345081
0.025950895825345977
0.019526521663305617
0.006350001834242006
0.04124489642150956
0.01145787934036644
0.04518008362353768
0.023629244732261773
0.04219586666398573
0.016873522037056683
0.020244620448452346
0.050169815195813205
0.004356459668509099
0.04029437117033705
0.013963823601356068
0.005333437425109881
0.02424450837665769
0.030250497941758804
0.05696269299246877
Table 8-1: Models
Based on the models, the weekly sales of each store was forecasted for 12 weeks, covering the holiday
season in December (the forecasted sales are shown after the vertical line on the graph). The following
graph shows the forecasted sales of a store that is doing fairly well. Store 1 is a store from the cluster B.
All stores in the cluster show a similar trends very high peak of sales during Christmas.
8-9: Store 1 - Cluster B
The following graph shows the sales for Store 7. The store show a good amount of sales from May to
September and from November to January. This store could have good potential growth in the future.
This store was selected from cluster A. All stores in this cluster have similar trend, which brings in a
steady amount of income in addition to higher sales during holidays. These can be considered as stores
with steady growth rates.
28 | P a g e
Report
Team 7
8-10: Store 7 - Cluster A
The following graph shows the sales for Store 36, which Walmart should focus on. The store has been
losing out on sales and is likely to go out of business over the next couple of years. The total sales for the
store decreased by half over a period of 2 years. Store 36 was taken off from cluster C. Stores from this
cluster generally showed a declining trend.
8-11: Store 36 Cluster C
29 | P a g e
Report
Team 7
9 Business Implications
Based on the analysis made, the Walmart should hire personnel a few weeks before the holiday seasons,
especially Thanksgiving and Christmas. This allows them to perform better when the sales go up
gradually as the holidays get closer.
Using the cluster information from the section 8.2 can be used in conjunction with sales forecasting to
come up with more accurated prediction.
Wal-mart should keep a close eye on the stores which are running out of business. Also provide an
incentive to other stores to improve their sales, and hire the right sales representatives.
30 | P a g e
Report
Team 7
10 References
[1] M. Gilliland, "Demand Forecasting in Retail," [Online]. Available:
http://www.sas.com/news/feature/retail/aug06forecast.html.
[2] M. K. &. R. R. Nitin Patel, "Clustering Models to Improve Forecasts in Retails Merchandising,"
[Online]. Available: http://www.cytel.com/Papers/INFORMS_Prac_%2004.pdf.
[3] L. C.-L. &. R. Dudley, "Wal-Mart Sees Profit at Low End of Forecast," [Online]. Available:
http://www.bloomberg.com/news/2014-01-31/wal-mart-sees-profit-at-low-end-of-forecast.html.
[4] R. Dudley, "Wal-Mart Cuts Annual Sales Forecast as Supercenters Struggle," [Online]. Available:
http://www.businessweek.com/news/2014-10-16/wal-mart-cuts-annual-sales-forecast-as-itssupercenters-.
[5] "Kaggle - Walmart Recruiting - Stores Sales Forecasting," [Online]. Available:
https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting.
[6] T. L. Sascha Schubert, "TIme Series Data Mining with SAS Enterprise Miner," [Online]. Available:
http://support.sas.com/resources/papers/proceedings11/160-2011.pdf.
[7] S. J. Satyajit Dwivedi, "Time-series Data Mining," [Online]. Available:
http://www.iasri.res.in/sscnars/data_mining/10SAS%20Enterprise%20Miner%207.1%20Time%20Series%20Data%20Mining.pdf.
31 | P a g e
Report
Team 7
Appendix A
ALTER TABLE WALMART_TRAIN
ADD TEMP_CLASS VARCHAR2(15);
UPDATE WALMART_TRAIN
SET
TEMP_CLASS = (CASE
WHEN
WHEN
WHEN
WHEN
WHEN
ELSE
END);
TEMPERATURE
TEMPERATURE
TEMPERATURE
TEMPERATURE
TEMPERATURE
NULL
< 32 THEN
>= 32 AND
>= 64 AND
>= 79 AND
> 95 THEN
'Freezing'
TEMPERATURE < 64 THEN 'Cold'
TEMPERATURE < 79 THEN 'Comfortable'
TEMPERATURE < 95 THEN 'Hot'
'Extremely Hot'
-- http://www.gasbuddy.com/gb_gastemperaturemap.aspx
ADD FUEL_CLASS VARCHAR2(15);
SET
FUEL_CLASS = (CASE
WHEN
WHEN
WHEN
ELSE
END);
FUEL_PRICE < 2.75 THEN 'Low'

FUEL_PRICE >= 2.75 AND FUEL_PRICE < 3.12 THEN 'Medium'
FUEL_PRICE > 3.12 THEN 'High'
NULL
-- http://www.statisticbrain.com/wal-mart-company-statistics/
ADD SALES_CLASS VARCHAR2(15);
SET
SALES_CLASS = (CASE
WHEN
WHEN
WHEN
WHEN
WHEN
ELSE
END);
WEEKLY_SALES
WEEKLY_SALES
WEEKLY_SALES
WEEKLY_SALES
WEEKLY_SALES
NULL
<= 0 THEN 'Negative'

> 0 AND WEEKLY_SALES <= 10000 THEN 'Low'
> 10000 AND WEEKLY_SALES <= 25000 THEN 'Medium'
> 25000 AND WEEKLY_SALES <= 100000 THEN 'High'
> 100000 THEN 'Very High'
32 | P a g e
Report
Team 7
Appendix B
CREATE TABLE WALMART_TRAIN_HOLIDAY
AS
SELECT *
FROM
WALMART_TRAIN;
ALTER TABLE WALMART_TRAIN_HOLIDAY
ADD HOLIDAY VARCHAR2(25);
UPDATE WALMART_TRAIN_HOLIDAY
SET HOLIDAY ='Super Bowl'
WHERE WEEK IN (TO_DATE('12-Feb-10', 'DD-Mon-RR'), TO_DATE('11-Feb-11', 'DD-Mon-RR'), TO_DATE('10Feb-12', 'DD-Mon-RR'), TO_DATE('08-Feb-13', 'DD-Mon-RR'));
UPDATE
WALMART_TRAIN_HOLIDAY
SET HOLIDAY ='Labor
Day'
WHERE WEEK IN (TO_DATE('10-Sep-10', 'DD-Mon-RR'), TO_DATE('09-Sep-11', 'DD-Mon-RR'), TO_DATE('07Sep-12', 'DD-Mon-RR'), TO_DATE('06-Sep-13', 'DD-Mon-RR'));
UPDATE
SET HOLIDAY
='Thanksgiving'
WHERE WEEK IN (TO_DATE('26-Nov-10', 'DD-Mon-RR'), TO_DATE('25-Nov-11', 'DD-Mon-RR'), TO_DATE('23Nov-12', 'DD-Mon-RR'), TO_DATE('29-Nov-13', 'DD-Mon-RR'));
UPDATE
SET HOLIDAY
='Christmas'
WHERE WEEK IN (TO_DATE('31-Dec-10', 'DD-Mon-RR'), TO_DATE('30-Dec-11', 'DD-Mon-RR'), TO_DATE('28Dec-12', 'DD-Mon-RR'), TO_DATE('27-Dec-13', 'DD-Mon-RR'));
SET HOLIDAY ='Before Super Bowl'
WHERE (WEEK BETWEEN (TO_DATE('12-Feb-10', 'DD-Mon-RR') - 14) AND TO_DATE('12-Feb-10', 'DD-Mon-RR'))
OR (WEEK BETWEEN (TO_DATE('11-Feb-11', 'DD-Mon-RR') - 14) AND TO_DATE('11-Feb-11', 'DD-Mon-RR'))
OR (WEEK BETWEEN (TO_DATE('10-Feb-12', 'DD-Mon-RR') - 14) AND TO_DATE('10-Feb-12', 'DD-Mon-RR'))
OR (WEEK BETWEEN (TO_DATE('08-Feb-13', 'DD-Mon-RR') - 14) AND TO_DATE('08-Feb-13', 'DD-Mon-RR'));
UPDATE
SET HOLIDAY ='Before Labor Day'
WHERE (WEEK BETWEEN (TO_DATE('10-Sep-10', 'DD-Mon-RR') - 14) AND
TO_DATE('10-Sep-10', 'DD-Mon-RR'))
33 | P a g e
Report
OR (WEEK BETWEEN (TO_DATE('09-Sep-11', 'DD-Mon-RR') - 14) AND
Team 7
TO_DATE('06-Sep-13', 'DD-Mon-RR'));
UPDATE
SET HOLIDAY ='Before Thanksgiving'
WHERE (WEEK BETWEEN (TO_DATE('26-Nov-10', 'DD-Mon-RR') - 14) AND TO_DATE('26-Nov-10', 'DD-Mon-RR'))
OR (WEEK BETWEEN (TO_DATE('25-Nov-11', 'DD-Mon-RR') - 14) AND TO_DATE('25-Nov-11', 'DD-Mon-RR'))
OR (WEEK BETWEEN (TO_DATE('23-Nov-12', 'DD-Mon-RR') - 14) AND TO_DATE('23-Nov-12', 'DD-Mon-RR'))
OR (WEEK BETWEEN (TO_DATE('29-Nov-13', 'DD-Mon-RR') - 14) AND TO_DATE('29-Nov-13', 'DD-Mon-RR'));
UPDATE
SET HOLIDAY ='Before Christmas'
WHERE (WEEK BETWEEN (TO_DATE('31-Dec-10', 'DD-Mon-RR') - 14) AND TO_DATE('31-Dec-10', 'DD-Mon-RR'))
OR (WEEK BETWEEN (TO_DATE('30-Dec-11', 'DD-Mon-RR') - 14) AND TO_DATE('30-Dec-11', 'DD-Mon-RR'))
OR (WEEK BETWEEN (TO_DATE('28-Dec-12', 'DD-Mon-RR') - 14) AND TO_DATE('28-Dec-12', 'DD-Mon-RR'))
OR (WEEK BETWEEN (TO_DATE('27-Dec-13', 'DD-Mon-RR') - 14) AND TO_DATE('27-Dec-13', 'DD-MonRR'));
SET HOLIDAY ='Not Holiday'
WHERE HOLIDAY IS NULL;
ADD STORE_SIZE_CLASS VARCHAR2(10);
SET
STORE_SIZE_CLASS = CASE
WHEN STORE_SIZE < 100000 THEN 'Small'
WHEN STORE_SIZE >= 100000 AND STORE_SIZE < 200000 THEN 'Medium'
WHEN STORE_SIZE >= 200000 THEN 'Large'
END;
ADD UNEMPLOYMENT_CLASS VARCHAR2(10);
SET
UNEMPLOYMENT_CLASS = CASE
WHEN UNEMPLOYMENT < 7 THEN 'Low'
WHEN UNEMPLOYMENT >= 7 AND UNEMPLOYMENT < 11 THEN 'Medium'
WHEN UNEMPLOYMENT >= 11 THEN 'High'
END;
ADD CPI_CLASS VARCHAR2(10);
SET
CPI_CLASS = CASE
WHEN CPI < 159 THEN 'Low'
WHEN CPI >= 159 AND UNEMPLOYMENT < 192 THEN 'Medium'
WHEN CPI >= 192 THEN 'High'
END;
34 | P a g e
Report
Team 7

ADD DEPT_CLASS VARCHAR2(12);
UPDATE WALMART_TRAIN_HOLIDAY OH
SET
DEPT_CLASS = 'Low Sales'
WHERE DEPT IN ( SELECT DEPT
FROM
( SELECT DEPT, MEDIAN(WEEKLY_SALES) MD
FROM
GROUP BY DEPT)
WHERE MD < 20000);
SET
DEPT_CLASS = 'Medium Sales'
FROM
FROM
GROUP BY DEPT)
WHERE MD > = 20000 AND MD < 40000);
SET
DEPT_CLASS = 'High Sales'
FROM
FROM
GROUP BY DEPT)
WHERE MD > = 40000);
35 | P a g e

Wal-Mart Sales Forecasting

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Wal-Mart Sales Forecasting

Caricato da

Copyright:

Formati disponibili

WAL-MART SALES

94-832: Business Intelligence & Data Mining SAS

Business Questions ............................................................................................................................... 4

Question One ................................................................................................................................ 4

Question Two ................................................................................................................................ 4

Description and Preparation of Data .................................................................................................... 5

Data Source ................................................................................................................................... 5

Data Sets Utilized .......................................................................................................................... 5

Data Preparation: Merging, Cleaning, and Transforming the Data ............................................. 5

Exploratory Analysis .............................................................................................................................. 7

Top 10 Stores by Sales .................................................................................................................. 7

Top 5 Departments across the stores ........................................................................................... 8

Sales vs CPI & Sales vs Fuel price .................................................................................................. 9

Unsupervised Learning: Clustering .................................................................................................... 10

Initial Results ............................................................................................................................... 10

Insight from Cluster A ................................................................................................................. 11

Insight from Cluster B ................................................................................................................. 12

Insight from Cluster C ................................................................................................................. 12

Overall Insight ............................................................................................................................. 13

Supervised Learning: Regression ........................................................................................................ 14

Linear Regression with Full Data ................................................................................................. 14

Linear Regression with Imputed and Transformed Full Data ..................................................... 15

Linear Regression with Filtered Data .......................................................................................... 16

Linear Regression with Normalized Data .................................................................................... 17

Supervised Learning: Decision Tree .................................................................................................... 19

Two-way Split .............................................................................................................................. 19

Three-way Split ........................................................................................................................... 20

Two-way Split without DEPT and STORE .................................................................................... 21

Decision Tree on Sampled Data .................................................................................................. 22

Time Series Analysis ............................................................................................................................ 23

Data Exploration ......................................................................................................................... 23

Hierarchical Clustering [6]........................................................................................................... 26

Sales Forecasting [7] ................................................................................................................... 27

94-832: Business Intelligence & Data Mining SAS

Business Implications .......................................................................................................................... 30

94-832: Business Intelligence & Data Mining SAS

94-832: Business Intelligence & Data Mining SAS

2.2 Question Two

94-832: Business Intelligence & Data Mining SAS

3 Description and Preparation of Data

3.2 Data Sets Utilized

3.3 Data Preparation: Merging, Cleaning, and Transforming the Data

Combining Stores_Features and Train table as Walmart_Train:

94-832: Business Intelligence & Data Mining SAS

94-832: Business Intelligence & Data Mining SAS

4-1: Top 10 Stores

94-832: Business Intelligence & Data Mining SAS

Top 5 Departments across the stores

4-2: Top 5 Departments

4-3: Top 10 Stores

94-832: Business Intelligence & Data Mining SAS

4.3 Sales vs CPI & Sales vs Fuel price

4-5: Sales vs CPI & Sales vs Fuel Price

94-832: Business Intelligence & Data Mining SAS

5 Unsupervised Learning: Clustering

5-1: Clustering Nodes

5.1 Initial Results

94-832: Business Intelligence & Data Mining SAS

5.2 Insight from Cluster A

5-4: Cluster A - Variable Importance

94-832: Business Intelligence & Data Mining SAS

5.3 Insight from Cluster B

5-5: Cluster B - Variable Importance