Sei sulla pagina 1di 4

Cover Page

Project Summary Report

IS 6841 Business Intelligence & Analytics
BI with Borders
Shruti Karvande
Anurag Bhavsar
Anuja Anil Gutal
Jiangnan Zhu

Table of Contents

Executive summary


Main body


References (either on each page or in a dedicated section)

Data Set Overview:

We choose to use a mortality data set on the leading causes of death in the United States in
2014. Each year, the Center for Disease Control and Prevention releases a record of every death
in the country. The report includes information about demographic background and causes of
death. The U.S. government uses the data to determine life expectancy and to understand the
complex circumstances of death across the country.

Data Source: Kaggle

Project Overview
For the project, we hope to use analytical tools like Tableau and R to achieve two goals:
1. Perform business intelligence analysis on the data set to determine the factors related to
2. Provide recommendations on business analysis strategy implementation

Every year, the Center for Disease Control and Prevention releases the countrys most
detailed report on death. The mortality dataset contains records of every death in the
country in 2014, which includes information about demographic background and causes
of death. The U.S. government uses the data to determine life expectancy and to
understand the complex circumstances of death across the country. In our study, we will
apply key business analytics concept to analyze the DeathRecord dataset, assess the BA
maturity of the CDC, and provide our insights.

Main Body
The dataset contains four main tables:
DeathRecord: the primary table containing all the pertinent information in a
single row per death
EntityAxisConditions: ordered list of causes of death
RecordAxisConditions: unordered list of causes of death
Lookup Tables: a reference to show the code and description for each column
name from DeathRecord table

In our study, we will apply the following topics from the course: Lead/Lag Data, Critical
Success Factor, SMART, Rockart Model, Insights, Explorative model, and BI Maturity,
to analyze the a Kaggle competition dataset.

Lag/ Lead Data

Lag data:
Following is the list of lag data from which inferences could be made.
DeathRecords provide us with the information of manner of death.
Eg. MannerOfDeath is the column which mentions whether the death was due
to accident, suicide, natural, etc.
Details of person such as education, age, gender, and race are part of lag data
which can be used to infer insights.

Lead data:
We have a lag data of age and year of death from which insights could be drawn
about deaths in particular year for particular age group.
Using the age data, patterns between deaths in specific age group v/s specific
manner of death could be evaluated. Eg. Manner of death is suicide for
particular age group.
Correlation between age group and gender can be evaluated using lag data of
age and gender form DeathOfRecords.
Manner of death lag data can be useful to infer the relation between the age of
persona and the manner of death.

Critical Success Factor:

In our study, we are particularly interested in answering the following questions in order to
understand the causes of death.
Marital status and Age: what is the relationship between marital status and death age?
Are married or single people live longer?
Accidents and Resident Status:
Education and Place of Injury:
Top causes of death: visualize the top causes of death by referring to the Icd10Code
Top death month: Is there more death in any particular month?
We want to find out if the data that the CDC collects and releases meets the SMART
Specific: the dataset is very specific covering all the aspects of a death, from
demographic background to causes of death. A very detailed record of each death
will help us understand death and improve the overall well-being of the American
Measurable: data collected is measurable. For example, the number of death for a
particular cause can be traced and measured over time.
Agreed: we believed that statisticians and data warehouse architects at the CDC
have reached to a consensus to publish this annual report.
Realistic: all the data recorded is realistic.
Time-bound: every year, the CDC releases the countrys most detailed report on
death. It is time-bound.

Explorative Analysis:
Explorative Analysis is an approach to analyzing data sets to summarize their main
characteristics with visual methods. By exploring this data, it is possible to formulate
hypotheses that can lead to new data collection thus helping with lead data for future

The DeathRecord table contains 38 variables and 1,048,576 observations. Before conducting a
BI analysis, we need to first perform a data cleansing process. By using R, we randomly select
2,000 samples from the dataset.1 We use the sample dataset to represent the whole population.

Data Reduction

1. Find most significant measurements: In our dataset, the following variables are
considered to be the most significant for measuring the required KPIs: Age, Education
Status, Causes of Death, and Marital Status.

2. Eliminate/Ignore less significant items: Causes of death pertaining to Pending

Investigation or undetermined cases are ignored for the analysis since they may not let
us to the actual causes affecting at a larger hand. Factors such as method of disposition,
age groups with irrelevant data were excluded.

Cluster Analysis

1. Grouping/ Segmentation: Group together same causes and analyze on various factors:
Age Groups, Marital status, Education. For example, marital status: married, divorced,
single etc. were grouped to come to a better analysis. Education system was clustered as
8th grade, bachelors, graduate etc. to understand the liability of deaths in particular
education pattern. Here, we will use Tableau to illustrate a better visualization.

1 Please refer to the Appendix for the R syntax to get the sample of 2,000
2. Allows for targeting: after grouping the dataset, we set the following targets:
a. The relationship between death and marital status
b. Top causes of death
c. Death and seasonality


a. Indicate which of the course materials were used during the project. For each item,
briefly describe the context in which it was used (1 to 2 bullet points is sufficient).
b. Hours spent on different project tasks by each team member. Anything else that is
relevant to the project but is not appropriate for the main body of the Project Summary

R code to randomly select 2,000 samples from 1,048,576 observations

install.packages("data.table", dependencies= TRUE)

d <- fread(input = "C:/Users/Anuja/Desktop/subset_column.csv")

subset_rows <- d[sample(nrow(d),size = 2000, replace = T),]
write.csv(file = "subset.csv",subset_rows)

Concepts applied in the study:

Lead/Lag Data
Critical Success Factor
Rockart Model
Explorative model
BI Maturity