Sei sulla pagina 1di 3

GROUP (UP TO 2 PEOPLE) WEB ANALYTICS ASSIGNMENT #2 (50 POINTS)

DUE DATE: Wednesday, 4/13/2017

This assignment will analyze the data (HotelClickStream.xls) and interpret the results. This
dataset includes clickstream data of online transactions for hotel booking in year 2011.
Appendix includes the detailed description for the variables.
Please follow the instructions very carefully to do this assignment! Please do the following
analyses and answer the corresponding questions. Please copy/summarize your key results for
each question to a word file along with your answers to produce the final report for submission.

1. Please first create the following 2 additional variables into your data
1) REF_D (create a dummy variable indicating whether the transaction was referenced from
other website, if not, the final booking website was directly accessed. If no information
provided for the variable REF_DOMAIN_NAME, REF_D = 0; otherwise REF_D = 1)
2) LOG_PRICE (take the log transformation of the variable PROD_TOTPRICE using the LOG
function in excel)

a) (4 points) Please provide a summary table showing the top 10 domain names (DOMAIN_NAME)
that generated the most volume of transactions the report should look like the following Table
(Hint: one way to do this is to use the COUNTIF function in excel). Please summarize briefly
your observations from the results
Rank Domain Names # of Transactions
1 marriott.com 524

b) (4 points) Please provide a summary table showing the top 10 reference domain names
(REF_DOMAIN_NAME) that generated the most volume of transactions the report should look
like the following Table. Please summarize briefly your observations from the results.
Rank Reference Domain Names # of Transactions
1 google.com 620

c) (4 points) Please provide summary statistics (N, Max, Min, Mean, and Std.) for variables:
DIRECTP_D; REF_D; DURATION; PAGES_VIEWED; LOG_PRICE; and TRANS_FREQ.
Please report your summary statistics table and provide short descriptions (a few bullet points) of
your observations.

2. (6 points) Please use the Binary Outcome (Logistic/Logit) regression technique to answer the
question on what are the factors that influence peoples decision on whether to book directly on a
hotel website or from other third party website? Please use DIRECT_D as your Dependent Variable
(DV); and REF_D, LOG_PRICE, TRANS_FREQ, DURATION, HOUSEHOLD_SIZE,
CHILDREN_D, and CONNECTIONSPEED_D as your Independent Variables (IV). Please report and
interpret your regression results, which should include the interpretation of the regression
coefficients.

1
3. a) (6 points) Please use the Count Data (Poisson) regression model to answer the question on what
are the factors that influence peoples booking frequencies? Please use TRANS_FREQ as your DV;
and REF_D, LOG_PRICE, PAGES_VIEWED, HOUSEHOLD_SIZE, CHILDREN_D, and
CONNECTIONSPEED_D as your IVs. Please report and interpret your regression results, which
should include the interpretation of the regression coefficients.
b) (6 points) Please repeat the analysis in question a) using the Negative Binomial Regression model.
Please report and interpret your regression results and coefficients.
c) (2 points) Please summarize your observations by comparing the results from a) and b).

4. a) (5 points) Please use the linear regression technique to answer the question on what are the
factors that influence how much time people spend on a website? Please use DURATION as your
DV; and you may decide on the IVs by conducting the similar exercises in Assignment #1.
Please ONLY report and interpret your final regression results.
b) (5 points) Please use the linear regression technique to answer the question on what are the
factors that influence how many pages people views when visiting a website? Please use
PAGES_VIEWED as your DV; and you may decide on the IVs by conducting the similar exercises in
Assignment #1. Please ONLY report and interpret your final regression results.
c) (5 points) Alternatively, you can also use count data model (Poisson or Negarive Binomial) since
PAGES_VIEWED is a variable with discrete and non-negative integers. Using the similar set of IVs,
do you see significantly different results by using linear regression vs. count data models?
d) (3 points) Please summarize your observations by comparing the results from a), b), and c).

2
Appendix

This is a sample data selected from the large online clickstream data collected in
2011 by tracking over 100,000 unique household online shopping behavior. This
small sample data includes transactions for booking hotels online.

Variable Descriptions
Variable Description and Measure
ID Unique transaction ID
DOMAIN_ID Unique ID for the web domain
MACHINE_ID Unique ID for the computer (household) on which the transaction was made
SITE_SESSION_ID Unique ID for the session in which the transaction was made
TRANS_FREQ Total number of transactions for the household.
DOMAIN_NAME The website (domain) name where the transaction was made
A dummy variable indicating whether the transaction is incurred directly
DIRECT_D
from a hotel website (1) or other third_party travel website (0).
PROD_NAME The product (e.g., hotel or packages) purchased by the household
PROD_TOTPRICE Total price paid for this transaction
The referring website (domain) name through which the final purchase
REF_DOMAIN_NAME
website was reached
DURATION Total time spent at a site (mins)
PAGES_VIEWED Total pages viewed at a site
HOUSEHOLD_SIZE Total number of people in the household
CHILDREN_D A dummy variable indicating whether the household has any children.
A dummy variable indicating whether the household has high speed internet
CONNECTIONSPEED_D
connection

Potrebbero piacerti anche