Homework 4 - Sebastian Rojas v1

Homework 4: Time Series
Analysis of Gone Girl Daily Box

Office
Sebastian Rojas
Regression and Multivariate Data Analysis
Prof. Jeffrey Simonoff
Fall 2014
Trying to find predictors for a films box office is complex problem that has haunted the film
industry for years. In the last decade though, Google search and social media have been
praised as a good thermometer of social phenomenon. In fact in June 2013, Andrea Chen,
Googles principal industry analyst for media and entertainment, claimed that a movies box
office could be predicted as far as four weeks in advance using Google search 1. For the
purpose of this paper, I wanted to test this claim combining the publicly available Google
Trends tool with Facebook data and Twitter data. The movie analyzed for this paper was
Gone Girl directed by David Fincher and starring Ben Affleck and Rosamund Pike,
released in October 3rd 2014. However, given the nature of the data publicly available, this
relation was analyzed on a day-to-day simultaneous basis, and not in the predictive way
declared by Chen.
The target variable on the analysis is the daily box office of the movie between October 10 th
2014 and November 8th 2014 (30 days) taken from Box Office Mojo. The reason to use only
30 days is because most of the predicting variables are only available for 30 days previous
to the moment when the data is extracted. Data covering the complete period from the
release date of the movie until now would be ideal, but is only available using paid social
media analytic tools thought for businesses. The predicting variables are then the following
ones:
a) Google Trend Results for the search topic Gone Girl, with label 2014 Film.
Google instantly separates topics that might have a similar title using labels, allowing
us to differentiate the film from the homonymous book on which the film is based.
The way Google Trends results are presented as observations in the range 0-100.
100 is set as a defining reference for the rest of the observations, taking the day
where the highest number of searches were performed on the defined period. Every
other number is relative to that peak. Google doesnt allow setting customized dates.
The only possible fine-tuning is to define the search as relative to the past week, past
30 days, 90 days, 365 days, or setting month parameters in the form every October
and November day in 2014. This is the way it was done (the 100 value was the first
Saturday the movie was been shown), and the observations used were the ones
covering the above-mentioned period of time (Oct 10- Nov 8).
b) Changes in likes for users located in the US for the movie Facebook page 2. This
data point was extracted using the trial version of the social media analytics tool
Quintly. Unfortunately, it only allowed getting data for 30 days. It can be argued that
people only liked the movie on Facebook after watching it and therefore the relation
has a lagging effect on the predictor side, but one of the interesting elements of using
this metric is that the action of liking is amplified by appearing on the likers friends
feeds for likers that dont have restricting privacy settings. This is one of the reasons
on why the editorial cut of the page is oriented to people that dont have watched the
movie yet.
c) Twitter Mentions in English to Gone Girl and @GoneGirlMovie (the official
twitter account for the movie). This data was taken the social media analytics tool
Topsy. While also considering mentions to Ben Affleck, Rosamund Pike and
David Fincher was tempting, it was difficult to control if the names were mentioned
as part of the same tweet (this would obviously lead to a multicollinearity problem) or
as part of a tweet related to the movie. I acknowledge that making sure that these
tweets are either generated or viewed by American audiences is impossible, and the
data might overestimate the predictor.
1 Google Unveils Model to Predict Box Office, The Hollywood Reporter, Jun
2013.http://www.hollywoodreporter.com/news/google-unveils-model-predict-box-563660
2 https://www.facebook.com/GoneGirlMovie
2
Since we are trying to predict a money variable, and our hypothesis is that all the predictors
are related to audience variables that should account for proportional changes in the target
variable (they act as a sample of the overall population interest in the movie), the relation
between them was treated as multiplicative/multiplicative. Therefore, the target variable as
well as the predictors were logged on base 10. The variables are then:
Predictors
Log (gti): Logged Google Trends
Log (FBuschange): Logged Change in American fans of the Facebook page for the
movie
Log (twGG): Logged number of mentions to Gone Girl
Log (tw@GG): Logged number of mentions to the official twitter account of Gone Girl
Target Variable
Log(DDG): Logged Daily Domestic Gross
The scatterplots for the different variables are the following:
All four scatterplots appear to show a strong and significant relation between the predictors
and the target variable, particularly Log (gti). If we run a best subsets regression, the output
is the following one:
Best Subsets Regression: Log(DDG) versus Log(gti),

Log(FBuschange), ...
Response is Log(DDG)
Vars
1
1
2
2
3
3
4
R-Sq
85.9
44.9
88.6
86.8
89.0
88.8
89.0
R-Sq
(adj)
85.4
42.9
87.8
85.8
87.7
87.5
87.3
R-Sq
(pred)
84.1
34.9
86.0
84.5
84.8
84.8
83.4
Mallows
Cp
6.0
99.3
1.8
6.1
3.0
3.6
5.0
S
0.13252
0.26239
0.12134
0.13086
0.12162
0.12296
0.12402
L
o
g
(
g
t
i
)
X
L
o
g
(
F
B
u
s
c
h
a
n
g
e
)
L
o
g
(
t
w
G
G
)
L
o
g
(
t
w
@
G
G
)
X
X
X
X
X
X
X
X
X
X
X X
X X X
With these results, the best model appears to be undoubtedly the two variable model that
keeps Log(gti) and Log(FBuschange). Under this model Mallows Cp is the smallest, and
both predicted R2 and adjusted R2 are maximized. It is interesting to notice that when we
take simple linear model that only considers Log(gti), which from the scatter plots appears to
be the predictor that has the strongest relation with the target variable, Mallows Cp is 6.0,
way above the rule of p+1. This can be a first warning that theres autocorrelation in the
target variable, leading to a wrong estimated and causing a high Residual SS.
The output of the regression that uses Log(gti) and Log(FBuschange) is the following:
Analysis of Variance
Source
Regression
Log(gti)
Log(FBuschange)
Error
Total
DF
2
1
1
27
29
Adj SS
3.10155
1.70632
0.09421
0.39751
3.49906
Adj MS
1.55078
1.70632
0.09421
0.01472
F-Value
105.33
115.90
6.40
P-Value
0.000
0.000
0.018
Model Summary
S
0.121336
R-sq
88.64%
R-sq(adj)
87.80%
R-sq(pred)
86.03%
Coefficients
Term
Constant
Log(gti)
Log(FBuschange)
Coef
4.645
2.182
-0.485
SE Coef
0.447
0.203
0.192
T-Value
10.39
10.77
-2.53
P-Value
0.000
0.000
0.018
VIF
2.66
2.66
Regression Equation
Log(DDG) = 4.645 +2.182Log(gti) -0.485*Log(FBuschange)
The regression appears to have a high level of significance, and same goes for the Log (gti)
coefficient. The significance level is lower for Log (FBuschange), but still within acceptance
levels since the p value is less than 0.05. The regression also has a strong fit with an R2 of
88.64%. However, before moving any further in analyzing these results, it would be good to
check if our suspicions of autocorrelation are justified.
When we look at the plots of the standardized residuals the errors appear to be normally
distributed, which allows us to use the Durbin Watson statistic:
The Durbin-Watson statistic for the regression is d=1.25613 and the lower and upper limits
for a regression with p=2 and n=30 are 1.134 and 1.264 with 1% significance, which makes
the test inconclusive for positive autocorrelation. Similarly, (4-d) = 2.74 which is higher than
the upper bound for the test and suggests that there is no statistical evidence that the error
terms are positively auto correlated.
However, Durbin-Watson only tests for autocorrelation in the form of an autoregressive
model of order 1. When we look at the Autocorrelation Function for the standardized
residuals, we can see that the two points were the plot exceeds the level of significance of
=5% not only suggest that there is an autocorrelation in the form of an autoregressive
model of order 1, but also of order 7:
A runs test was also performed with the following results:

Runs test for SRES1
Runs above and below K = 0.00298467
The observed number of runs = 11
The expected number of runs = 15.9333
14 observations above K, 16 below
P-value = 0.066
The p value appears to sustain the null hypothesis, indicating that there is no autocorrelation
in the errors.
What is happening then? If we look at the plot of the unlogged target variable against days of
the week, we can see that theres a clear and logic seasonality effect in the data: during
weekends people goes more to the movies than during the week since they have more time.
This spike starts on Friday, and is one of the reasons on why most movies are premiered on
Thursdays. Since the first observation corresponds to a Friday and these spikes are spaced
by one week, theres an autocorrelation of order 7.
In order to correct this, a seasonal indicator variable was added for the observations
corresponding to a Friday. The output and residual plots for the new regression model are
the following ones:
Regression Analysis: Log(DDG) versus Log(gti), Log(FBuschange),

FR
Method
Categorical predictor coding
(1, 0)
Analysis of Variance
Source
Regression
Log(gti)
Log(FBuschange)
FR
Error
Total
DF
3
1
1
1
26
29
Adj SS
3.32048
1.65024
0.07606
0.21893
0.17858
3.49906
Adj MS
1.10683
1.65024
0.07606
0.21893
0.00687
F-Value
161.15
240.27
11.07
31.87
P-Value
0.000
0.000
0.003
0.000
Model Summary
S
0.0828758
R-sq
94.90%
R-sq(adj)
94.31%
R-sq(pred)
93.39%
Coefficients
Term
Coef
SE Coef
T-Value
P-Value
VIF
Constant
Log(gti)
Log(FBuschange)
FR
1
4.496
2.148
-0.437
0.306
0.139
0.131
14.67
15.50
-3.33
0.000
0.000
0.003
2.66
2.67
0.2297
0.0407
5.65
0.000
1.00
Regression Equation
FR
0
Log(DDG) = 4.496 +2.148Log(gti) -0.437Log(FBuschange)
Log(DDG) = 4.726 +2.148Log(gti) -0.437Log(FBuschange)
Fits and Diagnostics for Unusual Observations

Obs
1
22
25
R
X
Log(DDG)
6.9105
6.2548
5.8428
Fit
6.8764
6.3976
6.0249
Resid
0.0342
-0.1428
-0.1821
Std Resid
0.54
-2.01
-2.30
X
R
R
Large residual
Unusual X
Durbin-Watson Statistic
Durbin-Watson Statistic =
1.59679
As we can see on the results, not only the results of the regression are still statistically
significant, but also the significance of the Log(FBuschange) coefficient was improved, and
the fit of the regression is also higher with an R2 of 94.9%. The VIFs also still are within
tresholds that cast away the presence of multicollinearity.
Though slightly negatively skewed, the standardized residuals still appear to tend to a
normal distribution:
10
If we look at the Autocorrelation function under this new model, all forms of autocorrelation
seem to have gone away:
11
Similarly, the p-value in the runs test is above the level of significance, hence rejecting the
null hypothesis of autocorrelation:
Runs Test: SRES3

Runs test for SRES3
Runs above and below K = 0.00375006
The observed number of runs = 13
The expected number of runs = 15.4
18 observations above K, 12 below
P-value = 0.352
Lastly, the Cooks distances in the observations are all small and less than 1:
COOK3
0.052907
0.046384
0.027199
0.002994
0.010496
0.000132
0.008043
0.032571
0.029246
0.014872
0.000407
0.024950
0.017652
0.009046
0.020785
0.009391
0.033414
0.043981
0.019056
0.008794
0.000685
0.356469
0.044249
0.064075
0.122753
0.003134
0.000277
0.011167
0.003478
0.037717
Coming back to the regression equation, it is of the form:

When Friday:
Log(DDG) = 4.726 +2.148*Log(gti) -0.437*Log(FBuschange)
When rest of the week:
Log(DDG) = 4.496 +2.148*Log(gti) -0.437*Log(FBuschange)
Which can also be expressed as:
Log(DDG) = 4.496 +2.148*Log(gti) -0.437*Log(FBuschange) + 0.23*FR +
If we antilog the regression equation, the expression is:

Daily Domestic Gross= 31,333*Google Trend Index ^ (2.148)* 1/(FB ^ 0.437) * 1.698^F
12
In conclusion, the estimated regression equation then indicates that for the month of the
analysis:
Leaving everything else fixed, proportional increments in the Google Trend Index
have multiplicative effects that are more or less equal to that proportional increment
squared. This strong relation could be seen in the initial scatterplot. Unlike social
network activities, performing a search on Google is most probably what everybody
that goes to the movies does to find a time, theater, exhibition times, etc. Part of the
exponential effect in the predictor could also be tied to the fact that a personal search
on Google could imply more than one ticket bought per search.
Leaving everything else fixed, Fridays have a 70% higher daily Box Office than the
rest of the days of the week.
Leaving everything else fixed, a proportional increase in Facebook likers of the

official movie page on Facebook is tied to a proportional decrease in Box Office that
is slightly less than half the proportional increase in Facebook likers. This result is
certainly surprising, and the difficulty to understand it could be either be linked to the
possibility that the predictor helps to calibrate an overestimation in the relation
between the Google Trends Index and Daily Box Office, or speaks to the fact that we
still dont fully understand how behavior on social networks affects consumption and
the effect was not properly coded in the regression.
13

Homework 4 - Sebastian Rojas v1

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Homework 4 - Sebastian Rojas v1

Caricato da

Copyright:

Formati disponibili

Homework 4: Time Series

Analysis of Gone Girl Daily Box

Best Subsets Regression: Log(DDG) versus Log(gti),

A runs test was also performed with the following results:

Regression Analysis: Log(DDG) versus Log(gti), Log(FBuschange),

Log(DDG) = 4.496 +2.148Log(gti) -0.437Log(FBuschange)

Log(DDG) = 4.726 +2.148Log(gti) -0.437Log(FBuschange)

Fits and Diagnostics for Unusual Observations

Runs Test: SRES3

Coming back to the regression equation, it is of the form:

If we antilog the regression equation, the expression is:

Leaving everything else fixed, a proportional increase in Facebook likers of the

Potrebbero piacerti anche