Sei sulla pagina 1di 53

Finding Needles in a Haystack:

Using Data Analytics to Improve Fraud Prediction

Johan Perols
Associate Professor of Accounting
University of San Diego
jperols@sandiego.edu

Robert Bowen
Distinguished Professor of Accounting
University of San Diego
rbowen@sandiego.edu

Carsten Zimmermann
Associate Professor of Management
University of San Diego
zimmermann@sandiego.edu

Basamba Samba
RWTH Aachen University
basamba.samba@rwth-aachen.de

April 2016

We acknowledge financial assistance from the School of Business Administration at the


University of San Diego and helpful comments from Darren Bernard, Nicole Cade, Ed deHaan,
Weili Ge, Jane Jollineau, Yen-Ting (Daniel) Lin, Sarah Lyon, Dawn Matsumoto, Barry Mishra,
Ted Mock, Ryan Ratcliff, Terry Shevlin, Brady Williams, and workshop participants at the
University of California, Riverside and the University of San Diego. All remaining errors are
our own.

Electronic copy available at: http://ssrn.com/abstract=2590588


Finding Needles in a Haystack: Using Data Analytics to Improve Fraud Prediction

Abstract: Developing models to detect financial statement fraud involves challenges related
to (i) the rarity of fraud observations, (ii) the relative abundance of explanatory variables
identified in the prior literature, and (iii) the broad underlying definition of fraud. Following
the emerging data analytics literature, we introduce and systematically evaluate three
methods to address these challenges. Results from evaluating actual cases of financial
statement fraud suggest that two of these methods improve fraud prediction performance by
approximately ten percent relative to the best current techniques. Improved fraud prediction
can result in meaningful benefits, such as improving the ability of the SEC to detect
fraudulent filings and improving audit firms client portfolio decisions.

Key words: Financial statement fraud, Data analytics, Fraud prediction, Risk
assessment, Data rarity, Data imbalance.
Data availability: Data are available from sources identified in the text.

Electronic copy available at: http://ssrn.com/abstract=2590588


I. INTRODUCTION
Organizations lose an estimated 5 percent of annual revenues to fraud in general and 1.6

percent of annual revenues specifically to financial statement fraud (ACFE 2014). Further, when

resources are misallocated because of misleading financial data, fraud can harm the efficiency of

capital, labor, and product markets. Financial statement fraud (henceforth fraud) also increases

business risk. For example, audit firms can face lawsuits, reputational costs, and loss of clients;

investors and banks are more likely to make suboptimal investment and loan decisions.

Data analytics is an important emerging field in both academic research (e.g., Agarwal and

Dhar 2014; Chen, Chiang, and Storey 2012) and in practice (e.g., Brown, Chui, and Manyika

2011; LaValle, Lesser, Shockley, Hopkins, and Kruschwitz 2013).1 In the fraud context, data

analytics can, for example, be used to create fraud prediction models that help (i) auditors

improve client portfolio management and audit planning decisions and (ii) regulators and other

oversight agencies identify firms for potential fraud investigation (e.g., SEC 2015; Walter 2013).

However, the usefulness of data analytics in fraud prediction is hindered by three challenges.

First, fraud prediction is a needle in a haystack problem. That is, the relative rarity of fraud

firms compared to non-fraud2 control firms (Bell and Carcello 2000) makes fraud prediction

difficult (Perols 2011). Second, fraud prediction is complicated by the curse of data

dimensionality (Bellman 1961). The rarity of fraud observations relative to the large number of

explanatory variables identified in the fraud literature (Whiting, Hansen, McDonald, Albrecht,

1 Data analytics refers to techniques that are grounded in data mining (e.g., decision trees, artificial neural networks, and support
vector machines) and statistics (e.g., ANOVA, regression analysis, and logistic regressions) (Chen et al. 2012). Data analytics
draws from statistics, artificial intelligence, computer science, and database research. It is related to big data in that it provides
tools that enable the analysis of large datasets. Data analytics is typically focused on prediction as opposed to explanation.
2 We use the term fraud as opposed to other terminology, such as material misstatements (Dechow, Ge, Larson, and Sloan 2011)

or misreporting. The primary difference between fraud and misstatements is that fraud is intentional while misstatements can be
either intentional or errors. Further, we use the term non-fraud firms to describe all firms for which fraud has not been detected.
This primarily includes firms that have not committed fraud, but also includes undetected cases of fraud. To the extent that
undetected fraud exists in our data, noise is introduced. This noise reduces the effectiveness of all prediction models, and
methods that address this noise might further improve fraud prediction. However, this noise is not likely to bias performance
comparisons among prediction models that use the same data.

-1-
and Albrecht 2012) can result in over-fitted prediction models that perform poorly when

predicting new observations. Third, prior research generally treats all frauds as homogeneous

events. This can make fraud prediction more difficult because prediction models have to detect

patterns that are common across different fraud types (e.g., revenue vs. expense fraud).

While prior fraud detection research enhances our general understanding of fraud indicators

and prediction methods, this research rarely addresses these problems explicitly. With a primary

objective of improving fraud prediction, we address these challenges by introducing three

methods grounded in data analytics research.3 The methods we examine have performed well in

other settings characterized by data rarity, such as predicting credit card fraud (e.g., Chan and

Stolfo 1998). The first method, Multi-subset Observation Undersampling (OU), addresses the

imbalance between the low number of fraud observations relative to the number of non-fraud

observations by creating multiple subsets of the original dataset that each contain all fraud

observations and different random subsamples of non-fraud observations. The second method,

Multi-subset Variable Undersampling (VU), addresses the imbalance between the low number of

fraud observations relative to the number of explanatory variables identified in the fraud

prediction literature by creating multiple subsets of randomly selected explanatory variables.

The third method, VU partitioned by type of fraud (PVU), is a variation of the second method

that addresses issues associated with treating all fraud cases as homogenous events. Rather than

randomly selecting variables, we instead use our a priori knowledge to partition the variables

into subsets based on their relation to specific types of fraud (e.g., revenue vs. expense fraud).

We use a dataset with 51 fraud firms, 15,934 non-fraud firm years, and 109 explanatory

variables from prior research. We then analyze over 10,000 prediction models to systematically

3We evaluate our results on out-of-sample data and thus perform predictive modeling. To clearly delineate our work from
explanatory models, we refer to our models as prediction models throughout the paper.

-2-
evaluate how to best implement these methods, e.g., how many data subsets to use in OU. In

addition, we examine the prediction performance of these implementations relative to various

benchmarks that represent the current standard in the literature, e.g., model 2 in Dechow et al.

(2011) and simple undersampling as used in Perols (2011). To avoid biasing the results, we

evaluate prediction performance using the prediction models probability predictions on hold-out

data that are not processed by the proposed methods.4

Results indicate that including additional data subsets (up to approximately 12 subsets)

increases OU fraud prediction performance, i.e., additional subsets after 12 do not appear to

enhance performance. This 12-subset configuration improves prediction performance by 10.8

percent relative to the best performing benchmark.

While results indicate that VU also has the potential to improve fraud prediction, the

performance of this method is highly dependent on the specific variables selected in the various

subsets. However, performance improves when we use a priori knowledge to separate

independent variables into different subsets based on the type of fraud they are likely to predict,

e.g., revenue or expense fraud. This method, i.e., PVU, improves fraud prediction performance

by 9.6 percent relative to the best performing VU benchmark. Additional analyses also show

that performance can be further improved by combining OU and PVU, but only under certain

conditions as described in section IV.

Our paper makes at least five important contributions. First, by introducing and

systematically evaluating three new methods and showing that two of these methods improve

prediction performance relative to the best performing benchmarks, we directly contribute to

4 We follow recent fraud data analytics research (e.g., Cecchini et al. 2010) and findings in Perols (2011) and implement all
prediction models using support vector machines. Support vector machines determine how to separate fraud firms from non-
fraud firms by finding the hyperplane that provides the maximum separation in the training data between fraud and non-fraud
firms. In additional analyses we also use logistic regression and bootstrap aggregation to examine the robustness of our results.

-3-
research that focuses on improving the performance of fraud prediction models. The

performance improvements from OU and PVU are large relative to other approaches for

improving prediction performance, e.g., (i) a 0.9 percent performance advantage in Dechow et al.

(2011) when two additional significant independent variables are added to their initial model and

(ii) a 2.2 percent improvement in Price, Sharp, and Wood (2011), when comparing Audit

Integritys Accounting and Governance Risk measure to Dechow et al. (2011) model 2.5

Second, the finding that OU significantly improves prediction performance has important

methodological implications for research that evaluates the value of new explanatory variables.

This research can potentially benefit from applying OU to ensure that (i) results are robust across

different subsamples and (ii) new variables provide incremental predictive value to models

implemented using our recommended methods.

Third, we show that the ability of VU to predict fraud improves consistently only when we

recognize that not all frauds are alike and subdivide the general fraud problem into types of

fraud. The importance of this approach likely extends beyond variable undersampling. For

example, future research could reorganize or design new fraud variables to predict a specific

fraud type (e.g., revenue fraud or expense fraud).

Fourth, OU and PVU can be extended to address rarity and data dimensionality problems that

are prevalent in other accounting classification settings, including prediction of financial

statement restatements, material weaknesses in internal controls, auditor resignations, audit

qualifications, and bankruptcy.

5 Dechow et al. (2011) do not report predictive performance and the 0.9 percent difference is based on a separate analysis that we
performed using the two models in their paper (Model 1 and Model 2). This analysis uses the same procedures used in our
material misstatement analysis described in Section IV. Price et al. (2011) compare Audit Integritys Accounting and
Governance Risk measure, which is considered the gold standard in commercial risk measures, to Dechow et al. (2011) Model 1
using material misstatement data. Based on their results, we calculate a 3.16 percent fraud prediction performance improvement
of the commercial measure to model 1. This implies a 2.24 percent improvement over Dechow et al. (2011) Model 2, which we
include as one of our benchmark models.

-4-
Finally, the introduction and evaluation of these methods makes an important contribution to

practice. Better prediction models can, for example, help the SEC and external auditors improve

their identification of potentially fraudulent accounting practices (Walter 2013; SEC 2015).

The remainder of the paper is organized as follows. Section II summarizes the fraud

literature, discusses data rarity, and describes how methods drawn from the data analytics

literature can be applied to fraud prediction. Section III describes the data, performance

measure, and experimental design. Section IV provides results, and section V concludes.

II. PRIOR LITERATURE, BACKGROUND, AND PROPOSED METHODS


Prior Fraud Prediction Research
Research on financial statement fraud prediction contributes to understanding factors that can

be used to predict fraud. Prior research includes testing fraud hypotheses grounded in the

earnings management and corporate governance literatures (e.g., Beasley 1996; Dechow, Sloan,

and Sweeney 1996; Summers and Sweeney 1998; Beneish 1999; Sharma 2004; Erickson

Hanlon, and Maydew 2006; Lennox and Pittman 2010; Feng, Ge, Luo, and Shevlin 2011; Perols

and Lougee 2011; Caskey and Hanlon 2012; Armstrong, Larcker, Ormazabal, and Taylor 2013;

Markelevich and Rosner 2013). This research also evaluates the significance of a variety of

other potential explanatory variables, such as red flags emphasized in auditing standards,

discretionary accruals measures, and non-financial indicators (e.g., Loebbecke, Eining, and

Willingham 1989; Beneish 1997; Lee, Ingram, and Howard 1999; Apostolou, Hassell, and

Webber 2000; Kaminski, Wetzel, and Guan 2004; Ettredge, Sun, Lee, and Anandarajan 2008;

Jones, Krishnan, and Melendrez 2008; Brazel, Jones, and Zimbelman 2009; Dechow et al. 2011).

Varian (2014) highlights the importance of the emerging field of data analytics. He suggests

that researchers using traditional econometric methods should consider adapting recent advances

from this field. A second stream of financial statement fraud prediction research follows this

-5-
suggestion and applies developments in data analytics research to improve fraud prediction.

Early research within this stream concludes that artificial neural networks perform well relative

to discriminant analysis and logistic regressions (e.g., Green and Choi 1997; Fanning and Cogger

1998; Lin, Hwang, and Becker 2003). More recent research in this stream examines additional

classification algorithms, such as support vector machines, decision trees, and adaptive learning

methods (e.g., Cecchini et al. 2010; Perols 2011; Abbasi, Albrecht, Vance, and Hansen 2012;

Gupta and Gill 2012; Whiting et al. 2012) and text mining methods (e.g., Glancy and Yadav

2011; Humpherys, Moffitt, Burns, Burgoon, and Felix 2011; Goel and Gangolly 2012; Larcker

and Zakolyukina 2012).

Data Rarity, Related Prior Research, and Proposed Methods


Data rarity is observed in diverse prediction settings, such as credit card fraud (Chan and

Stolfo 1998), auto insurance fraud (Phua, Alahakoon, and Lee 2004), bankruptcy (Shin, Lee, and

Kim 2005), and financial statement fraud (Whiting et al. 2012). Classification algorithms (e.g.,

logistic regression) have inherent difficulties in processing rarity (Weiss 2004), and data rarity is

regarded as one of the primary challenges in data analytics research (Yang and Wu 2006). Data

rarity is particularly severe in financial statement fraud detection because financial statement

fraud is characterized by both (i) relative rarity (a.k.a., the needle in the haystack problem) and

(ii) absolute rarity combined with an abundance of explanatory variables proposed in the

literature (a.k.a., the curse of data dimensionality problem).

The needle in a haystack problem. Relative rarity occurs when detected fraud observations

are a relatively small percentage of the majority non-fraud observations, e.g., approximately only

0.6 percent of all audited U.S. financial reports have been identified as fraudulent (Bell and

Carcello 2000). Relative rarity is a challenge since it forces classification algorithms to consider

a large number of potential patterns without having enough fraud observations to determine

-6-
which patterns are driven by noisy data. This increases the risk that identified patterns are based

on spurious relations in a particular sample, resulting in increased false positive rates for a given

false negative rate when the developed model is applied to a new sample (Weiss 2004). Further,

to minimize total classification errors, algorithms tend to be biased towards classifying

observations from the majority class correctly (e.g., Maloof 2003). To illustrate, if 99 percent of

all observations are non-fraudulent, a prediction model identifying all observations as non-

fraudulent achieves an overall accuracy of 99 percent correctly classifying 100 percent of the

non-fraudulent observations, but 0 percent of the fraudulent observations.

Perols (2011) takes an initial step towards addressing the relative rarity problem in a fraud

context by examining the performance of classification algorithms after undersampling the non-

fraud observations. However, while the simple undersampling method used in Perols (2011),

i.e., a method that simply removes non-fraud observations from the sample, generates more

balanced datasets, it also discards potentially useful non-fraud observations. We, therefore,

introduce a more sophisticated undersampling method that does not discard non-fraud

observations (and include simple undersampling as a benchmark).

More specifically, we use Multi-subset Observation Undersampling (OU), developed by

Chan and Stolfo (1998), to address relative rarity. OU uses multiple data subsets, where each

subset contains all fraud observations but different subsamples of non-fraud observations. We

specifically select OU because prior research shows that it performs well in other settings

constrained by relative rarity, such as predicting credit card fraud (e.g., Chan and Stolfo 1998).

OU is also effective compared to (i) other undersampling and oversampling methods (Nguyen,

Cooper, and Kamei 2012) and (ii) various types of bootstrap aggregation, boosting, and hybrid

ensemble data rarity methods used in the data analytics literature (Galar, Fernndez,

-7-
Barrenechea, Bustince, and Herrera 2012). OU is conjectured to improve performance (e.g.,

Nguyen et al. 2012) not only because it improves the balance between minority (fraud) and

majority (non-fraud) observations, but also because it more efficiently incorporates potentially

useful majority cases. By creating multiple prediction models that are based on different non-

overlapping subsets of majority observations, each prediction model is likely to differ somewhat

from the other prediction models. Importantly, patterns that are predictive of fraud are likely to

be present in multiple subsets. However, spurious patterns that exist by random chance in

individual subsets are unlikely to also exist in other subsets. By using a combination of these

models rather than a model built using a single data set, potentially important patterns are more

likely to be identified and estimated accurately (assuming that each model has a slightly different

estimate of the pattern). Additionally, when individual models are combined, spurious patterns

are likely to be discarded (or given less weight). This decreases the risk of overfitting, i.e., that

the prediction model has good in-sample performance but does not generalize to new

observations.

When applied in the fraud setting, OU first preprocesses the model building data by dividing

the data into multiple subsets, where each subset includes all fraud observations and a random

sample of non-fraud observations selected without replacement (Figure 1). Thus, all fraud

observations are included in all subsets while each non-fraud observation is part of at most one

subset. Each subset is then used in combination with a classification algorithm to build a fraud

prediction model.6 To perform fraud prediction, each prediction model is then applied to out-of-

sample data. For each observation in the out-of-sample data, the resulting model predictions are

6 Please refer to www.fraudpredictionmodels.com/ou for further details on how OU undersamples data.

-8-
averaged into an overall fraud probability prediction for the observation.7 For example, if OU is

implemented with 12 subsets, the method first creates 12 subsets as described above. Each

subset is then used to build a prediction model, for a total of 12 prediction models. The

prediction models are then applied to out-of-sample data, resulting in 12 fraud probability

predictions for each observation in the out-of-sample data. The probability predictions for each

observation are then combined by taking the average of the 12 probability predictions. Section

IV provides further details on how OU was implemented in the experiments.

<Insert Figure 1 Here>

The curse of data dimensionality problem. According to the curse of data dimensionality,

data requirements increase exponentially with the number of explanatory variables in the dataset

(Bellman 1961).8 This is a potential problem in fraud prediction because the number of known

fraud cases is small relative to the extensive number of independent variables identified in prior

fraud research. Hence, only a small number of fraud observations are available to identify

patterns among the large number of independent variables and fraud. This may result in over-

fitted prediction models that perform poorly when predicting new observations.

By using stepwise backward variable selection to build a parsimonious fraud prediction

model, Dechow et al. (2011) partially address the problem of data dimensionality in the fraud

context. However, while stepwise backward variable selection is designed to retain explanatory

7 This method has been found to perform well compared to more complex combiner methods (Duin and Tax 2000). Other
combiner methods, such as a Dempster-Shafer Fusion method, may be able to further improve the effectiveness of our proposed
methods; we encourage future research to examine this and other methods in more detail.
8 More specifically, when the number of explanatory variables increase, data used to fit models are spread across an increasingly

large feature space that grows exponentially with each additional explanatory variable, e.g., with one explanatory variable the
feature space is a line, with two variables the feature space is a plane, with three variables the feature space is a three-dimensional
space, etc. For example, with a dataset containing 50 fraud and 50 non-fraud observations and only one continuous explanatory
variable, the 100 observations are positioned on a line. If another variable is added, these same 100 observations are spread
across a two dimensional space. If a third variable is added, the 100 observations are spread within a three dimensional space.
For every variable that is added, the observations cover a smaller portion of the feature space. Thus, to cover a given percentage
of the feature space, the number of required observations would have to increase exponentially with the number of variables.

-9-
variables with the highest significance levels, it may discard potentially useful variables. We

build on Dechow et al. (2011) and introduce a new method that attempts to address the curse of

data dimensionality, while simultaneously retaining potentially useful explanatory variables. We

include the Dechow et al. (2011) model as a benchmark in our analyses.

To reduce the imbalance between minority fraud observations and the number of variables

identified in the literature to predict fraud, we design a new data rarity method, Multi-subset

Variable Undersampling (VU).9 VU randomly splits the set of explanatory variables without

replacement into different subsets (Figure 2). Each subset contains the same observations, but

different non-overlapping sets of explanatory variables. As with OU, each subset is then used in

combination with a classification algorithm to build a fraud prediction model that is applied to

out-of-sample data. For each observation in the out-of-sample data, the resulting model

predictions are then combined into an overall fraud probability prediction for the observation.

<Insert Figure 2 Here>

Partitioning Fraud into Types


Managers commit financial statement fraud by manipulating specific accounts, e.g., they may

improve reported earnings by artificially increasing revenue or reducing expenses. Many

financial statement fraud variables used in the literature are inherently related to a specific type

of fraud. For example, abnormal revenue growth is a potential measure of revenue fraud while

9 In an attempt to further mitigate problems associated with having a small number of fraud observations to learn from, we
examine the usefulness of an observation oversampling method named SMOTE in fraud prediction. SMOTE was developed by
Chawla, Bowyer, Hall, and Kegelmeyer (2002) and performs well across multiple classification problems (e.g., Chawla et al.
2002; He and Garcia 2009). We perform two experiments to investigate (i) the number of fraud observations to use when
creating a new synthetic fraud observation and (ii) the oversampling ratio to use, which determines how many additional
synthetic fraud observations are generated. In the first experiment, untabulated results indicate that SMOTE only performs
significantly better than the benchmark (simple oversampling, i.e., duplication of fraud observations in the training data) in one
out of 27 comparisons. In the second experiment, we again fail to find a significant performance advantage for SMOTE relative
to simple oversampling. Finally, we implement SMOTE after partitioning the data on fraud types and find that this
implementation does not statistically differ from the original implementation of SMOTE. Based on the above results, we cannot
recommend SMOTE to address data rarity in the fraud context.

- 10 -
an abnormally low amount of allowance for doubtful accounts is a potential measure of expense

fraud. Although these variables may provide useful information about a specific type of fraud,

they are less likely to detect multiple types of fraud.10 When different fraud types are combined

into a binary classification problem, variables that are helpful when detecting a specific type of

fraud may be discarded if they do not do well in predicting fraud in general. For example, a

variable that provides a good signal about expense fraud but provides no useful information

about other types of fraud will only provide value when classifying expense fraud cases, which

in our sample is only about ten percent of the fraud cases. Additionally, by combining different

fraud types into a binary classification problem, the classification algorithms focus on finding

patterns common to all fraud types. Given heterogeneity among different fraud types, such

patterns may be difficult to detect.

To reduce the potential negative effects associated with combining different fraud types into

binary classification models, we implement VU by partitioning the independent variables based

on different fraud types (PVU).11 When implementing PVU, we place all variables that appear

to predict a specific fraud type into a separate variable subset. Variables that can be used to

predict multiple fraud types are placed in multiple subsets. This creates four subsets of variables

relating to revenue, expenses, assets, and liabilities (each subset is also restricted to fraud

observations that represent the associated fraud type). We also include three additional variable

subsets, because some fraud variables measure general attributes of fraud, such as incentives,

opportunities, or the aggregate effect of fraud. The first of these subsets includes all variables

10 Since accounting information is recorded using a double entry system, specific variables may capture the effect of multiple
fraud types.
11 Additionally, the use of multiple VU variable subsets that focus on different fraud types increases the likelihood that different

prediction models capture different fraud patterns, which improves diversity among the prediction models. Prediction model
diversity is important for performance when combining multiple models (Kittler et al. 1998). We do not modify OU based on
different fraud types because OU only undersamples the non-fraud data and does not preprocess the fraud data.

- 11 -
not categorized as a specific fraud type variable. The second subset includes the variables used

in Dechow et al. (2011). These variables are included for their utility in binary fraud prediction.

The third subset includes all variables and is created to allow the classifiers to find patterns

among both fraud type specific and non-fraud type specific variables.

III. DATA AND EXPERIMENTAL DESIGN


Sample Data
We obtain a sample containing 51 fraud firms12 and 15,934 non-fraud firm years from Perols

(2011). We only include one firm year for each fraud observation that corresponds to the first

year that the Accounting and Auditing Enforcement Release (AAER) alleges that fraud was

committed. We do not include previous years as the fraud may have predated the reported first

fraud year. We do not include multiple fraud years for each fraud firm to prevent a single fraud

firm from being included in both the model building dataset and the out-of-sample model

evaluation dataset.

Perols (2011) identifies fraud firms in SEC investigations reported in AAERs between 1998

and 2005 that explicitly reference Section 10(b) Rule 10b-5 (Beasley 1996) or contain

descriptions of fraud. This fraud firm dataset excludes: financial firms; firms without the first

fraud year specified in the SEC release; non-annual financial statement fraud; foreign firms;

releases related to auditors; not-for-profit organizations; fraud related to registration statements,

10-KSB or IPO; and firms with missing Compustat (financial statement data), Compact D/SEC

(executive and director names, titles and company holdings), or I/B/E/S data (one-year-ahead

analyst earnings per share forecasts and actual earnings per share) in relevant years.13 Randomly

12 This sample size of 51 fraud firms is comparable to other fraud studies (e.g., Beasley 1996, Erickson et al. 2006; Brazel et al.
2009). Other research (e.g., Dechow et al. 2011) uses AAERs to create samples focused on material misstatements. Material
misstatement data include firms with AAERs that explicitly allege fraud as well as other firms that describe a material
misstatement without explicitly alleging fraud. While such samples are larger, they do not necessarily focus on fraud.
13 Since we add additional variables to the Perols (2011) dataset, some of the variables have missing values. Missing values are

replaced by global means/modes. The effect of this is a reduction in the utility of variables that have many missing values.

- 12 -
selected Compustat non-fraud firms (excluding observations following the applicable criteria

specified for fraud firms above) are added to the fraud firm dataset to create a sample with 0.3

percent fraud firms, which allows us to examine the robustness of the results around best

estimates of prior fraud probability, i.e., 0.6 percent (Bell and Carcello 2000), in the population

of interest. We include explanatory variables (summarized in Appendix A) that have been used

in recent literature to predict fraud or material misstatements (Cecchini et al. 2010; Dechow et al.

2011; Perols 2011). More specifically, we include all variables from Perols (2011) and all

variables from the final Dechow et al. (2011) model that can be calculated using Compustat data.

Following and extending Cecchini et al. (2010), we also include 48 variables measuring levels

and changes in levels, percentage changes in levels, and abnormal percentage changes of

commonly manipulated financial statement items and ratios.

Experimental Design
Overview of the experiments. As summarized in Table 1, we perform multiple experiments

to (i) determine how to best implement OU and VU (e.g., how many subsets to use) and (ii)

evaluate their relative performance compared to various benchmarks. The primary objective in

these experiments is to detect trends that indicate how to implement the methods in future

research. By detecting clear trends between the number of subsets and predictive ability rather

than selecting implementations that happen to be the most predictive, we reduce the risk that we

recommend implementations that perform well on our test data, but do not generalize well.

In experiment 1, we use OU to create observation subsets that contain all fraud observations

and random samples of non-fraud observations that yield 20 percent fraud observations per

subset. In an evaluation of simple undersampling ratios, Perols (2011) finds that this ratio

provides relatively good performance. We then evaluate how many observation subsets to

include when implementing OU. In experiment 2a, we use VU to randomly divide the variables

- 13 -
used in prior fraud prediction research into 20 subsets. We then assess how many variable

subsets to include when implementing VU. In experiment 2b, we examine whether the number

of variables included in each subset affects performance by dividing the total number of

variables into subsets as follows: one subset with all variables, two subsets each with one-half of

the variables, four subsets each with one-quarter, six subsets each with one-sixth, eight subsets

each with one-eighth, etc. We then evaluate how many variables per subset to include when

implementing VU. Finally, in experiment 3, we evaluate the performance of VU when

independent variables are grouped together based on their relation to specific types of fraud.

<Insert Table 1 Here>

After selecting what appear to be robust implementations, we determine whether these

implementations outperform assorted benchmarks in predicting fraud. Because we introduce OU

to the fraud detection literature to reduce the imbalance between the number of fraud versus non-

fraud observations, we use simple undersampling as a benchmark (Perols 2011) for OU.14 This

benchmark randomly removes non-fraud observations from the sample to generate a more

balanced model-building sample. OU and the OU benchmarks use all variables (as independent

variable reduction is examined in the VU analysis) and are implemented using support vector

machines, following recent fraud data analytics research (e.g., Cecchini et al. 2010; Perols 2011),

We introduce VU (and PVU) as an independent variable (data dimensionality) reduction

method that has the potential to improve the performance over currently used variable selection

methods. As a baseline we include a benchmark (the Dechow benchmark) that uses the

independent variables from model 2 in Dechow et al. (2011). We also use (i) a benchmark that

randomly selects variables and (ii) a benchmark that includes all variables (the all variables

14We also used no undersampling as an additional benchmark. However, because simple undersampling performs better than
no undersampling by 7.3 percent, we adopted simple undersampling as the benchmark.

- 14 -
benchmark) where data dimensionality is not reduced. The benchmark that randomly selects

variables performs better than both the Dechow benchmark and the all variables benchmark.15

Thus, we report our VU (and PVU) results using the benchmark that randomly selects variables.

VU, PVU, and their benchmarks use all observations (observation undersampling is examined in

the OU analysis) and are implemented using support vector machines.

10-fold cross-validation. Out-of-sample performance measures are generally preferred over

in-sample performance measures because they provide a more realistic measure of prediction

performance than measures commonly used in economics (Varian 2014: 7), and cross-

validation is particularly useful. We use stratified 10-fold cross-validation, where 10 folds (i.e.,

subsamples of observations) are generated using random sampling without replacement. The 10

folds rotate between being used for training and testing the prediction models. In each rotation,

nine folds are used for training (i.e., model building) and one fold is used for testing (i.e., model

evaluation). For example, in the first round, subsets one through nine are used for training and

subset 10 is used for testing; in round two, subsets one through eight and subset 10 are used for

training, and subset nine is used for testing. By using stratified cross-validation, we ensure that

the ratio of fraud to non-fraud observations is kept consistent across the training and test sets in

the different rounds. With a total of 51 fraud firms in the sample, 45 or 46 fraud firms are used

for model building and five or six fraud firms are used for model evaluation in each cross-

validation round. In our experiments, the OU and VU methods are only applied to training data.

Prediction performance metric. Following prior financial statement fraud research (e.g.,

Beneish 1997; Kwon, Pastena, and Park 2000; Lin et al. 2003), we use expected cost of

misclassification (ECM) as the preferred performance metric. ECM allows the researcher to

15The Dechow benchmark performed 0.02 percent better than the all variables benchmark and the random variable selection
benchmark performed 3.87 percent better than Dechow benchmark.

- 15 -
vary two important parameters in evaluating the prediction models performance on out-of-

sample data: (i) estimated percentage of fraud firms in the population of interest and (ii)

estimated ratio of the cost of a false negative to the cost of a false positive in the population of

interest. Including both parameters is important in settings such as fraud prediction that are

characterized by relative rarity and uneven misclassification costs. Given specific classification

results, ECM is calculated as follows:

ECM = CFN x P(Fraud) x nFN / nP + CFP x P(Non-Fraud) x nFP / nN (1)

where CFP and CFN are estimates of the cost of false positive and false negative classifications,

respectively, deflated by the lower of CFP or CFN; P(Fraud) and P(Non-Fraud) are estimates of

prior probability of fraud and non-fraud, respectively; nFP and nFN are the number of false

positive and false negative classifications, respectively, on the cross-validation test data;16 and nP

and nN are the number of fraud and non-fraud observations, respectively, in the cross-validation

test data. Bayley and Taylor (2007) estimate that actual cost ratios (FN to FP cost) average

between 20:1 and 40:1, while Bell and Carcello (2000) estimates that approximately 0.6 percent

of all firm years represent detected fraud. Thus, in experiments, which compare model

prediction performance at best estimates of prior fraud probability and cost ratios, we calculate

ECM at a cost ratio of 30:1 and a prior fraud probability of 0.6 percent (together with the

prediction models actual false positive and false negative rates). The goal of the prediction

models is to minimize ECM.

Experimental procedure OU example. The following example provides a summary step-

by-step description of the experimental procedures using OU (see Figure 3).

16Following prior research (e.g., Beneish 1997; Feroz, Kwon, Pastena, and Park 2000; Lin et al. 2003), nFP and nFN are obtained
using optimal fraud classification thresholds (e.g., probability cutoffs for classifying an observation as fraud or non-fraud) for
each combination of prior fraud probability and cost ratio. These optima are established by examining ECM scores using all
unique fraud probability predictions as potential thresholds.

- 16 -
1) The full sample is first separated into model-building data (a.k.a., training data) and model-
evaluation data (a.k.a., test data) using 10-fold cross-validation.
2) For each cross-validation round and OU implementation, the OU method is applied to the
training data (but not the test data, which is left intact) to partition the training data into OU
subsets. For example, in the first cross-validation round when evaluating the OU
implementation with 12 subsets, the OU method creates 12 subsets of the first training set.
3) A classification algorithm is used with each OU training subset generated in step 2 to build
one prediction model for each OU subset. For example, in OU with 12 subsets, a total of 12
prediction models are generated.
4) The test set, which was not modified using the OU method, is applied to each of the
prediction models generated in step 3.
5) For each observation in the test set, the probability predictions from each prediction model
are averaged. After combining the probability predictions, each observation in the test set
has a single probability prediction representing the average prediction of all the prediction
models developed in step 3.
6) The probability predictions along with the class labels (e.g., fraud or non-fraud) are used to
calculate ECM scores. When calculating ECM scores, optimal fraud classification thresholds
(cutoffs) are first determined for each combination of prior fraud probability and cost ratio
by examining ECM scores at different classification threshold levels (Beneish 1997).
Optimal thresholds are then used to calculate ECM scores for each combination of prior
fraud probability and cost ratio for that specific test dataset.
7) The experimental procedure repeats steps two through six for each cross validation round and
each OU implementation, e.g., OU with two subsets, OU with three subsets, etc., within each
cross validation round.
8) After completing all ten rounds, each OU implementation has ten ECM scores (one for each
test set) for each prior fraud probability and cost ratio combination. Averages of the ten ECM
scores are then used to examine prediction performance of different OU implementations and
against the benchmarks at different prior fraud probability and cost ratio levels.

<Insert Figure 3 Here>

- 17 -
IV. RESULTS
Main Results
Figures 4-6 summarize the performance results of different OU and VU implementations.

For each implementation, the results represent the average expected cost of misclassification

(ECM) from ten test folds. ECM is reported at the best estimates of (i) prior fraud probability,

i.e., 0.6 percent, and (ii) false negative to false positive cost ratios, i.e., 30:1. The results are

presented as the percentage difference in ECM between each OU and VU implementation and

their respective benchmarks.17 Given that each figure is plotted using a single benchmark that is

held constant across different implementations, we first use the figures to look for clear trends

that indicate how to implement OU and VU, respectively. We then compare the performance of

selected implementations to their respective benchmarks.

Multi-subset Observation Undersampling (OU) Experiment 1. Figure 4 (with supporting

details in Table 2) presents the performance results of OU relative to the best performing OU

benchmark (i.e., simple undersampling) as the number of subsets in OU increases. Our results

indicate that the benefit provided by OU initially increases as additional subsets are used but

remains relatively constant after 10 subsets.

<Insert Figure 4 and Table 2 Here>

Figure 4 also includes the corresponding results from two sensitivity analyses, i.e., the

experiments in which the subsets are selected in a different order and the random selection of

non-fraud cases is repeated. The results across all three versions of experiment 1 are similar in

that each shows a performance benefit from using OU that initially increases in the number of

OU subsets, but starts to plateau after about 10 subsets. These results indicate that the marginal

17Reported p-values are based on pairwise t-tests using the average and standard deviation of ECM scores across the ten test
folds and are one-tailed unless otherwise noted. Assumptions related to normality and independent observations are unlikely to
be satisfied, and p-values are only included as an indication of the relation between the magnitude and the variance of the
difference between each implementation and the respective benchmarks.

- 18 -
performance benefit from adding subsets declines as new subsets become less and less likely to

contain information not already included in the prior subsets.

Taken together, these experiments indicate that OU provides performance benefits and that

the number of subsets to include in OU is relatively consistent in the fraud setting. In an attempt

to balance performance benefits (we want to include enough subsets to make sure that we have

reached the performance plateau) with analysis costs (given that we have reached the plateau, we

want to keep the number of subsets low since adding additional subsets increases processing

costs), we include 12 subsets in OU in subsequent experiments and label this configuration

OU(12). This configuration lowers the expected cost of misclassification in the primary analysis

by 10.8 percent (p = 0.003) relative to the best performing OU benchmark.18

Multi-Subset Variable Undersampling (VU) Experiment 2. Figure 5 presents the

performance of VU relative to the best performing VU benchmark (i.e., random selection of

explanatory variables) as the number of subsets in VU increases. As summarized in Table 1, we

examine two versions. The dashed line shows the results when the number of variables in each

subset remains constant per experimental round (Experiment 2a). The round dotted line shows

the results when all variables are included and divided evenly across the subsets in each

experimental round (Experiment 2b).

18 OU, which uses all variables and under-sampled non-fraud firm observations (across multiple subsets), appears to improve
performance in two ways. First, simple undersampling improves performance over no undersampling by 7.3 percent. Second,
OU(12) further improves the performance over simple undersampling by another 10.8 percent. This indicates that OU improves
performance relative to the benchmarks that use all observations (i) because it undersamples observations, but more importantly
(ii) because of the way it undersamples these observations. That is, it creates multiple subsets including non-overlapping non-
fraud observations. This suggests that OU creates diverse models using different subsets. To better understand the source of this
diversity, i.e., if using different observations in the subsets allows OU to obtain more robust parameter estimates of a subset of
important variables or if different variables are emphasized in the different models, we perform an additional comparison. This
supplemental analysis indicates that OU(12) with all variables (as implemented in the paper) performs 7.0 percent better than
OU(12) with only the Dechow variables, which in turn performs 11.1 percent better than the Dechow benchmark that uses all
observations. The improvement in the Dechow benchmark when combined with OU(12) suggests that some performance benefit
is obtained by OU(12) creating more robust parameter estimates. The additional performance benefit of OU(12) with all
variables over OU(12) with only the Dechow variables (together with results in footnote 19), indicates that different models at
least partially rely on different variables. OU thus appears to improve performance by generating more robust parameter
estimates and by emphasizing different variables in different models.

- 19 -
<Insert Figure 5 Here>

When the number of variables is kept constant in each subset (the dashed line), the

performance of VU increases as additional variable subsets are included, plateauing at about 11

subsets, and then decreasing at 19 subsets. However, even at the plateau (VU with 11 to 18

subsets), the performance difference between VU and the benchmark only approaches statistical

significance (p = 0.125 on average). In addition, the jagged line indicates that VU is sensitive to

the usefulness of the individual explanatory variables in each additional subset.

When all available variables are divided into the selected subsets (the round dotted line), VU

does not provide a performance benefit relative to the random variable selection benchmark.

Consistent with the results from the analysis where the number of variables is kept constant in

each subset, these results indicate that the performance of VU is dependent on the specific

variables included in each subset. This second VU experiment also emphasizes the importance

of how variables are grouped together.

Multi-Subset Variable Undersampling Partitioned on Fraud Types (PVU) Experiment 3.

The VU results discussed above suggest that a more deliberate partitioning of variables may be

important. We earlier argued that fraud consists of multiple types (e.g., revenue vs. expense

fraud) and that it might be beneficial to partition the explanatory variables with this in mind. Our

results for PVU support this conjecture. More specifically, in untabulated results, PVU lowers

the expected cost of misclassification by 9.6 percent (p = 0.019) relative to the best performing

VU benchmark.19

19 To better understand why PVU (and VU) improves performance over the benchmarks, we first note that the small performance
difference (0.02 percent) between the all variables benchmark (that uses all observations and all variables) and the Dechow
benchmark (that uses all observations and a subset of variables as selected in Dechow et al. 2011) suggests that performance does
not improve by simply adding more variables. Given that VU (as well as PVU that performs even better) improves performance
relative to the all variables benchmark by 7.2 percent, it appears that the segmentation of the variables rather than the inclusion of
additional variables contributes to the performance improvement. Additionally, because PVU performs 6.3 percent better than
VU, it appears that how the variables are segmented matters.

- 20 -
Additional Analyses
Further validation using misstatement data. We use the observations in a material

misstatement dataset that is an expanded version (additional years) of the data used in Dechow et

al. (2011) to perform three additional analyses. This dataset is available from the Center for

Financial Reporting and Management at the University of California, Berkeley and includes the

fraud firms used in our primary dataset as well as additional material misstatement firms reported

in AAERs by the SEC.20 Unless otherwise noted, the prediction models are implemented using

the same variables as in the main experiments (e.g., OU is implemented using all variables) and

we use the Dechow benchmark given that these data are based on Dechow et al. (2011). To

evaluate predictive performance, we again use 10-fold cross validation. Further, due to a lack of

good estimates of prior probabilities and cost ratios for material misstatements, we use a

performance metric known as area under the Receiver Operating Curve (ROC) or simply AUC

for area under curve.21

The first analysis provides further validation of out-of-sample prediction performance of the

proposed methods and compares OU and PVU to the Dechow benchmark when using the

20 We exclude firms from the finance industry and, following Dechow et al. (2011), add all Compustat non-fraud firms in the
same year and industry as the fraud firms. We do, however, only include the first fraud year, i.e., we do not include multiple
years for each fraud firm, due to the potential bias introduced when including fraud firm years. We also follow the procedure
used in Dechow et al. (2011) to eliminate observations with missing values in one or more of the variables included in the
Dechow benchmark. We use mean replacement to handle missing values in the remaining variables. We also perform the
analyses reported in this section after eliminating all observations with one or more missing values. Before performing this
elimination, we remove six variables with over 25 percent missing values: abnormal change in order backlog, allowance for
doubtful accounts, allowance for doubtful accounts to accounts receivable, allowance for doubtful accounts to net sales, expected
return on pension plan assets, and change in expected return on pension plan assets.
21 While ECM is a preferred performance metric when prior probabilities and cost ratios are known, AUC is preferred over other

performance measures in settings with unknown error costs and prior probabilities (Provost, Fawcett, and Kohavi 1998). AUC
has become the de facto standard performance measures in machine learning research and has also been used in accounting
research (e.g., Larcker and Zakolyukina 2012). A single ROC curve is generated for each predicted evaluation dataset by
changing the classification threshold and then plotting the true positive rate (positive cases classified correctly to all positive
cases) to the false positive rate (negative cases classified incorrectly to all negative cases). ROC curves thus depict the trade-off
between classifying additional positive cases correctly and the cost of classifying additional negative cases incorrectly, as the
classification threshold decreases. Alternatively, they also show how well the prediction model performs in ranking the
evaluation dataset observations. The area under the ROC curve (AUC) provides a numeric value of this trade-off and represents
the probability that a randomly selected positive instance is ranked higher than a randomly selected negative instance. An AUC
of 0.5 is equivalent to a random rank order while an AUC of 1 is perfect ranking of the evaluation cases.

- 21 -
observations in the material misstatement data. This analysis also provides insight into the

usefulness of the proposed methods in a slightly different setting (material misstatements vs.

fraud). Results in Table 3 suggest that OU and PVU (panel A) continue to improve performance

over the Dechow benchmark when using material misstatement data now by 16.9 (p = 0.004)

and 26.0 (p < 0.003) percent, respectively.

The second analysis examines the sensitivity of the results to the classification algorithm

used. This analysis evaluates the performance of the methods when combined with logistic

regression and bootstrap aggregation instead of support vector machines (that is used in all other

analyses). Results in Table 3 (panel B) show that the performance of OU and PVU are

consistent (OU more so than PVU) across the different classification algorithms. The

performance of the Dechow benchmark, however, appears to be sensitive to the classification

algorithm used. More importantly, OU and PVU perform significantly better than the Dechow

benchmark across all of the different classification algorithms. OU improves the performance

over the benchmark by 3.6 (p = 0.004) and 50.5 (p = 0.003) percent when logistic regression and

bootstrap aggregation are used, respectively.22 Similarly, PVU improves the performance over

the benchmark by 7.8 (p < 0.001) and 36.0 (p < 0.001) percent when logistic regression and

bootstrap aggregation are used, respectively. These results suggest that the performance benefits

from using OU and PVU are robust to the specific classification algorithm used.23

22 The difference between OU and the Dechow benchmark when using logistic regression does not appear to be as strong as that
suggested in the main experiment using fraud data. In the main experiment, we used the same classification algorithm (support
vector machines) for all methods and benchmarks to maintain internal validity and to avoid making the experiments overly
complex. To evaluate the effect of a potential bias against the Dechow benchmark associated with this decision, we examine the
performance of the Dechow benchmark in the main experiment with logistic regression instead of support vector machines. The
results indicate an insignificant difference between the two implementations (p = 0.984, two tailed), and this result is robust
across different prior probability levels and cost ratios. Thus, the decision to use support vector machines for all methods and
benchmarks does not appear to have biased the results against the Dechow benchmark.
23 These results are also robust to an additional analysis using a sample that excludes all variables with over 25 percent missing

values and all observations with one or more missing values in remaining variables. We also performed some limited analysis
using boosting, and OU and PVU continue to outperform the Dechow benchmark by 44.8 (p < 0.001) and 5.1 (p = 0.044)
percent, respectively. However, the performance of both PVU and the Dechow benchmark fell considerably (while the

- 22 -
<Insert Table 3 Here>

The third analysis provides insight into (i) the usefulness of OU when used in combination

with a different set of independent variables (based on the financial kernel of Cecchini et al.

2010) and (ii) whether OU provides incremental predictive power when used in combination

with this kernel. Cecchini et al. (2010) based their financial kernel on 23 financial statement

variables commonly used to construct independent variables for fraud prediction models. The

financial kernel divides each of the 23 original variables by each other both in the current year

and in the prior year and calculates changes in the ratios. Both current and lagged ratios as well

as their changes are then used to construct a dataset with 1,518 independent variables.

We use the same initial set of observations used in the previous analysis and recreate the

financial kernel following Cecchini et al. (2010). We also follow their procedures and exclude

all observations with missing values. We then compare OU implemented with the variables in

the financial kernel to the Cecchini benchmark, which uses the financial kernel but does not

undersample observations (both implementations use support vector machines). We do not

attempt to implement PVU, as it is not clear how we would separate the 1,518 variables into

different fraud types. Results in Table 3 (panel C) indicate that OU (AUC = 0.67) outperforms

the financial kernel (AUC = 0.59) in misstatement prediction by 14.2 percent (p = 0.004).24

Combining the Methods. We analyze whether various combinations of OU, PVU, and

SMOTE (see footnote 9) provide additional performance benefits compared to OU(12), the best

performing individual method. Figure 6 plots the performance difference of various method

combinations compared to OU(12) at different cost ratios. The selection of the specific

performance of OU only fell slightly) when using boosting. Similarly, we performed some limited experiments using Bayesian
learning, but the performance of all three methods fell drastically. Thus, boosting and Bayesian learning do not appear to be
viable options, and we do not tabulate these results.
24 When including fraud firm years, OU performs 5.7 percent (p < 0.001) better than the Cecchini benchmark and both

approaches have high AUC values (AUC = 0.863 and AUC = 0.816, respectively).

- 23 -
configurations used in these combinations is based on their general performance in the previous

experiments. The combinations are generated by creating prediction models using OU and VU

separately and then averaging the predictions from the OU and PVU prediction models.25

In untabulated results, the three-method combination does not perform significantly different

than OU(12) (p = 0.465) at best estimates of prior fraud probability and cost ratios. Similarly,

the two-method combination of OU(12) and PVU also does not perform significantly different

than OU(12) (p = 0.421) at best estimates of prior fraud probability and cost ratios. Thus, in

typical fraud prediction research settings, we recommend using OU(12). However, the two- and

three-method combinations provide performance benefits over OU(12) at higher cost ratios and

higher prior fraud probability levels (see Figure 6). Given that the combination of OU(12) and

PVU either performs significantly better or not significantly different than OU(12), we

recommend using this combination of the two methods if maximizing predictive ability is more

important than minimizing implementation costs. For example, when the SEC uses a prediction

model to help decide which firms to investigate for potential fraud, the additional

implementation costs associated with using the combination is likely to be small relative to the

costs of misclassifying a non-fraud firm and using resources to investigate the firm (and even

more so relative to misclassifying a fraud firm and not detecting the fraud).

<Insert Figure 6 Here>

Using OU to Explore Robustness of Independent Variables. Fraud research often seeks to

identify new explanatory variables to improve fraud prediction. Traditionally, this research uses

the entire sample (i.e., all observations) or a single matched sample to evaluate the significance

of one or more independent variables that are hypothesized to be associated with the dependent

25
SMOTE is incorporated by oversampling the data used by OU and PVU. We also first create the OU subsets and then apply
SMOTE and PVU to these subsets, but this more integrated and complex combination does not improve performance further.

- 24 -
variable. However, the predictive performance benefits of OU reported earlier suggest that

classification algorithms (e.g., logistic regression) recognize different fraud patterns when

trained on different subsets of non-fraud firms. Thus, when evaluating explanatory variables in

hypothesis testing research, it may be important to consider the robustness of results across

different subsamples of the original data.

As an example, we perform an analysis that compares traditional hypothesis-testing results

(full sample) to results from implementing OU (summary of the 12 OU subsamples). The

example uses data from the additional analyses that examine misstatement data. In this example

we examine the significance of Sales to Employees given a set of control variables selected

based on prior research (the control variables in this example were selected using step-wise

backward feature selection). Traditionally, the hypothesis would be tested using all observations

in the sample, i.e., the full sample. The results for the full sample in Table 4 indicate that the

hypothesis is supported (p = 0.0116). However, the OU subsample analyses indicate that this

result might not be robust. For example, the average p-value of all Sales to Employees estimates

across the 12 models obtained using different sub-samples is p = 0.180 and the p-value is above

0.05 in four of the 12 models. Please see fraudpredictionmodels.com/ou/hypothesis_testing for

details on how to perform this analysis.

Results in Table 4 suggest that OU yields similar results to the traditional hypothesis testing

analysis, i.e., the most significant variables in the traditional approach tend to be the most

significant in the OU analysis. However, the OU results are generally more conservative. For

example, in only two cases are the median p-values from the OU results numerically smaller

(more significant) than the corresponding parametric result. For 12 of 17 variables, the median p-

values are numerically larger (less significant) than their parametric counterparts. Thus, we

- 25 -
encourage future research to consider applying OU as a robustness check for hypothesis testing.26

<Insert Table 4 Here>

V. DISCUSSION, RECOMMENDATIONS, AND FUTURE RESEARCH


Financial statement fraud is a costly problem that has far reaching negative consequences.

Hence, the accounting literature investigates a wide range of explanatory variables and various

classification algorithms that contribute to more accurate prediction of fraud and material

misstatements. However, the rarity of fraud data, the relative abundance of variables identified in

prior literature, and the broad definition of fraud create challenges in specifying effective

prediction models.

Research in the emerging field of data analytics has been applied successfully in other

settings constrained by data rarity, such as predicting credit card fraud (Chan and Stolfo 1998).

We, therefore, follow the call of Varian (2014) to apply recent advances in data analytics in other

settings and investigate the ability of methods drawn from data analytics to improve fraud

prediction. We first use Multi-subset Observation Undersampling (OU) to investigate

undersampling of non-fraud observations to establish a more effective balance with scarce fraud

observations. When used with 12 subsamples, this method improves fraud prediction by

lowering the expected cost of misclassification by more than ten percent relative to the best

performing benchmark. This method is also both efficient and relatively easy to implement.

Second, we use Multi-subset Variable Undersampling (VU) to investigate undersampling of

explanatory variables to put them more in balance with scarce fraud observations. Fraud

prediction improves in select situations when we randomly undersample explanatory variables

26In untabulated results, we repeat the analysis using bootstrapping. More specifically, the full sample is used to generate 1,000
bootstrap subsamples (each sample contained observations selected randomly with replacement). Each bootstrap subsample is
then used to fit a logistic regression model from which 2.5 and 97.5 percentiles of independent variable coefficient estimates are
obtained. The bootstrapping results are similar to the OU results in that they are also generally more conservative.

- 26 -
into different subsets. However, it does not do so reliably. When we instead implement Multi-

subset Variable Undersampling by partitioning variables into subsets based on the type of fraud

they are likely predict (PVU), the expected cost of misclassification is reduced by 9.6 percent

relative to the best performing VU benchmark.

Our research makes multiple contributions to the prior literature. First, we identify and

directly address financial statement fraud data rarity problems by systematically evaluating

multiple methods that we believe are new to the accounting literature. Based on our

experiments, we conclude that OU and PVU each produce economically and statistically

significant reductions in the expected cost of misclassification of about ten percent.27 This

compares to, for example, a 0.9 percent prediction performance advantage when, following

Dechow et al. (2011), two additional significant independent variables are added to their initial

model. The introduction and evaluation of these methods directly contributes to research that

focuses on improving fraud prediction. Beneish (1997) and Dechow et al. (2011), among others,

create fraud prediction models that can be used to indicate the likelihood that a company has

committed financial statement fraud. Our methods can be used to improve the quality of such

fraud predictions. We also directly extend research that examines the usefulness of data

analytics methods in fraud prediction (e.g., Cecchini et al. 2010; Perols 2011; Larcker and

Zakolyukina 2012; and Whiting et al. 2012).28

27 We specifically recommend the use of OU(12) at times in combination with PVU. The choice between using OU by itself or
in combination with PVU depends on the cost ratio and the prior fraud probability assumed by the specific entity that is trying to
predict fraud (see Figure 6).
28 Future research that tries to improve fraud prediction using data analytics methods can examine other problems related to

rarity, such as (i) noisy data that potentially have more significant negative effects on rare cases (Weiss 2004), and (ii) mislabeled
non-fraud firms, i.e., firms that are labeled non-fraud but have actually committed fraud. We performed a limited analysis of one
potential approach. We (1) manipulated the training data in each cross-validation round by using OU to generate fraud
probability predictions for all the observations in the training data and then removed all non-fraud firms with high fraud
probability predictions (we tried five different thresholds: 0.9, 0.8, 0.7, 0.6, and 0.5) from the training data; (2) used the modified
training data from step 1 as input into OU; and (3) compared the results from step 2 to the original OU implementation.
Untabulated results did not show any significant performance improvements over the original OU implementation. When
compared to the original implementation, the average change in AUC across the ten test folds was -0.08% (p = 0.809; two-tailed),

- 27 -
Second, by showing that performance benefits can be gained by (i) addressing data rarity

problems in fraud detection and (ii) partitioning financial statement fraud into different fraud

types, our results provide an indication of the potential benefits that may result from addressing

similar problems in other settings. For example, bankruptcy, financial statement restatements,

material weaknesses in internal control over financial reporting, and audit qualifications are also

rare events in both absolute and relative terms.

Third, our research has implications for research that focuses on designing new explanatory

variables and developing parsimonious prediction models (e.g., Dechow et al. 2011; and

Markelevich and Rosner 2013). Our findings suggest that classification algorithms recognize

different fraud patterns when trained on different subsets of non-fraud firms. Thus, even if an

explanatory variable is deemed significant in one subsample, it is valuable to show that it is also

significant in other subsamples. Example techniques include OU, bootstrapping, or a robustness

measure proposed by Athey and Imbens (2015) that creates subsamples based on values of the

independent variables in the model. While we perform additional analyses that suggest that OU

(i) performs better than bootstrapping in predictive modeling and (ii) can be used to evaluate the

robustness of explanatory models, future research is needed to provide more definitive

recommendations about which method(s) to use for hypothesis testing.29 Further, research that

concludes that a new explanatory variable provides incremental predictive power should

0.08% (p = 0.360; one-tailed), 0.12% (p = 0.337; one-tailed), 0.31% (p = 0.182; one-tailed), and 0.24% (p = 0.228; one-tailed)
when using thresholds of 0.5, 0.6, 0.7, 0.8, and 0.9, respectively. Future research is also needed to more directly address the
challenges associated with biases introduced by only having few fraud observations in absolute terms, while at the same time
having a potentially large number of undetected fraud cases. For example, to assess the impact of a potential overreliance on a
small sample of fraud firms and to attempt to improve out-of-sample predictive performance, future research could use random
subsamples rather than all fraud firms in each OU subset.
29 Future research can also examine the use of OU in conjunction with propensity score matching. For example, can OU be used

to generate more robust propensity scores? Alternatively, by applying OU and then generating propensity scores, matched
samples, and evaluating difference between the samples within each OU subsets, can OU be used to evaluate the robustness of
propensity score matching results?

- 28 -
consider showing that the variable provides incremental predictive value to models implemented

using our methods.30

Fourth, we also make a contribution by following the call to consider different types of fraud

(Brazel et al. 2009). We partition financial statement fraud into types and show that this

reframing improves the performance of VU in fraud prediction. The importance of this finding

may extend beyond VU. Research that examines predictors of fraud could, similar to Brazel et

al. (2009), design new explanatory variables to detect a specific type of fraud instead of fraud in

general. For example, fraud research could potentially develop variables that predict different

fraud types using different types of analyst forecasts (e.g., revenue vs. earnings) or different

types of debt covenants (e.g., leverage vs. interest coverage). For example, an independent

variable that indicates whether a firm uses a leverage (interest expense) debt covenant can in turn

be used in a prediction model that predicts liabilities (expense) fraud. This reframing could as

such contribute to better theoretical understanding of fraud and also more precise evaluation of

explanatory variables.

Finally, we believe that regulators and practitioners can potentially benefit from our findings.

Regulators, such as the SEC, are investing resources in developing better fraud risk models

(Walter 2013; SEC 2015). Our findings may enhance their ability to identify firms that have

committed fraud. This is important because, due to resource constraints, the SEC has to focus

investigations on a small sample of firms, and improvements in financial statement fraud

prediction models can be cost effective in identifying potential fraud firms. The negative effects

of financial statement fraud on other stakeholders, such as employees, auditors, suppliers,

30Please refer to www.fraudpredictionmodels.com/ou for further details on OU in general and more specific guidance on how to
use OU to evaluate (1) the robustness of independent variable hypothesis testing results and (2) the incremental predictive
performance of new independent variables. The hypothesis testing example includes further details on the analysis performed in
Table 4 of this paper and also includes mock data. The predictive performance example explains how to use OU in combination
with out-of-sample and includes mock data and SAS code.

- 29 -
customers, and lenders can also be potentially reduced. For example, auditors can use our

methods to potentially improve fraud risk assessment models that, in turn, can improve audit

client portfolio management and audit planning decisions. Given the significant costs and

widespread effects of financial statement fraud, improvements in fraud prediction models can

have a substantial positive impact on society.

- 30 -
REFERENCES
Abbasi, A., C. Albrecht, A. Vance, and J. Hansen. 2012. MetaFraud: A Meta-Learning
Framework for Detecting Financial Fraud. MIS Quarterly. 36(4): 1293-1327.
Agarwal, R., and V. Dhar. 2014. EditorialBig Data, Data Science, and Analytics: The
Opportunity and Challenge for IS Research. Information Systems Research. 25(3): 443-448.
Apostolou, B., J. Hassell, and S. Webber. 2000. Forensic Expert Classification of Management
Fraud Risk Factors. Journal of Forensic Accounting. 1(2): 181-192.
Armstrong, C. S., D. F. Larcker, G. Ormazabal, and D. J. Taylor. 2013. The relation between
equity incentives and misreporting: the role of risk-taking incentives. Journal of Financial
Economics. 109(2): 327-350.
Association of Certified Fraud Examiners. 2014. Report to the Nation on Occupational Fraud
and Abuse. Austin, TX.
Athey, S., and G. Imbens. 2015. A Measure of Robustness to Misspecification. American
Economic Review. 105(5): 476-80.
Bayley, L., and S. Taylor. 2007. Identifying earnings management: A financial statement
analysis (red flag) approach. Proceedings of the American Accounting Association Annual
Meeting, Chicago, IL.
Beasley, M. 1996. An Empirical Analysis of the Relation between the Board of Director
Composition and Financial Statement Fraud. The Accounting Review. 71(4): 443-465.
Bell, T., and J. Carcello. 2000. A Decision Aid for Assessing the Likelihood of Fraudulent
Financial Reporting. Auditing: A Journal of Practice & Theory. 19(1): 169-184.
Bellman, R. Adaptive Control Processes: A Guided Tour, Princeton University Press, 1961.
Beneish, M. 1997. Detecting GAAP Violation: Implications for Assessing Earnings Management
among Firms with Extreme Financial Performance. Journal of Accounting and Public Policy.
16(3): 271-309.
Beneish, M. 1999. Incentives and Penalties Related to Earnings Overstatements That Violate
GAAP. The Accounting Review. 74(4): 425-457.
Brazel, J. F., K. L. Jones, and M. F. Zimbelman. 2009. Using nonfinancial measures to assess
fraud risk. Journal of Accounting Research. 47(5): 1135-1166.
Breiman, L. 1996. Bagging predictors. Machine learning. 24(2): 123-140.
Brown, B., M. Chui, and J. Manyika. 2011. Are you ready for the era of big data. McKinsey
Quarterly. 4: 24-35.
Caskey, J., and M. Hanlon. 2013. Dividend Policy at Firms Accused of Accounting Fraud.
Contemporary Accounting Research. 30(2): 818-850.
Cecchini, M., G. Koehler, H. Aytug, and P. Pathak. 2010. Detecting Management Fraud in
Public Companies. Management Science. 56(7): 1146-1160.
Chan, P., and S. Stolfo. 1998. Toward Scalable Learning with Non-uniform Class and Cost
Distributions: A Case Study in Credit Card Fraud Detection. Proceedings of the Fourth
International Conference on Knowledge Discovery and Data Mining, New York, NY.
Chawla, N. V., K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. 2002. SMOTE: Synthetic
Minority Oversampling Technique. Journal of Artificial Intelligence Research. 16: 321-357.

- 31 -
Chen, H., R. H. Chiang, and V. C. Storey. 2012. Business Intelligence and Analytics: From Big
Data to Big Impact. MIS Quarterly. 36(4): 1165-1188.
Dechow, P. M., R. G. Sloan, and A. P. Sweeney. 1996. Causes and consequences of earnings
manipulation: An analysis of firms subject to enforcement actions by the sec. Contemporary
Accounting Research. 13(1): 1-36.
Dechow, P. M., W. Ge, C. R. Larson, and R. G. Sloan. 2011. Predicting Material Accounting
Misstatements. Contemporary Accounting Research. 28(1): 17-82.
Duin, P. W. R., and M. J. D. Tax. 2000. Experiments with Classifier Combining Rules.
Proceedings of the International Workshop on Multiple Classifier Systems 2000.
Erickson, M., M. Hanlon, and E. L. Maydew. 2006. Is There a Link between Executive Equity
Incentives and Accounting Fraud? Journal of Accounting Research. 44(1): 113-143.
Ettredge, M. L., L. Sun, P. Lee, and A. A. Anandarajan. 2008. Is earnings fraud associated with
high deferred tax and/or book minus tax levels? Auditing: A Journal of Practice & Theory.
27(1): 1-33.
Fanning, K., and K. Cogger. 1998. Neural network detection of management fraud using
published financial data. International Journal of Intelligent Systems in Accounting, Finance
and Management. 7(1): 21-41.
Feng, M., W. Ge, S. Luo, and T. Shevlin. 2011. Why do CFOs become involved in material
accounting manipulations? Journal of Accounting and Economics. 51(1): 21-36.
Feroz, E., T. Kwon, V. Pastena, and K. Park. 2000. The Efficacy of Red-Flags in Predicting the
SEC's Targets: An Artificial Neural Networks Approach. International Journal of Intelligent
Systems in Accounting, Finance & Management. 9(3): 145-157.
Galar, M., A. Fernndez, E. Barrenechea, H. Bustince, and F. Herrera. 2012. A review on
ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based
approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and
Reviews. 42(4): 463-484.
Glancy, F. H., and S. B. Yadav. 2011. A computational model for financial reporting fraud
detection. Decision Support Systems. 50(3): 595-601.
Goel, S., and J. Gangolly. 2012. Beyond The Numbers: Mining The Annual Reports For Hidden
Cues Indicative Of Financial Statement Fraud. Intelligent Systems in Accounting, Finance
and Management. 19(2): 75-89.
Green, B. P., and J. H. Choi. 1997. Assessing the Risk of Management Fraud Through Neural
Network Technology. Auditing: A Journal of Practice & Theory. 16(1): 14-28.
Gupta, R., and N. S. Gill. 2012. A Solution for Preventing Fraudulent Financial Reporting using
Descriptive Data Mining Techniques. International Journal of Computer Applications. 58(1):
22-28.
He, H., and E. A. Garcia. 2009. Learning from imbalanced data. IEEE Transactions on
Knowledge and Data Engineering. 21(9): 1263-1284.
Humpherys, S. L., K. C. Moffitt, M. B. Burns, J. K. Burgoon, and W. F. Felix. 2011.
Identification of fraudulent financial statements using linguistic credibility analysis. Decision
Support Systems. 50(3), 585-594.
Jones, K. L., G. V. Krishnan, and K. D. Melendrez. 2008. Do Models of Discretionary Accruals

- 32 -
Detect Actual Cases of Fraudulent and Restated Earnings? An Empirical Analysis.
Contemporary Accounting Research. 25(2): 499-531.
Kaminski, K., S. Wetzel, and L. Guan. 2004. Can Financial Ratios Detect Fraudulent Financial
Reporting. Managerial Auditing Journal. 19(1): 15-28.
Kittler, J., M. Hatef, R.P.W. Duin, and J. Matas. 1998. On Combining Classifiers. IEEE
Transactions on Pattern Analysis and Machine Intelligence. 20(3): 226-239.
Larcker, D. F., and A. A. Zakolyukina. 2012. Detecting deceptive discussions in conference
calls. Journal of Accounting Research. 50(2): 495-540.
LaValle, S., E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz. 2013. Big data, analytics
and the path from insights to value. MIT Sloan Management Review. 21.
Lee, T. A., R. W. Ingram, and T. P. Howard. 1999. The Difference between Earnings and
Operating Cash Flow as an Indicator of Financial Reporting Fraud. Contemporary
Accounting Research. 16(4): 749-786.
Lennox, C., and J. A. Pittman. 2010. Big Five Audits and Accounting Fraud. Contemporary
Accounting Research, 27(1): 209-247.
Lin, J., M. Hwang, and J. Becker. 2003. A Fuzzy Neural Network for Assessing the Risk of
Fraudulent Financial Reporting. Managerial Auditing Journal. 18(8): 657-665.
Loebbecke, J. K., M. M. Eining, and J. J. Willingham. 1989. Auditors experience with material
irregularities: Frequency, nature, and detectability. Auditing: A Journal of Practice and
Theory. 9(1): 1-28.
Maloof, M. 2003. Learning When Data Sets are Imbalanced and When Costs are Unequal and
Unknown. Proceedings of the Twenty International Conference on Machine Learning,
Washington, DC.
Markelevich, A., and R. L. Rosner. 2013. Auditor Fees and Fraud Firms. Contemporary
Accounting Research. 30(4), 1590-1625.
Nguyen, H. M., E. W. Cooper, and K. Kamei. 2012. A comparative study on sampling
techniques for handling class imbalance in streaming data. Soft Computing and Intelligent
Systems. 1762-1767.
Perols, J. 2011. Financial statement fraud detection: An analysis of statistical and machine
learning algorithms. Auditing: A Journal of Practice & Theory. 30(2): 19-50.
Perols, J. L., and B. A. Lougee. 2011. The relation between earnings management and financial
statement fraud. Advances in Accounting. 27(1): 39-53.
Phua, C., D. Alahakoon, and V. Lee. 2004. Minority Report in Fraud Detection: Classification of
Skewed Data. SIGKDD Explorations. 6(1): 50-59.
Price III, R. A., N. Y. Sharp, and D. A. Wood. 2011. Detecting and predicting accounting
irregularities: A comparison of commercial and academic risk measures. Accounting
Horizons. 25(4): 755-780.
Provost, F. J., T. Fawcett, and R. Kohavi. 1998. The case against accuracy estimation for
comparing induction algorithms. Proceedings of the Fiftheenth International Conference on
Machine Learning, Madison, WI. 98: 445-453.
Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc.,
San Francisco, CA, USA.

- 33 -
SEC 2015. Examination Priorities for 2015. Retrieved from
http://www.sec.gov/about/offices/ocie/national-examination-program-priorities-2015.pdf.
Sharma, V. 2004. Board of Director Characteristics, Institutional Ownership, and Fraud:
Evidence from Australia, Auditing: A Journal of Practice & Theory. 23(2): 105-117.
Shin, K. S., T. Lee, and H. J. Kim. 2005. An Application of Support Vector Machines in
Bankruptcy Prediction Models. Expert Systems with Application. 28: 127-135.
Summers, S. L., and J. T. Sweeney. 1998. Fraudulently Misstated Financial Statements and
Insider Trading: An Empirical Analysis. The Accounting Review. 73(1): 131-146.
Varian, H. R. 2014. Big data: New tricks for econometrics. The Journal of Economic
Perspectives. 28(2): 3-27.
Walter, E. (2013, February). Harnessing Tomorrows Technology for Todays Investors and
Markets. Speech Presented at American University School of Law, Washington, D.C.
Weiss, G. 2004. Mining with Rarity: A Unifying Framework. ACM SIGKDD Explorations
Newsletter. 6(1): 7-19.
Whiting, D. G., J. V. Hansen, J. B. McDonald, C. Albrecht, and W. S. Albrecht. 2012. Machine
Learning Methods For Detecting Patterns Of Management Fraud. Computational
Intelligence. 28(4): 505-527.
Yang, Q., and X. Wu. 2006. 10 challenging problems in data mining research. International
Journal of Information Technology & Decision Making. 5(4): 597-604.

- 34 -
APPENDIX A: Definitions of explanatory variablesa

Panel A: Variables from Dechow et al. (2011)


Variable Definitionb
Abnormal change in (OB - OBt-1) / OBt-1 - (SALE - SALEt-1) / SALEt-1
order backlog
Actual issuance IF SSTK>0 or DLTIS>0 THEN 1 ELSE 0
Book to market CEQ / (CSHO * PRCC_F)
Change in expected PPROR-PPRORt-1
return on pension plan
assets
Change in free cash (IB - RSST Accruals) / Average total assets - (IBt-1 - RSST Accrualst-1) /
flows Average total assetst-1
Change in inventory (INVT- INVTt-1)/Average total assets
Change in operating ((MRC1/1.1+ MRC2/1.1^2+ MRC3/1.1^3+ MRC4/1.1^4+ MRC5/1.1^5) - (MRC1t-
lease activity 1/1.1+ MRC2t-1/1.1^2+ MRC3t-1/1.1^3+ MRC4 -1/1.1^4+ MRC5t-1/1.1^5) )/
Average total assets
Change in receivables (RECT- RECTt-1)/Average total assets
Change in return on IB / Average total assets - IBt-1 / Average total assetst-1
assets
Deferred tax expense TXDI / ATt-1
Demand for financing IF ((OANCF-(CAPXt-3+CAPXt-2+ CAPXt-1)/ 3) /(ACT) < -0.5 THEN 1 ELSE 0
(ex ante)
Earnings to price IB / (CSHO x PRCC_F)
Existence of operating IF (MRC1 > 0 OR MRC2 > 0 OR MRC3 > 0 OR MRC4 > 0 OR MRC5 > 0 THEN 1
leases ELSE 0
Expected return on PPROR
pension plan assets
Level of finance raised FINCF / Average total assets
Leverage DLTT / AT
Percentage change in ((1-(COGS+(INVT-INVTt-1))/(SALE-(RECT-RECTt-1)))-
cash margin (1-(COGSt-1+(INVTt-1-INVTt-2))/(SALEt-1-(RECTt-1-RECTt-2)))) /
(1-(COGSt-1+(INVTt-1-INVTt-2))/(SALEt-1-(RECTt-1-RECTt-2)))
Percentage change in ((SALE - (RECT - RECTt-1)) - (SALEt-1 - (RECTt-1 - RECTt-2))) /
cash sales (SALEt-1 - (RECTt-1 - RECTt-2))
RSST accruals RSST Accruals = (WC + NCO + FIN)/Average total assets, where
WC = (ACT - CHE) - (LCT - DLC)
NCO = (AT - ACT - IVAO) - (LT - LCT - DLTT)
FIN = (IVST + IVAO) - (DLTT + DLC + PSTK)
Soft assets (AT-PPENT-CHE)/Average total assets
Unexpected employee FIRM((SALE/EMP - SALEt-1/EMPt-1)/(SALEt-1/EMPt-1)) -
productivityc INDUSTRY((SALE/EMP - SALEt-1/EMPt-1)/(SALE t-1/EMPt-1))
WC accruals (((ACT - ACTt-1) - (CHE - CHEt-1)) - ((LCT - LCTt-1) - (DLC - DLCt-1) -
(TXP - TXPt-1)) - DP)/Average total assets

- 35 -
APPENDIX A: Definitions of explanatory variables (continued)

Panel B: Variables from Perols (2011)


Variable Definitionb
Accounts receivable to (RECT/SALE)
sales
Accounts receivable to (RECT/AT)
total assets
Allowance for doubtful (RECD)
accounts
Allowance for doubtful (RECD/RECT)
accounts to accounts
receivable
Allowance for doubtful (RECD/SALE)
accounts to net sales
Altman Z score 3.3*(IB+XINT+TXT)/AT+0.999*SALE/AT+0.6*CSHO*PRCC_F/LT+
1.2*WCAP/AT+1.4*RE/AT
Big four auditor IF 0 < AU < 9 THEN 1 ELSE 0
Current minus prior year (INVT)/(SALE)-(INVTt-1)/(SALEt-1)
inventory to sales
Days in receivables (RECT/SALE)/(RECTt-1/SALEt-1)
index
Debt to equity (LT/CEQ)
Declining cash sales IF SALE-(RECT-RECTt-1) < SALE t-1-(RECT t-1-RECT t-2)
dummy THEN 1 ELSE 0
Fixed assets to total PPEGT/AT
assets
Four year geometric (SALE/SALEt-3)^(1/4)-1
sales growth rate
Gross margin (SALE-COGS)/SALE
Holding period return in (PRCC_F-PRCC_Ft-1)/PRCC_Ft-1
the violation period
Industry ROE minus NIindustry/CEQindustry - NI /CEQ
firm ROE
Inventory to sales INVT/SALE
Net sales SALE
Positive accruals dummy IF (IB-OANCF) > 0 and (IBt-1-OANCFt-1) > 0 THEN 1 ELSE 0
Prior year ROA to total (NIt-1/ATt-1) / AT
assets current year
Property plant and PPENT/AT
equipment to total assets
Sales to total assets SALE/AT
The number of auditor IF AU<>AUt-1 THEN 1 ELSE 0 + IF AUt-1<>AUt-2 THEN 1 ELSE 0 +
turnovers IF AUt-2<>AUt-3 THEN 1 ELSE 0
Times interest earned (IB+XINT+TXT) / XINT
Total accruals to total (IB-OANCF) / AT
assets
Total debt to total assets LT/AT

- 36 -
APPENDIX A: Definitions of explanatory variables (continued)
Total discretionary RSST Accrualst-1 + RSST Accrualst-2 + RSST Accrualst-3, where
accrual
Value of issued IF CSHI > 0 THEN CSHI*PRCC_F/(CSHO*PRCC_F) ELSE IF (CSHO-CSHOt-1)>0
securities to market THEN ((CSHO - CSHOt-1)*PRCC_F) / (CSHO*PRCC_F) ELSE 0
value
Whether accounts IF (RECT/RECT t-1) > 1.1 THEN 1 ELSE 0
receivable > 1.1 of last
years
Whether firm was listed IF EXCHG=5, 15, 16, 17, 18 THEN 1 ELSE 0
on AMEX
Whether gross margin IF ((SALE-COGS) / SALE) / ((SALEt-1 - COGSt-1)/SALEt-1) > 1.1 THEN 1 ELSE 0
percent > 1.1 of last
years
Whether LIFO IF INVVAL=2 THEN 1 ELSE 0
Whether new securities IF (CSHO-CSHOt-1)>0 OR CSHI>0 THEN 1 ELSE 0
were issued
Whether SIC code larger IF 2999<SIC<4000 THEN 1 ELSE 0
(smaller) than 2999
(4000)

Panel C: Variables based on Cecchini et al. (2010)d


Variable Definitionb
Sales SALE
Change in sales SALE - SALEt-1
% Change in sales (SALE - SALEt-1) / (SALEt-1)
Abnormal % change in (SALE - SALEt-1) / (SALEt-1) - INDUSTRY(SALE - SALEt-1) / (SALEt-1))
sales
Sales to assets SALE/AT
Change in sales to assets SALE/AT - SALEt-1/ATt-1
% Change in sales to (SALE/AT - SALEt-1/ATt-1) / (SALEt-1/ATt-1)
assets
Abnormal % change in (SALE/AT - SALEt-1/ATt-1) / (SALEt-1/ATt-1) -
sales to assets INDUSTRY(SALE/AT - SALEt-1/ATt-1) / (SALEt-1/ATt-1))
Sales to employees SALE/EMP
Change in sales to SALE/EMP - SALEt-1/EMPt-1
employees
% Change in sales to (SALE/EMP - SALEt-1/EMPt-1) / (SALEt-1/EMPt-1)
employees
Sales to operating SALE/XOPR
expenses
Change in sales to SALE/XOPR - SALEt-1/XOPRt-1
operating expenses
% Change in sales to (SALE/XOPR - SALEt-1/XOPRt-1) / (SALEt-1/XOPRt-1)
operating expenses
Abnormal % change in (SALE/XOPR - SALEt-1/XOPRt-1) / (SALEt-1/XOPRt-1) - INDUSTRY(SALE/XOPR-
sales to operating SALEt-1/XOPRt-1) / (SALEt-1/XOPRt-1))
expenses
Return on assets NI/AT

- 37 -
APPENDIX A: Definitions of explanatory variables (continued)
Change in return on NI/AT - NIt-1/ATt-1
assets
% Change in return on (NI/AT - NIt-1/ATt-1) / (NIt-1/ATt-1)
assets
Abnormal % change in (NI/AT - NIt-1/ATt-1) / (NIt-1/ATt-1) -
return on assets INDUSTRY(NI/AT - NIt-1/ATt-1) / (NIt-1/ATt-1))
Return on equity NI/CEQ
Change in return on NI/CEQ - NIt-1/CEQt-1
equity
% Change in return on (NI/CEQ - NIt-1/CEQt-1) / (NIt-1/CEQt-1)
equity
Abnormal % change in (NI/CEQ - NIt-1/CEQt-1) / (NIt-1/CEQt-1) -
return on equity INDUSTRY(NI/CEQ - NIt-1/CEQt-1) / (NIt-1/CEQt-1))
Return on sales NI/SALE
Change in return on NI/SALE - NIt-1/SALEt-1
sales
% Change in return on (NI/SALE - NIt-1/SALEt-1) / (NIt-1/SALEt-1)
sales
Abnormal % change in (NI/SALE - NIt-1/SALEt-1) / (NIt-1/SALEt-1) -
return on sales INDUSTRY(NI/SALE - NIt-1/ SALEt-1) / (NIt-1/SALEt-1))
Accounts payable to AP/INVT
inventory
Change in accounts AP/INVT - APt-1/INVTt-1
payable to inventory
% Change in accounts (AP/INVT - APt-1/INVTt-1) / (APt-1/INVTt-1)
payable to inventory
Abnormal % change in (AP/INVT - APt-1/INVTt-1) / (APt-1/INVTt-1) -
accounts payable to INDUSTRY(AP/INVT - APt-1/ INVTt-1) / (APt-1/INVTt-1))
inventory
Liabilities LT
Change in liabilities LT - LTt-1
% Change in liabilities (LT - LTt-1) / (LTt-1)
Abnormal % change in (LT - LTt-1) / (LTt-1) - INDUSTRY(LT - LTt-1) / (LTt-1))
liabilities
Liabilities to interest LT/XINT
expenses
Change in liabilities to
interest expenses LT/XINT - LTt-1/XINTt-1
% Change in liabilities (LT/XINT - LTt-1/XINTt-1) / (LTt-1/XINTt-1)
to interest expenses
Abnormal % change in (LT/XINT - LTt-1/XINTt-1) / (LTt-1/XINTt-1) -
liabilities to interest INDUSTRY(LT/XINT - LTt-1/XINTt-1) / (LTt-1/XINTt-1))
expenses
Assets AT
Change in assets AT - ATt-1
% Change in assets (AT - ATt-1) / (ATt-1)
Abnormal % change in (AT - ATt-1) / (ATt-1) - INDUSTRY(AT - ATt-1) / (ATt-1))
assets
Assets to liabilities AT/LT

- 38 -
APPENDIX A: Definitions of explanatory variables (continued)
Change in assets to AT/LT - ATt-1/LTt-1
liabilities
% Change in assets to (AT/LT - ATt-1/LTt-1) / (ATt-1/LTt-1)
liabilities
Abnormal % change in (AT/LT - ATt-1/LTt-1) / (ATt-1/LTt-1) -
assets to liabilities INDUSTRY(AT/LT - ATt-1/LTt-1) / (ATt-1/LTt-1))
Expenses XOPR
Change in expenses XOPR - XOPRt-1
% Change in expenses (XOPR - XOPRt-1) / (XOPRt-1)
Abnormal % change in (XOPR - XOPRt-1) / (XOPRt-1) -
expenses INDUSTRY(XOPR - XOPRt-1) / (XOPRt-1))

Notes:
a
The explanatory variables included represent a relatively comprehensive set of variables based on recent fraud and
material misstatement literature (Cecchini et al. 2010; Dechow et al. 2011; Perols 2011). We include all variables
from Perols (2011) and all variables from the final Dechow et al. (2011) model that can be calculated using
Compustat data. Dechow et al. (2011) perform step-wise backward feature selection to derive more parsimonious
material misstatement models. We use their second model, which is the most complete model in their study that
only relies on Compustat data (they also include a model that requires market related data). This study predicts
material misstatements using the following variables: RSST accruals, change in receivables, change in inventory,
soft assets, percentage change in cash sales, change in return on assets, actual issuance of securities, abnormal
change in employees, and existence of operating leases. The model in Cecchini et al. (2010) includes a total of
1,518 explanatory variables derived using 23 financial statement items. These items are divided by each other both
in the current year and in the prior year and used to calculate changes in the ratios. Both current and lagged ratios as
well as their changes are then used to construct a dataset with 1,518 independent variables. Rather than including all
1,518 variables in our study, we follow and extend the approach used in Cecchini et al. (2010) by including 48
variables measuring levels and changes in levels, percentage change in levels, and abnormal percentage change of
commonly manipulated financial statement items and ratios. We examine a model with all 1,518 variables from
Cecchini et al. (2010) in an additional analysis.
b
ACT is Current Assets - Total; AT is Assets - Total; AU is Auditor ; CAPX is Capital Expenditures; CEQ is
Common/Ordinary Equity - Total; CEQ is Common/Ordinary Equity - Total; CHE is Cash and Short-Term
Investments; COGS is Cost of Goods Sold; CSHI is Common Shares Issued; CSHO is Common Shares
Outstanding; DLC is Debt in Current Liabilities - Total; DLTIS is Long-Term Debt Issuance; DLTT is Long-Term
Debt - Total; DP is Depreciation and Amortization; EMP is Employees; EXCHG is Stock Exchange ; FINCF is
Financing Activities Net Cash Flow; IB is Income Before Extraordinary Items; INVT is Inventories - Total;
INVVAL is Inventory Valuation Method; IVAO is Investment and Advances Other; IVST is Short-Term
Investments - Total; LCT is Current Liabilities - Total; LT is Liabilities - Total; MRC1 is Rental Commitments
Minimum 1st Year; MRC2 is Rental Commitments Minimum 2nd Year; MRC3 is Rental Commitments
Minimum 3rd Year; MRC4 is Rental Commitments Minimum 4th Year; MRC5 is Rental Commitments
Minimum 5th Year; NI is Net Income (Loss); OANCF is Operating Activities Net Cash Flow; OB is Order
Backlog; PPEGT is Property Plant and Equipment - Total (Gross); PPENT is Property Plant and Equipment - Total
(Net); PPROR is Pension Plans Anticipated Long-Term Rate of Return on Plan Assets; PRCC_F is Price Close -
Annual - Fiscal Year; PSTK is Preferred/Preference Stock (Capital) - Total; RE is Retained Earnings; RECD is
Receivables - Estimated Doubtful; RECT is Receivables Total; SALE is Sales/Turnover (Net); SIC is SIC Code;
SSTK is Sale of Common and Preferred Stock; TXDI is Income Taxes - Deferred; TXP is Income Taxes Payable;
TXT is Income Taxes - Total; WCAP is Working Capital (Balance Sheet); XINT is Interest and Related Expense -
Total; XINT is Interest and Related Expense - Total; and XOPR is Operating Expense. We also included controls
for year and industry (two-digit SIC code).
c
Similar variable used in both Dechow et al. (2011) and Perols (2011).
d
Variable construction based on Financial Kernel in Cecchini et al. (2010).

- 39 -
Figure 1 Multi-subset Observation Undersampling (OU)

(1) (2) (3) (4)

Notes:
Column 1 represents the raw data with the fraud observations stacked on top and non-fraud cases below. Column 1
also shows that model building and out-of-sample data are kept separated. Column 2 shows the data subsets that are
created based on the OU method. All fraud data are used in each subset while the non-fraud data are under-sampled
to address data rarity within each subset. Cumulatively across all subsets, all of the non-fraud data can be used, but
a single non-fraud observation is only used in one subset. In column 3, a classification algorithm is used to build
one prediction model per subset with the goal of accurately classifying firms into fraud or non-fraud cases. Each
model is then applied out-of-sample and generates a fraud probability prediction for each observation in the out-of-
sample data. In column 4, for each out-of-sample observation, the individual fraud prediction probabilities are then
combined to arrive at an overall combined fraud probability prediction for each observation.
More formally, let M={f1, f2, f3,, fk} be a set of k fraud observations f and let C={c1, c2, c3,, cn} be a set of n non-
fraud observations c, where M is the minority class, i.e., k < n. Note that the union of M and C, i.e., M U C, forms a
set that contains k fraud and n non-fraud observations. To achieve a more balanced dataset, d non-fraud
observations c are removed from the non-fraud set C, where 0 < d n - k. However, instead of deleting these
removed non-fraud observations, OU segments the non-fraud observations into n / (n - d) or fewer subsets Ui that
each contains n - d different non-fraud observations c, i.e., C={U1, U2, U3,, Un/n-d}. Note that all subsets Ui
contain mutually exclusive (disjoint) sets of non-fraud observations, Ui Uj = for i j. OU then combines all
fraud observations, i.e., the entire set M, with each Ui to create subsets Wi. OU thus creates up to n / (n - d) subsets
Wi that contain all fraud observations f and n - d unique non-fraud observations c. Each subset Wi is then used to
build a prediction model that is used to predict out-of-sample observations. In our experiments, OU is only used on
the model building data and the model evaluation data is left intact. Finally, for each out-of-sample observation, the
different prediction models probability predictions are averaged into an overall probability prediction for each
observation.

- 40 -
Figure 2 Multi-subset Variable Undersampling (VU)

(1) (2) (3)

Notes:
Column 1 represents the raw data that include all explanatory variables used to predict fraud. These explanatory
variables are partitioned into different subsets represented by the vertical lines. Each subset contains a subset of the
explanatory variables and all of the observations. Column 1 also shows that model building and out-of-sample data
are kept separated. In column 2, a classification algorithm is used to build one prediction model per variable subset
with the goal of classifying firms into fraud vs. non-fraud cases. Each prediction model is then applied out-of-
sample to generate a fraud probability prediction for each observation in the out-of-sample data. In column 3, for
each out-of-sample observation, the fraud prediction probabilities from the different prediction models are combined
to arrive at an overall combined fraud prediction probability for each observation.
More formally, let W denote a dataset with m variables x, i.e., W={x1, x2, x3,, xm}. VU reduces data dimensionality
by randomly dividing the variables in W into q subsets X, where each X contains m/q variables, i.e., the following
variable subsets are created by VU, X1={x1, x2, x3,, xm/q}, X2={xm/q+1, xm/q+2, xm/q+3,, x2m/q}, X3={x2m/q+1, x2m/q+2,
x2m/q+3,, x3m/q},, Xq={xm-m/q+1, xm-m/q+2, xm-m/q+3,, xm}. The subsets X are then used to build q prediction models.
The prediction models are then (i) used to predict out-of-sample observations and (ii) for each out-of-sample
observation, the prediction models probability predictions are combined into an overall prediction for each
observation by taking an average of the individual probability predictions.

- 41 -
Figure 3 Experimental Procedures Multi-subset Observation Undersampling (OU) Example

Start
Raw Data

Perform 10-fold
cross-validation

For each cross-validation


round n = {1, 2, 3,, 10}
Round n training data

For each OU
implementation l
= {1, 2, 3,, 20}

Create l OU
subsets
Round n test data
l round n training OU subsets

Build prediction
models
l prediction models

Predict
test data
l round n test data sets with predictions
For each n test data
observation, average the Combine
l probability predictions
OU subset
predictions
Round n test data with combined predictions

Determine optimal Calculate ECM


classification threshold scores
and calculate ECM scores
for each test set

l = 20? False

True

n = 10? False

True

End

- 42 -
Figure 4 Multi-subset Observation Undersampling (OU) with Different Numbers of Subsets -
Percentage Performance Improvement Relative to Benchmark
15ECM % New subsets
Improvement
14
13
12 New order
11
10 Original order
9
8
7
6
5
4
3
2
1
0 # OU
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Subsets

Notes:
ECM is calculated using a 0.6 percent fraud probability and a 30:1 false negative to false positive cost ratio.
As discussed in the text, three versions of the experiment were conducted. Original order refers to the main
OU experiment; new order refers to the analysis in which the OU subsets are selected in a different order; and
new subsets refers to the analysis in which the random sampling of non-fraud cases is repeated using a different
random draw.
The benchmark is simple undersampling (Perols 2011), which randomly removes non-fraud observations from
the sample to generate a more balanced training sample. This benchmark performs better than a benchmark that
includes all fraud and non-fraud observations. OU and the OU benchmarks use all variables (independent
variable reduction is examined in the VU analysis) and are implemented using support vector machines.

- 43 -
Figure 5 Multi-subset Variable Undersampling (VU) with Different Numbers of Subsets of
Explanatory Variables - Percentage Performance Improvement Relative to Benchmark
8ECM %
Improvement
7
6 Constant
5 number of
variables in each
4 subset
3
2
1
0
-1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# VU
-2 Subsets
-3
-4 All variables in
-5 each round
-6

Notes:
ECM is calculated using a 0.6 percent fraud probability and a 30:1 false negative to false positive cost ratio.
As discussed in the text, two versions of the experiment were conducted. The constant number of variables in
each subset experiment (the dashed line) uses subsets that contain five or six variables in each subset; the all
variables in each round experiment (the round dotted line) uses all variables in each experimental round by
randomly dividing all 109 variables into different subsets (consequently, as the number of subsets increases, the
number of variables in each subset decreases).
The all variables in each round experiment only manipulates the number of VU subsets in even increments.
The benchmark contains six randomly selected variables (from the 109 variables described in Appendix A) and
is equivalent to the VU implementation with only one subset. This benchmark performed better than
benchmarks implemented using (i) all the variables in the dataset and (ii) the variables selected in Dechow et al.
(2011), i.e., RSST accruals, change in receivables, change in inventory, soft assets, percentage change in cash
sales, change in return on assets, actual issuance of securities, abnormal change in employees, and existence of
operating leases VU and the VU benchmarks use all fraud and non-fraud observations (observation
undersampling is examined in the OU analysis) and are implemented using support vector machines.

- 44 -
Figure 6 Performance of combinations of OU, PVU, and SMOTE
Percentage Performance Improvement Relative to OU(12)

ECM %
Difference to PVU + OU(12) +
OU(12) SMOTE(600)
8
6 PVU + OU(12)
4
2 PVU
0 Cost
1:1 10:1 20:1 30:1 40:1 50:1 60:1 70:1 80:1 90:1 100:1 Ratio
-2
-4
-6 SMOTE(600)

-8
-10
-12

Notes:
OU is Multi-subset Observation Undersampling. OU(12) represents the best performing individual OU
implementation.
PVU is Multi-subset Observation Undersampling partitioned on fraud type.
SMOTE(600) is Multi-subset Observation Oversampling with an oversampling ratio of 600 percent. This represent
the best performing SMOTE implementation.
ECM is calculated assuming an evaluation fraud probability of 0.6 percent.

- 45 -
TABLE 1
Summary of Experiments

Experiment Description Benchmarka


1. Evaluating the We create OU subsets with 20 percent fraud cases in each subset.b Each subset We use two benchmarks: (i) simple
number of Multi- includes all original fraud cases and a random sample of the original non-fraud observation undersampling, i.e., OU
subset cases selected without replacement. We then empirically examine the optimal with only one subset, as used in Perols
Observation number of subsets for implementing OU.c As sensitivity analyses, we repeat the (2011), and (ii) no undersampling.
Undersampling experiment using the same subsets, but select the subsets in a different order,
(OU) subsets and re-perform the random selection procedure of non-fraud cases. OU and the
benchmarks use all variables (data dimensionality reduction is examined in the
VU analyses below).
2a. Evaluating the To evaluate how many VU subsets to use, we first randomly select variables We use three benchmarks: (i) simple
number of Multi- from the 109 variables used in the prior literature and place these variables into variable undersampling (i.e., VU with
subset Variable 20 different subsets. Thus, each variable subset contains five or six variables. only one subset); (ii) a model that
Undersampling To determine how many VU variable subsets to use, we perform an experiment includes all variables; and (iii) model 2
(VU) subsets in which the subsets are randomly added one by one to VU. in Dechow et al. (2011).
2b. Evaluating the This experiment evaluates the performance of VU as we change the number of We use the same three benchmarks as
number of variables in each subset. We use all variables in each experimental round and the first VU experiment.
variables in each randomly divide the variables into the different subsets. Thus, as the number of
VU subset subsets increase, the number of variables in each subset decreases. For
example, all variables are included in one set in the first experimental round,
half the variables are included in each of two subsets in the second experimental
round, etc. This experiment skips all uneven rounds except for the first round to
reduce processing time.
3. Evaluating VU This experiment evaluates the performance of VU when partitioned on fraud We compare the performance of PVU
partitioned on types. Note that we do not examine the performance of different PVU to the best performing benchmark in the
fraud types (PVU) implementations in this experiment as the specific subsets included in PVU are VU experiments (i.e., simple variable
driven by the partitioning rather than an empirical evaluation. undersampling).

- 46 -
TABLE 1 (continued)
Summary of Experiments

Notes:
a
Since we introduce OU to the fraud detection literature to reduce the imbalance between the number of fraud and the number of non-fraud observations, we use
simple undersampling as a benchmark (Perols 2011) when evaluating the performance of OU. This benchmark randomly removes non-fraud observations from
the sample to generate a more balanced training sample. We also use no undersampling as an additional benchmark. However, simple undersampling performs
on average 7.3 percent better no undersampling and we consequently report only simple undersampling. The OU and the OU benchmarks use all variables (as
data dimensionality reduction is examined in the VU analysis). VU is introduced as a data dimensionality reduction method that is argued to improve the
performance over currently used variable selection methods. As a baseline, we use a benchmark that was created using the variables included in Dechow et al.
(2011) model 2 (the Dechow benchmark): RSST accruals, change in receivables, change in inventory, soft assets, percentage change in cash sales, change in
return on assets, actual issuance of securities, abnormal change in employees, and existence of operating leases. This model compares different fraud detection
variables with the objective of creating a parsimonious fraud prediction model. We also use (i) a benchmark that randomly selects variables and (ii) a benchmark
that includes all variables (the all variables benchmark), i.e., where data dimensionality is not reduced. The benchmark that randomly selects variables performs
better than the Dechow benchmark and the all variables benchmark. More specifically, VU with 12 variable subsets performs on average 7.2 percent better than
both the All Variable Benchmark and the benchmark based on Dechow et al. (2011). Thus, we report our results using the benchmark that randomly selects
variables. VU and the VU benchmarks use all fraud and non-fraud observations. Following recent fraud prediction research (e.g., Cecchini et al. 2010) and
findings in Perols (2011), all prediction models are implemented using support vector machines. Sensitivity analyses are used to examine other classification
algorithms.
b
Perols (2011) finds that a simple undersampling ratio of 20 percent provides relatively good performance compared to other undersampling ratios.
c
More specifically, we first create one subset and examine the performance of OU with this single subset. We then create a second subset and use this subset
along with the previously created subset to evaluate the performance of OU with two subsets. Note that while it is possible to derive a total of 41 subsets
following Chan and Stolfos (1998) approach, the addition of another OU subset is only valuable if the additional subset contains new information. We expect
that the marginal benefit of adding an additional subset decreases as the total number of subsets in OU increases. Additionally, for each subset that is added,
another prediction model has to be built, used for prediction, and combined with the other prediction models predictions. Thus, there is a computational cost
associated with increasing the number of subsets used. Based on this and the results that indicate that the performance benefit tapers off around 12 subsets, we
do not extend the experiment beyond 20 subsets.

- 47 -
TABLE 2
Multi-subset Observation Undersampling (OU)
Performancea - Increasing the Number of Subsets

Percentage
Number of Difference to
OU Subsets ECM Benchmark p-valueb
Benchmarkc 0.160
2 0.156 2.3% 0.146
3 0.151 5.4% 0.015

ECM Improving
4 0.151 5.4% 0.036
5 0.148 7.3% 0.031
6 0.149 6.7% 0.039
7 0.148 7.4% 0.012
8 0.146 8.9% 0.005
9 0.145 9.3% 0.005
10 0.143 10.8% 0.003
11 0.142 11.1% 0.003
Performance Plateau

12 0.143 10.8% 0.006


13 0.142 11.1% 0.006
14 0.143 10.4% 0.011
15 0.144 10.1% 0.013
16 0.142 11.2% 0.008
17 0.142 11.1% 0.008
18 0.143 10.7% 0.009
19 0.143 10.4% 0.010
20 0.143 10.6% 0.009
Notes:
a
Performance is the average Expected Cost of Misclassification (ECM)
across the ten test folds. ECM is measured at best estimates of prior fraud
probability, i.e., 0.6 percent, and cost ratios, i.e., 30:1.
b
Reported p-values are based on pairwise t-tests using the average and
standard deviation in ECM scores across the ten test folds and are one-tailed
unless otherwise noted. Assumptions related to normality and independent
observations are unlikely to be satisfied and p-values are only included as
an indication of the relation between the magnitude and the variance of the
difference between each implementation and the respective benchmarks.
c
The benchmark is simple undersampling (Perols 2011), which randomly
removes non-fraud observations from the sample to generate a more
balanced training sample. This benchmark performed better than a
benchmark that included all fraud and non-fraud observations. OU and the
OU benchmarks use all variables (independent variable reduction is
examined in the VU analysis) and are implemented using support vector
machines. (Other classification algorithms are used in additional analyses.)

- 48 -
TABLE 3
Prediction Performancea,b of OU and PVU on a
Material Misstatements Hold-Out Sample

Notes:
a
Prediction performance is evaluate using 10-fold cross-validation in which separate datasets are used for model
building vs. model evaluation. Performance is area under the ROC curve (AUC). AUC provides a numeric
value of how well the prediction model ranks the observations in the test sets and represents the probability that a
randomly selected positive (misstatement) instance is ranked higher than a randomly selected negative (non-
misstatement) instance. An AUC of 0.5 is equivalent to a random rank order while an AUC of 1 is perfect
ranking of the evaluation cases.
b
The results in Panel A compares the performance of OU and VU to the Dechow benchmark using material
misstatement data (all methods and benchmarks are implemented using support vector machines; Panel B reports
results when other classification algorithms are used). This comparison provides further validation of the results
reported earlier on fraud data and provides insight into the usefulness of the proposed methods in a slightly
different setting. The results in Panel B examine the sensitivity of the proposed methods to the use of other
classification algorithms, i.e., logistic regression and bootstrap aggregation. Please see footnotes 4, 22, and 26 in
the text for details about support vector machines and bootstrap aggregation. The results in Panel C compare the
performance of the financial kernel from Cecchini et al. (2010) with and without OU (both implementations use
support vector machines). This analysis provides insight into (i) the usefulness of OU when used in combination
with a different set of independent variables (created using the financial kernel of Cecchini et al. (2010)) and (ii)
whether OU provides incremental predictive power when used in combination with the financial kernel.

- 49 -
TABLE 3 (continued)
Prediction Performance of OU and PVU on a
Material Misstatements Hold-Out Sample

c
In panels A and B, given the source, i.e., Dechow et al. (2011), and the nature of the material misstatement data,
we use the Dechow et al. (2011) benchmark in these comparisons. This benchmark is based on model 2 from
Dechow et al. (2011): material misstatement = RSST accruals + change in receivables + change in inventory +
soft assets + percentage change in cash sales + change in return on assets + actual issuance of securities +
abnormal change in employees + existence of operating leases. The independent variables in this model were
selected using a material misstatement sample that is similar to the sample used in this experiment. Because the
entire sample was used when selecting these variables it is possible that the benchmark performance represents
an overfitted model. In this experiment, OU uses all 107 variables, but under-samples the non-fraud
observations using the OU method. PVU uses all data, but partitions the original 107 variables based on fraud
types.
d
The financial kernel consists of 1,518 independent variables representing current and lagged ratios and changes
in the ratios of 23 financial statement variables commonly used to construct independent variables in fraud
research. In this experiment, OU is implemented using the same 1,518 independent variables and support vector
machines. PVU is not implemented in this experiment, as it is not clear how to partition the 1,518 independent
variables into different fraud categories.
e
p-values are one-tailed based on pairwise t-tests using the average and standard deviation of ECM scores across
the ten test folds. Assumptions related to normality and independent observations are unlikely to be satisfied and
p-values are only included as an indication of the relation between the magnitude and the variance of the
difference between each implementation and the benchmark.

- 50 -
TABLE 4
Hypothesis Testing: Results on Full Sample Logistic Regressions versus 12 OU Subsamples Logistic Regressions

Full Sample Summary of 12 OU Subsamples


Percent
Average St. Dev. p-value p-value p-values Lower p-value Upper
Variables Estimate Std Error ChiSquare Prob>ChiSq Estimate Estimates Mean Minimum below 0.05 Quartile Median Quartile
7.820 0.698 125.46 <0.001 5.513 0.632 <0.001 <0.001 100% <0.001 <0.001 <0.001
SOFT_ASSETS -3.012 0.611 24.34 <0.001 -3.251 0.636 <0.001 <0.001 100% <0.001 <0.001 <0.001
FOURYGEOM_S -1.750 0.392 19.96 <0.001 -1.917 0.873 0.017 <0.001 92% <0.001 <0.001 0.004
AZSCORE -0.101 0.026 15.15 <0.001 -0.118 0.056 0.012 <0.001 83% <0.001 <0.001 0.009
TACCRU_T_TA -3.247 1.007 10.40 0.0013 -3.383 0.729 0.012 <0.001 92% <0.001 0.002 0.008
T_XOPR 0.000 0.000 9.52 0.0020 0.000 0.000 0.016 <0.001 92% 0.002 0.006 0.012
PPANDEQ_T_TA -3.644 1.193 9.33 0.0023 -3.887 1.379 0.041 <0.001 83% <0.001 <0.001 0.012
NETS 0.000 0.000 7.26 0.0070 0.000 0.000 0.034 <0.001 83% 0.003 0.016 0.046
S_T_T_EMP 0.001 0.000 6.37 0.0116 0.001 0.001 0.180 <0.001 67% <0.001 0.044 0.280
FA_T_TA 1.816 0.720 6.36 0.0117 1.841 1.146 0.138 <0.001 58% <0.001 0.029 0.206
PCHG_ACCP_T_INV -0.005 0.002 5.93 0.0149 -0.006 0.002 0.045 <0.001 83% 0.010 0.017 0.034
T_APCHG_ACCP_T_INV 0.004 0.002 4.39 0.0362 0.004 0.001 0.112 0.002 25% 0.054 0.076 0.108
ASS_T_LIAB 0.162 0.081 3.97 0.0462 0.166 0.095 0.191 <0.001 33% 0.008 0.178 0.319
PCHG_ASS_T_LIAB -0.005 0.003 3.72 0.0538 -0.005 0.002 0.114 0.013 33% 0.043 0.112 0.150
ACCR_T_TA 1.550 0.817 3.60 0.0578 1.694 1.391 0.168 <0.001 58% 0.008 0.029 0.321
INDUSTRY_FIRM_ROE -0.060 0.033 3.29 0.0695 -0.058 0.012 0.127 0.052 0% 0.076 0.113 0.189
LIAB_T_IEXP 0.001 0.001 2.90 0.0887 0.002 0.001 0.159 0.024 25% 0.050 0.103 0.237
RSST_ACCRUALS 0.022 0.013 2.82 0.0928 0.021 0.005 0.140 0.054 0% 0.078 0.102 0.227

Note: Average estimates, standard deviation estimates, and average p-values are based on estimates and p-values from the 12 OU subsample logistic regression
results. P-values less than 0.0001 were converted to 0.0001 before taking the average.

- 51 -