Sei sulla pagina 1di 10

SPE-169523-MS

Potential Pitfalls in Exploration and Production Applications of Machine


Learning
Cole Harris, Chevron Information Technology Company
Copyright 2014, Society of Petroleum Engineers
This paper was prepared for presentation at the SPE Western North American and Rocky Mountain Joint Regional Meeting held in Denver, Colorado, USA, 1618 April 2014.
This paper was selected for presentation by an SPE program committee following review of information contained in an abstract submitted by the author(s). Contents of the paper have not been
reviewed by the Society of Petroleum Engineers and are subject to correction by the author(s). The material does not necessarily reflect any position of the Society of Petroleum Engineers, its
officers, or members. Electronic reproduction, distribution, or storage of any part of this paper without the written consent of the Society of Petroleum Engineers is prohibited. Permission to
reproduce in print is restricted to an abstract of not more than 300 words; illustrations may not be copied. The abstract must contain conspicuous acknowledgment of SPE copyright.

Abstract
Exploration and production applications of machine learning algorithms are varied and numerous. However less
attention has been given to the underlying assumptions critical to the application of such techniques. As the breadth of
applications increases, it is critical to understand those characteristics of the problem and data that may impact the results.
Independent of the particulars of any specific algorithm, the standard model development and evaluation process
may be described as follows. From a set of data points consisting of features and a response, a machine learning algorithm
produces a model that may then be used to compute predicted responses on new data. This model can be applied to additional
data for which the response is available, and performance estimated by comparing the predicted and actual response.
However the reliability of this estimate may be strongly dependent on the statistical characteristics of the data. If the
observations are not independent, then the results may not reflect performance in application.
To explore the impact of the violation of the assumption of independence on predictive model development and
evaluation, standard machine learning algorithms were used to develop models from synthetic time series data and real
monthly oil production data from the Wolfcamp play in the Midland basin. For both, standard approaches to model
evaluation fail. For the oil production data, an alternative approach to model development and evaluation is shown to produce
both more reliable estimates of model performance and improved model performance.
Introduction
The use of machine learning algorithms in exploration and production applications is broad. For example, Esmaili
and Mohaghegh (2013) describe how fuzzy pattern recognition models have been used as an aid in optimizing drilling
parameters in shale, and Raghavenda et al. (2103) detail the application of support vector machine models in predicting
sucker rod failures. While the popularity of such techniques has increased, the implicit statistical assumptions of the data
underlying the application of such techniques have not been adequately addressed. It is crucial to understand how violation of
such assumptions can impact results.
Generically, a machine learning algorithm takes as input the training data: a set of data points consisting of features,
and a response corresponding to those features. From this data, the algorithm produces a model that may be used to compute
a response prediction from new data. Test data, consisting of data not overlapping with the training data, may be used to
estimate model performance. However the reliability of this estimate is dependent on the statistical characteristics of the data.
For example, if the observations are not independent, then test data results may not be indicative of real world results.
Consider the task of developing a production model from drilling and geologic data across a developed field. A model may
be designed from data corresponding to a randomly selected subset of wells, and subsequently tested on the remaining wells.
Strong predictive performance across these test set well locations does not imply similar performance outside these locations
as geologic features are typically strongly spatially correlated, violating the assumption of independent observations. While
characteristics of the data may impair estimation of model performance, techniques are under development to minimize this
effect. For example, if production potential is sought in a particular region, then evaluation of production model performance
biased towards wells closest to that region is preferred over random assignment of well data to training and test sets.
In short, the design of relevant and robust models requires care not only in the selection of the machine learning algorithm,
but also an understanding of the statistical characteristics of the available data and the specific aim of the project.

SPE-169523-MS

Overview of Datasets and Machine Learning Techniques


Software
All analyses employed the R statistical computing environment. The interested reader is referred to Ekstrom (2011)
for an introduction to the R language.
Datasets
Time Series Data
The synthetic time series data was designed to represent generically data that might arise from a frequently sampled
sensor. A sequence of 10800 normally distributed values was generated and filtered using a 39 point triangular filter. For
descriptive purposes, the sampling rate is defined as 1 sample/second, or 1 Hz. Thus the duration of this time series is three
hours.
Oil Production Data
526 vertical wells were selected from the Wolfcamp play in the Midland Basin based on the public availability of at least
seven years monthly oil production rates and well coordinates.
Machine Learning Techniques
Nearest Neighbor Regression
The nearest neighbor regression algorithm is quite simple. For a test data point, that training set data point with
minimum distance to the test data point is identified, and the predicted response is that of the identified nearest neighbor
training set data point. While alternative measures of distance might be appropriate for certain applications, the Euclidian
distance has been employed in this analysis. The algorithm is further described in Altman (1992), and the FNN package for R
used in Beygelzimer et al. (2013).
Random Forest Regression
Random forest regression may be described as average over a collection of regression trees, each designed from a
distinct subsample of the input training data. Please refer to Breiman (2001) for additional information on the algorithm, and
to Liaw and Wiener (2002) for information on the randomForest R package used in this analysis.
Methodology
Time Series Data Model development
The synthetic time series data, a segment of which is shown in Figure 1, is sampled at 1 Hz, and has duration of three
hours. As Figure 2 shows, the data are strongly autocorrelated for short lags, but not for lags greater than about 30 seconds.
This is consistent with the construction and filtering of the data as described above. With the aim of clearly demonstrating the
potential impact of such autocorrelation on estimates of predictive model performance, a nearest neighbor regression model
was designed to predict the time series value 10 minutes in the future from the most recent one minute (60 samples) of data.
The data were reorganized into an input matrix such that each row contains values for 60 contiguous samples, and a
response vector with associated values ten minutes ahead. These were split into training (two hours) and test (49 minutes 1
second) segments.
A nearest neighbor regression model was designed from the training data using the knn.reg function in the FNN package.
The accuracy of this model was estimated within the training data using leave-one-out-cross-validation (LOOCV).
The LOOCV process is described in the following pseudocode.
For each sample in training data
{
remove sample from training dataset
design model from remaining samples
apply model to excluded sample
record predicted response
return sample to training dataset
}
From this vector of recorded predictions, the model may be evaluated using any of several available measures, such as mean
absolute error and correlation. These measures can be compared with evaluation based on test set predictions.

SPE-169523-MS

Oil Production Model Development


The spatial distribution of the wells represented in this data is shown in Figures 3-4. As is evident, there are regions of
relatively high and low well density. To explore the possible impact of spatial correlations in this data, random forest models
were designed from distinct subsets of training wells to predict production for wells in a specific test region as depicted in
Figure 5. Predictive variables were chosen to be the first twelve month oil production values, and the response given by the
cumulative production over years two through seven. Thus this analysis is somewhat analogous to a decline curve analysis,
but with empirically-derived models substituted for parametric decline curve models.
Training wells were selected for a particular test well as those wells within a specified distance from that test set well, for
a range of distances, as shown in Figure 5. It is expected that production at wells distant from the test well region will not be
as informative as nearby wells, and furthermore that spatial correlations within the data for wells far from the test area may
negatively impact model evaluation. Thus models designed from nearby well data may outperform a model incorporating all
data. If a distance-related model performance effect is observed, the notion of nearby can be quantified.
In a subsequent analysis, the performance of models designed for a small set of training wells surrounding the test region
boundary (Figure 5) from nearby training wells was evaluated for a range of cutoff distances. A distance-related model
performance effect from this analysis could be used to apriori select more appropriate training wells for the test set, as would
be required in real world applications in which future production is not known.
Results and Discussion
Time Series Nearest Neighbor Regression Model
The LOOCV predictions are plotted against the actual sample values in Figure 6. The LOOCV mean absolute error
estimate for the nearest neighbor regression model is 0.012, and the correlation between predicted and actual values is .996.
The test set predictions are plotted against the actual sample values in Figure 7. In the test data the mean absolute error is .21,
and the correlation is -.06. Clearly the standard model evaluation process has failed, as LOOCV results suggest a near-perfect
model, while test set results demonstrate a lack of predictive capability. The root of this failure is the dependence of the
LOOCV algorithm on the statistical independence of the time series samples. Adjacent time samples are so strongly
associated that the nearest neighbor is always an adjacent sample. In the LOOCV algorithm as described above, a sample
adjacent to the removed sample is available for prediction. However the test set data, while exhibiting the same high
adjacent sample correlation, is statistically independent from the training data. No very near neighbors to the test set samples
are present in the training set.
Oil Production Random Forest Models
Initially, a random forest regression model was designed from all training set wells, including those identified as
surrounding wells. Analogous to LOOCV, predictions for each training set well may be generated from those trees within
the random forest model that did not incorporate the data for that well in their design. These predictions, known as out-of-bag
predictions, are shown in Figure 8. The mean relative error in cumulative production over years two through seven is 23.0%.
Across all training wells, the error in the total field production is .4%. When applied to the test wells, this model performs
less well, with a mean relative error of 34.8% (Figure 9). Across all test wells, the prediction error in the total field
production is 24.1%.
The impact on performance of selecting only nearby training wells for model design was investigated as described in the
following pseudocode.
For R in a range of inclusion radii
{
For each well in test well set
{
identify training wells within a distance R of the test well
design model using only these training wells
apply model to test well
record predicted response
}
}
The baseline result, using all training wells, may be thought of as a limit case with a very large inclusion radius. The
total field production error is shown in Figure 10, and the mean relative error in Figure 11, both for inclusion radii up to
100,000 feet. There is a minimum in both errors at 25,000 feet, however the total field production error depicts a stronger
relationship between error and increasing inclusion radius. At 25,000 feet, the mean relative error is 22.7%, and the total field
production error is 4.1%. This compares very favorably with the baseline model results for the test wells, suggesting that
consideration of training well distance may enable more accurate production modeling. While these results do not directly
implicate spatial statistical correlations within the training data as a contributing factor, the results are consistent with such an
interpretation.

SPE-169523-MS

In any application in which later production data is not available for the wells of interest, the proper inclusion radius
cannot be identified as described above. Can nearby wells for which sufficient production data is available be substituted? In
Figure 4, such a set of surrounding wells is depicted. The above algorithm for identifying the appropriate inclusion radius
has been applied to these wells. The resulting total field production error and mean relative error as a function of inclusion
radius are shown in Figures 12 and 13. The relationship between total field production error and inclusion radius for these
surrounding wells is very similar to that observed for the test wells, with a minimum error at an inclusion radius of 30,000
feet. Had this been used as the basis for inclusion radius selection, the test well total field production error and mean relative
error would be 5.3% and 23.3% respectively. The relationship between the inclusion radius and the mean relative error is less
clear, with a rather broad range of radii resulting in similar errors.
Conclusion
While the impact of autocorrelation on model evaluation was described in reference to temporal data, this finding
generalizes to spatial data as well, as the synthetic data could have been described as spatial. For all types of data and
modeling approaches, correlations within the data can render standard model evaluation procedures unreliable. Thoughtful
selection of both data points likely to be useful for model development, and of data points likely to be useful for model
evaluation can increase model performance and reduce model evaluation error. Additional work remains to better understand
the relationship between the statistical characteristics of a dataset and the optimal selection of a subset of that data form
model building and evaluation.
Acknowledgement
The author would like to acknowledge the management of Chevron Information Technology Company for allowing
this presentation.
References
N.S. Altman 1992. An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression,.
The American Statistician, Vol. 46, No. 3. (Aug., 1992), pp. 175-185.
Beygelzimer A., Kakadet S., Langford J., Arya S., Mount D. and Li S. 2013. FNN: Fast Nearest Neighbor Search
Algorithms and Applications. R package version 1.1. http://CRAN.R-project.org/package=FNN
L. Breiman 2001. Random Forests. Machine Learning. 45:5-32
Liaw and Wiener 2002. Classification and Regression by randomForest. R News 2(3), 18-22.
Mohaghegh, S. D., & Esmaili, S. (2013, September 30). Using Data-Driven Analytics to Assess the Impact of
Design Parameters on Production from Shale. Society of Petroleum Engineers
Raghavenda, C. S., Liu, Y., Wu, A., Olabinjo, L., Balogun, O., Ershaghi, I.,Yao, K.-T. (2013, April 19). Global
Model for Failure Prediction for Rod Pump Artificial Lift Systems. Society of Petroleum Engineers.
R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical
Computing, Vienna, Austria. URL http://www.R-project.org/

SPE-169523-MS

Figures

Fig. 1 - Segment of synthetic time series data

Fig. 2 - Autocorrelation of synthetic time series

SPE-169523-MS

Fig. 3 - Location of wells used in oil production modeling

Training

Surrounding

Test

Fig. 4 - Well set assignment. Size of circle representative of total oil production for years 2-7

SPE-169523-MS

Fig. 5 - Training wells within radius used to design model for each test well

Fig. 6 -

LOOCV predictions vs. actual


synthetic time series training data

SPE-169523-MS

Fig. 7 -

Test set predictions vs. actual


synthetic time series data

Fig. 8 -

Out of bag predictions for random forest model


using all training data

SPE-169523-MS

Fig. 9 -

Test well predictions for baseline


random forest model

Fig. 10 - Total test well field production error vs.


training well inclusion radius

Fig. 11 - Test well mean relative error vs.


training well inclusion radius

10

Fig. 12 - Total surrounding well field production


error vs. training well inclusion radius

Fig. 13 - Surrounding well mean relative error


vs. training well inclusion radius

SPE-169523-MS

Potrebbero piacerti anche