Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Abstract
Exploration and production applications of machine learning algorithms are varied and numerous. However less
attention has been given to the underlying assumptions critical to the application of such techniques. As the breadth of
applications increases, it is critical to understand those characteristics of the problem and data that may impact the results.
Independent of the particulars of any specific algorithm, the standard model development and evaluation process
may be described as follows. From a set of data points consisting of features and a response, a machine learning algorithm
produces a model that may then be used to compute predicted responses on new data. This model can be applied to additional
data for which the response is available, and performance estimated by comparing the predicted and actual response.
However the reliability of this estimate may be strongly dependent on the statistical characteristics of the data. If the
observations are not independent, then the results may not reflect performance in application.
To explore the impact of the violation of the assumption of independence on predictive model development and
evaluation, standard machine learning algorithms were used to develop models from synthetic time series data and real
monthly oil production data from the Wolfcamp play in the Midland basin. For both, standard approaches to model
evaluation fail. For the oil production data, an alternative approach to model development and evaluation is shown to produce
both more reliable estimates of model performance and improved model performance.
Introduction
The use of machine learning algorithms in exploration and production applications is broad. For example, Esmaili
and Mohaghegh (2013) describe how fuzzy pattern recognition models have been used as an aid in optimizing drilling
parameters in shale, and Raghavenda et al. (2103) detail the application of support vector machine models in predicting
sucker rod failures. While the popularity of such techniques has increased, the implicit statistical assumptions of the data
underlying the application of such techniques have not been adequately addressed. It is crucial to understand how violation of
such assumptions can impact results.
Generically, a machine learning algorithm takes as input the training data: a set of data points consisting of features,
and a response corresponding to those features. From this data, the algorithm produces a model that may be used to compute
a response prediction from new data. Test data, consisting of data not overlapping with the training data, may be used to
estimate model performance. However the reliability of this estimate is dependent on the statistical characteristics of the data.
For example, if the observations are not independent, then test data results may not be indicative of real world results.
Consider the task of developing a production model from drilling and geologic data across a developed field. A model may
be designed from data corresponding to a randomly selected subset of wells, and subsequently tested on the remaining wells.
Strong predictive performance across these test set well locations does not imply similar performance outside these locations
as geologic features are typically strongly spatially correlated, violating the assumption of independent observations. While
characteristics of the data may impair estimation of model performance, techniques are under development to minimize this
effect. For example, if production potential is sought in a particular region, then evaluation of production model performance
biased towards wells closest to that region is preferred over random assignment of well data to training and test sets.
In short, the design of relevant and robust models requires care not only in the selection of the machine learning algorithm,
but also an understanding of the statistical characteristics of the available data and the specific aim of the project.
SPE-169523-MS
SPE-169523-MS
SPE-169523-MS
In any application in which later production data is not available for the wells of interest, the proper inclusion radius
cannot be identified as described above. Can nearby wells for which sufficient production data is available be substituted? In
Figure 4, such a set of surrounding wells is depicted. The above algorithm for identifying the appropriate inclusion radius
has been applied to these wells. The resulting total field production error and mean relative error as a function of inclusion
radius are shown in Figures 12 and 13. The relationship between total field production error and inclusion radius for these
surrounding wells is very similar to that observed for the test wells, with a minimum error at an inclusion radius of 30,000
feet. Had this been used as the basis for inclusion radius selection, the test well total field production error and mean relative
error would be 5.3% and 23.3% respectively. The relationship between the inclusion radius and the mean relative error is less
clear, with a rather broad range of radii resulting in similar errors.
Conclusion
While the impact of autocorrelation on model evaluation was described in reference to temporal data, this finding
generalizes to spatial data as well, as the synthetic data could have been described as spatial. For all types of data and
modeling approaches, correlations within the data can render standard model evaluation procedures unreliable. Thoughtful
selection of both data points likely to be useful for model development, and of data points likely to be useful for model
evaluation can increase model performance and reduce model evaluation error. Additional work remains to better understand
the relationship between the statistical characteristics of a dataset and the optimal selection of a subset of that data form
model building and evaluation.
Acknowledgement
The author would like to acknowledge the management of Chevron Information Technology Company for allowing
this presentation.
References
N.S. Altman 1992. An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression,.
The American Statistician, Vol. 46, No. 3. (Aug., 1992), pp. 175-185.
Beygelzimer A., Kakadet S., Langford J., Arya S., Mount D. and Li S. 2013. FNN: Fast Nearest Neighbor Search
Algorithms and Applications. R package version 1.1. http://CRAN.R-project.org/package=FNN
L. Breiman 2001. Random Forests. Machine Learning. 45:5-32
Liaw and Wiener 2002. Classification and Regression by randomForest. R News 2(3), 18-22.
Mohaghegh, S. D., & Esmaili, S. (2013, September 30). Using Data-Driven Analytics to Assess the Impact of
Design Parameters on Production from Shale. Society of Petroleum Engineers
Raghavenda, C. S., Liu, Y., Wu, A., Olabinjo, L., Balogun, O., Ershaghi, I.,Yao, K.-T. (2013, April 19). Global
Model for Failure Prediction for Rod Pump Artificial Lift Systems. Society of Petroleum Engineers.
R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical
Computing, Vienna, Austria. URL http://www.R-project.org/
SPE-169523-MS
Figures
SPE-169523-MS
Training
Surrounding
Test
Fig. 4 - Well set assignment. Size of circle representative of total oil production for years 2-7
SPE-169523-MS
Fig. 5 - Training wells within radius used to design model for each test well
Fig. 6 -
SPE-169523-MS
Fig. 7 -
Fig. 8 -
SPE-169523-MS
Fig. 9 -
10
SPE-169523-MS