Sei sulla pagina 1di 14

Geoderma xxx (xxxx) xxx–xxx

Contents lists available at ScienceDirect

Geoderma
journal homepage: www.elsevier.com/locate/geoderma

Modelling and mapping soil organic carbon stocks in Brazil


Lucas Carvalho Gomesa,b, , Raiza Moniz Fariaa, Eliana de Souzaa, Gustavo Vieira Velosoa,

Carlos Ernesto G.R. Schaefera, Elpídio Inácio Fernandes Filhoa


a
Department of Soil and Plant Nutrition, Federal University of Viçosa, campus UFV, 36570-900 Viçosa, Brazil
b
Farming Systems Ecology Group, Wageningen University & Research, P.O. Box 430, 6700, AK, Wageningen, the Netherlands

ARTICLE INFO ABSTRACT

Handling Editor: Alex McBratney Brazil has extensive forests and savannas on deep weathered soils and plays a key role in the discussions about
Keywords: carbon sequestration, but the distribution of soil organic carbon (SOC) stocks up to 1 m depth has not been
Soil carbon stock investigated in Brazil using machine learning techniques. In this study, we applied a methodological framework
Machine learning to optimize the prediction of SOC stocks for the entire Brazilian territory and determine how the environmental
Spatial prediction heterogeneity of Brazil influences the SOC stocks distribution. We used a legacy dataset of 8227 soil profiles
Protected areas which consisted of 37,693 samples. For each profile, the vertical distribution of SOC and bulk density were
Random Forests interpolated to standard depths (0–5, 5–15, 15–30, 30–60 and 60–100 cm) using mass preserving equal-area
quadratic splines. The covariates database was composed of 74 variables including bioclimatic (temperature and
precipitation) data, soil and biome maps, vegetation indexes and morphometric maps derived from a digital
elevation model, with a 1 km spatial resolution. To obtain the best prediction performance, we tested four
machine learning algorithms: Random Forests, Cubist, Generalized Linear Model Boosting and Support Vector
Machines. Random Forests showed the best performance in predicting SOC stocks for all depths, with the highest
performance at 30–60 cm for training (R2 = 0.32) and validation (R2 = 0.33); hence, it was selected for the
spatial prediction of SOC stocks. The most important covariates selected by Random Forests using the recursive
feature elimination were: soil class, sum of monthly mean temperature (SAMT), precipitation, slope height and
vegetation indexes (NDVI, GPP). In total, Brazilian soils store approximately 71.3 PgC within the top 100 cm,
where the first 0–30 cm contains almost 36 PgC. Approximately 31% of the total SOC stocks (22.2 PgC) occurs in
protected areas (2.6 million km2), which are not subjected to land use pressure and carbon losses. Although the
Amazon biome has the highest amount of stored SOC (36.1 PgC), its soils do not represent a good potential for
carbon accumulation. Among soil classes, the Luvisols showed the lowest SOC density (6.45 kg m−2) and the
Histosols presented the highest values (14.87 kg m−2). More than 57% of the total SOC was found in nutrient-
poor, deep-weathered Ferralsols and Acrisols, which are the dominant soils in Brazil. The presented methodo-
logical framework covers all steps of prediction process, building maps with known accuracy and has great
potential to be used in future soil carbon inventories at large scales. Concerning conservation issues, the results
highlight the importance of nature reserves for protecting SOC in the long-term.

1. Introduction 1 m depth, possibly resulting from different soil data and methods
employed. The development of digital soil mapping (DSM) technologies
Soil organic carbon (SOC) stocks are the largest reservoir of carbon and their applications (McBratney et al., 2003) enabled mapping SOC
in the biosphere, important sinks of CO2 from the atmosphere (Batjes, stocks from local to continental scales, using the scorpan factors (e.g.,
1998; Davidson and Janssens, 2006; Lal, 2004), and have the potential soil, climate, organisms, material parent) (Minasny et al., 2013).
to mitigate the impacts of present-day and future climate changes Mapping SOC stocks in large and heterogeneous environments using the
(Edenhofer et al., 2014). However, the prediction of global SOC stocks scorpan factors is a challenge due to high variability and sample data
is highly variable; Scharlemann et al. (2014) reviewed 27 studies and availability, but it is crucial to improve global estimates of SOC stocks
found values varying from 504 to 3000 PgC (mean of 1460.5 PgC) up to and highlight the main drivers to support carbon inventories and policy

Corresponding author at: Department of Soil and Plant Nutrition, Federal University of Viçosa, campus UFV, 36570-900 Viçosa, Brazil.

E-mail addresses: lucas.c.gomes@ufv.br (L.C. Gomes), raiza.faria@ufv.br (R.M. Faria), eliana.souza@ufv.br (E. de Souza), gustavo.veloso@ufv.br (G.V. Veloso),
carlos.schaefer@ufv.br (C.E.G.R. Schaefer), elpidio@ufv.br (E.I.F. Filho).

https://doi.org/10.1016/j.geoderma.2019.01.007
Received 1 December 2017; Received in revised form 23 November 2018; Accepted 3 January 2019
0016-7061/ © 2019 Elsevier B.V. All rights reserved.

Please cite this article as: Gomes, L.C., Geoderma, https://doi.org/10.1016/j.geoderma.2019.01.007


L.C. Gomes et al. Geoderma xxx (xxxx) xxx–xxx

decisions. The climate and environmental heterogeneity of Brazil, from method in the DSM to create prediction and uncertainty maps. The
humid tropical to semiarid zones, makes it uniquely suitable for cou- GlobalSoilMap (GSM) project defined the uncertainty as the 90% PI
pling methodological prediction approaches to assess the SOC stocks that describes the range of values within which the true value is ex-
distribution and to understand how the scorpan factors control SOC pected to occur 9 times out of 10, with a 1 out of 20 probability for each
stocks distribution, at large scale. of the two tails (Arrouays et al., 2014).
The SOC stocks depend on the balance between carbon inputs and The aims of the present study were to: i) model the vertical dis-
outputs and the environmental conditions (scorpan factors) that, di- tribution of SOC, bulk density and SOC stocks in soil profiles, ii) analyse
rectly or indirectly affect the processes controlling the spatial dis- the effect of sub-setting the covariate set on prediction performance, iii)
tribution of SOC stocks. At global and continental scales, the SOC stock select the best model for prediction of SOC stocks, iv) predict the hor-
distribution is mainly controlled by climate conditions, such as tem- izontal and vertical spatial distribution of SOC stocks across the
perature and precipitation, increasing with higher precipitation and Brazilian territory and analyze the SOC stocks per biome, soil type and
lower temperatures (Hengl et al., 2015; Minasny et al., 2013). Soil within protected areas.
texture also influences SOC stocks, and clay amounts are key to SOC
protection (Powers et al., 2011), due to interactions between the SOC 2. Materials and methods
and reactive surfaces of clay minerals (Grüneberg et al., 2013; Mayer,
1994). Vegetation also plays a key role in the SOC stocks favoring the 2.1. Study area
input of decaying plant biomass, especially in the topsoil (0–30 cm)
(Bui et al., 2009), and vegetation indexes, such as NDVI, have been used The Brazilian territory covers an 8.5 million km2 area from 05° 16′
to predict SOC stocks (Yang et al., 2016b). In areas with large topo- 19″ N to 33° 45′ 07″ S latitude and 34° 47′ 34″ E to 73° 59′ 26″ W
graphic variations the SOC is reasonably predicted by terrain attributes, longitude, with the highest elevation of 2972 m in the Amazon region.
such as elevation, slope and curvature (Fissore et al., 2017; Nyssen Its geographical position enables a wide range of biodiversity, with six
et al., 2008; Oueslati et al., 2013). The combinations of these different biomes possessing different climatic conditions, mostly tropical
environmental factors create unique local conditions, leading to in- (Fig. 1a). The majority of soils in Brazil are deep weathered Ferralsols
creasing or decreasing SOC stocks, being useful to predict the spatial and Acrisols, summing up more than 60% of Brazil's area (Fig. 1b)
distribution of SOC stocks in the DSM. (IBGE and EMBRAPA, 2001), whereas Histosols cover only 1613 km2
Several models are used to estimate the spatial distribution of SOC and Chernosols 33,685 km2.
including machine learning algorithms, such as the Random Forests The Amazon biome covers almost half of Brazil (49.29%), with a
(Hengl et al., 2015; Hounkpatin et al., 2018; Ließ et al., 2016; Wang humid tropical climate (Af, Am and Aw), mean annual rainfall greater
et al., 2018; Yang et al., 2016a), Cubist (Adhikari et al., 2014; Gray than 3100 mm, and mean annual temperature of 25.9 to 27.7 °C. The
et al., 2015; Rossel et al., 2016), Support Vector Machines-SVM (Ließ Cerrado biome (Brazilian savannas) is the second largest biome, cov-
et al., 2016; Ottoy et al., 2017; Rudiyanto et al., 2018; Wang et al., ering 22% of the territory, with a predominantly semi-humid climate
2018) and kriging-based models (Bonfatti et al., 2016; Gamble et al., (Aw), mean annual temperature between 22 and 23 °C and mean annual
2017). The application of predictive models should follow the principle precipitation between 1200 and 1800 mm. The Atlantic Forest biome in
of parsimony – Occam's razor – suggesting that the best model can eastern Brazil extends from the coast to the interior dissected plateau
explain the same phenomena using fewer variables without loss of and has the highest diversity of environments, characterized by well-
performance (Batty and Torrens, 2001). The use of fewer predictive drained highlands. In this biome, there are diverse climate types (Cfb,
variables enables a better understanding of the main drivers and Cfa, Cwb, Aw and As), with a mean annual temperature of 11 to 26 °C
quicker computer processing (Brungard et al., 2015). Aiming to create depending on the altitude and mean annual precipitation between 700
simpler models, the Recursive Feature Elimination (RFE) algorithm is and 1500 mm. The Caatinga biome is the driest biome, under a semi-
widely applied to backward select optimal subsets of variables, while arid climate (Bsh), with an annual average rainfall of 500 mm and mean
maintaining model performance (Ballabio, 2009; Brungard et al., 2015; annual temperature of 20 to 29 °C. The Pantanal biome is characterized
Hounkpatin et al., 2018; Stevens et al., 2013; Vašát et al., 2017). by long periods of flooding, seasonal Aw climate with a mean annual
Soil carbon mapping is usually done in 2 dimensions (2D) at a single temperature of 22 to 24 °C and mean annual precipitation between
soil depth interval. However, the SOC is highly variable with soil depth, 1000 and 1600 mm. The Pampa biome in southern Brazil is covered by
and studies have modelled the vertical distribution of SOC using ne- temperate grasslands with a Cfa climate, mean annual temperature of
gative exponential functions and interpolation (Minasny et al., 2006; 14 to 20 °C and mean annual precipitation between 1300 and 2500 mm.
Mishra et al., 2009). Equal area quadratic splines are also used to Across the whole territory, Brazil has a large extent of natural protected
harmonize the soil data by depth (Bishop et al., 1999), enabling the areas, especially in the Amazon biome (Fig. 1c), covering about 29% of
creation of models to map soil characteristics in 3D (Adhikari et al., the country.
2013; Mulder et al., 2016). The study of SOC stocks in the entire soil
depth is important since it provides information of carbon pools less 2.2. Soil data
prone to climatic variations, as the SOC stored at subsurface has higher
stability and protection, compared with in the upper layers (Fontaine We used legacy soil data that was mainly produced by the
et al., 2007). The distribution of SOC stocks in the upper layers RadamBrasil Project, which consisted of a Brazilian national systematic
(0–50 cm) is mainly explained by environmental variables, and it re- soil survey program from 1973 to 1986. These data comprise in-
mains a challenge to find covariates that explain subsurface SOC stocks formation on the soil classes and soil properties of representative lo-
variability. The discovery of such covariates can improve our under- cations across the territory. Part of the data was compiled by Cooper
standing of the processes leading to carbon sinks and resilience in deep et al. (2005) and contains 8227 soil profiles (Fig. 1a), accounting for
soils. 37,693 soil samples widely distributed across the country. Nonetheless,
Regardless of the model used for SOC stocks mapping, the predic- the density of soil profiles varies among the biomes, with the Atlantic
tion process has an uncertainty that can be assessed by Monte Carlo Forest showing the highest value with 2.1 soil profiles per 1000 km2,
simulation (Meersmans et al., 2009), Bootstrap (Pan and Politis, 2016) followed by Caatinga (1.5), Pantanal (1.3), Cerrado (1.2), Pampa (1.0)
or by the prediction interval (PI) based on the difference of the residuals and, Amazon with only 0.5 soil profile per 1000 km2.
between the modelled outputs and corresponding observed data The soil bulk density is essential for calculating the SOC stocks, but
(Shrestha and Solomatine, 2006; Solomatine and Shrestha, 2009). within the soil samples database, only 10% (3278 samples) had soil
Using regression kriging Malone et al. (2011) implemented the PI bulk density values. Pedotransfer functions are widely used to estimate

2
L.C. Gomes et al. Geoderma xxx (xxxx) xxx–xxx

Fig. 1. Brazilian biomes and the distribution of soil profiles used to calculate the soil organic carbon (SOC) stocks (a), soil classes (b) and protected areas in Brazil (c).

the soil bulk density of missing depths (Abdelbaki, 2018; Rodríguez- primary production (GPP) index from MODIS satellite imagery were
Lado et al., 2015; Tranter et al., 2007), but Brazil does not have a included as proxies of the vegetation cover, representing the “biolo-
countrywide pedotransfer function (PTF) for bulk density, and few gical” factor. Additionally, the map of Brazilian Biomes was also con-
functions have been published for specific regions (Benites et al., 2007; sidered as a possible predictor variable, since the biomes relate to dif-
Bernoux et al., 1998; Souza et al., 2016; Tomasella and Hodnett, 1998). ferent vegetation types and therefore can affect the distribution of SOC
To overcome this problem, we used chemical and physical soil prop- stocks spatially and by depth, across the regions.
erties data to calibrate a bulk density pedotransfer function (PTF). The The relationship of SOC stocks with climate factors was investigated
PTF was calibrated using a linear model based on SOC and clay content, by assessing the significance of temperature and precipitation from
resulting in a linear regression model (Eq. (1)): BD = 1.5701–0.06834 1970 to 2000. These climate variables were obtained from the
(SOC) − 0.00568 (clay) with an R2 of 0.52. The assessment of the PTF WordClim Data Portal (Hijmans et al., 2005). Temperature is an im-
accuracy, using randomly selected 75% of the data to build the function portant variable that affects soil carbon dynamics, and thus we summed
and 25% to validation (holdout), showed root mean squared error the mean temperature of each month at each pixel to create a unique
(RMSE) of 0.15 and mean absolute error (MAE) equal to 0.114. covariate map that we called “sum of monthly mean temperature”
After the missing data depths were filled with BD estimates we (SAMT).
harmonized the soil data (BD and SOC) vertically over the 100 cm depth A group of 12 maps of bioclimatic variables (Bio 2, Bio 3, … Bio 13)
in standard depths by applying the mass preserving spline (Bishop (Hijmans et al., 2005) was evaluated as potential climate soil factor
et al., 1999; Malone et al., 2009) function of the GSIF package (Hengl predictors of SOC stocks. A description of the variables is shown in
et al., 2017). For each spline we used a smoothing parameter value of Table 1. The bioclimatic variables are more biologically significant,
0.1 leading to values of BD and SOC for each cm of the profile. Five than using only temperature and precipitation, as climate proxies.
depth intervals were defined (0–5, 5–15, 15–30, 30–60, and Bioclimatic variables were derived from the monthly temperature and
60–100 cm) according to GlobalSoilMap specifications (Arrouays et al., rainfall data from 1970 to 2000. These biologically meaningful vari-
2014) and the values of BD and SOC represent the average values at ables represent annual trends (e.g., mean annual temperature, annual
each interval. The SOC stock was calculated per soil depth according to precipitation) seasonality (e.g., annual range in temperature and pre-
Eq. (2). The legacy data did not report stoniness, and for this reason, we cipitation) and extreme or limiting environmental factors (e.g., the
did not include this variable in Eq. (2). temperature of the coldest and warmest month, and precipitation of the
wet and dry quarters). We included a map of Brazil's climatic zones
SOC stocks = [SOC content × Ds × D]/ 100 (2)
according to Koppen classification (Alvares et al., 2013).
−2
In which SOC stocks are reported in kg m , SOC content (g kg −1
), To account for parent material effects on the SOC stocks, we used a
soil bulk density (Ds) (g cm−3), and soil thickness (D) (cm). geological map (CPRM, 2004) and the pedological map published by
IBGE and EMBRAPA (2001). The pedological map contains the soil
spatial distribution of all soil orders, and was updated to the current
2.3. Environmental covariates
System of Soil Classification - SiBCS (Santos, 2011). All covariate maps
were resampled to a common 1 km grid cell using the Equal Area
To model the spatial distribution of SOC stocks we used environ-
Lambert Projection System.
mental covariates related to the scorpan factors (Table 1). A set of 18
morphometric maps was generated using a digital elevation model
(DEM). These maps were generated using the RSAGA package 2.4. Methodological framework
(Brenning, 2008). The maps include elevation, slope, aspect, curva-
tures, valleys, hills and other aspects of the terrain topography. The To predict reliable maps of SOC stocks with known accuracy, we
DEM was derived from a radar image from the Shuttle Radar Topo- applied a methodological framework that aggregated different meth-
graphy Mission – SRTM with a 90 m spatial resolution (Jarvis et al., odologies to optimize the prediction process and estimate the un-
2008). certainty (Fig. 2). The methodological framework was applied to each
The normalized difference vegetation index (NDVI) and the gross soil depth separately, covering all stages of model prediction, with

3
L.C. Gomes et al. Geoderma xxx (xxxx) xxx–xxx

Table 1
Groups of covariates from which the most important sets of predictors for SOC stocks were selected.
Morphometric Temperature Bioclimatic Precipitation Vegetation Soil/parent material

Elevation Tmean 1 Bio 2 Precipitation 1 Biome Soil class


Aspect Tmean 2 Bio 3 Precipitation 2 NDVI Lithology
Curvature flow line Tmean 3 Bio 4 Precipitation 3 GPP
Curvature profile Tmean 4 Bio 5 Precipitation 4
Curvature maximal Tmean 5 Bio 6 Precipitation 5
Curvature minimal Tmean 6 Bio 7 Precipitation 6
Curvature plan Tmean 7 Bio 8 Precipitation 7
Curvature total Tmean 8 Bio 9 Precipitation 8
Hill Tmean 9 Bio 10 Precipitation 9
Mid-slope position Tmean 10 Bio 11 Precipitation 10
Normalized height Tmean 11 Bio 12 Precipitation 11
Slope Tmean 12 Bio 13 Precipitation 12
Slope height Tmin 1 Koppen Annual Precipitation
Standardized height Tmin 2
Terrain surface convexity Tmin 3
Terrain surface texture Tmin 4
Valley depth Tmin 5
Valley Tmin 6
Tmin 7
Tmin 8
Tmin 9
Tmin 10
Tmin 11
Tmin 12
Sum of monthly mean temperature (SAMT)

Tmean: monthly mean temperature (1, 2, 3 … 12 = Jan. Feb. Marc., … Dec.), Tmin: monthly minimum temperature.
Koppen: map of climate zones according to Koppen classification.
Precipitation 1, 2, 3, … 12 (monthly rainfall of Jan., Feb., Marc, … Dec.)
GPP: Gross Primary Production vegetation index from MODIS imagery.
Bioclimatic variables (Bio 2 = Mean Diurnal Range (Mean of monthly (max temp - min temp)), Bio 3 = Isothermality (BIO2/BIO7) (∗100), Bio 4 = Temperature
Seasonality (standard deviation ∗ 100), Bio 5 = Max Temperature of Warmest Month, Bio 6 = Min Temperature of Coldest Month, Bio 7 = Temperature Annual
Range (BIO5-BIO6), Bio 8 = Mean Temperature of Wettest Quarter, Bio 9 = Mean Temperature of Driest Quarter, Bio 10 = Mean Temperature of Warmest Quarter,
Bio 11 = Mean Temperature of Coldest Quarter, Bio 12 = Annual Precipitation, Bio 13 = Precipitation of Wettest Month).

methods for variable selection, model selection, prediction of SOC Random Forests algorithm was developed to perform both classifi-
stocks and uncertainty analysis using prediction intervals. cation and regression (Breiman, 2001). The model is based on the idea
of ensemble learning (bagging), known for its ability to reduce the
experimental noise and enhance the prediction accuracy by aggregating
2.5. Environmental covariates and model selection
several different predictions. A large set of random trees are constructed
during the model training, which are merged into a single prediction.
Several potential covariates are available to predict SOC stocks, but
using all available covariates is time consuming and computationally Each tree, on the forest, is constructed based on a unique bootstrap
sample of the original training data with each bootstrap set leaving out
expensive; to overcome this problem, we applied the ‘recursive feature
elimination - RFE’ algorithm of the Caret package (Kuhn, 2017) to se- about one-third of the observations (Hastie et al., 2009). Accuracy is
reached with low bias and low variance as a large number of trees are
lect a representative group of covariates to predict the SOC stocks. The
RFE is a backward selection algorithm that iteratively eliminates the averaged, thereby providing a reliable error estimation based on the
remaining dataset, called the out-of-bag (OOB) data. Three parameters
least promising predictors from the model based on an initial predictor
importance measure (Kuhn and Johnson, 2013). It is implemented have to be defined for model training: the number of trees (ntree) in the
forest, the minimum number of data points in each terminal node
within a model-based procedure selection which is compatible with the
Cubist, Random Forests and Linear models. RFE was run with the full (nodesize), and the number of features tried at each node (mtry). The
‘mtry’ is the only parameter that requires special judgment (Breiman,
set of covariates (Table 1), and eighteen subsets: 5, 6, 7 … 20, 25, 30
and/or 40 covariates. The selection of the optimal subset of covariates 2002), and is the parameter used for model optimization on the Caret
package, used in this study.
was based on the cross-validation with 5-folds and accuracy metric R-
squared for each subset. Cubist is a rule-based model built as an extension of M5 tree model
(Quinlan, 1992). Its structure consists of a conditional component - or
In recent years, the use of data-driven machine learning models,
such as tree-based methods, has gained popularity in the digital soil piecewise function - acting as a decision tree, coupled with multiple
linear regression models (Kuhn, 2017). It is similar to a typical re-
mapping field. In this study, we compared four machine learning
methods (Random Forests, Cubist, Support Vector Machines and gression tree model in terms of being a data-partitioning algorithm that
allows non-linear relationships in observed data to be explored. The
Generalized Linear Models) with different characteristics to assess the
most accurate prediction of SOC stocks. These models mainly differ in model recursively partitions the data into subsets, which are more in-
how they manage the training dataset, which can result in one variable ternally homogeneous with respect to the target variable and covariates
than the dataset as a whole. A regression model is fitted for each rule
being important to model x but not important for model y. To overcome
this and enable a comparison of model's performance, we used each based on the data subset defined by the model rules. A series of rules,
hierarchically arranged, defines the recursive partitioning of the pre-
model separately to select their optimal set of covariates applying the
RFE approach, which means that covariates for prediction with Random diction variable in order to minimize the standard deviation across all
potential splits (Wilford and Thomas, 2013).
Forests were selected with RFE-Random Forests bases, Cubist used RFE-
Cubist and so forth. Support vector machines (SVM) are a set of supervised learning

4
L.C. Gomes et al. Geoderma xxx (xxxx) xxx–xxx

Fig. 2. Methodological flowchart showing the sequence of methodologies applied for soil organic carbon (SOC) stocks prediction. The most accurate model between
Cubist, Random Forests (RF), Support Vector Machines (SVM) and Generalized Linear Models (GLM) was selected to model and map the SOC stocks maps: 5th
percentile (Lower (Q5)), Mean values (mean), 95th percentile (Upper (Q95)), and the coefficient of variation (CV).

techniques, for classification and regression, proposed by Cortes and Generalized Linear Models (GLM) is one of the mathematical ex-
Vapnik (1995). Originally developed as an algorithm for classification, tensions of linear regression models that account for nonlinearity and
it searches for a hyperplane which leaves the largest possible margin inhomogeneous variance structures in the data (Hastie and Tibshirani,
between classes, leading to a better generalization probability (Ließ 1990). In concept, a GLM is based on an assumed relationship (called a
et al., 2016). The SVM for regression has inherited some of the prop- link function) between the mean of the dependent variable and the
erties of the SVM classifier in its configuration (Hastie et al., 2009). In linear combination of the explanatory variables. The link function al-
the case of regression, the model searches for a function which fulfills lows distributions other than a normal distribution to be used for pre-
the error criterion, in which, points on the correct side of the decision dictions (Lane, 2002). GLM models, unlike standard regression models,
boundary and far away from it are ignored in the model optimization. allow incorporating both categorical and quantitative factors in the
These are called the “low error” points, and they are the ones with small regression analysis. The boosting procedure originally developed in the
residuals. Points outside the margin are allowed while a penalty weight machine learning community, primarily to handle classification pro-
is introduced. The penalty determines the trade-off between allowing blems, has been successfully translated into the statistical field
points outside and the flatness of the regression function and is parti- (Breiman, 1998; Friedman, 2001) and extended to many statistical
cularly important to decrease the impact of outliers (Smola and problems. Due to its resistance to overfitting, it is particularly useful in
Schölkopf, 2004). Within SVM, non-linear regression is typically the construction of prediction models. Its iterative nature, allows
achieved by applying the so-called Kernel Trick. For this, the training straightforward adaptations to cope with high-dimensional data,
data are first transformed into a higher-dimensional feature space by through its component-wise version (Buehlmann, 2006; Bühlmann and
applying a non-linear kernel function. Then a linear model is adapted to Yu, 2003). The basic idea of boosting is to provide estimates of the
this new feature space (Olson and Delen, 2008) and finally, the linear parameters (e.g., the regression coefficients) by updating their values
regression in the new feature space is equivalent to a non-linear re- step by step. At each iteration, a “weak” estimator is fitted onto a
gression in the original predictor space. The generalization performance modified version of the data, with the goal of minimizing a pre-speci-
of SVM depends on a good setting of hyperparameters (Cherkassky and fied loss function. The obtained value provides a small contribution
Ma, 2004). The parameters used on the model tuning on the Caret used to update the estimate of the parameters: the result of all the
Package (Kuhn, 2008) includes the penalty (cost) that controls the contributions is the final estimate. In essence, it combines a set of ‘weak
trade-off between margin and training errors, and the kernel width learners’ creating a single ‘strong learner’ (Ridgeway, 2017). Boosting
(sigma) that controls the degree of nonlinearity of the model. relies on two tuning parameters. A first parameter controls the

5
L.C. Gomes et al. Geoderma xxx (xxxx) xxx–xxx

“weakness” of the estimator and is usually called the penalty or (30.5 dag kg−1 at 0–5 cm to 41.58 dag kg−1 at 60–100 cm soil depth).
boosting step size. A second much more influential tuning parameter is The soil bulk density varied from 0.2 to 1.68 g cm−3, with a mean of
related to the stopping criterion, i.e., specifies how many boosting 1.28 g cm−3 at the 0–100 cm depth. The SOC stocks varied from 0 to
iterations are performed. 9.9 kg m−2 and increased with depth, from a mean of 0.98 kg m−2 at
To select the most accurate model, we predicted SOC stocks ap- 0–5 cm to 2.26 kg m−2 at 60–100 cm. The CV was 65% on average for
plying each model separately using their own optimal subset of cov- all depths.
ariates selected by the RFE analysis. The data was randomly separated
in 80% for training and 20% for validation (Holdout). We used cross
3.2. Covariates selected and models' performance
validation with 5 folds to optimize the model on selecting the best
hyperparameters using the training dataset. The model's performance
The models selected distinct groups of covariates for each soil depth
was evaluated applying the fitted model to validation data (20%) and
(Supplementary material - Tables S1.1–S1.5) and performance analysis
the accuracy was expressed by the statistical indexes: R-square - R2 (Eq.
shows that, on average, approximately 15 variables had the same per-
(3)), root mean squared error - RMSE (Eq. (4)), mean absolute error -
formance as when using the total set of 74 variables (Fig. 3). The
MAE, (Eq. (5)). For each model the process was repeated 50 times with
Random Forests algorithm showed the highest performance in the RFE
their own subset of variables and compared by the mean values of the
selection, and the most important variables selected for each soil depth
accuracy parameters (R2, RMSE and MAE). The process of several re-
are presented in Table 3. The main characteristics selected were soil
petitions is important to determine the variability of the prediction
class, vegetation indexes, temperature, and precipitation and morpho-
since different groups of training and validation datasets can generate
metric data such as elevation, slope and valley depth.
different accuracy results (Kuhn and Johnson, 2013).
The Random Forests algorithm also showed the best performance to
predict the SOC stocks from 50 runs, with the highest R2 and lowest
n
i=1
(Pi Om)2
R2 = n
( Om i Om)2 (3) RMSE and MAE for all depths (Fig. 4). Random Forests had an R2 of
i=1
0.28 to 0.32 and the performance increased with depth. The second-
1 N 1/2 best performance was from Cubist, just behind Random Forests. The
RMSE = (Pi Oi ) 2
N i=1 (4) SVM algorithm presented moderate performance, and GLM presented
the worst performance for SOC stocks prediction at all depths, with a
N
i=1
|Pi Oi | maximum of 24% of model variance explained. Hence, the Random
MAE =
n (5) Forests algorithm was selected for spatial prediction of the distribution
of SOC stocks.
In Eqs. (3), (4) and (5), Oi represents the SOC stocks calculated for
The importance of covariates in predicting the SOC stocks by soil
each soil depth and Om the average of these values, Pi is the SOC stocks
depth show that the soil class was the most important variable to pre-
predicted by the Random Forests model and n is the number of loca-
dict the SOC stocks, contributing 45 to 52% on the decreasing of the
tions used for the prediction.
mean accuracy (Fig. 5). Covariates related to vegetation as NDVI and
GPP were important to predict the distribution of SOC stocks at the
2.6. Prediction and uncertainty of SOC stocks maps
surface layer, whereas in deeper layers, the temperature and rainfall
were more important.
To build the SOC stocks maps at different depths, we selected among
the four models the one that presented the best performance to run 100
predictions allocating randomly 80% of the data to training and 20% 3.3. Prediction of SOC stocks
for validation as described in detail in Section 2.5. This resulted in 100
maps of SOC stocks for each soil depth, and the final map of SOC stocks The highest SOC stocks occur in southern Brazil and in northwestern
was the resulting mean value for each pixel. The performance was Amazon located in high elevation and under tropical climate, whereas
evaluated in terms of R2 (Eq. (3)), RMSE (Eq. (4)) and MAE (Eq. (5)). the lowest SOC stocks occurred in northeastern Brazil where the climate
The SOC stocks predicted for each soil depth (i.e., 0–5, 5–15, 15–30, is dry and covered by xeric shrublands (Fig. 2). Approximately 71.3 PgC
30–60, 60–100 cm) were aggregated to a 0–100 cm soil depth and are stored within the top 100 cm, of which 35.9 PgC are in the 0–30 cm
stratified by biome, soil class, and protected areas. layer (Fig. 6). The spatial variation of SOC stocks on the maps of the
Based on the 100 predictions of SOC stocks for each depth we 90% prediction interval (lower and upper) and the prediction of the
computed the upper (Q95) and lower (Q5) percentiles for each pixel, mean value follow a similar pattern (Fig. 6). The analysis of the un-
resulting in maps of SOC stocks for upper and lower percentiles. The certainty of prediction (Fig. 6) shows that the total amount of SOC
uncertainty was estimated as the difference between the 95th and the stocks up to 100 cm soil depth can vary from 64.6 to 78 PgC from the
5th percentiles (i.e. 90% prediction interval), as proposed by the lower and upper limits of 90% prediction interval.
GlobalSoilMap project (Arrouays et al., 2014). Additionally, we calcu- The coefficient of variation of SOC stocks predictions ranged from
late the coefficient of variation (CV % = standard deviation / mean) of 0.8 to 42% in the soil depths (Fig. 6, a.4 to e.4). The uncertainty was
the 100 times predictions of SOC stocks for each pixel at the different higher in the Amazon, Cerrado e Pantanal biomes with values of CV
soil depths. from 8.51 to 42%. A spatial analysis of the CV in soil depths shows that,
country areas with CV higher than 8.5% account for 21.2% in the
3. Results 0–5 cm depth, and from 8.4 to 13% in the bellow depths down to
100 cm. An average of 50% of the total area follows within the range of
3.1. Vertical distribution of soil properties 4.5 to 8.5% of CV on predictions.
The performance of the Random Forests model in predicting the
Countrywide, the SOC varied from 0.01 to 205.8 g kg−1 in the SOC stocks is summarized in Table 4 and is relative to average values of
topsoil (0–30 cm) and 0 to 38.27 g kg−1 in the subsoil (30–100 cm) 100 predictions. The highest performance occurred in the subsurface
(Table 2). The mean SOC decreased with depth, from 17.18 g kg−1 in (30–60 cm) for training (R2 = 0.32) and validation (R2 = 0.33). The
the topsoil to 4.55 g kg−1 in the subsoil. The coefficient of variation lowest performance was found at 60–100 cm with R2 = 0.17 for
(CV) of the SOC also decreased with depth, from 92.47% at the 0–5 cm training and validation data sets. Overall, Random Forests showed good
depth to 74.34% at the 60–100 cm depth. The average clay content power of generalization indicated by the similar accuracy results be-
varied from 0 to 96.2 dag kg−1 and increased with depth tween cross-validation and holdout for all soil depths.

6
L.C. Gomes et al. Geoderma xxx (xxxx) xxx–xxx

Table 2
Descriptive statistics of soil organic carbon (SOC), soil bulk density (Ds), clay content and SOC stocks (number of samples = 8227).
Depth Variable Unity Min Max Mean Sd Median CV (%)

−1
0–5 SOC (g kg ) 0.1 190.5 17.18 15.89 13.56 92
Ds (g cm−3) 0.21 1.66 1.28 0.18 1.31 14
Clay (dag kg−1) 0.01 95.68 30.65 20.27 27.16 66
SOC stock (kg m−2) 0 4.31 0.98 0.64 0.88 65
5–15 SOC (g kg−1) 0.1 177.4 15.31 14.23 12.12 93
Ds (g cm−3) 0.25 1.57 1.29 0.17 1.313 14
Clay (dag kg−1) 0.01 95.28 31.74 20.33 28.77 64
SOC stock (kg m−2) 0.01 8.68 1.78 1.16 1.59 65
15–30 SOC (g kg−1) 0 205.8 11.7 11.7 8.99 100
Ds (g cm−3) 0.21 1.63 1.29 0.16 1.32 13
Clay (dag kg−1) 0.06 95.71 34.7 20.68 32.59 60
SOC stock (kg m−2) 0.01 9.98 2.07 1.44 1.74 70
30–60 SOC (g kg−1) 0 38.27 6.67 5.01 5.28 75
Ds (g cm−3) 0.84 1.68 1.3 0.14 1.32 11
Clay (dag kg−1) 0.01 96.26 38.78 21.17 37.85 55
SOC stock (kg m−2) 0 10 2.47 1.63 2.08 66
60–100 SOC (g kg−1) 0 24.42 4.55 3.38 3.66 74
Ds (g cm−3) 0.89 1.66 1.3 0.13 1.31 10
Clay (dag kg−1) 0.04 96.21 41.58 21.41 40.89 51
SOC stock (kg m−2) 0 9.87 2.26 1.51 1.91 67

Sd: Standard deviation, CV: Coefficient of variation.

Fig. 3. Models' performance in the process of selecting variables for each soil depth using the Recursive Feature Elimination at different soil depths.

3.4. Distribution of SOC stocks by biome, soil classes and in protected areas the first soil layer (0–5 cm) between biomes, whereas deeper layers
showed large differences. In the first layers of soil (0–15 cm) the
Amazon biome possesses the highest amount of carbon (36.10 PgC) average cumulative SOC stock is higher in the forested biomes (Amazon
among biomes, and Pantanal the lowest value (0.77 PgC) (Fig. 7a). The and Atlantic Forest), while in deeper layers the highest values are found
regions with higher SOC stocks at 1 m depth are the northwestern re- in the Atlantic Forest biome. More than 57% of the total SOC stocks
gion of Amazon, southeastern of Atlantic Forest and in dispersed areas were found in Ferralsols and Acrisols. However, Histosols showed the
in the Cerrado. The SOC stocks density up to 1 m depth varied much highest SOC stocks density (14.87 kg m−2) and Luvisols the lowest
across the biomes, with the Atlantic Forest having the highest average (6.45 kg m−2) (Table 5). The SOC stock stored in protected areas
value (12.08 kg m−2), especially in the southeast, whereas Pantanal (conservation units and indigenous areas) is about 22.2 PgC, re-
and Caatinga showed the lowest values (6.04 and 6.40 kg m−2) presenting approximately 31% of the total SOC stock of the entire
(Fig. 7b). The cumulative SOC stocks density was more homogenous in Brazilian territory.

7
L.C. Gomes et al. Geoderma xxx (xxxx) xxx–xxx

Table 3
Selected variables using the Recursive Feature Elimination-Random Forests method to model the distribution of soil organic carbon (SOC) stocks at the different soil
depths.
Soil depth (cm)

0–5 5–15 15–30 30–60 60–100

Soil class Soil class Soil class SAMT Soil class


NDVI Precipitation 9 SAMT Precipitation 12 SAMT
GPP GPP Valley depth Soil class Precipitation 2
Bio 12 SAMT Precipitation 9 Slope height Precipitation 9
Valley depth Bio 12 Precipitation 10 GPP Precipitation 12
Precipitation 7 Precipitation 1 GPP Normalized height Precipitation 8
Precipitation 9 Normalized height Slope height Valley depth Slope height
Precipitation 3 Valley depth Precipitation 01 Precipitation 4 Valley depth
SAMT NDVI NDVI Precipitation 1 GPP
Bio 5 Precipitation 8 Precipitation 03 NDVI NDVI
Elevation Bio 5 Standardized height Precipitation 9 Normalized height
Precipitation 10 Slope height Bio 12 Precipitation 11 Terrain surface convexity
Slope height slope Precipitation 2 Bio 12
Standardized height Precipitation 10 Normalized height Bio 7
Biome Elevation Slope Standardized height

Fig. 4. Performance of the models Cubist, General Linear Model (GLM), Random Forests (RF) and Support Vector Machines (SVM) on the prediction of the soil
organic carbon (SOC) stocks, assessed by the R-squared (R2), root mean squared error (RMSE), and mean absolute error (MAE) of the validation dataset (Holdout).

4. Discussion RFE did not improve the model performance, as found by Stevens et al.
(2013), but selecting the best subset of variables for each soil depth,
4.1. Model performance and uncertainty from a total of 74 to 10–20 variables, considerably decreased the pro-
cessing time. This is particularly interesting for large territorial areas,
The methodological framework optimized the prediction of SOC such as Brazil. The framework enables choosing the best model based
stocks by applying DSM methodologies for selecting variables and on the performance given with the optimal subset of covariates with
models, as well as for assessing prediction uncertainty. In general, the highest confidence, once the variables were selected a priori by the

8
L.C. Gomes et al. Geoderma xxx (xxxx) xxx–xxx

Fig. 5. The relative importance of covariates, given by the decrease in mean accuracy (IncMSE), for soil organic carbon (SOC) stocks prediction per soil depths
intervals (cm) using the RFE-Random Forests model.

respective model using RFE. The framework also provides a measure- validation, the CV showed higher variation than the training set, but the
ment of the prediction uncertainty, which is one of the gains when values were lower than 11% for all depths. The higher variation in the
applying DSM to model SOC stocks. Therefore, this framework enables validation sets indicates an increase in the prediction uncertainty,
production of maps with known accuracy and is indicated for mapping which may be related to the smaller number of samples used for the
SOC stocks in large areas. validation compared to the training or the process of internal data se-
Our results confirm that Random Forests model is a powerful ma- lection of the Random Forests model.
chine learning algorithm for predictions, as reported by Nawar and As pointed out by Malone et al. (2018), a major source of error in
Mouazen (2017). The prediction statistics (Table 4) show no overfitting, DSM is the sparseness of soil data, and in this regard, we attribute the
as the R2 of the training and validation sets are similar for all five low representativeness of soil samples in the area of Amazon and Cer-
depths. The RMSE and MAE corroborate this assertion, presenting small rado biomes, as one possible reason for the high uncertainty of the
differences in training and validation. The R2 variation was less than prediction in these regions (Fig. 6). In addition to this, the use of legacy
4% when predicting SOC stocks across all depth increments in the data implies unknown uncertainties due to measurement, positional
training set, which indicates low variation in prediction. For the location and temporal changes of the soil information.

9
L.C. Gomes et al. Geoderma xxx (xxxx) xxx–xxx

(caption on next page)

10
L.C. Gomes et al. Geoderma xxx (xxxx) xxx–xxx

Fig. 6. Total SOC stocks predicted for the soil depth at 0–5 (a), 5–15 (b), 15–30 (c), 30–60 (d) and 60–100 cm (e) soil depth, from the 100 runs model Predicted
Interval of the Lower 5% (a.1–e.1), Mean (a.2–e.2), and Upper 95% (a.3–e.3) intervals and, the Coefficient of Variation-CV (a.4–e.4) with respective percentage of
area per class range in brackets, of the 100 predictions with Random Forests model.

Table 4 precipitation) was more important for SOC accumulation, suggesting a


Performance of the Random Forests model to predict the spatial distribution of cumulative long-term effect of climate variables on the accumulation of
SOC stocks with the mean values from 100 predictions. SOC stocks.
Soil depth R2 CV (%) RMSE CV (%) MAE CV (%) n The Amazon biome showed the highest SOC stocks, whereas the
(cm) R2 (kg m−2) RMSE (kg m−2) MAE Pantanal and Caatinga showed the lowest (Fig. 7), but absolute values
of SOC stock only serve as a general measure of the magnitude of the
5-fold cross-validation
carbon stored in each region. Much more relevant understanding of
0–5 0.22 3.4 0.56 0.6 0.41 0.6 6583
5–15 0.26 2.6 0.99 0.7 0.71 0.5 6574 SOC accumulation is attained by analysing the SOC density. In terms of
15–30 0.31 2.1 1.19 0.7 0.84 0.6 6464 the SOC accumulation per unit area, there is a clear trend of higher
30–60 0.32 2.6 1.31 0.9 0.95 0.9 5618 values in milder climates of the mountainous regions of southeastern
60–100 0.17 2.9 1.49 0.9 1.08 0.8 4923 and southern Brazil, where moderate, prominent or humic A horizons
Validation (Holdout)
0–5 0.23 10.3 0.56 2.4 0.40 2.07 1644
coexist, suggesting a long-term pattern of climatically driven SOC sta-
5–15 0.26 8.0 0.66 2.8 0.70 1.8 1640 bilization. These values are nearly twice the values of the Caatinga
15–30 0.32 6.9 1.19 2.6 0.84 1.7 1615 biome (Fig. 7), where erosion is widespread, and weak (ochric) A
30–60 0.33 7.3 1.33 3.1 0.94 2.2 1404 horizons predominate, with little accumulation of SOC. The NW region
60–100 0.17 8.8 1.48 3.3 1.08 2.6 1228
of the Amazon biome also revealed a great SOC stocks density, mostly
Mean absolute error (MAE), root mean squared error (RMSE), number of associated with mobile, unprotected SOC forms that are highly unstable
samples (n). and subject to alluvial losses in spodic horizons, as shown by Schaefer
et al. (2008).
4.2. Soil carbon stocks in Brazil The lowest potential of carbon accumulation in soils occurs along a
NE/SW dry axis, with minimum values in both extremes (Pantanal and
The estimates of SOC stock to the 100 cm depth in previous studies Caatinga), where the climates are much drier and seasonal. In the
were mainly based on average SOC stocks by ecosystem types and soil Cerrado, due to the high altitude of the Central plateau, there is a
classes. Schroeder and Winjum (1995) estimated that Brazilian soils greater relative accumulation of carbon, but less than under Forest.
contain approximately 72 PgC in 0–100 cm depth, based on ecosystem Among the Forests, the highest values are those from the Atlantic
types (Zinke et al., 1986). Bernoux et al. (2002) estimated SOC stocks of Forest, since much a warmer Amazon climate leads to higher rates of
36.4 ± 3.4 PgC at a limited upper 0–30 cm soil depth, using soil and SOC mineralization, greatly reducing its potential for accumulation
vegetation data, and determining missing soil bulk density data by (Schaefer et al., 2008).
linear regression. Batjes (2005) estimated SOC stocks of 65.9–67.5 PgC Soil classes were key to predict the distribution of SOC stocks
for the whole territory using SOTER 1: 5 M. (Fig. 5) as they are associated with texture and structure, directly af-
In the present study, using machine learning algorithms, we found fecting SOC accumulation/stabilization processes (Harrison-Kirk et al.,
values of SOC stocks varying from 64.6 to 78 PgC with a mean value of 2013; Lenka and Lal, 2013). Most of the SOC stocks are stored in Fer-
71.3 PgC at 0–100 cm soil depth (Fig. 6), consistent with previous re- ralsols and Acrisols, the dominant soils throughout Brazil, regardless of
ports. Although total SOC stocks are in agreement with the literature, the SOC density. When analysing the SOC stocks density (kg m−2) we
our study highlights a very contrasting distribution pattern of SOC found that Histosols had the highest values (14.87 kg m−2), as ob-
stocks between biomes (Fig. 7a). viously expected for organic-rich soils, and semi-arid Luvisols with
The SOC stock for the Amazon biome (36.1 PgC) at 0–100 cm was lowest values (6.47 kg m−2). Histosols are found under very local
lower than previously published values (Batjes and Dijkshoorn, 1999; conditions of low temperature or low oxygen availability that prevent
Moraes et al., 1995), which estimated 47 and 46.5 PgC at 0–100 cm soil organic matter degradation. Luvisols are located in the hottest and
depth, respectively. These authors used the mean values of SOC and soil driest regions (Caatinga) under a semi-arid climate that favors strong
bulk density of each soil class to determine the SOC stocks. Cerri et al. losses by erosion and less organic matter inputs by vegetation.
(1999) estimated the SOC stocks of the Amazon in 23.4 PgC at 0–30 cm The SOC stocks modelled in this study are based on the legacy data
and 41 PgC up to 100 cm soil depth, using a soil map of the national from 1970 and 2005: since then, the agricultural area in Brazil greatly
database, and soil bulk density calculated by linear regression. increased. This means that the values for the first soil layers can be now
overestimated, especially in the Cerrado biome, where the intensive
agriculture increased in the last decades. However, the adoption of no-
4.3. Soil carbon stock distribution in Brazil
till in recent years may reverse this trend in Cerrado areas, increasing
SOC content (Bayer et al., 2006). Nevertheless, Brazil contains large
Soil organic carbon stocks are the result of the balance between
forested areas without anthropogenic interference and protected areas
inputs and outputs of carbon in soils; several factors affect this balance,
for nature conservation, for which the modelled SOC stocks in the first
according to environmental conditions (Davidson and Janssens, 2006)
layers may be more accurate.
and local characteristics, such as vegetation (Jobbágy and Jackson,
Brazil has the largest areas of tropical vegetation in the world and
2000) and topography (Tang et al., 2017). In our study, soil class, the
plays a global key role in the context of provision of ecosystem services,
sum of monthly mean temperature, precipitation and vegetation were
such as conservation of biodiversity (Barlow et al., 2016), climate
the most important covariates driving the distribution of SOC stocks in
regulation (Salazar et al., 2007) and carbon sequestration in vegetation
Brazil, with varying influences according to soil depth (Fig. 5). In the
biomass (Englund et al., 2017). Extensive areas of tropical forest in the
first layer (0–5 cm) vegetation characteristics (NDVI and GPP) largely
Amazon biome are protected by the Brazilian legislation (Fig. 1c),
influenced the SOC stock distribution, indicating the importance of
which contribute to maintaining the provision of ecosystem services.
vegetation as a driver of organic matter accumulation. In deeper layers,
Although protected areas are important to provision of ecosystem
the climate factor (sum of monthly mean temperature and

11
L.C. Gomes et al. Geoderma xxx (xxxx) xxx–xxx

Fig. 7. Total SOC stocks in PgC (a) and cumulative SOC stocks (kg m−2) to different soil intervals (b) for the biomes Amazon, Atlantic Forest, Caatinga, Cerrado,
Pampa and Pantanal.

services, this study highlights an additional benefit showing that these (0.37 PgC); Italy (0.99 PgC); Netherlands (0.30 PgC); Poland (1.75
areas contain approximately 22.2 PgC at 0–100 cm soil depth, that are PgC); Slovakia (0.12 PgC); in France (3.1 PgC); Great Britain (4.6 PgC);
supposedly protected from anthropogenic actions in the long-term. The and in Belgium (0.303 PgC) (Arrouays et al., 2001; Bradley et al., 2005;
magnitude of SOC stocks in the Brazilian protected areas is nearly twice Lettens et al., 2004; Panagos et al., 2013). This highlights the strategic
the total combined SOC stock (11.8 PgC) within 100 cm soil depth es- position of Brazil for global policies aiming to maintain SOC stored and
timated for the following countries: Bulgaria (0.31 PgC); Denmark mitigate future climate changes and land use impacts.

12
L.C. Gomes et al. Geoderma xxx (xxxx) xxx–xxx

Table 5 Appendix A. Supplementary data


Average of 100 runs prediction of soil organic carbon stocks in the top 100 cm
soil according to FAO soil groups. Supplementary data to this article can be found online at https://
Soil class Soil class Area (km2) PgC kg m−2C doi.org/10.1016/j.geoderma.2019.01.007.
(SiBCS) (IUSS -WRB)
References
Afloramento de rochas Rock outcrops 262,456 2.34 9.45
Argissolos Acrisols 2,017,050 17.73 9.30
Cambissolos Cambisols 410,276 4.82 12.44 Abdelbaki, A.M., 2018. Evaluation of pedotransfer functions for predicting soil bulk
Chernossolos Chernozems 33,685 0.32 10.14 density for U.S. soils. Ain Shams Eng. J. 9 (4), 1611–1619.
Adhikari, K., et al., 2013. High-resolution 3-D mapping of soil texture in Denmark. Soil
Espodossolos Podzols 168,806 1.65 10.35
Sci. Soc. Am. J. 77 (3), 860–876.
Gleissolos Gleysols 392,046 3.55 9.58
Adhikari, K., et al., 2014. Digital mapping of soil organic carbon contents and stocks in
Latossolos Ferralsols 2,586,520 24.01 9.82
Denmark. PLoS One 9 (8), e105519.
Luvissolos Luvisols 234,998 1.43 6.45 Alvares, C.A., Stape, J.L., Sentelhas, P.C., de Moraes Gonçalves, J.L., Sparovek, G., 2013.
Neossolos Regoliticos Regosols 1,048,710 9.06 9.14 Köppen's climate classification map for Brazil. Meteorol. Z. 22 (6), 711–728.
Nitossolos Nitisols 82,920 0.92 11.77 Arrouays, D., Deslais, W., Badeau, V., 2001. The carbon content of topsoil and its geo-
Organossolos Histosols 1613 0.02 14.87 graphical distribution in France. Soil Use Manag. 17 (1), 7–11.
Planossolos Planosols 208,183 1.32 6.69 Arrouays, D., et al., 2014. Chapter three-GlobalSoilMap: toward a fine-resolution global
Plintossolos Plinthosols 575,049 4.54 8.35 grid of soil properties. Adv. Agron. 125, 93–134.
Vertissolos/Neossolos Vertisols/ 15,319 0.10 7.08 Ballabio, C., 2009. Spatial prediction of soil properties in temperate mountain regions
Flúvicos Fluvisols using support vector regression. Geoderma 151 (3–4), 338–350.
Barlow, J., et al., 2016. Anthropogenic disturbance in tropical forests can double biodi-
versity loss from deforestation. Nature 535 (7610), 144–147.
SiBCS - Brazilian System of Soil Classification.
Batjes, N.H., 1998. Mitigation of atmospheric CO2 concentrations by increased carbon
IUSS - WRB (World reference base for soil resources, 2015). sequestration in the soil. Biol. Fertil. Soils 27 (3), 230–235.
Batjes, N., 2005. Organic carbon stocks in the soils of Brazil. Soil Use Manag. 21 (1),
5. Conclusions 22–24.
Batjes, N., Dijkshoorn, J., 1999. Carbon and nitrogen stocks in the soils of the Amazon
region. Geoderma 89 (3), 273–286.
This study shows that machine learning algorithms approaches can Batty, M., Torrens, P.M., 2001. Modelling complexity: the limits to prediction. Cybergeo
be successfully applied for mapping SOC stocks and the uncertainty at Eur. J. Geogr. https://doi.org/10.4000/cybergeo.1035.. Dossiers, document 201, mis
en ligne le 04 décembre 2001, consulté le 14 janvier 2019. URL: http://journals.
large scale. openedition.org/cybergeo/1035.
Bayer, C., Martin-Neto, L., Mielniczuk, J., Pavinato, A., Dieckow, J., 2006. Carbon se-
• The methodological framework allows to optimize the prediction questration in two Brazilian Cerrado soils under no-till. Soil Tillage Res. 86 (2),
237–245.
process without loss of performance, selecting only the most im-
Benites, V.M., Machado, P.L., Fidalgo, E.C., Coelho, M.R., Madari, B.E., 2007.
portant variables and the best model, and predicting the SOC stocks Pedotransfer functions for estimating soil bulk density from existing soil survey re-
with the associated uncertainty. ports in Brazil. Geoderma 139 (1), 90–97.

• The most important covariates that influence SOC stocks distribu- Bernoux, M., Cerri, C., Arrouays, D., Jolivet, C., Volkoff, B., 1998. Bulk densities of
Brazilian Amazon soils related to other soil properties. Soil Sci. Soc. Am. J. 62 (3),
tion in Brazil are the soil class, climate (sum of monthly mean 743–749.
temperature, precipitation), slope height and vegetation indexes Bernoux, M., da Conceição Santana Carvalho, M., Volkoff, B., Cerri, C.C., 2002. Brazil's
(NDVI, GPP). soil carbon stocks. Soil Sci. Soc. Am. J. 66 (3), 888–896.


Bishop, T.F.A., McBratney, A.B., Laslett, G.M., 1999. Modelling soil attribute depth
The Random Forests model showed the best performance in pre- functions with equal-area quadratic smoothing splines. Geoderma 91 (1), 27–45.
dicting SOC stocks at the national level, attaining the largest accu- Bonfatti, B.R., Hartemink, A.E., Giasson, E., Tornquist, C.G., Adhikari, K., 2016. Digital
racy compared to the other tested machine learning methods. mapping of soil carbon in a viticultural region of southern Brazil. Geoderma 261,

• Brazilian soils store approximately 71.3 PgC within the top 100 cm,
204–221.
Bradley, R., et al., 2005. A soil carbon and land use database for the United Kingdom. Soil
half of which (35.91 PgC) stored in the surface layer (0–30 cm). Use Manag. 21 (4), 363–369.
• Although the Amazon biome has the highest absolute SOC stocks Breiman, L., 1998. Arcing classifier (with discussion and a rejoinder by the author). Ann.
Stat. 26 (3), 801–849.
(36.1 PgC), it does not have the greatest relative potential for ac- Breiman, L., 2001. Random forests. Mach. Learn. 45 (1), 5–32.
cumulation. Breiman, L., 2002. Manual on setting up, using. In: and understanding random forests
• Almost 57% of total SOC stocks are in deep-weathered Ferralsols v3.1. Statistics Department University of California Berkeley, CA, USA, pp. 1.
Brenning, A., 2008. In: Boehner, J., Blaschke, T., Montanarella, L. (Eds.), Statistical
and Acrisols, showing that deeper layers must be considered for a
Geocomputing Combining R and SAGA: The Example of Landslide Susceptibility
more accurate representative SOC stock estimate. Analysis with Generalized Additive Models Hamburger Beitrage zur Physischen
• High altitudes and subtropical areas of Brazil (southern, south- Geographie und Landschaftsoekologie.
Brungard, C.W., Boettinger, J.L., Duniway, M.C., Wills, S.A., Edwards Jr., T.C., 2015.
eastern) have the greatest SOC density, revealing the highest po-
Machine learning for predicting soil classes in three semi-arid landscapes. Geoderma
tential for carbon sequestration. 239, 68–83.
Buehlmann, P., 2006. Boosting for high-dimensional linear models. Ann. Stat. 34 (2),
559–583.
Bühlmann, P., Yu, B., 2003. Boosting with the L 2 loss: regression and classification. J.
Acknowledgment Am. Stat. Assoc. 98 (462), 324–339.
Bui, E., Henderson, B., Viergever, K., 2009. Using knowledge discovery with data mining
We thank the anonymous reviewers for their careful reading and from the Australian Soil Resource Information System database to inform soil carbon
mapping in Australia. Glob. Biogeochem. Cycles 23 (4) (n/a-n/a).
valuable comments, which helped to improve the manuscript; and to Cerri, C., Bernoux, M., Arrouays, D., Feigl, B., Piccolo, M., 1999. Carbon stocks in soils of
Cristiano Souza for the help with graphic design. We thank the the Brazilian Amazon. In: Global Climate Change and Tropical Ecosystems. 199.
Fundação de Amparo à Pesquisa do Estado de Minas Gerais (FAPEMIG) CRC, Boca Raton, pp. 33–50.
Cherkassky, V., Ma, Y., 2004. Practical selection of SVM parameters and noise estimation
for the first author scholarship; and the ‘Programa de Pós-Graduação for SVM regression. Neural Netw. 17 (1), 113–126.
em Solos e Nutrição de Plantas –PGSNP’ of the Soil Department of Cooper, M., Mendes, L.M.S., Silva, W.L.C., Sparovek, G., 2005. A national soil profile
Federal University of Viçosa, Brazil for the support given. This study database for Brazil available to international scientists. Soil Sci. Soc. Am. J. 69 (3),
649.
was financed in part by the Coordenação de Aperfeiçoamento de
Cortes, C., Vapnik, V., 1995. Support-vector networks. Mach. Learn. 20 (3), 273–297.
Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001, CPRM, 2004. Carta Geológica do Brasil ao Milionésimo: sistema de informações
throughout a scholarship of the PNPD project to the co-author Eliana de geográficas-SIG [Geological Map of Brazil 1:1.000.000 scale: geographic information
Souza. system-GIS]. CPRM, Brasília.
Davidson, E.A., Janssens, I.A., 2006. Temperature sensitivity of soil carbon decomposi-
tion and feedbacks to climate change. Nature 440 (7081), 165–173.

13
L.C. Gomes et al. Geoderma xxx (xxxx) xxx–xxx

Edenhofer, O., et al., 2014. IPCC, 2014: Climate Change 2014: Mitigation of climate Mishra, U., et al., 2009. Predicting soil organic carbon stock using profile depth dis-
change. In: Contribution of Working Group III to the Fifth Assessment Report of the tribution functions and ordinary kriging. Soil Sci. Soc. Am. J. 73 (2), 614–621.
Intergovernmental Panel on Climate Change, Cambridge, United Kingdom and New Moraes, J.L., Cerri, C.C., Melillo, J.M., Kicklighter, D., Neill, C., Steudler, P.A., Skole, D.L.,
York, NY, USA. 1995. Soil carbon stocks of the Brazilian Amazon Basin. Soil Sci. Soc. Am. J. 59 (1),
Englund, O., et al., 2017. A new high-resolution nationwide aboveground carbon map for 244–247.
Brazil. Geo: Geogr. Environ. 4 (2), e00045. Mulder, V.L., Lacoste, M., Richer-de-Forges, A.C., Martin, M.P., Arrouays, D., 2016.
Fissore, C., et al., 2017. Influence of topography on soil organic carbon dynamics in a National versus global modelling the 3D distribution of soil organic carbon in
Southern California grassland. Catena 149, 140–149. mainland France. Geoderma 263, 16–34.
Fontaine, S., et al., 2007. Stability of organic carbon in deep soil layers controlled by fresh Nawar, S., Mouazen, A.M., 2017. Comparison between random forests, artificial neural
carbon supply. Nature 450 (7167), 277. networks and gradient boosted machines methods of on-line Vis-NIR spectroscopy
Friedman, J.H., 2001. Greedy function approximation: a gradient boosting machine. Ann. measurements of soil total nitrogen and total carbon. Sensors 17 (10), 2428.
Stat. 1189–1232. Nyssen, J., et al., 2008. Spatial and temporal variation of soil organic carbon stocks in a
Gamble, J.D., Feyereisen, G.W., Papiernik, S.K., Wente, C., Baker, J., 2017. Regression- lake retreat area of the Ethiopian Rift Valley. Geoderma 146 (1–2), 261–268.
Kriged soil organic carbon stock changes in manured corn silage–alfalfa production Olson, D.L., Delen, D., 2008. Advanced Data Mining Techniques. Springer Science &
systems. Soil Sci. Soc. Am. J. 81 (6), 1557. Business Media.
Gray, J.M., Bishop, T.F.A., Yang, X., 2015. Pragmatic models for the prediction and digital Ottoy, S., De Vos, B., Sindayihebura, A., Hermy, M., Van Orshoven, J., 2017. Assessing
mapping of soil properties in eastern Australia. Soil Res. 53 (1), 24. soil organic carbon stocks under current and potential forest cover using digital soil
Grüneberg, E., Schöning, I., Hessenmöller, D., Schulze, E.D., Weisser, W.W., 2013. mapping and spatial generalisation. Ecol. Indic. 77, 139–150.
Organic layer and clay content control soil organic carbon stocks in density fractions Oueslati, I., Allamano, P., Bonifacio, E., Claps, P., 2013. Vegetation and topographic
of differently managed German beech forests. For. Ecol. Manag. 303, 1–10. control on spatial variability of soil organic carbon. Pedosphere 23 (1), 48–58.
Harrison-Kirk, T., Beare, M.H., Meenken, E.D., Condron, L.M., 2013. Soil organic matter Pan, L., Politis, D.N., 2016. Bootstrap prediction intervals for linear, nonlinear and
and texture affect responses to dry/wet cycles: effects on carbon dioxide and nitrous nonparametric autoregressions. J. Statist. Plann. Inference 177, 1–27.
oxide emissions. Soil Biol. Biochem. 57, 43–55. Panagos, P., Hiederer, R., Van Liedekerke, M., Bampa, F., 2013. Estimating soil organic
Hastie, T., Tibshirani, R., 1990. Generalized Additive Models. Wiley Online Library. carbon in Europe based on data collected through an European network. Ecol. Indic.
Hastie, T., Tibshirani, R., Friedman, J., 2009. Overview of Supervised Learning, the 24 (Supplement C), 439–450.
Elements of Statistical Learning. Springer, pp. 9–41. Powers, J.S., Corre, M.D., Twine, T.E., Veldkamp, E., 2011. Geographic bias of field ob-
Hengl, T., et al., 2015. Mapping soil properties of Africa at 250 m resolution: random servations of soil carbon stocks with tropical land-use changes precludes spatial ex-
forests significantly improve current predictions. PLoS One 10 (6), e0125814. trapolation. Proc. Natl. Acad. Sci. 108 (15), 6318–6322.
Hengl, T., Kempen, B., Heuvelink, G., Malone, B., 2017. Package ‘GSIF’. Quinlan, J.R., 1992. Learning with continuous classes. In: 5th Australian joint conference
Hijmans, R.J., Cameron, S.E., Parra, J.L., Jones, P.G., Jarvis, A., 2005. Very high re- on artificial intelligence, pp. 343–348 (Singapore).
solution interpolated climate surfaces for global land areas. Int. J. Climatol. 25 (15), Ridgeway, G., 2017. Generalized boosted regression models. In: R package version 2.1.3, .
1965–1978. http://CRAN.R-project.org/package=gbm.
Hounkpatin, O.K.L., Op de Hipt, F., Bossa, A.Y., Welp, G., Amelung, W., 2018. Soil or- Rodríguez-Lado, L., Rial, M., Taboada, T., Cortizas, A.M., 2015. A pedotransfer function
ganic carbon stocks and their determining factors in the Dano catchment (Southwest to map soil bulk density from limited data. Procedia Environ. Sci. 27, 45–48.
Burkina Faso). Catena 166, 298–309. Rossel, R.V., Brus, D., Lobsey, C., Shi, Z., McLachlan, G., 2016. Baseline estimates of soil
IBGE and EMBRAPA, 2001. Mapa de solos do Brasil. 1:5.000.000. IBGE, Rio de Janeiro, organic carbon by proximal sensing: comparing design-based, model-assisted and
pp. Working Group WRB. World reference base for soil resources 2014, update 2015: model-based inference. Geoderma 265, 152–163.
International soil classification system for naming soils and creating legends for soil Rudiyanto, Minasny, B., Setiawan, B.I., Saptomo, S.K., McBratney, A.B., 2018. Open di-
maps. Rome: Food and Agriculture Organization of the United Nations; 2015. World gital mapping as a cost-effective method for mapping peat thickness and assessing the
Soil Resources Reports, 106. carbon stock of tropical peatlands. Geoderma 313, 25–40.
Jarvis, A., Reuter, H.I., Nelson, A., Guevara, E., 2008. Hole-filled SRTM for the globe Salazar, L.F., Nobre, C.A., Oyama, M.D., 2007. Climate change consequences on the
version 4, available from the CGIAR-CSI SRTM 90m database. http://srtm.csi.cgiar. biome distribution in tropical South America. Geophys. Res. Lett. 34 (9).
org. Santos, H.G., 2011. O novo mapa de solos do Brasil. Embrapa Solos, Rio de Janeiro.
Jobbágy, E.G., Jackson, R.B., 2000. The vertical distribution of soil organic carbon and its Schaefer, C.E., et al., 2008. Soil and vegetation carbon stocks in Brazilian Western
relation to climate and vegetation. Ecol. Appl. 10 (2), 423–436. Amazonia: relationships and ecological implications for natural landscapes. Environ.
Kuhn, M., 2008. Caret package. J. Stat. Softw. 28 (5), 1–26. Monit. Assess. 140 (1–3), 279–289.
Kuhn, M., 2017. Classification and regression with random forest. In: R package version Scharlemann, J.P., Tanner, E.V., Hiederer, R., Kapos, V., 2014. Global soil carbon: un-
4.6–12, . http://CRAN.R-project.org/package=randomForest. derstanding and managing the largest terrestrial carbon pool. Carbon Manag. 5 (1),
Kuhn, M., Johnson, K., 2013. Applied Predictive Modeling, 810. Springer. 81–91.
Lal, R., 2004. Soil carbon sequestration to mitigate climate change. Geoderma 123 (1–2), Schroeder, P.E., Winjum, J.K., 1995. Assessing Brazil's carbon budget: I. biotic carbon
1–22. pools. For. Ecol. Manag. 75 (1–3), 77–86.
Lane, P., 2002. Generalized linear models in soil science. Eur. J. Soil Sci. 53 (2), 241–251. Shrestha, D.L., Solomatine, D.P., 2006. Machine learning approaches for estimation of
Lenka, N.K., Lal, R., 2013. Soil aggregation and greenhouse gas flux after 15 years of prediction interval for the model output. Neural Netw. 19 (2), 225–235.
wheat straw and fertilizer management in a no-till system. Soil Tillage Res. 126, Smola, A.J., Schölkopf, B., 2004. A tutorial on support vector regression. Stat. Comput. 14
78–89. (3), 199–222.
Lettens, S., Orshoven, J.v., Wesemael, B.a., Muys, B., 2004. Soil organic and inorganic Solomatine, D.P., Shrestha, D.L., 2009. A novel method to estimate model uncertainty
carbon contents of landscape units in Belgium derived using data from 1950 to 1970. using machine learning techniques. Water Resour. Res. 45 (12).
Soil Use Manag. 20 (1), 40–47. Souza, E.d., et al., 2016. Pedotransfer functions to estimate bulk density from soil
Ließ, M., Schmidt, J., Glaser, B., 2016. Improving the spatial prediction of soil organic properties and environmental covariates: Rio Doce basin. Sci. Agric. 73 (6), 525–534.
carbon stocks in a complex Tropical Mountain landscape by methodological speci- Stevens, A., Nocita, M., Tóth, G., Montanarella, L., van Wesemael, B., 2013. Prediction of
fications in machine learning approaches. PLoS One 11 (4), e0153673. soil organic carbon at the European scale by visible and near InfraRed reflectance
Malone, B.P., McBratney, A.B., Minasny, B., Laslett, G.M., 2009. Mapping continuous spectroscopy. PLoS One 8 (6), e66409.
depth functions of soil carbon storage and available water capacity. Geoderma 154 Tang, X., Xia, M., Pérez-Cruzado, C., Guan, F., Fan, S., 2017. Spatial distribution of soil
(1), 138–152. organic carbon stock in Moso bamboo forests in subtropical China. Sci. Rep. 7, 42640.
Malone, B.P., McBratney, A.B., Minasny, B., 2011. Empirical estimates of uncertainty for Tomasella, J., Hodnett, M.G., 1998. Estimating soil water retention characteristics from
mapping continuous depth functions of soil attributes. Geoderma 160 (3–4), limited data in Brazilian Amazonia. Soil Sci. 163 (3), 190–202.
614–626. Tranter, G., et al., 2007. Building and testing conceptual and empirical models for pre-
Malone, B.P., Odgers, N.P., Stockmann, U., Minasny, B., McBratney, A.B., 2018. Digital dicting soil bulk density. Soil Use Manag. 23 (4), 437–443.
mapping of soil classes and continuous soil properties. In: McBratney, A., Minasny, B., Vašát, R., et al., 2017. Combining reflectance spectroscopy and the digital elevation
Stockmann, U. (Eds.), Pedometrics. Progress in Soil Science. Springer, Cham. model for soil oxidizable carbon estimation. Geoderma 303, 133–142.
Mayer, L.M., 1994. Relationships between mineral surfaces and organic carbon con- Wang, B., et al., 2018. High resolution mapping of soil organic carbon stocks using remote
centrations in soils and sediments. Chem. Geol. 114 (3), 347–363. sensing variables in the semi-arid rangelands of eastern Australia. Sci. Total Environ.
McBratney, A.B., Mendonça Santos, M.L., Minasny, B., 2003. On digital soil mapping. 630, 367–378.
Geoderma 117 (1–2), 3–52. Wilford, J., Thomas, M., 2013. Predicting regolith thickness in the complex weathering
Meersmans, J., van Wesemael, B., De Ridder, F.a., Van Molle, M., 2009. Modelling the setting of the central Mt lofty ranges, South Australia. Geoderma 206, 1–13.
three-dimensional spatial distribution of soil organic carbon (SOC) at the regional Yang, R.-M., et al., 2016a. Comparison of boosted regression tree and random forest
scale (Flanders, Belgium). Geoderma 152 (1–2), 43–52. models for mapping topsoil organic carbon concentration in an alpine ecosystem.
Minasny, B., McBratney, A.B., Mendonça-Santos, M.d.L., Odeh, I., Guyon, B., 2006. Ecol. Indic. 60, 870–878.
Prediction and digital mapping of soil carbon storage in the lower Namoi Valley. Soil Yang, R.-M., et al., 2016b. Precise estimation of soil organic carbon stocks in the northeast
Res. 44 (3), 233–244. Tibetan plateau. Sci. Rep. 6, 21842.
Minasny, B., McBratney, A.B., Malone, B.P., Wheeler, I., 2013. Chapter one - digital Zinke, P.J., Stangenberger, A., Post, W., Emanual, W., Olson, J., 1986. Worldwide Organic
mapping of soil carbon. In: Sparks, D.L. (Ed.), Advances in Agronomy. Academic Soil Carbon and Nitrogen Data. Oak Ridge National Lab., TN (United States).
Press, pp. 1–47.

14

Potrebbero piacerti anche