Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Submitted By:
JAI DHALL (9915103184)
TARUN KHARE (9915103200)
AKSHAT JAIN (9915103210)
MAY 2019
Submitted in partial fulfilment of the Degree of
Bachelor of Technology
in
Student Declaration II
Acknowledgement IV
Summary V
List of Figures VI
4.1.3 Algorithms 20
of testing required
6.1 Findings 27
6.2 Conclusion 30
References 32-33
DECLARATION
We hereby declare that this submission is our own work and that, to the best of our knowledge
and belief, it contains no material previously published or written by another person nor
material which has been accepted for the award of any other degree or diploma of the university
or other institute of higher learning, except where due acknowledgment has been made in the
text.
Akshat Jain(9915103210)
II
CERTIFICATE
This is to certify that the work titled “Solar Radiation Prediction” submitted by Akshat Jain,
Jai Dhall, Tarun Khare in partial fulfilment for the award of degree of B Tech of Jaypee
Institute of Information Technology, Noida has been carried out under my supervision. This
work has not been submitted partially or wholly to any other University or Institute for the
award of this or any other degree or diploma.
Signature of Supervisor:
Date :
III
ACKNOWLEDGEMENT
First and foremost, we would like to thank our guide Dr. Himani Bansal – Assistant Professor
(Senior Grade), Jaypee Institute of Information Technology, Noida for guiding us
thoughtfully and efficiently throughout this project, giving us an opportunity to work at our
own pace along our own lines, while providing us with very useful directions whenever
necessary. We would also like to thank our friends and classmates for being great sources of
motivation and for providing encouragement throughout the length of this project. We offer
our sincere thanks to all other persons who knowingly or unknowingly helped us in this
project.
Signatures of Students
IV
SUMMARY
Sun’s energy is literally the most pure renewable energy resource which is freely available
across the world. The already existing methods focused either only on machine learning or on
deep learning models but here we have performed comparison on the same dataset using both
the machine learning and deep learning models in order to get the best possible results.
The aim of our work is to predict solar irradiation which would help us know the areas that
receive maximum sunlight that would not only help in estimating the solar power radiation
but also in improving the productivity of solar panels. Some significant forecasting techniques
namely Linear Regression, Lasso Regression, Decision Tree, Multilayer Perceptron and
Gradient Boosting Regression have been evaluated by estimating global horizontal irradiance
(GHI) across 4 cities in India: Barmer, Jaipur, Jodhpur and Bikaner. The datasets have been
taken from National Solar Radiation Database, or NSRDB, which is an assemblance of solar
irradiance and meteorological data sets for all international locations, describing the quantity
of solar energy which is available at any location in the world. The ML and Deep Learning
models have been evaluated in the terms of accuracy and root mean square error (RMSE). It
was observed that the deep learning models performed better than the machine learning
models in terms of both accuracy and RMSE with GBR giving the highest value for accuracy
followed by MLP and other machine learning models.
V
LIST OF FIGURES
VI
LIST OF TABLES
VII
LIST OF ACRONYMS
VIII
1. INTRODUCTION
Nowadays fossil energy sources are rapidly drying up. Due to this reason we are seeking new
energy sources for meeting the energy needs of the future generations. In addition, all the
energy sources need to be more sustainable like solar energy, geothermal energy etc. To make
the energy output effective, we should know in advance how much demand a specific energy
source can fulfil. In view of solar energy, it varies daily with weather and sun’s relative
position. So we must therefore be in a position to predict what quantity we can produce with
solar energy being our main power source at a certain day. The rest of the demand can be
managed by other resources. By considering the effectiveness of the machine learning and
deep learning models, it could be possible to locate the best suited locations for initiating a
solar energy project.
This project uses data-driven approaches such as machine learning and deep learning in the
field of climate science to predict how much solar power can be generated during the day.
Currently, modelling solar radiation is an important part of climate simulators, where they
spend a majority of their computation cycles estimating solar radiation with complex and
costly calculations. Therefore, the approach to be taken instead of the classical and complex
mathematical models is to use machine learning models, such as Linear Regression, Lasso
Regression, Decision trees and Deep Learning models such as MLP and GBR.
The project uses 5 different classifiers for comparison. For each classifier accuracy and
RMSE value have been generated.
The tool used for implementation of this project is Jupyter Notebook. It is an open source web
based application used to create and share text containing live source code, narrative text etc.
Jupyter notebook makes the task easier on both the ends i.e., for the developer as well the
user. Hence the application has been used for data pre-processing, statistical modelling and
data simulation. The language that has been used is Python
1
1.4 Significance of Problem
Sun’s energy is literally the most pure renewable energy resource which is freely available
across the world but it is still not being harnessed with full capacity leading to under
utilisation of power generation through photovoltaic panels. The significance of our work is to
solve this problem by predicting solar irradiation with the help of machine learning and deep
learning models on the same dataset which would help us know the areas that receive
maximum sunlight that would not only help in estimating the solar power radiation but also in
improving the productivity of solar panels.
2
2. LITERATURE SURVEY
It has been observed that out of various statistical models fitted over the solar radiation data,
linear regression has the best performance. Wu Ji et al. [1] have discussed various methods for
dealing with unprocessed data of solar radiation and models to estimate the time series pattern
of solar radiation. In order to smooth the data over time, several simulations have been
observed. The data from these simulations showed that the average of 3 moving points was
most efficient for smoothing of data. They also focused on application of other machine
learning algorithms, one being ANN, which uses many parallel and interconnected nodes
with different weights which get multiplied recursively to the values in the feature set to find
the predicted solar radiation value. The other algorithm used was Evolutionary Algorithm
(EA). Both of these algorithms can be used in combination with linear regression to further
improve the accuracy of the predicted values. R. Muthukrishnan et al. [2] discussed about the
regression methods which are OLS, Ridge and LASSO regression. Regression methods are
specifically used as it is a supervised machine learning algorithm, and thus it fits quite well
within the problem statement. Regression models usually deal with two problems, one being
parameter estimation, and other being variable selection. To solve these problems, shrinkage
approach has been used over algorithms like Ridge regression and LASSO regression. On
analysis of these algorithms over both real and simulated datasets, we find out that LASSO is
comparatively better than both ordinary linear regression and ridge regression as it enables the
shrinking of the coefficients completely to zero. Hence, LASSO can be used as a substitute to
the usual feature selection methods. The traditional methods of regression are highly prone to
unexpected random errors like low accuracy, over-fitting etc. as compared to LASSO
regression. Rathod et al. [3] suggested that atmospheric and geographical factors have a major
influence on the isolation amount received by earth, like solar radiation grows with rise in
altitude. Also in plains and plateaus solar radiation is more uniform whereas in hilly and
mountainous regions it is uneven. Furthermore insolation is also decided by air quality and
local vegetation. Indian states having high altitude, very less vegetation cover coupled with
low amount of rainfall will get a greater share of sun’s radiation. Latitude also comes into
picture. With the Indian states lying closer to the equator, they get a majority share of the
radiation as well. Even the cloud cover plays a critical role in deciding radiation collected. If
there are a few clouds in the sky, this would increase the chances of light getting absorbed as
3
well as scattered while on a clear sky day the amount of radiation received could be very high.
Furthermore, the more polluted the air is because of brown clouds as well as smog, lesser will
be the insolation received. Therefore, while emphasizing on research on solar radiation and
photovoltaic, we should also see in what stretches it is going to be placed so that it can
perform to its maximum efficiency. Patel et al. [4] discussed about relevant data and
information needed to be classified into different categories according to certain rules like the
data must be clean and adequate, the problem must be properly defined and many more. The
decision tree is basically a tree based classifier comprising of both decision nodes as well as
leaf nodes with the major objective being to try making the leaf nodes as homogeneous as
possible. The decision node contains the features of the dataset while the leaf nodes consist of
class labels and the branches contain the decisions. K-means on the other side forms K-
clusters based upon mean values generated. The centroid coordinate is determined, distance of
each object calculated and finally the object being put into respective cluster. On the basis of
experiments performed, it has been observed that Decision tree (supervised learning) is much
better in terms of accuracy and complexity as compared to K-Means (unsupervised
learning).Dedgaonkar et al. [5] applied the least square linear regression method to estimate
solar radiation at a given place based on environmental factors like temperature, rainfall wind
speed using the NDC weather forecast data. It has been observed that the solar intensity and
days in a year are most correlated and it was also noted that summers had comparatively more
intensity than winters. It is also noted that dew point and temperature are highly correlated. If
any of these parameters is high, solar intensity tends to be higher. But if any of these values is
found to be lower, then intensity flickers between high and low values. However, relative
humidity, sky cover, wind speed and amount of clouds were negatively correlated with the
solar intensity. To access solar radiation, model used was linear regression. All the parameters
of the dataset were multiplied with some coefficients and a linear equation was prepared. Its
accuracy was found out by calculating the RMSE values. Using this RMSE value, the
accuracy of the model was found to be 71%.Verma et al. [6] used models namely linear,
logarithmic and polynomial regressions, ANN to forecast solar generation and it was seen that
ANN gave the minimum error and was extremely dependable technique. Besides this, there
were several other factors on which solar radiation is dependent such as temperature, wind
speed, humidity, etc. The data set was prepared from Jan 2014-Dec 2014. While studying
different parameters of the dataset, it was found out that if any of temperature, cloud cover
and relative humidity were found to be higher, then efficiency on solar panels was decreased.
4
Whereas, if wind speed was higher, it reduced the temperature and cleared cloud cover and
thereby helped in increasing the efficiency. As time passes, the solar panels start to degrade
due to factors like rainfall and wind and hence, full efficiency could not be achieved. Prakash
et al. [7] told that there has been a major focus on the comparison of least squares linear
regression algorithm and the Gradient Boosting Regression (GBR) algorithm. Gradient
Boosting Regression is a technique, which uses collection of many models like decision trees,
neural networks etc. The model is made of many stages. The model is generalized using an
optimized loss function. They used ‘least squares’ as the loss function. On further analysis of
the results obtained using both the algorithms, it was observed that Gradient Boosted
Regression is more accurate when applied to solar radiation prediction problem as compared
to least square linear regression. The results can be further improved by adjusting the boosting
stage parameter and learning rate of the GBR algorithm. Asradj et al. [8] did a major
comparison has been drawn between empirical models and predictive models based on neural
networks for solar radiation prediction. Several empirical models such as Bahel, Newland and
Abdalla have already existed for prediction but surprisingly they fared below par in front of a
neural network model which was Multilayer perceptron. In MLP model, there are more than 1
layer of perceptron present. The signals are received by an input layer, conclusion or decision
about the input is done by the output layer, and between both of these lie the true
computational engine of the model which are the innumerable hidden layers. Number of
variables (parameters) is the input layer size that receives the information from various
different parameters and the output layer consists of 1 neuron that tells the predicted radiation
value. It is during the training that the total neurons to be present inside the hidden layer and
what number of layers are to be present is decided. Based on the RMSE values, predictive
models i.e. MLP was seen to be way more efficient than empirical models. Shruthi el al. [9]
considered data sets for 23 Indian states to ensure that the data varies according to climatic
changes and weather in all the states. These states ranged from cold areas like Srinagar to
relatively hotter place like Chennai and Port Blair. Parameters like altitude, longitude,
minimum temperature, latitude, extra-terrestrial radiation, sunshine hours, and clearness index
were considered for prediction. Algorithms that were used in this study were ANN, Radial
Base Function (RBF), Generalised Regression Neural Network (GRNN) and Multilayer
Perceptron (MLP). To check the performance of all of these models, values like RMSE,
Mean square error(MSE) and Mean absolute percentage error(MAPE) were calculated for
each model and compared out of which it was seen that MAPE was most reliable in
5
comparison other performance measures. The already existing research papers focused either
only on machine learning or on deep learning models but here we have performed comparison
on the same dataset using both the machine learning and deep learning models in order to get
the best possible results.
10
3. ANALYSIS, DESIGN AND MODELLING
Sun’s energy is literally the most pure renewable energy resource which is freely available
across the world. The already existing methods focused either only on machine learning or on
deep learning models but here we have performed comparison on the same dataset using both
the machine learning and deep learning models in order to get the best possible results.
The aim of our work is to predict solar irradiation which would help us know the areas that
receive maximum sunlight that would not only help in estimating the solar power radiation
but also in improving the productivity of solar panels. Some significant forecasting techniques
namely Linear Regression, Lasso Regression, Decision Tree, Multilayer Perceptron and
Gradient Boosting Regression have been evaluated by estimating global horizontal irradiance
(GHI) across 4 cities in India: Barmer, Jaipur, Jodhpur and Bikaner. The ML and Deep
Learning models have been evaluated in the terms of accuracy and root mean square error
(RMSE).
Dataset loading: The project supports loading of datasets for multiple cities
like Jaipur, Barmer, Jodhpur etc.
Jupyter notebook support: The outputs and the graphs generated in this project
can be viewed in real time on jupyter notebook.
11
Graph representation: The dataset and their corresponding predictions can be
pictorially represented using graphs, thus providing interactive view of the
dataset.
Though this project does not use any formal database like MySQL, MongoDB etc. But
a CSV file is used in its place to store the solar radiation data. The CSV file has a table
which contains the following columns:
Year, Month, Day, Hour, Minute, DHI, DNI, GHI, Dew point, Temperature, Pressure,
Relative Humidity, Wind speed and direction
Dataset Description: The dataset has been taken from the National Solar Radiation Database
which is a serially complete collection of meteorological and solar irradiance data sets for the
international locations.
The pre-processed Jaipur dataset consists of 23,725 rows. We have used 17,793 rows for
training and 5932 rows for testing.
12
The pre-processed Barmer dataset consists of 47,450rows.Wehave used 35,587 rows for
training and 11,863 rows for testing.
The pre-processed Jodhpur dataset consists of 23,725 rows. We have used 17,793 rows for
training and 5932 rows for testing.
The pre-processed Bikaner dataset consists of 23,725 rows. We have used 17,793 rows for
training and 5932 rows for testing.
Data Preprocessing: Removing the Null values, dropping the unnecessary columns,
identifying the hours of solar radiation and finally cleaning the data
Applying ML and DL models: We will be using Linear regression, Lasso regression, Decision
tree, MLP, GBR based models in our problem statement.
Identifying the results: Comparing the accuracy of all the implemented ML and DL models
and finding the one which is best suited for solar radiation prediction.
13
4. IMPLEMENTATION DETAILS AND ISSUES
14
Fig 3: Radiation vs Hour for Barmer
16
Fig 8: Bikaner dataset graph
Unavailability of solar radiation before 6 AM and after 6 PM: Since solar radiation is
not available during these intervals, it causes unnecessary lines in the dataset which
we have to drop while performing predictions.
17
Speed: Since the datasets have large amount of rows in them, it takes some time to
clean the data and perform training on it.
Useless columns in dataset: Several columns like DHI, DNI, and Wind direction are
not required by us while making predictions. So we have to drop these columns
The datasets have been prepared as comma-separated values (CSV) files. These files contain
data recorded at an interval of one hour. Each line contains 14 columns with the
corresponding values of year, month, day, hour, minute, ghi, dew point, temperature pressure,
relative humidity, wind speed and direction for all the four Indian states namely Barmer,
Jaipur, Jodhpur, Bikaner.
For any classifier to be able to use this data, we need to do some pre-processing.
Data Pre-processing:
Different pre-processing approaches have been applied to different classifiers based on their
requirement of input data. Following is a brief description of these approaches.
1) Visualisation of all the datasets using the graphical approach which led us to the
finding that solar radiation data is only available from 6 AM to 6 PM and no data is
present outside these hours.
2) Analysing the data using the resultant value (GHI) which led us to the finding that all
the columns were not requires fetching the result and hence could be dropped and
discarded.
3) Finally, converting all the values to a similar data type to make sure there are no
discrepancies while getting the final result.
4) After performing all these steps we were in a state to divide the relevant data into
training and testing data for further evaluation.
The datasets were split into two parts - the data that will be used to train our classifiers and the
18
data that will be used to test them. We split our datasets such that 70% of the data was used
for training while 30% of the data was used for testing.
Barmer Dataset
30%
Training:35,587
70%
Testing:11,863
Jaipur Dataset
30%
Training:17,793
70%
Testing:5,932
Jodhpur Dataset
30%
Training:17,793
70%
Testing:5,932
Bikaner Dataset
30%
Training:17,793
70%
Testing:5,932
19
4.1.3 Algorithms
Different pre-processing approaches have been applied to different classifiers based on their
requirement of input data. Following is a brief description of these approaches.
A. Linear regression
Suppose that (g1, h1), (g2, h2) …. (gn,hn) are realisations of the random variable pairs, (G1,
H1), (G2, H2) …. (Gn, Hn). The equation for linear regression is that the mean of H is
straight line function of g. This could be written as:
Where E (Hi) is used to represent the mean value (expected value) and the subscript i denote
the (hypothetical) ith unit in the population and β0, β1are the coefficients of regression and gi
is the independent feature.
B. Lasso Regression
Lasso regression is the advancement of linear regression which utilizes the concept of
shrinkage. Shrinkage is the process in which the values of the data are narrowed towards the
middle position i.e. mean. This procedure encourages simple and sparse models. This was
ideally for models having huge levels of multicolinearity or when you wanted automation in
certain specific parts of model selection, like variable selection. The ultimate objective of the
algorithm is to minimise:
Where λ is the tuning parameter and βi, βj are the coefficients of regression, Xij represents the
features and Yi represents the labels.
C. Decision Tree
The decision tree is basically a tree based classifier comprising of both decision nodes as well
as leaf nodes. The decision node contains the features of the dataset while the leaf nodes
20
consist of class labels and the branches contain the decisions. Top to bottom approach is
followed.
D. Multilayer Perceptron
A multilayer perceptron is a well-known deep learning mode land a type of artificial neural
network. There are more than 1 layer of perceptron present. The signals are received by an
input layer, conclusion or decision about the input is done by the output layer, and between
both of these lie the true computational engine of the model which are the innumerable hidden
layers. MLPs that have more than 1 hidden layer can be used for the approximation of any
mathematical continuous function. In Supervised learning problems, mostly MLPs are
applied: training part is done on an arrangement of input-output pairs and learn to model the
correlation (or dependencies) between those inputs and outputs. Training part involves
parameter adjusting or the weights if any for minimizing error. Error is made relative to
weights and bias using back propagation and the error itself can be measured in a variety of
ways, including by root mean squared error (RMSE).
Where w denotes the vector of weights, x is the vector of inputs, b is the bias and phi is the
non-linear activation function
Many basic neurons have been combined together to form a multilayer perceptron model.
E. GBR
Gradient Boosting Regression is a regression technique which uses multiple weak ML models
(usually decision trees) and ensembles them to form a prediction model. It tries to identify
patterns in residuals of models and use it to strengthen a single prediction model to make
better predictions. Once we do this multiple times, we reach a stage where we stop finding
patterns in residuals, we stop the iterations of the algorithm (ideally to prevent over fitting).
We do this to make sure our loss function remains minimum on our test data. A Gradient
Boosting Model works in three steps:
1. Loss function: It is a function which should be minimized in order to get the best
predictions. It must be differentiable.
2. Weak learner: In GBR, the role of weak learners is usually played by decision trees.
These algorithms are specifically constricted so that they don’t become strong enough.
3. Additive Model: Usually gradient descent is used as additive model to combine
multiple weak decision trees and make a single strong predictive model.
22
4.2 Risk Analysis and Mitigation
Personnel
8 Related Incompetent Skills Time High High
Personnel
9 Related Irregularity Time Medium High
Table 3: Mitigation
23
5. TESTING
24
5.2 Component decomposition and type of testing required
25
5.5 Limitations of the above Solution
Though the solution developed performs with decent accuracy on our dataset, but it lacks
handling of different types of datasets and depends only on datasets provided by NSRDB.
Also, the simpler regression based algorithms like Linear and Lasso Regression don’t
perform up to the mark.
26
6. FINDINGS AND CONCLUSIONS
6.1 Findings
Table-8 shows the output of Barmer dataset where when we use linear regression model we
get an accuracy of 52.53 and RMSE 211, for lasso regression we get an accuracy of 52.52
and RMSE 211, for decision tree we get an accuracy of 87.18 and RMSE 109, for MLP we
get an accuracy of 93.04 and RMSE 109 and for GBR we get an accuracy of 93.74 and
RMSE 76.
27
Ml model Accuracy RMSE
Linear regression 48.491051596233425 218.92277104024262
Table-9 shows the output of Jaipur dataset where when we use linear regression model we get
an accuracy of 48.49 and RMSE 218, for lasso regression we get an accuracy of 42.49 and
RMSE 218, for decision tree we get an accuracy of 85 and RMSE 114, for MLP we get an
accuracy of 91.93 and RMSE 114 and for GBR we get an accuracy of 93.01 and RMSE 80.
28
Ml model Accuracy RMSE
Linear regression 48.7558991918188 220.10321247698928
Table-10 shows the output of Bikaner dataset where when we use linear regression model we
get an accuracy of 48.75and RMSE 220, for lasso regression we get an accuracy of 48.75 and
RMSE 220, for decision tree we get an accuracy of 87.61 and RMSE 108, for MLP we get an
accuracy of 91.76 and RMSE 108 and for GBR we get an accuracy of 93.04 and RMSE 109.
29
Ml model Accuracy RMSE
Linear regression 49.272618251221706 219.04443890408334
Lasso regression 49.27261520006399 219.04444549164114
Decision tree 85.20957001362939 118.27730168323379
MLP 90.8875000982763 118.27730168323379
GBR 92.1356259856941 86.24681935952573
Table-11 shows the output of Jodhpur dataset where when we use linear regression model we
get an accuracy of 49.27 and RMSE 219, for lasso regression we get an accuracy of 49.27
and RMSE 219, for decision tree we get an accuracy of 85.20 and RMSE 118, for MLP we
get an accuracy of 90.88 and RMSE 118 and for GBR we get an accuracy of 92.13 and
RMSE 86.
6.2 Conclusion
It was the need of the hour to make accurate solar energy predictions in order to choose
places for solar parks and make efficient solar systems. This research paper thus tries to
compare multiple types of machine learning and deep learning algorithms to get the best
possible predictions. Out of all the models applied, MLP was found to be the best algorithm
as it gave maximum accuracy for the test data with the least root mean square error. In
30
machine learning models, Decision tree was found to be the best one but after decision tree
we observed that both the regression algorithms i.e. linear regression and Lasso regression
gave almost similar results. Linear regression only checks for linear relationships amongst the
dependent variables along with independent variables i.e. it considers that there exist a linear
relationship among them. At times, this might be incorrect if the variables have a curved
relationship amongst them. In all we observed that deep learning models fared better than
machine learning models.
A lot of research has already been done the area of solar radiation prediction, corresponding
to which we have seen a lot applications. Significant results have been obtained from this
work, corresponding to which this research can be taken to the real world application level
for solar radiation prediction. We are further looking forward to implement more efficient
and advanced algorithms in the hope of getting better results in terms of accuracy and
requesting the concerned organizations for more datasets which would eventually help us in
getting to know more about the solar radiation patterns in different parts of the country.
31
REFERENCES
3.A. P. S. Rathod, P. Mittal and B. Kumar, "Analysis of factors affecting the solar radiation
received by any region," 2016 International Conference on Emerging Trends in
Communication Technologies (ETCT), Dehradun, 2016, pp. 1- 4
4.Patel, Bhaskar N., Satish G. Prajapati, and Kamaljit I. Lakhtaria. “Efficient Classification of
Data Using Decision Tree.”Bonfring International Journal of Data Mining 2, no. 1 (2012):
06-12.
5. Suruchi Dedgaonkar, Vishal Patil, Niraj Rathod, Gajanan Hakare & Jyotiba Bhosale.
“Solar Energy Prediction using Least Square Linear Regression Method.” International
Journal of Current Engineering and Technology Vol.6, No.5 (Oct 2016)
7.A. Prakash and S. K. Singh, "Towards an efficient regression model for solar energy
prediction," 2014 Innovative Applications of Computational Intelligence on Power, Energy
and Controls with their impact on Humanity (CIPECH), Ghaziabad, 2014, pp. 18-23.
8.Z. Asradj and R. Alkama, "Prediction of solar radiation in Bejaia city using neurnal
network (MLP).," 2014 International Conference on Electrical Sciences and Technologies in
Maghreb (CISTEM), Tunis, 2014, pp. 1-4.
32
9.D. Shruthi, M. S. P. Subathra, J. Kumari and K. Raimond, "Artificial neural network based
prediction of monthly global solar radiation in Indian stations," 2017 International
Conference on Signal Processing and Communication (ICSPC), Coimbatore, 2017, pp. 410-
414.
33