Solar Radiation Prediction: Dr. Himani Bansal

SOLAR RADIATION PREDICTION
Submitted By:
JAI DHALL (9915103184)
TARUN KHARE (9915103200)
AKSHAT JAIN (9915103210)
Under The Supervision Of:

DR. HIMANI BANSAL
MAY 2019
Submitted in partial fulfilment of the Degree of
Bachelor of Technology
in
Computer Science Engineering
DEPARTMENT OF COMPUTER SCIENCE ENGINEERING &

INFORMATION TECHNOLOGY
JAYPEE INSTITUTE OF INFORMATION TECHNOLOGY, NOIDA

TABLE OF CONTENTS
Chapter No Topics Page No.
Student Declaration II
Certificate from the Supervisor III
Acknowledgement IV
Summary V
List of Figures VI
List of Tables VII
List of Symbols and Acronyms VIII
Chapter-1 Introduction 1-2
1.1 General Introduction 1
1.2 Problem Statement 1
1.3 Approach to problem in terms of 1
technology /platform to be used
1.4 Support for Novelty/ significance of problem 2
Chapter-2 Literature Survey 3-10
2.1 Summary of papers 3
2.2 Integrated summary of the literature studied 6
Chapter 3: Analysis, Design and Modeling 11-13
3.1 Overall description of the project 11
3.2 Functional requirements 11
3.3 Non Functional requirements 11

3.4 Logical database requirements 12
3.5 Design Diagrams 12
3.6 Component Description 12
Chapter-4 Implementation details and issues 14-23
4.1 Implementation details and issues 14
4.1.1 Implementation Issues 17
4.1.2 Implementation Details 18
4.1.3 Algorithms 20
4.2 Risk Analysis and Mitigation 23
Chapter-5 Testing (Focus on Quality of Robustness and Testing) 24-26
5.1 Testing Plan 24
5.2 Component decomposition and type 25
of testing required
5.3 List all test cases 25
5.4 Error and Exception Handling 25
5.5 Limitations of the solution 26
Chapter-6 Findings & Conclusion 27-31
6.1 Findings 27
6.2 Conclusion 30
6.3 Future Work 31
References 32-33
DECLARATION
We hereby declare that this submission is our own work and that, to the best of our knowledge
and belief, it contains no material previously published or written by another person nor
material which has been accepted for the award of any other degree or diploma of the university
or other institute of higher learning, except where due acknowledgment has been made in the
text.
Place: Noida Jai Dhall(9915103184)
Date: 6 May 2019 Tarun Khare(9915103200)
Akshat Jain(9915103210)
II
CERTIFICATE
This is to certify that the work titled “Solar Radiation Prediction” submitted by Akshat Jain,
Jai Dhall, Tarun Khare in partial fulfilment for the award of degree of B Tech of Jaypee
Institute of Information Technology, Noida has been carried out under my supervision. This
work has not been submitted partially or wholly to any other University or Institute for the
award of this or any other degree or diploma.
Signature of Supervisor:
Name of Supervisor : Dr Himani Bansal
Designation : Assistant Professor (Senior Grade)
Date :
III
ACKNOWLEDGEMENT
First and foremost, we would like to thank our guide Dr. Himani Bansal – Assistant Professor
(Senior Grade), Jaypee Institute of Information Technology, Noida for guiding us
thoughtfully and efficiently throughout this project, giving us an opportunity to work at our
own pace along our own lines, while providing us with very useful directions whenever
necessary. We would also like to thank our friends and classmates for being great sources of
motivation and for providing encouragement throughout the length of this project. We offer
our sincere thanks to all other persons who knowingly or unknowingly helped us in this
project.
Signatures of Students
Jai Dhall (9915103184)
Tarun Khare (9915103200)
Akshat Jain (9915103210)
IV
SUMMARY
Sun’s energy is literally the most pure renewable energy resource which is freely available
across the world. The already existing methods focused either only on machine learning or on
deep learning models but here we have performed comparison on the same dataset using both
the machine learning and deep learning models in order to get the best possible results.
The aim of our work is to predict solar irradiation which would help us know the areas that
receive maximum sunlight that would not only help in estimating the solar power radiation
but also in improving the productivity of solar panels. Some significant forecasting techniques
namely Linear Regression, Lasso Regression, Decision Tree, Multilayer Perceptron and
Gradient Boosting Regression have been evaluated by estimating global horizontal irradiance
(GHI) across 4 cities in India: Barmer, Jaipur, Jodhpur and Bikaner. The datasets have been
taken from National Solar Radiation Database, or NSRDB, which is an assemblance of solar
irradiance and meteorological data sets for all international locations, describing the quantity
of solar energy which is available at any location in the world. The ML and Deep Learning
models have been evaluated in the terms of accuracy and root mean square error (RMSE). It
was observed that the deep learning models performed better than the machine learning
models in terms of both accuracy and RMSE with GBR giving the highest value for accuracy
followed by MLP and other machine learning models.
V
LIST OF FIGURES
S No. Title Page No

1 Architecture Diagram 12
2 Barmer Dataset Graph 14
3 Radiation vs Hour for Barmer 14
4 Jaipur Dataset Graph 15
5 Radiation vs Hour for Jaipur 15
6 Jodhpur Dataset Graph 16
7 Radiation vs Hour for Jodhpur 16
8 Bikaner Dataset Graph 17
9 Radiation vs Hour for Bikaner 17
10 Splitting of Dataset 19
11 Basic Neuron 21
12 Multilayer perceptron model 22
13 RMSE and Accuracy Graph for Barmer 27
14 RMSE and Accuracy Graph for Jaipur 28
15 RMSE and Accuracy Graph for Bikaner 29
16 RMSE and Accuracy Graph for Jodhpur 30
VI
LIST OF TABLES
S No. Table Name Page

1 Literature Summary 10
2 Risk Analysis 23
3 Mitigation 23
4 Type of Testing 24
5 Component Decomposition and Identification of 25
Tests Required
6 Test Cases for Components 25
7 Type of Debugging Technique used 25
8 Results for Barmer Dataset 27
9 Results for Jaipur Dataset 28
10 Results for Bikaner Dataset 29
11 Results for Jodhpur Dataset 30
VII
LIST OF ACRONYMS
S No. Acronym Abbreviation

1 GHI Global Horizontal Irradiance
2 NSRDB National Solar Radiation DataBase
3 MLP Multilayer Perceptron
4 GBR Gradient Boosting Regression
5 RMSE Root Mean Square Error
6 DHI Direct Horizontal Irradiance
7 DNI Direct Normal Irradiance
VIII
1. INTRODUCTION
1.1 General Introduction
Nowadays fossil energy sources are rapidly drying up. Due to this reason we are seeking new
energy sources for meeting the energy needs of the future generations. In addition, all the
energy sources need to be more sustainable like solar energy, geothermal energy etc. To make
the energy output effective, we should know in advance how much demand a specific energy
source can fulfil. In view of solar energy, it varies daily with weather and sun’s relative
position. So we must therefore be in a position to predict what quantity we can produce with
solar energy being our main power source at a certain day. The rest of the demand can be
managed by other resources. By considering the effectiveness of the machine learning and
deep learning models, it could be possible to locate the best suited locations for initiating a
solar energy project.
1.2 Problem Statement
This project uses data-driven approaches such as machine learning and deep learning in the
field of climate science to predict how much solar power can be generated during the day.
Currently, modelling solar radiation is an important part of climate simulators, where they
spend a majority of their computation cycles estimating solar radiation with complex and
costly calculations. Therefore, the approach to be taken instead of the classical and complex
mathematical models is to use machine learning models, such as Linear Regression, Lasso
Regression, Decision trees and Deep Learning models such as MLP and GBR.
1.3 Approach to the problem in terms of technology/Platform used
The project uses 5 different classifiers for comparison. For each classifier accuracy and
RMSE value have been generated.
The tool used for implementation of this project is Jupyter Notebook. It is an open source web
based application used to create and share text containing live source code, narrative text etc.
Jupyter notebook makes the task easier on both the ends i.e., for the developer as well the
user. Hence the application has been used for data pre-processing, statistical modelling and
data simulation. The language that has been used is Python
1
1.4 Significance of Problem
across the world but it is still not being harnessed with full capacity leading to under
utilisation of power generation through photovoltaic panels. The significance of our work is to
solve this problem by predicting solar irradiation with the help of machine learning and deep
learning models on the same dataset which would help us know the areas that receive
maximum sunlight that would not only help in estimating the solar power radiation but also in
improving the productivity of solar panels.
2
2. LITERATURE SURVEY
2.1 Summary of the papers studied
It has been observed that out of various statistical models fitted over the solar radiation data,
linear regression has the best performance. Wu Ji et al. [1] have discussed various methods for
dealing with unprocessed data of solar radiation and models to estimate the time series pattern
of solar radiation. In order to smooth the data over time, several simulations have been
observed. The data from these simulations showed that the average of 3 moving points was
most efficient for smoothing of data. They also focused on application of other machine
learning algorithms, one being ANN, which uses many parallel and interconnected nodes
with different weights which get multiplied recursively to the values in the feature set to find
the predicted solar radiation value. The other algorithm used was Evolutionary Algorithm
(EA). Both of these algorithms can be used in combination with linear regression to further
improve the accuracy of the predicted values. R. Muthukrishnan et al. [2] discussed about the
regression methods which are OLS, Ridge and LASSO regression. Regression methods are
specifically used as it is a supervised machine learning algorithm, and thus it fits quite well
within the problem statement. Regression models usually deal with two problems, one being
parameter estimation, and other being variable selection. To solve these problems, shrinkage
approach has been used over algorithms like Ridge regression and LASSO regression. On
analysis of these algorithms over both real and simulated datasets, we find out that LASSO is
comparatively better than both ordinary linear regression and ridge regression as it enables the
shrinking of the coefficients completely to zero. Hence, LASSO can be used as a substitute to
the usual feature selection methods. The traditional methods of regression are highly prone to
unexpected random errors like low accuracy, over-fitting etc. as compared to LASSO
regression. Rathod et al. [3] suggested that atmospheric and geographical factors have a major
influence on the isolation amount received by earth, like solar radiation grows with rise in
altitude. Also in plains and plateaus solar radiation is more uniform whereas in hilly and
mountainous regions it is uneven. Furthermore insolation is also decided by air quality and
local vegetation. Indian states having high altitude, very less vegetation cover coupled with
low amount of rainfall will get a greater share of sun’s radiation. Latitude also comes into
picture. With the Indian states lying closer to the equator, they get a majority share of the
radiation as well. Even the cloud cover plays a critical role in deciding radiation collected. If
there are a few clouds in the sky, this would increase the chances of light getting absorbed as
3
well as scattered while on a clear sky day the amount of radiation received could be very high.
Furthermore, the more polluted the air is because of brown clouds as well as smog, lesser will
be the insolation received. Therefore, while emphasizing on research on solar radiation and
photovoltaic, we should also see in what stretches it is going to be placed so that it can
perform to its maximum efficiency. Patel et al. [4] discussed about relevant data and
information needed to be classified into different categories according to certain rules like the
data must be clean and adequate, the problem must be properly defined and many more. The
decision tree is basically a tree based classifier comprising of both decision nodes as well as
leaf nodes with the major objective being to try making the leaf nodes as homogeneous as
possible. The decision node contains the features of the dataset while the leaf nodes consist of
class labels and the branches contain the decisions. K-means on the other side forms K-
clusters based upon mean values generated. The centroid coordinate is determined, distance of
each object calculated and finally the object being put into respective cluster. On the basis of
experiments performed, it has been observed that Decision tree (supervised learning) is much
better in terms of accuracy and complexity as compared to K-Means (unsupervised
learning).Dedgaonkar et al. [5] applied the least square linear regression method to estimate
solar radiation at a given place based on environmental factors like temperature, rainfall wind
speed using the NDC weather forecast data. It has been observed that the solar intensity and
days in a year are most correlated and it was also noted that summers had comparatively more
intensity than winters. It is also noted that dew point and temperature are highly correlated. If
any of these parameters is high, solar intensity tends to be higher. But if any of these values is
found to be lower, then intensity flickers between high and low values. However, relative
humidity, sky cover, wind speed and amount of clouds were negatively correlated with the
solar intensity. To access solar radiation, model used was linear regression. All the parameters
of the dataset were multiplied with some coefficients and a linear equation was prepared. Its
accuracy was found out by calculating the RMSE values. Using this RMSE value, the
accuracy of the model was found to be 71%.Verma et al. [6] used models namely linear,
logarithmic and polynomial regressions, ANN to forecast solar generation and it was seen that
ANN gave the minimum error and was extremely dependable technique. Besides this, there
were several other factors on which solar radiation is dependent such as temperature, wind
speed, humidity, etc. The data set was prepared from Jan 2014-Dec 2014. While studying
different parameters of the dataset, it was found out that if any of temperature, cloud cover
and relative humidity were found to be higher, then efficiency on solar panels was decreased.
4
Whereas, if wind speed was higher, it reduced the temperature and cleared cloud cover and
thereby helped in increasing the efficiency. As time passes, the solar panels start to degrade
due to factors like rainfall and wind and hence, full efficiency could not be achieved. Prakash
et al. [7] told that there has been a major focus on the comparison of least squares linear
regression algorithm and the Gradient Boosting Regression (GBR) algorithm. Gradient
Boosting Regression is a technique, which uses collection of many models like decision trees,
neural networks etc. The model is made of many stages. The model is generalized using an
optimized loss function. They used ‘least squares’ as the loss function. On further analysis of
the results obtained using both the algorithms, it was observed that Gradient Boosted
Regression is more accurate when applied to solar radiation prediction problem as compared
to least square linear regression. The results can be further improved by adjusting the boosting
stage parameter and learning rate of the GBR algorithm. Asradj et al. [8] did a major
comparison has been drawn between empirical models and predictive models based on neural
networks for solar radiation prediction. Several empirical models such as Bahel, Newland and
Abdalla have already existed for prediction but surprisingly they fared below par in front of a
neural network model which was Multilayer perceptron. In MLP model, there are more than 1
layer of perceptron present. The signals are received by an input layer, conclusion or decision
about the input is done by the output layer, and between both of these lie the true
computational engine of the model which are the innumerable hidden layers. Number of
variables (parameters) is the input layer size that receives the information from various
different parameters and the output layer consists of 1 neuron that tells the predicted radiation
value. It is during the training that the total neurons to be present inside the hidden layer and
what number of layers are to be present is decided. Based on the RMSE values, predictive
models i.e. MLP was seen to be way more efficient than empirical models. Shruthi el al. [9]
considered data sets for 23 Indian states to ensure that the data varies according to climatic
changes and weather in all the states. These states ranged from cold areas like Srinagar to
relatively hotter place like Chennai and Port Blair. Parameters like altitude, longitude,
minimum temperature, latitude, extra-terrestrial radiation, sunshine hours, and clearness index
were considered for prediction. Algorithms that were used in this study were ANN, Radial
Base Function (RBF), Generalised Regression Neural Network (GRNN) and Multilayer
Perceptron (MLP). To check the performance of all of these models, values like RMSE,
Mean square error(MSE) and Mean absolute percentage error(MAPE) were calculated for
each model and compared out of which it was seen that MAPE was most reliable in
5
comparison other performance measures. The already existing research papers focused either
only on machine learning or on deep learning models but here we have performed comparison
on the same dataset using both the machine learning and deep learning models in order to get
the best possible results.
2.2 Integrated Summary of the Literature studied
Research Paper Title Author Summary

[1]Solar Radiation Wu Ji, CK Chan, Wu Ji et al. [1] have discussed various
Prediction using JW Loh, FH Choo, methods for dealing with unprocessed data of
Statistical LH Chen solar radiation and models to estimate the time
Approaches series pattern of solar radiation. In order to
smooth the data over time, several simulations
have been observed. The data from these
simulations showed that the average of 3
moving points was most efficient for
smoothing of data. They also focused on
application of other machine learning
algorithms, one being ANN, which uses many
parallel and interconnected nodes with
different weights which get multiplied
recursively to the values in the feature set to
find the predicted solar radiation value. The
other algorithm used was Evolutionary
Algorithm (EA). Both of these algorithms can
be used in combination with linear regression
to further improve the accuracy of the
predicted values.
[2] LASSO: A Muthukrishnan R, R. Muthukrishnan et al. [2] discussed about
Feature Selection Rohini R the regression methods which are OLS, Ridge
Technique In and LASSO regression. Regression methods
Predictive are specifically used as it is a supervised
Modeling For machine learning algorithm, and thus it fits
Machine Learning quite well within the problem statement.
Regression models usually deal with two
problems, one being parameter estimation, and
other being variable selection. To solve these
problems, shrinkage approach has been used
over algorithms like Ridge regression and
LASSO regression. On analysis of these
algorithms over both real and simulated
datasets, we find out that LASSO is
comparatively better than both ordinary linear
regression and ridge regression as it enables
the shrinking of the coefficients completely to
zero.
[3] Analysis of Arun Pratap Singh Rathod et al. [3] suggested that atmospheric
Factors Affecting the Rathod, Poornima and geographical factors have a major
Solar Radiation Mittal, Brijesh influence on the isolation amount received by
received by any Kumar earth, like solar radiation grows with rise in
Region altitude. Also in plains and plateaus solar
radiation is more uniform whereas in hilly and
mountainous regions it is uneven. Furthermore
insolation is also decided by air quality and
local vegetation. Indian states having high
altitude, very less vegetation cover coupled
with low amount of rainfall will get a greater
share of sun’s radiation. Latitude also comes
into picture. With the Indian states lying closer
to the equator, they get a majority share of the
radiation as well. Even the cloud cover plays a
critical role in deciding radiation collected. If
there are a few clouds in the sky, this would
increase the chances of light getting absorbed
as well as scattered while on a clear sky day
the amount of radiation received could be very
high.
[4] Efficient Bhaskar N. Patel, Patel et al. [4] discussed about relevant data
Classification of Data Satish G. Prajapati and information needed to be classified into
Using Decision Tree and Dr.Kamaljit I. different categories according to certain rules
Lakhtaria like the data must be clean and adequate, the
problem must be properly defined and many
more. The decision tree is basically a tree
based classifier comprising of both decision
nodes as well as leaf nodes with the major
objective being to try making the leaf nodes as
homogeneous as possible. The centroid
coordinate is determined, distance of each
object calculated and finally the object being
put into respective cluster. On the basis of
experiments performed, it has been observed
that Decision tree (supervised learning) is
much better in terms of accuracy and
complexity as compared to K-Means
(unsupervised learning).
[5] Solar Energy Suruchi Dedgaonkar et al. [5] applied the least square
Prediction using Least Dedgaonkar, linear regression method to estimate solar
Square Linear Vishal Patil, Niraj radiation at a given place based on
Regression Method Rathod, Gajanan environmental factors like temperature,
Hakare & Jyotiba rainfall wind speed using the NDC weather
Bhosale forecast data. It has been observed that the
solar intensity and days in a year are most
correlated and it was also noted that summers
had comparatively more intensity than winters.
It is also noted that dew point and temperature
are highly correlated. If any of these
parameters is high, solar intensity tends to be
higher. But if any of these values is found to
be lower, then intensity flickers between high
and low values. However, relative humidity,
sky cover, wind speed and amount of clouds
were negatively correlated with the solar
intensityAll the parameters of the dataset were
multiplied with some coefficients and a linear
equation was prepared.
[6] Data Analysis to Tushar Verma, A. Verma et al. [6] used models namely linear,
Generate Models P. S. Tiwana and logarithmic and polynomial regressions, ANN
Based on Neural C. C. Reddy, Vikas to forecast solar generation and it was seen
Network and Arora and P. that ANN gave the minimum error and was
Regression for Devanand extremely dependable technique. Besides this,
Solar Power there were several other factors on which solar
Generation radiation is dependent such as temperature,
Forecasting wind speed, humidity, etc While studying
different parameters of the dataset, it was
found out that if any of temperature, cloud
cover and relative humidity were found to be
higher, then efficiency on solar panels was
decreased. As time passes, the solar panels
start to degrade due to factors like rainfall and
wind and hence, full efficiency could not be
achieved.
[7] Towards an A. Prakash and S. Prakash et al. [7] told that there has been a
efficient regression K. Singh major focus on the comparison of least squares
model for solar linear regression and the Gradient Boosting
energy prediction Regression (GBR). GBR is a technique, which
uses collection of many models like decision
trees, neural networks etc. The model is
generalized using an optimized loss function.
They used ‘least squares’ as the loss function.
On further analysis of the results obtained
using both the algorithms, it was observed that
GBR is more accurate when applied to solar
radiation prediction problem as compared to
least square linear regression. The results can
be further improved by adjusting the boosting
stage parameter and learning rate of the GBR
[8] Prediction of solar .Z. Asradj and R. Asradj et al. [8] did a major comparison has
radiation in Bejaia Alkama been drawn between empirical models and
city using neurnal predictive models based on neural networks
network (MLP) for solar radiation prediction. Several
empirical models such as Bahel, Newland and
Abdalla have already existed for prediction but
surprisingly they fared below par in front of a
neural network model which was Multilayer
perceptron. In MLP model, there are more
than 1 layer of perceptron present. The signals
are received by an input layer, conclusion or
decision about the input is done by the output
layer, and between both of these lie the true
computational engine of the model which are
the innumerable hidden layers. Number of
variables is the input layer size that receives
the information from various different
parameters and the output layer consists of 1
neuron that tells the predicted radiation value.
Based on the RMSE values, MLP was seen to
be way more efficient than empirical models.
[9] Artificial neural D. Shruthi, M. S. P. Shruthi el al. [9] considered data sets for 23
network based Subathra, J. Indian states to ensure that the data varies
prediction of monthly Kumari and K. according to climatic changes in all states.
global solar radiation Raimond Algorithms used in this study were ANN,
in Indian stations Radial Base Function (RBF), Generalised
Regression Neural Network (GRNN) and
Multilayer Perceptron (MLP). To check the
performance of all of these models, values like
RMSE, Mean square error(MSE) and Mean
absolute percentage error(MAPE) were
calculated and compared out of which it was
seen that MAPE was most reliable in
comparison other performance measures.
Table 1: Literature Summary
10
3. ANALYSIS, DESIGN AND MODELLING
3.1 Overall description of the project
across the world. The already existing methods focused either only on machine learning or on
deep learning models but here we have performed comparison on the same dataset using both
the machine learning and deep learning models in order to get the best possible results.
The aim of our work is to predict solar irradiation which would help us know the areas that
receive maximum sunlight that would not only help in estimating the solar power radiation
but also in improving the productivity of solar panels. Some significant forecasting techniques
namely Linear Regression, Lasso Regression, Decision Tree, Multilayer Perceptron and
Gradient Boosting Regression have been evaluated by estimating global horizontal irradiance
(GHI) across 4 cities in India: Barmer, Jaipur, Jodhpur and Bikaner. The ML and Deep
Learning models have been evaluated in the terms of accuracy and root mean square error
(RMSE).
3.2 Functional Requirements
This project has following functional requirements:
 Dataset loading: The project supports loading of datasets for multiple cities
like Jaipur, Barmer, Jodhpur etc.
 Support of multiple ML algorithms: This project supports multiple ML

algorithms like Linear regression, Lasso regression, Decision tree and Gradient
Boosting Regression.
 Score calculation: Based on the amount of predictions correctly made by an

algorithm, a percent based score is calculated for each algorithm.
3.3 Non Functional requirements
This project has following non functional requirements:
 Jupyter notebook support: The outputs and the graphs generated in this project
can be viewed in real time on jupyter notebook.
11
 Graph representation: The dataset and their corresponding predictions can be
pictorially represented using graphs, thus providing interactive view of the
dataset.
3.4 Logical database requirements
Though this project does not use any formal database like MySQL, MongoDB etc. But
a CSV file is used in its place to store the solar radiation data. The CSV file has a table
which contains the following columns:
Year, Month, Day, Hour, Minute, DHI, DNI, GHI, Dew point, Temperature, Pressure,
Relative Humidity, Wind speed and direction
3.5 Design Diagrams
Fig 1: Architecture Diagram
3.6 Component Description
Dataset Description: The dataset has been taken from the National Solar Radiation Database
which is a serially complete collection of meteorological and solar irradiance data sets for the
international locations.
The pre-processed Jaipur dataset consists of 23,725 rows. We have used 17,793 rows for
training and 5932 rows for testing.
12
The pre-processed Barmer dataset consists of 47,450rows.Wehave used 35,587 rows for
training and 11,863 rows for testing.
The pre-processed Jodhpur dataset consists of 23,725 rows. We have used 17,793 rows for
The pre-processed Bikaner dataset consists of 23,725 rows. We have used 17,793 rows for
Data Preprocessing: Removing the Null values, dropping the unnecessary columns,
identifying the hours of solar radiation and finally cleaning the data
Applying ML and DL models: We will be using Linear regression, Lasso regression, Decision
tree, MLP, GBR based models in our problem statement.
Identifying the results: Comparing the accuracy of all the implemented ML and DL models
and finding the one which is best suited for solar radiation prediction.
13
4. IMPLEMENTATION DETAILS AND ISSUES
4.1 Implementation details and issues.

Four Machine learning algorithms, namely Linear regression, LASSO regression, Decision
trees and Gradient Boost Regression (GBR) have been implemented on solar radiation
datasets corresponding to three cities, Jaipur, Barmer, Jodhpur. Following are the graphs
corresponding to the datasets:
Fig 2: Barmer dataset graph
14
Fig 3: Radiation vs Hour for Barmer
Fig 4: Jaipur dataset graph
Fig 5: Radiation vs hour for Jaipur

15
Fig 6: Jodhpur dataset graph
Fig 7: Radiation vs hour for Jodhpur
16
Fig 8: Bikaner dataset graph
Fig 9: Radiation vs hour for Bikaner
4.1.1 Implementation Issues
 Unavailability of solar radiation before 6 AM and after 6 PM: Since solar radiation is
not available during these intervals, it causes unnecessary lines in the dataset which
we have to drop while performing predictions.
17
 Speed: Since the datasets have large amount of rows in them, it takes some time to
clean the data and perform training on it.
 Useless columns in dataset: Several columns like DHI, DNI, and Wind direction are
not required by us while making predictions. So we have to drop these columns
4.1.2 Implementation Details
Preparing Readable Datasets:
The datasets have been prepared as comma-separated values (CSV) files. These files contain
data recorded at an interval of one hour. Each line contains 14 columns with the
corresponding values of year, month, day, hour, minute, ghi, dew point, temperature pressure,
relative humidity, wind speed and direction for all the four Indian states namely Barmer,
Jaipur, Jodhpur, Bikaner.
For any classifier to be able to use this data, we need to do some pre-processing.
Data Pre-processing:
Different pre-processing approaches have been applied to different classifiers based on their
requirement of input data. Following is a brief description of these approaches.
1) Visualisation of all the datasets using the graphical approach which led us to the
finding that solar radiation data is only available from 6 AM to 6 PM and no data is
present outside these hours.
2) Analysing the data using the resultant value (GHI) which led us to the finding that all
the columns were not requires fetching the result and hence could be dropped and
discarded.
3) Finally, converting all the values to a similar data type to make sure there are no
discrepancies while getting the final result.
4) After performing all these steps we were in a state to divide the relevant data into
training and testing data for further evaluation.
Training and Test Datasets:
The datasets were split into two parts - the data that will be used to train our classifiers and the
18
data that will be used to test them. We split our datasets such that 70% of the data was used
for training while 30% of the data was used for testing.
Barmer Dataset
30%
Training:35,587
70%
Testing:11,863
Jaipur Dataset
30%
Training:17,793
70%
Testing:5,932
Jodhpur Dataset
30%
Training:17,793
70%
Testing:5,932
Bikaner Dataset
30%
Training:17,793
70%
Testing:5,932
Fig 10: Splitting of Datasets
19
4.1.3 Algorithms
Different pre-processing approaches have been applied to different classifiers based on their
requirement of input data. Following is a brief description of these approaches.
A. Linear regression
Suppose that (g1, h1), (g2, h2) …. (gn,hn) are realisations of the random variable pairs, (G1,
H1), (G2, H2) …. (Gn, Hn). The equation for linear regression is that the mean of H is
straight line function of g. This could be written as:
E (Hi) = β0+ β1gi
Where E (Hi) is used to represent the mean value (expected value) and the subscript i denote
the (hypothetical) ith unit in the population and β0, β1are the coefficients of regression and gi
is the independent feature.
B. Lasso Regression
Lasso regression is the advancement of linear regression which utilizes the concept of
shrinkage. Shrinkage is the process in which the values of the data are narrowed towards the
middle position i.e. mean. This procedure encourages simple and sparse models. This was
ideally for models having huge levels of multicolinearity or when you wanted automation in
certain specific parts of model selection, like variable selection. The ultimate objective of the
algorithm is to minimise:
Where λ is the tuning parameter and βi, βj are the coefficients of regression, Xij represents the
features and Yi represents the labels.
C. Decision Tree
The decision tree is basically a tree based classifier comprising of both decision nodes as well
as leaf nodes. The decision node contains the features of the dataset while the leaf nodes
20
consist of class labels and the branches contain the decisions. Top to bottom approach is
followed.
D. Multilayer Perceptron
A multilayer perceptron is a well-known deep learning mode land a type of artificial neural
network. There are more than 1 layer of perceptron present. The signals are received by an
input layer, conclusion or decision about the input is done by the output layer, and between
both of these lie the true computational engine of the model which are the innumerable hidden
layers. MLPs that have more than 1 hidden layer can be used for the approximation of any
mathematical continuous function. In Supervised learning problems, mostly MLPs are
applied: training part is done on an arrangement of input-output pairs and learn to model the
correlation (or dependencies) between those inputs and outputs. Training part involves
parameter adjusting or the weights if any for minimizing error. Error is made relative to
weights and bias using back propagation and the error itself can be measured in a variety of
ways, including by root mean squared error (RMSE).
Equation of output is defined below:
Where w denotes the vector of weights, x is the vector of inputs, b is the bias and phi is the
non-linear activation function
Fig 11: Basic neuron

21
The basic neuron takes features as input and uses weights and activation function to generate
the output.
Fig 12: Multilayer perceptron model
Many basic neurons have been combined together to form a multilayer perceptron model.
E. GBR
Gradient Boosting Regression is a regression technique which uses multiple weak ML models
(usually decision trees) and ensembles them to form a prediction model. It tries to identify
patterns in residuals of models and use it to strengthen a single prediction model to make
better predictions. Once we do this multiple times, we reach a stage where we stop finding
patterns in residuals, we stop the iterations of the algorithm (ideally to prevent over fitting).
We do this to make sure our loss function remains minimum on our test data. A Gradient
Boosting Model works in three steps:
1. Loss function: It is a function which should be minimized in order to get the best
predictions. It must be differentiable.
2. Weak learner: In GBR, the role of weak learners is usually played by decision trees.
These algorithms are specifically constricted so that they don’t become strong enough.
3. Additive Model: Usually gradient descent is used as additive model to combine
multiple weak decision trees and make a single strong predictive model.
22
4.2 Risk Analysis and Mitigation
Risk Classification Description of risk Risk area Probability Impact(I)

Id (P)
1 Hardware Incapability of hardware Performance, Medium Medium
like RAM, Processor, hardware,
Memory etc. Time
2 Multitenancy All the users are using Security Low Low

(Shared the same physical
access) Architecture.
3 Security Critical Data at risk. Security Low Low

4 Security Authentication, User, Project Low Low
authorization, and access Scope, Time
control.
5 Hardware Processor Performance, Low High

Time
6 Ownership User is the owner of data. Security High Low
7 Environment Windows Performance, High Medium
Time
Personnel
8 Related Incompetent Skills Time High High
Personnel
9 Related Irregularity Time Medium High
Table 2: Risk Analysis
Risk Mitigation Plan
Hardware related issues can be resolved by

Hardware using powerful processors, faster
RAMs and bigger storage devices.
Security Secure connection must be established.
Personnel Related We will try to avoid irregularity.
Table 3: Mitigation
23
5. TESTING
5.1 Testing Plan
Type of Test Has it been Explanations Software Components

Performed
Requirement Testing Yes Requirements specification Manual work, need to
must contain all the plan out all the
requirements that are to be software requirements,
solved by our system. time needed to
develop, technology to
be used etc.
Unit Yes Testing technique using Manual check is
which individual modules required.
are tested to determine if
there are any issues, by the
developer himself.
Integration No Testing where individual Compiling full part of

components are combined the code and testing it
and tested as a group. together.
(Not needed at present.)
Performance Yes Testing to evaluate the input Protocols used ensure
where the best and most this.
optimal output is yielded by
the system.
Stress No Not needed N/A
Compliance No Not needed N/A
Security No Not needed There are no security
issues.
Load No Not needed Does not depend on
multiple user access.
Volume Yes Testing done with high Executes in time user
volume of data. can easily wait.
Table 4: Type of Testing
24
5.2 Component decomposition and type of testing required
S.No. List of various modules that Type of Testing Techniques for

required testing required writing test cases
1 Bamer.ipynb White box
Requirement, Unit,
2 Jaipur.ipynb White box
Performance
3 Jodhpur.ipynb White box
4 Bikaner.ipynb White box
Table 5: Component Decomposition and Identification of Tests required
5.3 Test Cases
Test Case ID Input Expected Output Status

1 Bamer.ipynb Accuracy and RMSE Pass
2 Jaipur.ipynb Accuracy and RMSE Pass
3 Jodhpur.ipynb Accuracy and RMSE Pass
4 Bikaner.ipynb Accuracy and RMSE Pass
Table 6: Test cases for components
5.4 Error and Exception Handling
Test Case ID Test case for Debugging Technique

components
1 Bamer.ipynb Print debugging
2 Jaipur.ipynb Print debugging
3 Jodhpur.ipynb Print debugging
4 Bikaner.ipynb Print Debugging
Table 7: Type of debugging technique used
25
5.5 Limitations of the above Solution
Though the solution developed performs with decent accuracy on our dataset, but it lacks
handling of different types of datasets and depends only on datasets provided by NSRDB.
Also, the simpler regression based algorithms like Linear and Lasso Regression don’t
perform up to the mark.
26
6. FINDINGS AND CONCLUSIONS
6.1 Findings
Ml model Accuracy RMSE

Linear 52.51865034073566 211.25675343944854
regression
Lasso 52.51861348215003 211.25675343944854
regression
Decision 87.18658080908012 109.74416090799727
tree
MLP 93.04640476692928 109.74416090799727
GBR 93.74755592735533 76.66088224310354
Table 8: Results for Barmer Dataset
Table-8 shows the output of Barmer dataset where when we use linear regression model we
get an accuracy of 52.53 and RMSE 211, for lasso regression we get an accuracy of 52.52
and RMSE 211, for decision tree we get an accuracy of 87.18 and RMSE 109, for MLP we
get an accuracy of 93.04 and RMSE 109 and for GBR we get an accuracy of 93.74 and
RMSE 76.
RMSE Barmer Accuracy Barmer

250 100
200 80
150 60
100 40
50 RMSE 20 Accuracy
0 0
Barmer Barmer
Fig 13: RMSE and Accuracy graph for Barmer
27
Linear regression 48.491051596233425 218.92277104024262
Lasso regression 48.491115152926134 218.92263597621968
Decision tree 85.82293746839373 114.85310016420861
MLP 91.93545587069794 114.85310016420861
GBR 93.0178307177374 80.60180767045756
Table 9: Results for Jaipur Dataset
Table-9 shows the output of Jaipur dataset where when we use linear regression model we get
an accuracy of 48.49 and RMSE 218, for lasso regression we get an accuracy of 42.49 and
RMSE 218, for decision tree we get an accuracy of 85 and RMSE 114, for MLP we get an
accuracy of 91.93 and RMSE 114 and for GBR we get an accuracy of 93.01 and RMSE 80.
RMSE Jaipur Accuracy Jaipur

250 100
200 80
150 60
100 40
50 RMSE 20 Accuracy
0 0
Jaipur Jaipur
Fig 14: RMSE and Accuracy graph for Jaipur
28
Decision tree 87.61724105630887 108.19648502934594
MLP 91.76710388291591 108.19648502934594
GBR 93.28076991319929 79.70107869381548
Table 10: Results for Bikaner Dataset
Table-10 shows the output of Bikaner dataset where when we use linear regression model we
get an accuracy of 48.75and RMSE 220, for lasso regression we get an accuracy of 48.75 and
RMSE 220, for decision tree we get an accuracy of 87.61 and RMSE 108, for MLP we get an
accuracy of 91.76 and RMSE 108 and for GBR we get an accuracy of 93.04 and RMSE 109.
Accuracy Bikaner RMSE Bikaner

100 250
80 200
60 150
40 100
20 Accuracy 50 RMSE
0 Bikaner 0 Bikaner
Fig 15: RMSE and Accuracy graph for Bikaner
29
Decision tree 85.20957001362939 118.27730168323379
MLP 90.8875000982763 118.27730168323379
GBR 92.1356259856941 86.24681935952573
Table 11: Results for Jodhpur Dataset
Table-11 shows the output of Jodhpur dataset where when we use linear regression model we
get an accuracy of 49.27 and RMSE 219, for lasso regression we get an accuracy of 49.27
and RMSE 219, for decision tree we get an accuracy of 85.20 and RMSE 118, for MLP we
get an accuracy of 90.88 and RMSE 118 and for GBR we get an accuracy of 92.13 and
RMSE 86.
Accuracy Jodhpur RMSE Jodhpur

100 250
80 200
60 150
40 100
20 Accuracy 50 RMSE
0 Jodhpur 0 Jodhpur
Fig 16: RMSE and Accuracy graph for Jodhpur
6.2 Conclusion
It was the need of the hour to make accurate solar energy predictions in order to choose
places for solar parks and make efficient solar systems. This research paper thus tries to
compare multiple types of machine learning and deep learning algorithms to get the best
possible predictions. Out of all the models applied, MLP was found to be the best algorithm
as it gave maximum accuracy for the test data with the least root mean square error. In
30
machine learning models, Decision tree was found to be the best one but after decision tree
we observed that both the regression algorithms i.e. linear regression and Lasso regression
gave almost similar results. Linear regression only checks for linear relationships amongst the
dependent variables along with independent variables i.e. it considers that there exist a linear
relationship among them. At times, this might be incorrect if the variables have a curved
relationship amongst them. In all we observed that deep learning models fared better than
machine learning models.
6.3 Future Scope
A lot of research has already been done the area of solar radiation prediction, corresponding
to which we have seen a lot applications. Significant results have been obtained from this
work, corresponding to which this research can be taken to the real world application level
for solar radiation prediction. We are further looking forward to implement more efficient
and advanced algorithms in the hope of getting better results in terms of accuracy and
requesting the concerned organizations for more datasets which would eventually help us in
getting to know more about the solar radiation patterns in different parts of the country.
31
REFERENCES
1. Wu Ji, CK Chan, JW Loh, FH Choo, LH Chen, “Solar Radiation Prediction Using

Statistical Approaches.”2009 7th International Conference on Information, Communications
and Signal Processing (ICICS), pp. 1-5
2. R. Muthukrishnan and R. Rohini, "LASSO: A feature selection technique in predictive

modelling for machine learning," 2016 IEEE International Conference on Advances in
Computer Applications (ICACA), Coimbatore, 2016, pp. 18-20.
3.A. P. S. Rathod, P. Mittal and B. Kumar, "Analysis of factors affecting the solar radiation
received by any region," 2016 International Conference on Emerging Trends in
Communication Technologies (ETCT), Dehradun, 2016, pp. 1- 4
4.Patel, Bhaskar N., Satish G. Prajapati, and Kamaljit I. Lakhtaria. “Efficient Classification of
Data Using Decision Tree.”Bonfring International Journal of Data Mining 2, no. 1 (2012):
06-12.
5. Suruchi Dedgaonkar, Vishal Patil, Niraj Rathod, Gajanan Hakare & Jyotiba Bhosale.
“Solar Energy Prediction using Least Square Linear Regression Method.” International
Journal of Current Engineering and Technology Vol.6, No.5 (Oct 2016)
6.T. Verma, A. P. S. Tiwana, C. C. Reddy, V. Arora and P. Devanand, "Data Analysis to

Generate Models Based on Neural Network and Regression for Solar Power Generation
Forecasting," 2016 7th International Conference on Intelligent Systems, Modelling and
Simulation (ISMS), Bangkok, 2016, pp. 97-100.
7.A. Prakash and S. K. Singh, "Towards an efficient regression model for solar energy
prediction," 2014 Innovative Applications of Computational Intelligence on Power, Energy
and Controls with their impact on Humanity (CIPECH), Ghaziabad, 2014, pp. 18-23.
8.Z. Asradj and R. Alkama, "Prediction of solar radiation in Bejaia city using neurnal
network (MLP).," 2014 International Conference on Electrical Sciences and Technologies in
Maghreb (CISTEM), Tunis, 2014, pp. 1-4.
32
9.D. Shruthi, M. S. P. Subathra, J. Kumari and K. Raimond, "Artificial neural network based
prediction of monthly global solar radiation in Indian stations," 2017 International
Conference on Signal Processing and Communication (ICSPC), Coimbatore, 2017, pp. 410-
414.
33

Solar Radiation Prediction: Dr. Himani Bansal

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Solar Radiation Prediction: Dr. Himani Bansal

Caricato da

Copyright:

Formati disponibili

SOLAR RADIATION PREDICTION

Under The Supervision Of:

Computer Science Engineering

DEPARTMENT OF COMPUTER SCIENCE ENGINEERING &

JAYPEE INSTITUTE OF INFORMATION TECHNOLOGY, NOIDA

Chapter No Topics Page No.

Certificate from the Supervisor III

List of Tables VII

List of Symbols and Acronyms VIII

Chapter-1 Introduction 1-2

1.1 General Introduction 1

1.2 Problem Statement 1

1.3 Approach to problem in terms of 1

technology /platform to be used

1.4 Support for Novelty/ significance of problem 2

Chapter-2 Literature Survey 3-10

2.1 Summary of papers 3

2.2 Integrated summary of the literature studied 6

Chapter 3: Analysis, Design and Modeling 11-13

3.1 Overall description of the project 11

3.2 Functional requirements 11

3.3 Non Functional requirements 11

3.5 Design Diagrams 12

3.6 Component Description 12

Chapter-4 Implementation details and issues 14-23

4.1 Implementation details and issues 14

4.1.1 Implementation Issues 17

4.1.2 Implementation Details 18

4.2 Risk Analysis and Mitigation 23

Chapter-5 Testing (Focus on Quality of Robustness and Testing) 24-26

5.1 Testing Plan 24

5.2 Component decomposition and type 25

5.3 List all test cases 25

5.4 Error and Exception Handling 25

5.5 Limitations of the solution 26

Chapter-6 Findings & Conclusion 27-31

6.3 Future Work 31

Place: Noida Jai Dhall(9915103184)

Date: 6 May 2019 Tarun Khare(9915103200)

Name of Supervisor : Dr Himani Bansal

Designation : Assistant Professor (Senior Grade)

Jai Dhall (9915103184)

Tarun Khare (9915103200)

Akshat Jain (9915103210)

S No. Title Page No

S No. Table Name Page

S No. Acronym Abbreviation

1.1 General Introduction

1.2 Problem Statement

1.3 Approach to the problem in terms of technology/Platform used

2.1 Summary of the papers studied

2.2 Integrated Summary of the Literature studied

Research Paper Title Author Summary

Table 1: Literature Summary

3.1 Overall description of the project

3.2 Functional Requirements

This project has following functional requirements:

 Support of multiple ML algorithms: This project supports multiple ML

 Score calculation: Based on the amount of predictions correctly made by an

3.3 Non Functional requirements

This project has following non functional requirements:

3.4 Logical database requirements

3.5 Design Diagrams