The paper is devoted to the preparation and analysis of data sets in order to improve the prediction of the amount of consumed and generated electrical energy volumes using machine learning methods. The importance level and influence on predicting the time of day, month, year, temperature, humidity, atmospheric pressure, and other factors were determined. The dataset used in this article contains the data of smart house equipped by photovoltaic cells for the own generation of electrical energy that covers the part of house’s demand. There are following values in dataset: «time», consumed electrical energy («use [kW]»), generated electrical energy («gen [kW]»), «temperature», «humidity», «visibility», «pressure», «windSpeed», «cloudCover», «windBearing», the temperature as it felt by human «apparentTemperature», precipitation intensity «precipIntensity», «dewPoint», precipitation probability «precipProbability». The data was collected during 11 months with a data fixing period of 1 minute. Before the data analysis and further learning it’s necessary to execute preliminary processing. At first stage, it was investigated how large is the part of missed and zero values in dataset. The second stage includes elimination of outliers that are situated at anomaly distance from other values in random sample. These outliers could be caused by measurement errors, wrong measuring units use. Also, it could be correct but extremum values. The purification procedure includes defining the lower and the upper quartiles of existing data for the distribution of used energy. For effective learning of the model it is necessary to choose the values that are most important and suitable for training. Pearson’s correlation coefficient was used to estimate numerically the level and positivity of linear connections between the pairs of values as well as to estimate their influence to the used and generated energy. Among the values with the high level of correlation only one was chosen that helped increasing adequacy, generalization and results interpretation. As a result of correlation analysis three parameters were selected for the training - «apparentTemperature», «dewPoint» and «precipProbability». Use of proposed preprocessing methods allows increasing the predictions exactness by 25% for the used energy and by 2% for the generated energy. The initial dataset was divided as follows: 70% of values were considered as the training samples and 30% - as testing ones. To compare the training methods three models of machine learning from the library Scikit-learn in programming language Python were considered: «Linear», «Random forest», «k nearest neighbors». The determination coefficient R² was used as a metrics to estimate the exactness. The diagrams of numerical values of R² coefficient for the parameters of generation and consumption of electrical energy and for three considered models of machine learning were built. Among the tested model the best result was demonstrated for the “Random forest” model (84% for the used energy and by 95% for the generated energy). Additional exactness increasing could be reached by use of more amount of testing samples and parameters during the analysis and more time intervals of observation as well as additional methods of data preprocessing.
Read full abstract