Predicting-Power-Outages-Durations


Project maintained by hgallocodes Hosted on GitHub Pages — Theme by mattgraham

Predicting-Power-Outage-Durations

by Sarah Borsotto (sborsott@ucsd.edu) and Hector Gallo (hgallo@ucsd.edu)


Problem Identification

Major power outages can have detrimental effects on public safety, infrastructure, and the economy. Therefore, predicting power outage severity is crucial for proactive planning and resource allocation. It enables utilities and emergency services to allocate resources efficiently and mitigate potential damages, ultimately reducing the impact on the community. We seek to anticipate power outage severity by predicting outage durations, since longer power outages tend to be more negatively influential. Given that a power outage just occurred, we want to know how long the outage will last depending on the cause of the power outage, as well as population metrics, cost metrics, geography, and time. We will be implementing a k-Nearest Neighbors (k-NN) regressor that incorporates these variables to predict outage duration. In order to evaluate our model for its accuracy, we plan to use RMSE and R-squared as our metrics. We hope to minimize RMSE while maximizing R-squared.

Here is a brief explanation of how the k-NN regressor model works: When a new data point is given for prediction, the algorithm identifies the ‘k’ nearest neighbors of that point in the feature space in terms of euclidean distance. That is, it will calculate the distance between the new data point and all points in the training set. Then, it will sort the distances in ascending order to identify the ‘k’ nearest neighbors. The predicted value for the new data point is often the weighted average (or mean) of the target values of its ‘k’ nearest neighbors. The weights can be based on the inverse of the distance or any other suitable weighting scheme. The predicted value for the new data point is then output as the result of the regression. In summary, KNN Regressor makes predictions based on the average of the target values of the ‘k’ nearest neighbors in the feature space. The simplicity and flexibility of KNN make it a popular choice for regression tasks, especially in cases where the underlying relationships in the data are non-linear or where interpretability is essential.

Based on our research, we know that the features that generally have a higher impact on the power outage durations are the cauge of the outage, the peak hours of energy consumption (which are generally between 4:00 pm and 9:00 pm), whether the area is urban or rural, population density, state, and year. At the “time of prediction”, we would have access to all of these features since all of them are knowledge that we have access to before a power outage even starts. Some of the features that we wouldn’t have access to before the power outage, for example, would be customers affected, and outage restoration date.


Baseline Model

What can we expect in terms of outage duration in general? Will the values be high? Will there be a lot of variation? Let’s investigate the distribution of power outage durations by plotting a boxplot of outage durations.

Wow! There seems to be a lot of outliers in our dataset. The boxplot appears small compared to the rest of our graph because the mean outage duration is very far from the max outage duration. This could negatively influence our prediction model, since outliers could skew our prediction towards higher values, even though most of our data centers around a lower threshold. In order to ensure that our prediction model follows the trend of the majority of our data, and that it doesn’t become too biased from outlier datapoints, we will only look at power outages that are within the upper fence of the boxplot, which is 7020 minutes.

We still have a good amount of data to work with. As we noticed in the boxplot, only a small amount of our data is above 7020 minutes.

We also don’t have any null values, so we can work with our new dataset directly without having to drop or impute null values.

Since we are using a regression model to predict outage duration times, we need to find variables that can act as good predictors for our dependent variable. We have an abundant list of quantitative variables we could use to predict outage duration times, so let’s see if we can find any correlations between each variable and outage duration.

Below we generated scatter plots of each quantitative variable with outage duration times. While there are only 3 graphs illustrated below, we went through all of the graphs in increments of 5 and looked for any possible relationships between outage durations and other quantitative variables.

Unfortunately it doesn’t look like there are any clear relationships between duration times and the other quantitative variables, such as COM.CUSTOMERS. Most of the points seem to be clustered. Some graphs did look more promising than others, including RES.CUST.PCT, COM.CUST.PCT, IND.CUST.PCT, PC.REALGSP.USA, UTIL.CONTRI, POPPCT_UC, POPDEN_RURAL, PCT_LAND, and PCT_WATER_TOT. To get a closer look, let’s see what the correlation coefficents are between outage duration and each quantitative variable.

** Correlation coefficients table ** The correlation coefficients seem to fit our interpretation of the graphs. Yet, these variables have relatively low correlation with outage duration, the highest being only 0.266. As such, our quantitative variables may not be good linear predictors for outage duration. We will further explore this in our final model, but for now let’s focus on categorical variables, as they may tell us more about outage duration times.

Intuitively, a strong predictor for outage duration may be the cause of the power outage. In our analysis, we are assuming that the power outage has just occurred, and that the cause of the outage is known. Based on this information, we may be able to generalize outage duration times. For example, severe-weather induced power outages may last longer than intentional attacks since there is a higher sense of urgency for attacks, and because most of the time companies must wait for the harsh weather conditions to pass. Another parameter that could significantly influence power outage durations is location. Different states may have different weather patterns, population sizes, availability of resources, local regulations, and so on. Accordingly, for our baseline model we focus on these two predictors, CAUSE.CATEGORY and U.S._STATE. These variables are both nominal. CAUSE,CATEGORY depicts the cause of the power outage, like severe-weather or intentional attack, while U.S._STATE depicts the state where the power outage occurred, like California. We are one-hot encoding both variables so we can input them into our KNN regression model.

  train_rmse_err test_rmse_err train_r2_err test_r2_err
1 1642.270999 1875.361017 0.050308 -0.511042
2 1373.207860 1484.866550 0.336004 0.052713
3 1332.848949 1467.738424 0.374460 0.074441
4 1372.078114 1400.169371 0.337096 0.157698
5 1349.243155 1379.142951 0.358977 0.182806
96 1370.640540 1293.449097 0.338484 0.281204
97 1371.944761 1295.082746 0.337225 0.279387
98 1372.836064 1293.390659 0.336363 0.281269
99 1373.395286 1294.065417 0.335822 0.280519
100 1374.111346 1293.866986 0.335130 0.280740

The output of our knn_reg_perf function is a dataframe of training and testing rmse and r-squared values from 1 to 200. The index of the dataframe denotes the number of neighbors used in the regressor. We plot this below to get a better view of our error data.

Our rmse for both training and testing appears to be minimized around 20. What does our r-squared look like?

Luckily for us, both our training and test r-squared errors appear to be maximized at around 10. To get a more accurate number for our neighbor metric we will use gridsearchcv.

Based on these plots, we decided to use 10 as our number of neighbors. Below is the data for our rmse and r-2 squared values for testing and training with 10 neighbors using a KNN model.

  Value
train_rmse_err 1294.576506
test_rmse_err 1286.928656
train_r2_err 0.409869
test_r2_err 0.288433

As we can see, our rmse is about 1300 minutes, which is about 16 hours. Additionally, our r-squared values are around .30, which isn’t very high. While our rmse is high, we have a lot of variation in our outage duration data that could explain this score. We believe that for a simple KNN regression model these scores are not bad, but we hope to improve them in our final model by adding more features.


Final Model

For our final model, we want to try to improve upon the baseline model. We will continue to one-hot encode U.S.STATE and CAUSE.CATEGORY. Additionally, we decided to feature engineer two new features, the time of the outage, as well as one of the quantitative features we explored previously. To better predict power outages based on time, we looked at peak hours for energy consumption, which tends to be from 4pm to 9pm. We expect that power outages that started during peak hours would impact the severity of the power outage. In order to include this in our KNN regression model, we transformed our OUTAGE.START column into 0 or 1 for peak vs non-peak hours. We utilized a function transformer, as well as a helper function, to do this. For our other additional feature, we need to look at possible combinations of MONTH and YEAR with the quantitative variables that had the highest correlation coefficient with outage duration. In order to identify an optimized combination, we will perform a manual iterative method that finds the average rmse score for each KNN regressor that includes said combination. For example, in our first iteration, we will employ a stdscalarbygroup transformer on MONTH and RES.CUST.PCT, where MONTH is the month that the power outage occurred in and RES.CUST.PCT is the percent of residential customers served in the U.S. state in percentage. The stdscalarbygroup transformer standardizes the quantitative variable by grouping the categorical variable. This means that the RES.CUST.PCT would be standardized based on the month. We have created a function that goes through each combination, finds the average rmse for 10 different k neighbors, and returns a dictionary with this information that we can then use to pick the best parameters for our stdscalarbygroup transformer.

We see that the lowest rmse score is YEAR and POPPCT_UC, where POPPCT_UC is percentage of the total population of the U.S. state represented by the population of the urban clusters. We will therefore incorporate this into our final baseline model.

  train_rmse_err test_rmse_err train_r2_err test_r2_err
1 105.684702 1840.035670 0.997300 1.125817
2 910.761805 1633.587352 0.708155 0.678476
3 1053.317783 1545.492809 0.557461 0.524874
4 1111.706816 1505.790350 0.526024 0.479102
5 1152.026594 1475.341636 0.483599 0.448388
6 1173.089981 1435.087526 0.467839 0.393032
7 1189.793687 1438.795772 0.442593 0.380050
8 1205.354492 1424.373958 0.413447 0.361813
9 1220.067789 1423.852698 0.397547 0.351167
10 1232.077148 1420.939274 0.382934 0.354234
11 1241.251799 1417.351332 0.374690 0.350008
12 1246.214292 1415.914804 0.370883 0.347326
13 1254.141361 1395.002525 0.360587 0.336891
14 1255.123179 1374.835545 0.353124 0.331480
15 1261.756344 1379.408443 0.349293 0.323860
16 1265.901720 1376.964935 0.345111 0.319492
17 1267.736803 1364.314707 0.341199 0.321177
18 1274.951097 1364.237629 0.335359 0.317671
19 1276.773141 1355.857298 0.331533 0.313840
20 1280.277572 1343.414699 0.328424 0.308299
21 1289.140633 1347.857259 0.325885 0.299305
22 1295.931427 1346.961912 0.322734 0.293356
23 1300.007807 1350.867097 0.318751 0.288839
24 1301.418411 1350.796035 0.318024 0.283763
25 1301.110768 1354.411895 0.314845 0.277796
26 1302.816367 1355.197238 0.310633 0.273237
27 1304.752940 1354.283373 0.306981 0.269502
28 1307.214645 1351.632309 0.303739 0.263166
29 1307.815539 1349.204668 0.298618 0.260607
30 1310.412254 1353.617253 0.297752 0.259972
31 1313.379848 1358.735671 0.293442 0.255846
32 1313.252969 1359.235548 0.292632 0.252869
33 1313.030619 1363.457462 0.290917 0.251648
34 1314.326382 1366.786330 0.287943 0.250035
35 1314.517797 1367.075777 0.284985 0.247926
36 1314.315319 1366.985188 0.283564 0.248222
37 1316.566700 1365.240244 0.281018 0.246938
38 1319.331858 1363.494964 0.277981 0.244985
39 1322.186877 1367.151699 0.277404 0.242297
40 1325.100049 1370.370870 0.275620 0.239591
41 1326.888421 1370.690068 0.273390 0.239710
42 1327.323262 1374.084230 0.272829 0.237424
43 1327.005737 1377.289026 0.269717 0.235372
44 1327.306803 1377.754334 0.268244 0.234932
45 1327.606345 1373.275409 0.266370 0.232430
46 1327.964071 1373.303228 0.264351 0.230166
47 1329.346467 1372.804152 0.262345 0.228614
48 1331.733503 1373.719738 0.260238 0.225789
49 1333.236596 1375.092115 0.258422 0.222594
50 1333.171581 1375.985922 0.256788 0.219934

We can graph the output of our rmse and r-squared scores below, like we did for our baseline.

Based on these graphs, it seems that rmse is minimized at the beginning, and r-squared is similarily maximized at the beginning. We will therefore use a parameter of 2 for the number of neighbors.

  Value
train_rmse_err 910.761805
test_rmse_err 1633.587352
train_r2_err 0.708155
test_r2_err 0.678476

Our training rmse score is 910, our testing rmse score is 1633, our training r-squared score is 0.71, and our testing r-squared score is .68. As we can see the training rmse score decreased from our baseline and our r-squared scores for both the training and test sets improved by about 0.30. However, our rmse score for the testing set increased slightly. We may be able to improve our model to better fit unknown data with other parameters. We believe that we were able to achieve better training rmse because we have added more features to our model that can help predict outage duration time. More specifically, peak energy consumption hours can impact severity of a power outage since resources are exhausted. Similarily, year and U.S. state percentage population can inform us about possible power outage durations, as some years may have more outages and higher populations could lead to longer durations, as more people are utilizing energy. Lastly, as we saw in our baseline, the location and cause of the category is a strong indicator of outage duration. This makes sense, as most power outages in our dataset are severe-weather induced, meaning that other variables like cost or population don’t have great influence over duration time, since the most important attribute is how long the weather event lasts.


Fairness Analysis

We will be performing a “fairness analysis” of our Final Model from the previous step. The question we will try to answer is: Does our model perform better when predicting the power outage duration for states that are located in the East of the United States than it does for states that are located in the West?

Since we are comparing a quantitative attribute, in this case R^2 across the two groups, we will be using a permutation test. First, we will divide all the obsevations into two categories: East and West. We will do this by creating a new column ‘region’, that will contain a 1 if the power outage took place in a state that is in the East side of the country, or a 0 if the power outage happened in a state in the West side of the country.

Therfore, Group X will be power outages that took place in Eastern states and Group Y will be power outages that took place in Western states.

As we know, the model will output the R^2 score that it achieves, so we can use this R^2 score to compare how the model does for predicting power outage duration for Eastern and Western states.

From the whole dataframe, we obtain the difference in performance when the model predicts the power outage duration for Western and Eastern states. Surprisingly, it seems like the model performs better when predicting the power outage durations for Eastern states. Therefore, our null hypothesis is: The k-NN regressor model is fair. It will perform equally when predicting the power outage duraiton for the Eastern and Western states. Our alternative hypothesis is: The k-NN regresssor model is unfair. It will perform better when predicting the power outage duration for Eastern states than for Western states. The test statistic will be the difference between the R^2 obtained from predicting the power outage durations for Eastern states and Western states. If the test statistic is positive, then this means that the model seems to be better at predicting the power outage duration for Eastern states.

In order to perform the permutation test, we will create a function that shuffles the ‘region’ column and computes the difference of R^2 between the Eastern and Wester states. This will allow us to run a permutation test in which we create a distribution under the null, and observe wheter our observed statistic is way too high and, therefore, see if the model seems to be unfair since it has a higher R^2 when predicting the power outage duration for Eastern states.

Now, we can proceed to run the permutation test

After running the permutation test, we can see that the p-value of the observed statistic is 0.437. This is way higher than a significance level of 0.05, therefore, we fail to reject the null. Although this isn’t enough evidence to conclude that the null hypothesis is 100% true, we can at least observe that the model appears to make fair predictions for both the Eastern and Western states.

In order to visualize this, we can create a histogram with the distribution of R^2 differences under the null. As we can see, the observed statistic lies almost in the middle of the disribution created under the null.