Evaluation of Distribution Fault Diagnosis
Algorithms using ROC Curves
Yixin Cai, Student Member, IEEE, Mo-Yuen Chow, Fellow, IEEE, Wenbin Lu, and Lexin Li
features. For example, Chen et al. [3] proposed an online
Abstract-- In power distribution fault data, the percentage of diagnosis approach using cause-effect network and fuzzy rules faults with different causes could be very different and varies to find out the root cause based on protective device settings from region to region. This data imbalance issue seriously affects and their operations during the fault in distribution the performance evaluation of fault diagnosis algorithms. Due to
substations; Xiao and Wen [4] solved the similar problem
the limitations of conventional accuracy (ACC) and geometric
mean (G-mean) measures, this paper discusses the application of with fuzzy set and optimization technologies; Xu and Chow Receiver Operating Characteristic (ROC) curves in evaluating [5-7] formulated fault diagnosis as a classification problem distribution fault diagnosis performance. After introducing how and applied several biologically inspired algorithms, including to obtain ROC curves, Artificial Neural Networks (ANN), Artificial Neural Networks (ANN), fuzzy systems and Logistic Regression (LR), Support Vector Machines (SVM), Artificial Immune Recognition Systems (AIRS); Bowers et al. Artificial Immune Recognition Systems (AIRS), and K-Nearest
[8] and Butler-Purry et al. [9] used signal processing
Neighbor (KNN) algorithm are compared using ROC curves and
Area Under the Curve (AUC) on real-world fault datasets from technologies to diagnose incipient faults and prevent sustained Progress Energy Carolinas. Experimental results show that AIRS outages.
Although equipment failures, tree contacts, animal contacts performs best most of the time and ANN is potentially a good
algorithm with a proper decision threshold. and lightning strikes consist the majority of distribution faults [1], their frequency of occurrence varies drastically. Tree-Index Terms— artificial neural networks, artificial immune caused faults could be the majority in a wooded area but could recognition systems, classification, fault cause identification, k-be rare cases in a metropolitan distribution system. This nearest neighbor algorithm, logistic regression, power
property of practical outage databases is usually referred to as distribution systems, support vector machine, ROC curves
data imbalance. Typically, data imbalance is the situation where there are much more samples of one class than others. It I. INTRODUCTION
affects the classification in different aspects and has been
UTAGE management in power distribution systems has
recognized as a significant issue in machine learning and data
drawn a lot of attentions during the past decades as
mining [10] .
distribution outages are the major source of customer
The major problem brought by imbalanced data is with the
reliability problems [1]. As interruptions of power supply
performance evaluation. Suppose the class of interest (e.g.
become more and more costly in modern society [2], fast
animal-caused fault) is the positive class, a confusion matrix
service restoration is in increasingly high demand. Although
can be built from the classification results by counting the
most modern Outage Management Systems (OMS) feature
number of samples falling into the four cases listed in Table I.
fault location functions, the root cause of a fault still needs to be identified by on-site engineers due to safety concerns. The TABLE I
CONFUSION MATRIX spatially dispersed distribution systems, stochastic nature of
faults, and noisy and limited data make fault cause Predicted Positive Class Predicted Negative Class identification a challenging and time-consuming task. To Actual Positive Class True Positive (TP) False Negative (FN) provide distribution engineers with clues of the root cause and Actual Negative Class False Positive (FP) True Negative (TN) expedite the repairing process, automated fault diagnosis has
The most commonly used performance measure is accuracy been studied for years.
Automated fault diagnosis generally learns from historical (ACC), defined as:
TP+TNfault events and infers the actual root cause from the fault
ACC=. (1)
TP+FP+FN+TN
This research was sponsored by the National Science Foundation (NSF) Suppose an imbalanced dataset with animal-caused faults through Grant No.ECS-0653017 (Small World Stratification for Power
accounting for only around 10%. When all samples are System Fault Diagnosis with Causality).
Y. Cai and M.-Y. Chow are with Department of Electrical and Computer classified as non-animal, a high ACC close to 90% could be Engineering, North Carolina State University, Raleigh, NC 27695 USA (e-achieved, which is misleading to infer a good diagnosis
mails: ycai2@ncsu.edu, chow@ncsu.edu).
W. Lu and L. Li are with Department of Statistics, North Carolina State performance. Therefore, geometric mean (G-mean) measure University, Raleigh, NC 27695 USA (e-mails: wlu4@stat.ncsu.edu, considering accuracy on both positive and negative classes li@stat.ncsu.edu). was adopted [11]:
O
978-1-4244-6551-4/10/$26.00 ©2010 IEEE
2
is greater than a predefined threshold. By varying this TPTNG−mean=×. (2) threshold, we can get a group of classification results, which
TP+FNTN+FPare mapped into the ROC space to form the ROC curve. Fig. 1
The term TP/(TP+FN) represents how many of the positive
shows a sample ROC curve.
samples are actually detected, so it is usually referred to as probability of detection (POD) or sensitivity [12]. The term 0.100.2TN/(TN+FP) is the probability of detection on the negative
0.3class and is called specificity. In an ideal case, these two terms
both equal to 1 as all samples are correctly classified.
0.4G-mean is balanced in representing the performance on both classes, which is especially important with imbalanced
0.5data. Consider the aforementioned example. We can achieve
0.690% ACC by classifying all samples as non-animal but cannot
nsitivity0.60.81.0prevent G-mean being 0 due to the 0 POD. A high G-mean can only be achieved when the classification is mostly correct in both classes.
One problem with G-mean is that it could be boosted by manipulating TP. Suppose the negative class is dominant in an imbalanced dataset. TN is much bigger while TP, FP and FN are comparable in magnitude. Thus, intentionally reporting more positives would possibly increase TP and improve POD a lot without affecting specificity much. Another problem of
G-mean is that the number is only a relative measure. Unlike ACC or POD, it is not a direct explanation of the performance. G-mean 0.9 is obviously better than 0.8 but we do not know
what value represents good enough performance.
This paper will explore the effectiveness of evaluating distribution fault diagnosis performance using Receiver Operating Characteristic (ROC) curves. With a brief review of how to generate and interpret ROC curves, ANN, Logistic Regression (LR), Support Vector Machines (SVM), AIRS and K-Nearest Neighbor (KNN) algorithm are compared on the actual outage datasets from the distribution systems of Progress Energy Carolinas. The rest of this paper is structured as follows: Section II reviews fundamentals of ROC curves
and discusses its application in distribution fault diagnosis performance evaluation; Section III presents the case study of comparing algorithms using ROC curves; Section IV is the conclusion.
II. PERFORMANCE EVALUATION WITH ROC CURVES
A. ROC Space and ROC Curves Having its origin in signal detection theory, ROC curve is a graphical plot of the sensitivity vs. (1−specificity) for a binary classifier as its decision threshold is varied [13]. The term (1−specificity) can be simplified to FP/(TN+FP) and is called false positive rate (FPR). ROC space is a plane where the y-axis and x-axis are sensitivity and (1−specificity) respectively. Every classification result of a given dataset can be mapped
into one point on this plane through confusion matrix.
There are generally two types of algorithms being used in fault diagnosis. One type, such as AIRS and KNN, reports the class of a given sample directly. Thus, their performance is
represented as a point in the ROC space. Another type of algorithms, including ANN, LR and SVM, predicts a probability of the sample being the positive class and the sample is assigned to the positive class when this probability es4.00.72.00.8sample ROC curveperfect case0.0.901random guess0.00.20.40.60.81.01-specificity
Fig. 1. A sample ROC curve.
An ROC curve always starts at the point (0, 0), where the decision threshold is set to 1 such that all samples are classified as the negative class. As the decision threshold decreases, sensitivity increases and specificity decreases. The curve stops with decision threshold 0 when all the samples are classified as the positive class, which leads to sensitivity value 1 and specificity value 0, the point (1, 1) on the ROC plane. Circle markers on the sample ROC curve show the decision thresholds with 0.1 step length.
As discussed in Section I, a perfect classification yields sensitivity value 1 and specificity value 1. This is the upper left corner or the point (0, 1) on the ROC plane. Another extreme case is completely random guess, which would be
correct on both classes for half of the time, and yields a point on the diagonal from the left bottom corner to the top right corner. ACC is not available from ROC space because the two axes
are evaluated based on each single class. This is exactly how ROC curves overcome the effect of data imbalance. G-mean can be obtained from the ROC space – given a fixed threshold, it is the square root of the area of the rectangle indicated by the point on the curve and the point (1, 0). B. Performance Evaluation
ROC curve is a simple and intuitive way to visualize the classification performance under different decision threshold, thus it is a good tool to evaluate the overall performance of an algorithm.
As discussed above, the perfect classification occurs at the upper left corner in ROC space and the random guess is on the diagonal. Therefore, a good algorithm yields an ROC curve as close to the perfect case shown in Fig. 1 as possible. An algorithm generating an ROC curve close to the diagonal
3
performs as poor as random guess. An algorithm could be even worse than random guess, make more mistakes than correct decisions, and yield ROC curves below the diagonal. Following is an example of evaluating fault diagnosis performance using ROC curves. Tree-caused and animal-caused faults are identified from others by LR. Fault data are randomly split into training and testing set with equal size. Logistic model is estimated on training set and used to predict the probability of being the class of interest on both training commonly used. Intuitively, AUC describes how much the curve is stretched towards the upper left corner from the diagonal. It is proven to equal to the probability that an algorithm will predict a higher probability for a randomly chosen positive sample than a randomly chosen negative one [13]. For the results shown in Fig. 2, the average AUC of tree-caused faults is 0.773 on both training and testing set. For animal-caused faults, the average AUC on training and testing sets are 0.853 and 0.855 respectively. This reflects the same and testing set. ROC curves are generated as shown in Fig. 2.
.01o*0.1.80o*0.2.60o*0.3.40o*0.4.20oo**0.60.5otraining setoo*0.7o**testing set.00*0.8090.00.20.40.60.81.0a) Tree-caused faults
0.18.o*00.1o*0.26.04.0o*o*0.40.32.0o*0.5otraining set0o*testing set.0*096780.00.20.40.60.81.0b) Animal-caused faults
Fig. 2. An example of using ROC curves to evaluate fault diagnosis
performance.
According to the shape of ROC curves, LR is a reasonably good algorithm for both tree-caused and animal-caused faults because they are away from the diagonal and could achieve POD 0.6 while maintaining FPR 0.2. ROC curves on training set and testing set are very close to each other, indicating LR has good generalization capability.
Sometimes, a quantitative measure based on ROC curves is preferred, of which the area under the curve (AUC) is the most
result as the curves with numbers – the LR algorithm performs reasonably well and is capable to generalize. However, the tradeoff between POD and FPR as the threshold varies is lost with AUC.
III. CASE STUDY ON PERFORMANCE COMPARISON This section will discuss the use of ROC curves to compare the performance of different distribution fault diagnosis algorithms. Similar rules apply as we use for evaluating a single algorithm – for algorithms reporting a single class, a point closer to the upper left corner represents better performance; for algorithms reporting the probability of being the positive class, a high curve with steep rise is favorable. To illustrate this idea, ANN, LR, SVM, AIRS and KNN algorithms are compared on real-world datasets.
Three regions in North Carolina are chosen – service region of Ashville operation center for mountainous area, Garner operation center for piedmont area and Wilmington operation center for coastal plains (as shown in Fig. 3). Fault characteristics are quite different in these regions. As a direct indication shown in Fig. 4, the percentage of tree-caused faults varies a lot – as high as 46% in Asheville and as low as 13% in Wilmington.
Fig. 3. Regions under study.
Fig. 4. Percentage of fault causes in three study regions.
4
Tree-caused and animal-caused faults in these three regions are to be identified. Six factors, number of phases, season, time of the day, protective device, weather condition and overhead or underground device, are converted into likelihood measure [14] as the feature vector.
A. Comparison of Overall Performance
ANNANN [15] is commonly used as a nonlinear classifier which
can learn from various types of data. A one-hidden-layer LRfeedforward network with 6 hidden nodes is used in this SVMcomparison. The number of hidden nodes is determined
AIRSexperimentally to obtain the best performance while avoiding
0.20.40.60.81.0over-fitting as possible.
LR [16] is a well known statistic method to analyze problems with binary dependent variable. LR builds a model to fit the logarithm odds for the sample to be in one class based on training data. The probability of a sample being the positive class can be then predicted. SVM [17] is another popular technique to learn from
complex data. SVM with Gaussian radial basis function is used. Similar to ANN, the parameters are tuned to obtain the best performance on testing set.
AIRS [18] is a supervised learning algorithm simulating human immune mechanisms. Memory cells which capture the characteristics of each class are generated during the training and new samples are assigned to a class based on the memory cells that it is close to. KNN [19] classifies new samples based on the voting of its
k nearest neighbors. The neighbor is defined by the Euclidian
distance among feature vectors.
For each region, data are randomly split into training set
and testing set with equal size. The classifiers are trained on training set and the performance on the corresponding testing set is recorded. Experiments are repeated for 30 times. For
ANN, LR and SVM, ROC curves generated in different runs
are averaged by threshold [13] and plotted in Fig. 5 and 6. The cross markers indicate decision threshold 0.5 which is a natural choice for a binary classification problem. For AIRS
and KNN, the plus sign and the solid dot indicate the average
performance and the dotted ellipse around them represent the
region where 95% of experimental results fall in.
.01.80.60ANN.40LRSVM.20AIRS0KNN.00.00.20.40.60.81.0a) Tree-caused faults in Asheville.
KNN.000.00.20.40.60.81.0 b) Tree-caused faults in Garner.
0.18.06.04ANN.0LR2SVM.0AIRS0KNN.00.00.20.40.60.81.0
c) Tree-caused faults in Wilmington.
Fig. 5. Use ROC curves to compare algorithms for tree-caused faults.
From Fig. 5 and 6, we can see that the overall performance
of ANN is similar to LR most of the time. They are in general
good algorithms with high ROC curves. The ROC curves of SVM are comparable to those of ANN and LR at the starting section but drop below them as the threshold goes smaller, which indicates worse overall performance of SVM. Given the decision threshold 0.5, the classification result of ANN, LR and SVM are close to each other, not good with quite low POD though. These results are similar to the performance of KNN as well. The performance of AIRS is better than KNN and ANN/LR/SVM with threshold 0.5 except for tree-caused faults in Asheville, the most balanced dataset in this case
study.
Average AUC of ANN, LR and SVM is calculated and summarized in Table II. These AUCs confirm that ANN and
LR are comparable in terms of the overall fault diagnosis performance and SVM does not perform as well as them.
.01.80.60ANN.40LRSVM.20AIRSKNN.000.00.20.40.60.81.0
a) Animal-caused faults in Asheville.
.01.80.60ANN.40LRSVM.20AIRSKNN.000.00.20.40.60.81.0
b) Animal-caused faults in Garner.
.01.80.60ANN.40LRSVM.20AIRSKNN.000.00.20.40.60.81.0
c) Animal-caused faults in Wilmington.
Fig. 6. Use ROC curves to compare algorithms for animal-caused faults.
5
TABLE II
USE AUC TO COMPARE ALGORITHM PERFORMANCE
Tree-caused faults Animal-caused faults
Asheville 0.771(0.004) 0.828(0.005) ANN Garner 0.777(0.007) 0.843(0.009)
Wilmington 0.746(0.013) 0.826(0.010) Asheville 0.769(0.004) 0.827(0.004) LR Garner 0.770(0.006) 0.852(0.009)
Wilmington 0.755(0.011) 0.831(0.007) Asheville 0.740(0.004) 0.631(0.063) SVM Garner 0.672(0.021) 0.714(0.023) Wilmington 0.603(0.021) 0.696(0.044)
B. Revisit ANN vs. AIRS
As we can see in Fig. 5 and 6, some points on the ROC curves of ANN are closer to the upper left corner (perfect classification) than those of AIRS, showing higher POD given the same FPR. To quantify the comparison, we first set the decision threshold for ANN by selecting the points on ROC curves where FPR equals to 0.2 according to the observations. Then, conventional performance measures, ACC, POD, FAR and G-mean are calculated. FAR=FP/(TP+FP) represents how much we can believe the result when a sample is diagnosed as the positive class. In a perfect case, FAR equals to 0 which means all samples classified as positive are actually positive.
The average performance measures of ANN and AIRS are summarized in Table III with standard deviation listed in brackets. Similar to the results reported in [7], AIRS is generally better than ANN with a natural decision threshold 0.5. In most cases, although ACC drops slightly, AIRS detects much more positive samples (higher POD) and improves the G-means up to 84%. However, with the decision threshold selected through ROC curves, ANN not only delivers higher ACC, POD and G-mean, but also yields smaller FAR, which means making fewer mistakes in false positives. These results suggest that ROC curves are critical to identify the optimal decision threshold and help select the best algorithm.
IV. CONCLUSION
Performance evaluation of distribution fault diagnosis is affected by the data imbalance issue. To cope this problem, this paper discusses the application of ROC curves. As a case study, five algorithms are compared using ROC curves on real-world datasets.
An ROC curve is a group of classification results mapped into points on the ROC plane as the decision threshold is varied. It is not affected by data imbalance because the performance is evaluated based on each class. The curve provides more information than G-mean measure and G-mean can actually be derived from the curve. The overall performance of an algorithm can be evaluated qualitatively by the shape of its ROC curve – a high curve with steep rise is good, or quantitatively by the area under the curve.
In the ROC space, ANN and LR are good algorithms in general. SVM is comparable to ANN and LR with big decision threshold but performs worse as the threshold goes smaller. KNN yields similar results as ANN, LR and SVM
6
[8] J. S. Bowers, A. Sundaram, C. L. Benner, and B. D. Russell, \"Outage
avoidance through intelligent detection of incipient equipment failures on distribution feeders,\" presented in IEEE Power and Energy Society General Meeting, Pittsburgh, PA, USA, 2008.
[9] K. L. Butler-Purry and J. Cardoso, \"Characterization of underground
cable incipient behavior using time-frequency multi-resolution analysis and artificial neural networks,\" presented in IEEE Power and Energy Society General Meeting, Pittsburgh, PA, USA, 2008.
[10] N. V. Chawla, N. Japkowicz, and A. Kotcz, \"Editorial: special issue on TABLE III learning from imbalanced data sets,\" ACM SIGKDD Explorations
COMPARISON OF FAULT DIAGNOSIS PERFORMANCE Newletter, vol. 6, pp. 1-6, 2004.
[11] M. Kubat, R. C. Holte, and S. Matwin, \"Machine learning for the
ANN with ANN with detection of oil spills in satellite radar images,\" Machine Learning, vol.
natural AIRS Selected 30, pp. 195-215, 1998. threshold threshold [12] WWRP/WGNE Joint Working Group on Verification, \"Forecast
ACC 0.705(0.004) 0.622(0.024) 0.703(0.004) verification - issues, methods and FAQ,\" [Online]. Available: POD 0.619(0.019) 0.586(0.052) 0.590(0.015) Thttp://www.bom.gov.au/bmrc/wefor/staff/eee/verif/verif_web_page.htm. FAR 0.299(0.010) 0.409(0.037) 0.287(0.007) F [13] T. Fawcett, \"An introduction to ROC analysis,\" Pattern Recognition G-mean 0.694(0.005) 0.615(0.020) 0.686(0.005) Letters, vol. 27, pp. 861-874, 2006. ACC 0.868(0.003) 0.761(0.027) 0.792(0.011) [14] Y. Cai and M.-Y. Chow, \"Exploratory analysis of massive data for POD 0.186(0.095) 0.640(0.087) 0.663(0.022) Adistribution fault diagnosis in Smart Grids,\" presented in IEEE Power & FAR 0.455(0.203) 0.692(0.032) 0.654(0.016) F Energy Society General Meeting, Calgary, Canada, 2009.
G-mean 0.383(0.182) 0.704(0.039) 0.733(0.008) [15] W. S. Sarle, \"Neural network FAQ, part 1 of 7: Introduction, periodic ACC 0.794(0.004) 0.725(0.034) 0.752(0.009) posting to the Usenet newsgroup comp.ai.neural-nets,\" [Online]. POD 0.266(0.026) 0.421(0.095) 0.560(0.033) TAvailable: ftp://ftp.sas.com/pub/neural/FAQ.html. FAR 0.354(0.036) 0.571(0.056) 0.523(0.020) F [16] R. L. Ott and M. Longnecker, An Introduction to Statistical Methods and G-mean 0.504(0.022) 0.581(0.046) 0.674(0.014) Data Analysis, 5th ed. Pacific Grove, CA, USA: Duxbury, 2001. ACC 0.880(0.006) 0.788(0.112) 0.792(0.015) [17] A. Karatzoglou and D. Meyer, \"Support vector machines in R,\" Journal POD 0.285(0.062) 0.613(0.176) 0.713(0.040) Aof Statistical Software, vol. 15, pp. 1-28, 2006. FAR 0.457(0.091) 0.641(0.079) 0.662(0.022) F [18] A. Watkins, J. Timmis, and L. Boggess, \"Artificial immune recognition G-mean 0.514(0.102) 0.686(0.073) 0.756(0.013) system(AIRS): an immune-inspired supervised learning algorithm,\" ACC 0.879(0.004) 0.809(0.038) 0.771(0.028) Genetic Programming and Evolvable Machines, vol. 5, pp. 291-317, POD 0.094(0.025) 0.220(0.157) 0.524(0.067) T2004. FAR 0.483(0.080) 0.800(0.078) 0.730(0.027) F [19] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed. G-mean 0.302(0.038) 0.399(0.171) 0.647(0.027) New York: John Wiley & Sons, 2001. AF
ACC 0.848(0.005) 0.782(0.023) 0.786(0.011) POD 0.207(0.096) 0.591(0.126) 0.680(0.045) FAR 0.451(0.125) 0.635(0.063) 0.610(0.017) G-mean 0.423(0.143) 0.686(0.093) 0.740(0.018) with a fixed decision threshold 0.5. AIRS outperforms others most of the time. With a proper decision threshold selected based on the data and corresponding ROC curves, ANN is able to achieve similar or even better performance than AIRS. The selection of the optimal decision threshold and the best algorithm will be discussed in our future work.
Asheville
Garner
Wilming-ton
TF stands for tree-caused faults and AF stands for animal-caused faults. V. ACKNOWLEDGMENT
The authors gratefully acknowledge the contribution of John W. Gajda and Glenn C. Lampley from Progress Energy Carolinas Inc. for their support on data and field experience.
VI. REFERENCES
[1] R. E. Brown, Electric Power Distribution Reliability. New York: Marcel
Dekker, Inc, 2002.
[2] Electricity Advisory Committee, US Department of Energy, \"Smart grid:
enabler of the new energy economy,\" [Online]. Available: http://www.oe.energy.gov/DocumentsandMedia/final-smart-grid-report.pdf.
[3] W.-H. Chen, C.-W. Liu, and M.-S. Tsai, \"On-line fault diagnosis of
distribution substations using hybrid cause-effect network and fuzzy rule-based method,\" IEEE Trans. Power Delivery, vol. 15, pp. 710-717, 2000.
[4] J. Xiao and F. Wen, \"Combined use of fuzzy set-covering theory and
mode identification technique for fault diagnosis in power systems,\" presented in IEEE Power Engineering Society General Meeting, Tampa, FL, USA, 2007.
[5] L. Xu and M.-Y. Chow, \"A classification approach for power
distribution systems fault cause identification,\" IEEE Trans. Power Systems, vol. 21, pp. 53-60, 2006.
[6] L. Xu, M.-Y. Chow, and L. S. Taylor, \"Power distribution fault cause
identification with imbalanced data using the data mining-based fuzzy classification E-Algorithm,\" IEEE Trans. Power Systems, vol. 22, pp. 164-171, 2007.
[7] L. Xu, M.-Y. Chow, J. Timmis, and L. S. Taylor, \"Power distribution
outage cause identification with imbalanced data using artificial immune recognition system (AIRS) algorithm,\" IEEE Trans. Power Systems, vol. 22, pp. 198-204, 2007.
因篇幅问题不能全部显示,请点此查看更多更全内容