ROCit: An R Package for Performance Assessment of Binary Classifier with Visualization

Md Riaz Ahmed Khan



Sensitivity (or recall or true positive rate), false positive rate, specificity, precision (or positive predictive value), negative predictive value, misclassification rate, accuracy, F-score- these are popular metrics for assessing performance of binary classifier for certain threshold. These metrics are calculated at certain threshold values. Receiver operating characteristic (ROC) curve is a common tool for assessing overall diagnostic ability of the binary classifier. Unlike depending on a certain threshold, area under ROC curve (also known as AUC), is a summary statistic about how well a binary classifier performs overall for the classification task. ROCit package provides flexibility to easily evaluate threshold-bound metrics. Also, ROC curve, along with AUC can be obtained using different methods, such as empirical, binormal and non-parametric. ROCit encompasses a wide variety of methods for constructing confidence interval of ROC curve and AUC. ROCit also features the option of constructing empirical gains table, which is a handy tool for direct marketing. The package offers options for commonly used visualization, such as, ROC curve, KS plot, lift plot. Along with in-built default graphics setting, there are rooms for manual tweak by providing the necessary values as function arguments. ROCit is a powerful tool offering a range of things, yet it is very easy to use.

Binary Classifier

In statistics and machine learning arena, classification is a problem of labeling an observation from a finite number of possible classes. Binary classification is a special case of classification problem, where the number of possible labels is two. It is a task of labeling an observation from two possible labels. The dependent variable represents one of two conceptually opposed values (often coded with 0 and 1), for example:

There are many algorithms that can be used to predict binary response. Some of the widely used techniques are logistic regression, discriminant analysis, Naive Bayes classification, decision tree, random forest, neural network, support vector machines (James et al. 2013), etc. In general, the algorithms model the probability of one of the two events to occur, for the certain values of the covariates, which in mathematical terms can be expressed as \(Pr(Y=1|X_1=x_1, X_2=x_2,\dots,X_n=x_n)\). Certain threshold can then be applied to convert the probabilities into classes.

Binary Classifier Performance Metrics

Hard Classification

When hard classification are made, (after converting the probabilities using threshold or returned by the algorithm), there can be four cases for a certain observation:

  1. The response actually negative, the algorithm predicts it to be negative. This is known as true negative (TN).

  2. The response actually negative, the algorithm predicts it to be positive. This is known as false positive (FP).

  3. The response actually positive, the algorithm predicts it to be positive. This is known as true positive (TP).

  4. The response actually positive, the algorithm predicts it to be negative. This is known as false negative (FN).

All the observations fall into one of the four categories stated above and form a confusion matrix.

Predicted Negative (0) Predicted Positive (1)
Actual Negative (0) True Negative (TN) False Positive (FP)
Actual Positive (1) False Negative (FN) True Positive (TP)

Following are some popular performance metrics, when observations are hard classified:

Specificity is also known as true negative rate (TNR).

\[Positive\ DLR=\frac{TPR}{FPR}\] \[Negative\ DLR=\frac{TNR}{FNR}\]

\[ F\text{-}Score=\frac{2}{\frac{1}{PPV} +\frac{1}{TPR}}=2\times \frac{PPV\times TPR}{PPV+TPR} \]

Observation Are Scored

Rather than making simple classification, often models give probability scores, \(Pr(Y=1)\). Using certain cutoff or threshold values, we can dichotomize the scores and calculate these metrics. This is also true when some certain diagnostic variable is used to categorize the observations. For example, having a hemoglobin A1c level of lower than 6.5% being treated as no diabetes, and having a level equal to greater than 6.5% being treated as having the disease. Here the diagnostic measure is not bound in between 0 and 1 like the probability measure, yet all the metrics stated above can be derived. But these metrics give a sense of performance measure only at certain threshold. There are metrics, that measure overall performance of the binary classifier considering the performance at all possible thresholds. Two such metrics are

  1. Area under receiver operating characteristic (ROC) curve
  2. KS statistic

Receiver operating characteristic (ROC) curve (Lusted 1971, @hanley1982meaning, @bewick2004statistics) is a simple yet powerful tool used to evaluate a binary classifier quantitatively. The most common quantitative measure is the area under the curve (Hanley and McNeil 1982). ROC curve is drawn by plotting the sensitivity (TPR) along \(Y\) axis and corresponding 1-specificity (FPR) along \(X\) axis for all possible cutoff values. Mathematically, it is the set of all ordered pairs \((FPR(c), TPR(c))\), where \(c\in R\).

Some Properties of ROC curve

If the diagnostic variable is unrelated with the binary outcome, the expected ROC curve is simply the \(y=x\) line. In a situation where the diagnostic variable can perfectly separate the two classes, the ROC curve consists of a vertical line (\(x=0\)) and a horizontal line (\(y=1\)). For a practical data, usually the ROC stays in between these two extreme scenarios. Figure below illustrates some examples of different types of ROC curves. The red and the green curves illustrate two extreme scenarios. The random line in red is the expected ROC curve when the diagnostic variable does not have any predictive power. When the observations are perfectly separable, the ROC curve consists of one horizontal and a vertical line as shown in green. The other curves are the result of typical practical data. When the curve shifts more to the north-west, it means better the predictive power.

ROC curves example

ROC curves example

For more details, see Pepe (2003).

Common approaches to estimate ROC curve

  • Empirical: The empirical method simply constructs the ROC curve empirically, applying the definitions of TPR and FPR to the observed data. Figure 1 is an example of such approach. For every possible cutoff value c, TPR and FPR are estimated by:

\[ \hat{TPR}(c)=\sum_{i=1}^{n_Y}I(D_{Y_i}\geq c)/n_Y \]

\[ \hat{FPR}(c)=\sum_{j=1}^{n_{\bar{Y}}}I(D_{{\bar{Y}}_j}\geq c)/n_{\bar{Y}} \] where, \(Y\) and \(\bar{Y}\) represent the positive and negative responses, \(n_Y\) and \(n_{\bar{Y}}\) are the total number of positive and negative responses, \(D_Y\) and \(D_{\bar{Y}}\) are the distributions of the diagnostic variable in the positive and the negative responses. The indicator function has the usual meaning. It evaluates 1 if the expression is true, and 0 otherwise. The area under empirically estimated ROC curve is given by:

\[ \hat{AUC}=\frac{1}{n_Yn_{\bar{Y}}} \sum_{i=1}^{n_Y}\sum_{j=1}^{n_{\bar{Y}}} (I(D_{Y_i}>D_{Y_j})+ \frac{1}{2}I(D_{Y_i}>D_{Y_j})) \] The variance of AUC can be estimated as (Hanley and McNeil 1982): \[ V(AUC)=\frac{1}{n_Yn_{\bar{Y}}}( AUC(1-AUC) + (n_Y-1)(Q_1-AUC^2) + (n_{\bar{Y}}-1)(Q_2-AUC^2) ) \] where, \(Q_1=\frac{AUC}{2-AUC}\), and \(Q_2=\frac{2\times AUC^2}{1+AUC}\).

An alternate formula is developed by DeLong, DeLong, and Clarke-Pearson (1988) which is given in terms of survivor functions: \[ V(AUC)=\frac{V(S_{D_{\bar{Y}}}(D_Y))}{n_Y} +\frac{V(S_{D_Y}(D_{\bar{Y}}))}{n_{\bar{Y}}} \]

A confidence band can be computed using the usual approach of normal assumption. For example, a \((1-\alpha)\times 100\%\) confidence band can be constructed using:

\[ AUC\pm\phi^{-1}(1-\alpha/2)\sqrt{V(AUC)} \]

The above formula does not put any restriction on the computed values of upper and lower bound. However, AUC is a measure bounded between 0 and 1. One systematic way to do this is the logit transformation (Pepe 2003). Instead of constructing the interval directly for the AUC, an interval in the logit scale is first constructed using:

\[ L_{AUC}\pm \phi^{-1}(1-\alpha/2)\frac{\sqrt{AUC}}{AUC(1-AUC)} \]

where \(L_{AUC}=log(\frac{AUC}{1-AUC})\) is the logit of AUC. The logit scale intervals can then be inverse logit transformed to find the actual bounds of AUC.

Confidence interval of ROC curve: For large values of \(n_Y\) and \(n_{\bar{Y}}\), the distribution of \(TPR(c)\) at \(FPR(c)\) can be approximated as a normal distribution with following mean and variance:

\[ \mu_{TPR(c)}=\sum_{i=1}^{n_Y}I(D_{Y_i}\geq c)/n_Y \]

\[ V \Big( TPR(c) \Big)= \frac{ TPR(c) \Big( 1- TPR(c)\Big) }{n_Y} + \bigg( \frac{g(c^*)}{f(c^*) } \bigg)^2\times K \] where, \[ K=\frac{ FPR(c) \Big(1-FPR(c)\Big)}{n_{\bar{Y}} } \]

\[ c^*=S^{-1}_{D_{\bar{ Y}}}\Big( FPR(c) \Big) \] and, \(S\) is the survival function given by, \[ S(t)=P\Big(T>t\Big)=\int_t^{\infty}f_T(t)dt=1-F(t) \] For details, see Pepe (2003).

  • Binormal: This is a parametric approach where the diagnostic variable in the two groups are assumed to be normal.

\[ D_Y\sim N(\mu_{D_Y}, \sigma_{D_Y}^2) \]

\[ D_{\bar{Y}}\sim N(\mu_{D_{\bar{Y}}}, \sigma_{D_{\bar{Y}}}^2) \]

When such distributional assumptions are made, ROC curve can be defined as:

\[ y(x)=1-G(F^{-1}(1-x)), \ \ 0\leq x\leq 1 \] where by \(F\) and \(G\) are the cumulative density functions of the diagnostic score in the negative and positive groups respectively, with \(f\) and \(g\) being corresponding probability density functions. For normal condition, the ROC curve and AUC under curve are given by:

\[ ROC\ Curve: y= \phi(A+BZ_x) \]

\[ AUC=\phi(\frac{A}{\sqrt{1+B^2}}) \]

where, \(Z_x=\phi^{-1}(x(t))=\frac{\mu_{D_{\bar{Y}}}-t}{\sigma_{D_{\bar{Y}}}}\), \(t\) being a cutoff; and \(A=\frac{|\mu_{D_{{Y}}}-\mu_{D_{\bar{Y}}}|}{\sigma_{D_{{Y}}}}\), \(B=\frac{\sigma_{D_{\bar{Y}}}}{\sigma_{D_{{Y}}}}\).

Confidence interval of ROC curve: To get the confidence interval, variance of \(A+BZ_x\) is derived using:

\[ V(A+B Z_x)=V(A)+Z_x^2V(B)+2Z_xCov(A, B) \] A \((1-\alpha)\times100\%\) level confidence limit for \(A+Z_xB\) can be obtained as

\[ (A+Z_xB)\pm \phi^{-1}(1-\alpha/2)\sqrt{V(A+Z_xB)} \] Point-wise confidence limit can be achieved by taking \(\phi\) of the above expression.

  • Non-parametric: Non-parametric estimates of \(f\) and \(g\) are used in this approach. Zou, Hall, and Shapiro (1997) presented one such approach using Kernel densities:

\[ \hat{f}(x)=\frac{1}{n_{\bar{Y}}h_{\bar{ Y}}}\sum_{i=1}^{n_{\bar{ Y}}} K\big( \frac{x-D_{\bar{ Y}i} }{h_{\bar{ Y}}} \big) \]

\[ \hat{g}(x)= \frac{1}{n_{{Y}}h_y}\sum_{i=1}^{n_{{ Y}}} K\big( \frac{x-D_{{ Y}i} }{h_Y} \big) \]

where \(K\) is the Kernel function and \(h\) smoothing parameter (bandwidth). Zou, Hall, and Shapiro (1997) suggested a biweight Kernel:

\[ K\big(\frac{x-\alpha}{\beta}\big)=\begin{cases} \frac{15}{16} \Big[ 1-\big(\frac{x-\alpha}{\beta}\big)^2 \Big] , & x\in (\alpha - \beta, \alpha + \beta)\\ 0, & \text{otherwise} \end{cases} \]

with the bandwidth given by, \[ h_{\bar{Y}}=0.9\times min\big( \sigma_{\bar{ Y}}, \frac{IQR(D_{\bar{ Y}})}{1.34} \big)/ (n_{\bar{ Y}} )^{\frac{1}{5}} \] \[ h_{{Y}}=0.9\times min\big( \sigma_{{ Y}}, \frac{IQR(D_{{ Y}})}{1.34} \big)/ (n_{{ Y}} )^{\frac{1}{5}} \]

Smoother versions of TPR and FPR are obtained as the right-hand side area (of cutoff) of the smoothed \(f\) and \(g\). That is,

\[ \hat{TPR}(t)=1-\int_{-\infty}^{t}\hat{g}(t)dt=1-\hat{G}(t) \]

\[ \hat{FPR}(t)=1-\int_{-\infty}^{t}\hat{f}(t)dt=1-\hat{F}(t) \] When discrete pairs of \((FPR, TPR)\) are obtained, trapezoidal rule can be applied to calculate the AUC.

Using Package ROCit

1/0 coding of response

A binary response can exist as factor, character, or numerics other than 1 and 0. Often it is desired to have the response coded with just 1/0. This makes many calculations easier.

So the response is a factor variable. There are 131 cases of charged off and 769 cases of fully paid. Often the probability of defaulting is modeled in loan data, making the fully paid group as reference.

If reference not specified, alphabetically, charged off group is set as reference.

Performance metrics of binary classifier

Various performance metrics for binary classifier are available that are cutoff specific. Following metrics can be called for via measure argument:

Accuracy vs Cutoff

Accuracy vs Cutoff

ROC curve estimation

rocit is the main function of ROCit package. With the diagnostic score and the class of each observation, it calculates true positive rate (sensitivity) and false positive rate (1-Specificity) at convenient cutoff values to construct ROC curve. The function returns “rocit” object, which can be passed as arguments for other S3 methods.

Diabetes data contains information on 403 subjects from 1046 subjects who were interviewed in a study to understand the prevalence of obesity, diabetes, and other cardiovascular risk factors in central Virginia for African Americans. According to Dr John Hong, Diabetes Mellitus Type II (adult onset diabetes) is associated most strongly with obesity. The waist/hip ratio may be a predictor in diabetes and heart disease. DM II is also associated with hypertension - they may both be part of “Syndrome X”. The 403 subjects were the ones who were actually screened for diabetes. Glycosolated hemoglobin > 7.0 is usually taken as a positive diagnosis of diabetes.

In the data, the dtest variable indicates whether glyhb is greater than 7 or not.

The variable is a character variable in the dataset. There are 60 positive and 330 negative instances. There are also 13 instances of NAs.

Now let us use the total cholesterol as a diagnostic measure of having the disease.

The negative was taken as the reference group in rocit function. No method was specified, by default empirical was used.

The summary method is available for a rocit object.

The Cutoffs are in descending order. TPR and FPR are in ascending order. The first cutoff is set to \(+\infty\) and the last cutoff is equal to the lowest score in the data that are used for ROC curve estimation. A score greater or equal to the cutoff is treated as positive.

Other methods:


Trying a better model:

Confidence interval of AUC:

Confidence interval of ROC curve:

Empirical ROC curve with 90% CI

Empirical ROC curve with 90% CI

Options available for plotting ROC curve with CI

KS plot: KS plot shows the cumulative density functions \(F(c)\) and \(G(c)\) in the positive and negative populations. If the positive population have higher value, then negative curve (\(F(c)\)) ramps up quickly. The KS statistic is the maximum difference of \(F(c)\) and \(G(c)\).

KS plot

KS plot

Gains table

Gains table is a useful tool used in direct marketing. The observations are first rank ordered and certain number of buckets are created with the observations. The gains table shows several statistics associated with the buckets. This package includes gainstable function that creates gains table containing ngroup number of groups or buckets. The algorithm first orders the score variable with respect to score variable. In case of tie, it class becomes the ordering variable, keeping the positive responses first. The algorithm calculates the ending index in each bucket as \(round((length(score) / ngroup) * (1:ngroup))\). Each bucket should have at least 5 observations.

If buckets’ end index are to be ended at desired level of population, then breaks should be specified. If specified, it overrides ngroup and ngroup is ignored. breaks by default always includes 100. If whole number does not exist at specified population, nearest integers are considered. Following stats are computed:

rocit object can be passed

rocit_emp <- rocit(score = score, 
                   class = class, 
                   negref = "FP")
gtable_custom <- gainstable(rocit_emp, 
                    breaks = seq(1,100,15))
# ------------------------------
#>    Bucket Obs CObs Depth Resp CResp RespRate CRespRate CCapRate  Lift
#> 1       1  60   60 0.067   20    20    0.333     0.333    0.153 2.290
#> 2       2  60  120 0.133   11    31    0.183     0.258    0.237 1.260
#> 3       3  60  180 0.200   12    43    0.200     0.239    0.328 1.374
#> 4       4  60  240 0.267   14    57    0.233     0.238    0.435 1.603
#> 5       5  60  300 0.333   11    68    0.183     0.227    0.519 1.260
#> 6       6  60  360 0.400   13    81    0.217     0.225    0.618 1.489
#> 7       7  60  420 0.467    9    90    0.150     0.214    0.687 1.031
#> 8       8  60  480 0.533    7    97    0.117     0.202    0.740 0.802
#> 9       9  60  540 0.600    5   102    0.083     0.189    0.779 0.573
#> 10     10  60  600 0.667    9   111    0.150     0.185    0.847 1.031
#> 11     11  60  660 0.733    4   115    0.067     0.174    0.878 0.458
#> 12     12  60  720 0.800    7   122    0.117     0.169    0.931 0.802
#> 13     13  60  780 0.867    3   125    0.050     0.160    0.954 0.344
#> 14     14  60  840 0.933    6   131    0.100     0.156    1.000 0.687
#> 15     15  60  900 1.000    0   131    0.000     0.146    1.000 0.000
#>    CLift
#> 1  2.290
#> 2  1.775
#> 3  1.641
#> 4  1.632
#> 5  1.557
#> 6  1.546
#> 7  1.472
#> 8  1.388
#> 9  1.298
#> 10 1.271
#> 11 1.197
#> 12 1.164
#> 13 1.101
#> 14 1.071
#> 15 1.000
#>   Bucket Obs CObs Depth Resp CResp RespRate CRespRate CCapRate  Lift CLift
#> 1      1   9    9  0.01    5     5    0.556     0.556    0.038 3.817 3.817
#> 2      2 135  144  0.16   33    38    0.244     0.264    0.290 1.679 1.813
#> 3      3 135  279  0.31   26    64    0.193     0.229    0.489 1.323 1.576
#> 4      4 135  414  0.46   26    90    0.193     0.217    0.687 1.323 1.494
#> 5      5 135  549  0.61   13   103    0.096     0.188    0.786 0.662 1.289
#> 6      6 135  684  0.76   18   121    0.133     0.177    0.924 0.916 1.215
#> 7      7 135  819  0.91    7   128    0.052     0.156    0.977 0.356 1.074
#> 8      8  81  900  1.00    3   131    0.037     0.146    1.000 0.254 1.000
Lift and Cum. Lift plot

Lift and Cum. Lift plot


Altman, Douglas G, and J Martin Bland. 1994a. “Diagnostic Tests. 1: Sensitivity and Specificity.” BMJ: British Medical Journal 308 (6943): 1552.

———. 1994b. “Statistics Notes: Diagnostic Tests 2: Predictive Values.” Bmj 309 (6947): 102.

Beitzel, Steven M, Eric C Jensen, Abdur Chowdhury, Ophir Frieder, and David Grossman. 2007. “Temporal Analysis of a Very Large Topically Categorized Web Query Log.” Journal of the American Society for Information Science and Technology 58 (2): 166–78.

Bermingham, Adam, and Alan Smeaton. 2011. “On Using Twitter to Monitor Political Sentiment and Predict Election Results.” In Proceedings of the Workshop on Sentiment Analysis Where Ai Meets Psychology (Saaip 2011), 2–10.

Bewick, Viv, Liz Cheek, and Jonathan Ball. 2004. “Statistics Review 13: Receiver Operating Characteristic Curves.” Critical Care 8 (6): 508.

DeLong, Elizabeth R, David M DeLong, and Daniel L Clarke-Pearson. 1988. “Comparing the Areas Under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach.” Biometrics, 837–45.

Denecke, Kerstin. 2008. “Using Sentiwordnet for Multilingual Sentiment Analysis.” In Data Engineering Workshop, 2008. ICDEW 2008. IEEE 24th International Conference on, 507–12. IEEE.

Hanley, James A, and Barbara J McNeil. 1982. “The Meaning and Use of the Area Under a Receiver Operating Characteristic (Roc) Curve.” Radiology 143 (1): 29–36.

Huang, Jeff, and Efthimis N Efthimiadis. 2009. “Analyzing and Evaluating Query Reformulation Strategies in Web Search Logs.” In Proceedings of the 18th Acm Conference on Information and Knowledge Management, 77–86. ACM.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.

Lusted, Lee B. 1971. “Decision-Making Studies in Patient Management.” New England Journal of Medicine 284 (8): 416–24.

Nguyen, Thuy TT, and Grenville Armitage. 2006. “Training on Multiple Sub-Flows to Optimise the Use of Machine Learning Classifiers in Real-World Ip Networks.” In Proceedings. 2006 31st Ieee Conference on Local Computer Networks, 369–76. IEEE.

Pepe, Margaret Sullivan. 2003. The Statistical Evaluation of Medical Tests for Classification and Prediction. Medicine.

Siddiqi, Naeem. 2012. Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. Vol. 3. John Wiley & Sons.

Zou, Kelly H, WJ Hall, and David E Shapiro. 1997. “Smooth Non-Parametric Receiver Operating Characteristic (Roc) Curves for Continuous Diagnostic Tests.” Statistics in Medicine 16 (19): 2143–56.