Frequently the classifiers are developed using class-imbalanced data, i.e., data sets where the number of subjects in each class is not equal. Standard classification methods used on class-imbalanced data produce classifiers that do not accurately predict the smaller class. We previously showed that additional challenges exist when the data are both class-imbalanced and high dimensional, i.e., when the number of samples is smaller than the number of measured variables. Most classification methods base the classification rule on a numerical variable that is produced by the classification algorithm, for example on the probability for a sample to belong to a class (as for penalized logistic regression), on the proportion of samples among the nearest neighbors that belong to a class (for k-NN), on the proportion of bootstrap trees that classified the sample in a given class (for random forests). If the value of this numerical variable is above a pre-specified threshold, then the new sample is classified in a given class. We evaluated if we could improve the performance on imbalanced data of some classifiers by estimating the threshold value upon which their classification rule is based. We addressed the issue on how choose the threshold value. We estimated the threshold (on a training set) maximizing the Youdens index (sensitivity + specificity-1), the positive or the negative predictive value, or their sum. The results obtained on independent test sets were evaluated both in terms of class-specific predictive accuracies and of class-specific predictive values, and we compared the empirically determined thresholds with the thresholds commonly used in practice. In this talk we will show the simulation-based results obtained using penalized logistic regression models.
Author(s): Lara Lusa, Rok Blagus