Is The Classifier Even Better Than My Guess?

This question might sound a little silly; how come a sophisticated, advanced, well-built statistical or machine learning model is even worse than my guess? Well, it can happen and it can happen quite often especially for financial data. You must have heard of “garbage in, garbage out”. This is no difference for financial data. What if what you have been doing is to find gold from pure sands? As the signal-to-noise ratio of financial data is almost always very low, the chance is that you have tried very hard but end up with a model which is not better and even worse than your guess. It’s not your fault, and it’s the data’s.

But the question is how to know whether the data and the model can actually bring some sort of predictive power at all? In textbooks, we have learned that there are many different metrics that can be used to compare the performance of classifiers, such as accuracy, F1 score, ROC, and AUC. One can certainly use those quantities to compare different models, but now my question is really humble: I just wanted to know whether the data input is not random noise and whether the model together with the data can really bring anything useful.

Now, we are given a training set and a test set, and our goal is to classify the test set based on what we have learned from the training set. The numbers of observations for different classes can be very imbalanced, and the distributions of classes for the training set and for the test set can be very different. Without any model, we can at least use the following simple strategies to do the classification:

Random guess: based on the total number of classes in the training set, randomly assign a class for each observation from the test set.
Educated guess: based on the percentage of each class in the training set, assign a class accordingly for each observation from the test set.
Majority guess: assign the class that appears most often in the training set to all the observations from the test set.

Then, we can compare the accuracy from our model to the accuracy led to by the above guesses. For the random guess and the educated guess, we can do some simulations to check whether our model-based accuracy is significantly better than the guesses; for the majority guess, we can simply compare the model-based accuracy to the accuracy based on the majority guess.

In order to assess how hard to beat the random guess and the educated guess, we can first do such guesses for a large number of times, say 1000 times, and then check the percentage of those 1000 guesses that are larger than the model-based accuracy; we denote such a percentage as a p-value. If such a p-value is very small, then we can say that the model-based accuracy is significantly better than the corresponding guess.

The following is the R function that can be used to calculate the probabilities and to assess the p-values.

acc_lucky <- function(train_class, test_class, my_acc, s=1000)
{
  acc_random_guess <- acc_educated_guess <- NULL
  nTrain_class <- length(train_class)
  nTest_class <- length(test_class)
  nTrain <- sum(train_class)
  nTest <- sum(test_class)
 
  if(nTrain_class!=nTest_class)
    stop("Error: The number of classes in test and train sets are different!")
  true_class <- unlist(sapply(seq_len(nTrain_class),
                              function(i){rep(i, test_class[i])}))

  random_guess <- sample(1:nTrain_class, nTest*s, replace = T)
  random_guess <- matrix(random_guess, s, nTest)
 
  educated_guess <- sample(1:nTrain_class, nTest*s,
                           prob = train_class/nTrain, replace = T)
  educated_guess <- matrix(educated_guess, s, nTest)
 
  acc_random_guess <- apply(random_guess, 1,
                            function(xvec){sum(true_class == xvec) / nTest})
  acc_educated_guess <- apply(educated_guess, 1,
                              function(xvec){sum(true_class == xvec) / nTest})
  acc_majority_guess <- sum(
    true_class == rep(which.max(train_class), nTest)) / nTest
 
  # one-side p value
  p_random_guess <- sum(my_acc <= acc_random_guess)/length(acc_random_guess)
  p_educated_guess <- sum(
    my_acc <= acc_educated_guess)/length(acc_educated_guess)
  return(list(my_accuracy=my_acc,
              p_random_guess=p_random_guess,
              p_educated_guess=p_educated_guess,
              mean_random_guess=mean(acc_random_guess),
              mean_educated_guess=mean(acc_educated_guess),
              acc_majority_guess=acc_majority_guess))
}

For example, there are 3 classes in both the training set and the test set; the number of observations from each class of the training set is 1223, 1322, and 1144, respectively; the number of observations from each class of the test set is 345, 544, and 233, respectively; the model-based accuracy is 0.45. Then the following calculation suggests whether the model can be useful in terms of classification accuracy.

train_class <- c(1223,1322,1144)
test_class <- c(345,544,233)
my_acc <- 0.45
acc_lucky(train_class, test_class, my_acc)

## $my_accuracy
## [1] 0.45
## 
## $p_random_guess
## [1] 0
## 
## $p_educated_guess
## [1] 0
## 
## $mean_random_guess
## [1] 0.3335045
## 
## $mean_educated_guess
## [1] 0.340025
## 
## $acc_majority_guess
## [1] 0.4848485

From the above example, we can conclude that the model-based accuracy is significantly better than the random guess and the educated guess. However, it does not beat the majority guess. This situation is actually quite common for low-signal financial data, for which a machine learning model, although maybe very sophisticated, may have only learned the class imbalance pattern. Therefore, for prediction on financial market data, even the humble need of overperforming a guess is not very easy. That being said, we have to resort to better financial data and more research on methodologies.