📎Sampling Metrics

Sampling realtime data from model retraining

Data Sampling

"Not all data is equal"

Expert anecdote by Jennifer Prendki, founder and CEO of Alectio

If you care about your nutrition, you don’t go to the supermarket and randomly select items from the shelves. You might eventually get the nutrients you need by eating random items from supermarket shelves, but you will eat a lot of junk food in the process. I think it is weird that in machine learning, people still think it’s better to sample the supermarket randomly than figure out what they need and focus their efforts there.

Credits: Human in the Loop Machine Learning by Robert Munroby

The assumption is that some data points are more valuable for the model than others. The focus is on how to identify these valuable data points.

Uncertainty Sampling Techniques:

The following techniques helps to sample data, where the model is most uncertain.

  • Least Confidence Sampling

  • Marginal Confidence

  • Ratio Confidence

  • Entropy Confidence

Before we dive deep in to each of the sampling techniques, we will shortly discuss the place of sampling in the whole MLOps lifecycle.

Least Confidence Sampling

Least Confidence Sampling is the most common method for uncertainty sampling, which takes the difference between 100% confidence and the most confidently predicted label for each item. Least confidence is sensitive to the base used for the softmax algorithm. Least confidence sampling is in the range of 0-1 where 1 is most uncertain.

Formula:

  • The model queries the instances where it is least confident in its prediction.

  • If the model predicts a class y^=argmax(P(y∣x)) y^​= argmax(P(y∣x)), where P(y∣x)P(y∣x) is the probability of class y given input x, the uncertainty score is:

Uncertainty(x)=1−P(y^∣x)Uncertainty(x)=1−P(y^​∣x)

  • Here, P(y^∣x)P(y^​∣x) is the model's predicted probability for the most likely class.

Thresholds:

  • Threshold value : You can set a threshold based on selecting the top 10% of samples with the lowest confidence (highest uncertainty). For example, out of 1000 samples, pick 100 where the model is the least confident.

Intuition:

  • This technique queries the samples where the model is "unsure" of its prediction. It only looks at the highest prediction probability and assesses its confidence.

  • Imagine a binary classifier trying to determine whether an image is a cat or a dog. If it predicts "cat" with 55% confidence, the model is not very sure, making it a good candidate for sampling.

Example:

Let's assume a multi-class problem with 3 classes (dog, cat, bird):

  • P(dog∣x)=0.55

  • P(cat∣x)=0.35

  • P(bird∣x)=0.10

The model predicts "dog" since it's the most likely class, bit its confidence is P(dog∣x) =0.55 . The uncertainity is 1−0.55=0.45, so this instance is not very confidently predicted.

Based on this, you may decide to query instances where the model is least confident about at least one of the labels.

Best For:

  • Imbalanced Classification: If your data has a dominant class, LC works well because it focuses on the least confident predictions.

  • Simple Classification Problems: When your model performs well and you just need to focus on the hardest cases where it’s least confident.

  • Binary Classification: LC is particularly efficient in binary classification since it focuses on the most uncertain cases between two classes.

Not Ideal For:

  • Multi-Class Classification with Close Predictions: If multiple classes have similar probabilities, LC might miss some nuanced uncertainty because it only looks at the top class.

  • Scenarios with High Confidence Predictions: If your model has very high confidence in most predictions, LC may not differentiate well between uncertain cases.

Example:

  • Scenario: In a medical diagnosis task, where you have one dominant class (healthy) and several rare classes (different types of diseases). You want to focus on the less confident instances where the model is unsure about the disease class.

  • Use LC: When you're unsure which cases may be false negatives and need to catch edge cases where the model is least confident.

Pros:

  • Simple to implement and works well for many classification tasks.

Cons:

  • It doesn’t consider the second-best prediction. For example, if the top two classes have very similar probabilities (e.g., 51% vs. 49%), LC might not capture this uncertainty well.

Marginal Confidence

The most intuitive form of uncertainty sampling is the difference between the two most confident predictions. Margin of confidence is less sensitive than least confidence sampling to the base used for the softmax algorithm, but it is still sensitive. Marginal confidence sampling in 0-1 range where 1 is most uncertain

ntuition:

  • Instead of only looking at the most confident prediction, this strategy examines how close the top two predicted classes are. If they’re close, the model is more uncertain.

  • The smaller the difference (margin) between the top two predicted classes, the more uncertain the model is.

Formula:

  • Let P(y1∣x)P(y1​∣x) and P(y2∣x)P(y2​∣x) be the probabilities of the first and second most probable classes, respectively.

  • The uncertainty score is

Uncertainty(x)=P(y1∣x)−P(y2∣x)

  • A smaller margin means higher uncertainty.

Example:

  • P(dog∣x)=0.55

  • P(cat∣x)=0.45

  • P(bird∣x)=0.10

Here, the margin between the most probable class (dog) and the second most probable class (cat) is 0.55−0.45=0.100.55−0.45=0.10. Since the margin is small, the model is highly uncertain about whether the correct class is "dog" or "cat."

Pros:

  • Takes into account more information by looking at the top two classes.

  • More robust than Least Confidence Sampling when two classes are close in probability.

Cons:

  • Ignores the rest of the class probabilities.

  • May still miss uncertainty when the third and fourth class probabilities are significant.

Best For:

  • Multi-Class Problems with Close Confidences: When your task involves many classes, and the top two predictions are often close, MS captures the uncertainty better than LC.

  • Problems with Class Ambiguity: Useful when the task involves classes that are often confused with each other, as it focuses on the difference between the top two predictions.

  • Intermediate Models: If your model is not overconfident, but makes close predictions between classes, MS is a good choice to identify cases where two classes are equally likely.

Not Ideal For:

  • Highly Confident Models: If your model’s top prediction is much more confident than the second best, MS won’t offer much insight.

  • Highly Imbalanced Classes: Margin sampling may miss rare classes since it focuses only on the difference between the top two predictions, not the overall distribution.

Example:

  • Scenario: In image classification where the model frequently confuses “cat” and “dog,” MS will help by identifying the images where the model is equally uncertain between these two classes.

  • Use MS: When classes are ambiguous and you want to focus on improving classification where two (or more) classes are nearly indistinguishable.

Ratio Confidence

Ratio of confidence is a variation on margin of confidence, looking at the ratio between the top two scores instead of the difference. Ratio of confidence is invariant across any base used in softmax. Ratio of confidence in 0-1 range , where 1 is most uncertain.

Intuition:

  • This method compares the ratio between the probability of the most confident class and the second most confident class.

  • A low ratio means the model is relatively uncertain because the second-best class is nearly as likely as the best one.

Formula:

  • The uncertainty score is based on the ratio

Uncertainty(x)=P(y2∣x)P(y1∣x)

  • The higher the ratio (closer to 1), the more uncertain the model is.

Example:

  • P(dog∣x)=0.55

  • P(cat∣x)=0.45

  • P(bird∣x)=0.10

In this case, the ratio is:

P(cat∣x)*P(dog∣x)=0.45*0.55≈0.82

Since the ratio is high, the model is quite uncertain between "dog" and "cat."

Pros:

  • More flexible than margin sampling because it considers the relative confidence.

  • Particularly useful when the model has comparable confidence for two classes.

Cons:

  • Sensitive to close probabilities but ignores other less likely classes.

  • Computationally more intensive than Least Confidence Sampling.

Best For:

  • Highly Confident Predictions with Close Runner-Up: Ratio sampling is useful when the model has very high confidence in one class but also considers another class highly probable.

  • Fine-Grained Uncertainty: If you want to investigate whether a high-confidence prediction is truly reliable by seeing how close the second-highest class is.

  • Calibration Checks: Good for detecting if a model is overconfident, as it can highlight when the difference between the top two predictions is minimal, even if the top class has high confidence.

Not Ideal For:

  • High Entropy Models: If your model distributes probability across many classes, ratio sampling might not be effective since it only looks at the top two.

  • Binary Classification: In binary cases, Ratio Sampling is similar to Least Confidence Sampling, and it might not provide additional benefits.

Example:

  • Scenario: In speech recognition, where the model is 90% confident in one word (e.g., “cat”) but 85% confident in another (“bat”), ratio sampling will help identify cases where the model is highly confident but close in its second-best guess.

  • Use Ratio Sampling: When the top two predictions are close, even though the model appears confident, and you want to validate predictions with a small margin between them.

Entropy Confidence

Entropy mesaures the information (surpise) element of the model . High entropy occurs when the probabilities are almost likely . So in our case hight enropy 1 means model is most confused.

Intuition:

  • Entropy measures the uncertainty in the probability distribution across all classes. Higher entropy means the prediction is more uncertain.

  • It considers the full distribution of predicted probabilities, making it a comprehensive method to assess uncertainty.

Formula:

  • For a classification problem with kk classes and predicted probabilities P(y1∣x),P(y2∣x),…,P(yk∣x)P(y1​∣x),P(y2​∣x),…,P(yk​∣x), the entropy is:

Entropy(x) = −∑(i=1,k) P(yi∣x)log⁡P(yi∣x)

  • The higher the entropy, the more uncertain the model is about the prediction.

Example:

  • P(dog∣x)=0.55

  • P(cat∣x)=0.35

  • P(bird∣x)=0.10

The entropy is:

Entropy(x)=−[0.55log⁡0.55+0.35log⁡0.35+0.10log⁡0.10]≈0.97

This score reflects the overall uncertainty in the prediction, taking all three classes into account.

Pros:

  • Considers the entire distribution of class probabilities.

  • Captures the overall uncertainty, not just between the top predictions.

Cons:

  • More computationally expensive, as it requires evaluating the logarithms of probabilities.

  • May not always provide significantly better results than simpler methods like margin sampling.

Best For:

  • Multi-Class Problems: Entropy is particularly well-suited for tasks where there are many classes, and the model’s uncertainty is spread across several of them.

  • Highly Confident and Highly Uncertain Models: If your model gives uncertain predictions across all classes (high entropy), it helps find instances where the model has no clear decision. It also helps detect cases where many classes have similar probabilities.

  • Complex Tasks (e.g., NLP, Vision): ES is useful in tasks that involve complex and diverse data (e.g., object detection or natural language understanding) where the uncertainty might not be concentrated on just two classes.

Not Ideal For:

  • Binary Classification: In binary classification, entropy sampling usually reduces to Least Confidence Sampling, so it may not offer significant advantages.

  • Low Computational Resources: Entropy sampling is computationally expensive as it requires calculating the full probability distribution and taking the logarithm of each class probability.

Example:

  • Scenario: In an object detection model for autonomous vehicles, where there are many possible objects (cars, people, bikes, etc.) and the model has uncertainty spread across these classes, entropy sampling helps capture the overall uncertainty.

  • Use ES: When the prediction is uncertain across many classes (e.g., 30% cat, 30% dog, 20% bird, 20% other), and you want to address cases where there’s a lot of indecision between classes.

Summary of Pros and Cons for Multi-Class Classification

Thresholds

1. Classification vs. Uncertainty Sampling Threshold

2. Precision and Recall Influence on Thresholds

3. Examples of Thresholds for Different Classification Types

4. Best Practices for Threshold Selection Based on Scenarios

5. Thresholds in Relation to Classification Metrics

6. Threshold Selection for Different Scenarios


7. Consideration of Precision and Recall with Thresholds

4. Best Practices for Setting Thresholds:

  • Adjust dynamically: Start with a moderate threshold (5-10%) and adjust as you observe model performance and uncertainty distribution in your data.

  • Analyze Confidence Distribution: Use histograms or other visual tools to understand how the model's confidence is spread across the dataset. This can help you choose the right uncertainty sampling method and threshold.

  • Iterative Approach: Thresholds should be adaptive. As your model becomes more confident (and accurate), lower your threshold to focus on the remaining difficult cases. Conversely, if your model struggles, increase the threshold to gather more uncertain data points.

  • Use domain knowledge: If you know that certain types of misclassifications are more critical for your use case, adjust your uncertainty sampling method and threshold accordingly.

Summary:

  • Least Confidence Sampling is effective when focusing on instances where the model is least certain overall, and works well for both binary and multi-class tasks.

  • Margin Sampling is useful when the model is confused between two classes, making it a good choice when the focus is on fine-tuning decision boundaries.

  • Entropy Sampling captures the full distribution of uncertainty across all classes, making it useful for multi-class or multi-label classification.

The sampling threshold depends on the classification task's nature, the precision-recall trade-off, and the stage of model development. Start with moderate thresholds and adapt based on your model's performance and confidence distribution.

Last updated