📎Sampling Metrics
Sampling realtime data from model retraining
Data Sampling
"Not all data is equal"
Expert anecdote by Jennifer Prendki, founder and CEO of Alectio
If you care about your nutrition, you don’t go to the supermarket and randomly select items from the shelves. You might eventually get the nutrients you need by eating random items from supermarket shelves, but you will eat a lot of junk food in the process. I think it is weird that in machine learning, people still think it’s better to sample the supermarket randomly than figure out what they need and focus their efforts there.
Credits: Human in the Loop Machine Learning by Robert Munroby
The assumption is that some data points are more valuable for the model than others. The focus is on how to identify these valuable data points.
Uncertainty Sampling Techniques:
The following techniques helps to sample data, where the model is most uncertain.
Least Confidence Sampling
Marginal Confidence
Ratio Confidence
Entropy Confidence
Before we dive deep in to each of the sampling techniques, we will shortly discuss the place of sampling in the whole MLOps lifecycle.
Least Confidence Sampling
Least Confidence Sampling is the most common method for uncertainty sampling, which takes the difference between 100% confidence and the most confidently predicted label for each item. Least confidence is sensitive to the base used for the softmax algorithm. Least confidence sampling is in the range of 0-1 where 1 is most uncertain.
Formula:
The model queries the instances where it is least confident in its prediction.
If the model predicts a class y^=argmax(P(y∣x)) y^= argmax(P(y∣x)), where P(y∣x)P(y∣x) is the probability of class y given input x, the uncertainty score is:
Uncertainty(x)=1−P(y^∣x)Uncertainty(x)=1−P(y^∣x)
Here, P(y^∣x)P(y^∣x) is the model's predicted probability for the most likely class.
Thresholds:
Threshold value : You can set a threshold based on selecting the top 10% of samples with the lowest confidence (highest uncertainty). For example, out of 1000 samples, pick 100 where the model is the least confident.
Intuition:
This technique queries the samples where the model is "unsure" of its prediction. It only looks at the highest prediction probability and assesses its confidence.
Imagine a binary classifier trying to determine whether an image is a cat or a dog. If it predicts "cat" with 55% confidence, the model is not very sure, making it a good candidate for sampling.
Example:
Let's assume a multi-class problem with 3 classes (dog, cat, bird):
P(dog∣x)=0.55
P(cat∣x)=0.35
P(bird∣x)=0.10
The model predicts "dog" since it's the most likely class, bit its confidence is P(dog∣x) =0.55 . The uncertainity is 1−0.55=0.45, so this instance is not very confidently predicted.
Based on this, you may decide to query instances where the model is least confident about at least one of the labels.
Best For:
Imbalanced Classification: If your data has a dominant class, LC works well because it focuses on the least confident predictions.
Simple Classification Problems: When your model performs well and you just need to focus on the hardest cases where it’s least confident.
Binary Classification: LC is particularly efficient in binary classification since it focuses on the most uncertain cases between two classes.
Not Ideal For:
Multi-Class Classification with Close Predictions: If multiple classes have similar probabilities, LC might miss some nuanced uncertainty because it only looks at the top class.
Scenarios with High Confidence Predictions: If your model has very high confidence in most predictions, LC may not differentiate well between uncertain cases.
Example:
Scenario: In a medical diagnosis task, where you have one dominant class (healthy) and several rare classes (different types of diseases). You want to focus on the less confident instances where the model is unsure about the disease class.
Use LC: When you're unsure which cases may be false negatives and need to catch edge cases where the model is least confident.
Pros:
Simple to implement and works well for many classification tasks.
Cons:
It doesn’t consider the second-best prediction. For example, if the top two classes have very similar probabilities (e.g., 51% vs. 49%), LC might not capture this uncertainty well.
Marginal Confidence
The most intuitive form of uncertainty sampling is the difference between the two most confident predictions. Margin of confidence is less sensitive than least confidence sampling to the base used for the softmax algorithm, but it is still sensitive. Marginal confidence sampling in 0-1 range where 1 is most uncertain
ntuition:
Instead of only looking at the most confident prediction, this strategy examines how close the top two predicted classes are. If they’re close, the model is more uncertain.
The smaller the difference (margin) between the top two predicted classes, the more uncertain the model is.
Formula:
Let P(y1∣x)P(y1∣x) and P(y2∣x)P(y2∣x) be the probabilities of the first and second most probable classes, respectively.
The uncertainty score is
Uncertainty(x)=P(y1∣x)−P(y2∣x)
A smaller margin means higher uncertainty.
Example:
P(dog∣x)=0.55
P(cat∣x)=0.45
P(bird∣x)=0.10
Here, the margin between the most probable class (dog) and the second most probable class (cat) is 0.55−0.45=0.100.55−0.45=0.10. Since the margin is small, the model is highly uncertain about whether the correct class is "dog" or "cat."
Pros:
Takes into account more information by looking at the top two classes.
More robust than Least Confidence Sampling when two classes are close in probability.
Cons:
Ignores the rest of the class probabilities.
May still miss uncertainty when the third and fourth class probabilities are significant.
Best For:
Multi-Class Problems with Close Confidences: When your task involves many classes, and the top two predictions are often close, MS captures the uncertainty better than LC.
Problems with Class Ambiguity: Useful when the task involves classes that are often confused with each other, as it focuses on the difference between the top two predictions.
Intermediate Models: If your model is not overconfident, but makes close predictions between classes, MS is a good choice to identify cases where two classes are equally likely.
Not Ideal For:
Highly Confident Models: If your model’s top prediction is much more confident than the second best, MS won’t offer much insight.
Highly Imbalanced Classes: Margin sampling may miss rare classes since it focuses only on the difference between the top two predictions, not the overall distribution.
Example:
Scenario: In image classification where the model frequently confuses “cat” and “dog,” MS will help by identifying the images where the model is equally uncertain between these two classes.
Use MS: When classes are ambiguous and you want to focus on improving classification where two (or more) classes are nearly indistinguishable.
Ratio Confidence
Ratio of confidence is a variation on margin of confidence, looking at the ratio between the top two scores instead of the difference. Ratio of confidence is invariant across any base used in softmax. Ratio of confidence in 0-1 range , where 1 is most uncertain.
Intuition:
This method compares the ratio between the probability of the most confident class and the second most confident class.
A low ratio means the model is relatively uncertain because the second-best class is nearly as likely as the best one.
Formula:
The uncertainty score is based on the ratio
Uncertainty(x)=P(y2∣x)P(y1∣x)
The higher the ratio (closer to 1), the more uncertain the model is.
Example:
P(dog∣x)=0.55
P(cat∣x)=0.45
P(bird∣x)=0.10
In this case, the ratio is:
P(cat∣x)*P(dog∣x)=0.45*0.55≈0.82
Since the ratio is high, the model is quite uncertain between "dog" and "cat."
Pros:
More flexible than margin sampling because it considers the relative confidence.
Particularly useful when the model has comparable confidence for two classes.
Cons:
Sensitive to close probabilities but ignores other less likely classes.
Computationally more intensive than Least Confidence Sampling.
Best For:
Highly Confident Predictions with Close Runner-Up: Ratio sampling is useful when the model has very high confidence in one class but also considers another class highly probable.
Fine-Grained Uncertainty: If you want to investigate whether a high-confidence prediction is truly reliable by seeing how close the second-highest class is.
Calibration Checks: Good for detecting if a model is overconfident, as it can highlight when the difference between the top two predictions is minimal, even if the top class has high confidence.
Not Ideal For:
High Entropy Models: If your model distributes probability across many classes, ratio sampling might not be effective since it only looks at the top two.
Binary Classification: In binary cases, Ratio Sampling is similar to Least Confidence Sampling, and it might not provide additional benefits.
Example:
Scenario: In speech recognition, where the model is 90% confident in one word (e.g., “cat”) but 85% confident in another (“bat”), ratio sampling will help identify cases where the model is highly confident but close in its second-best guess.
Use Ratio Sampling: When the top two predictions are close, even though the model appears confident, and you want to validate predictions with a small margin between them.
Entropy Confidence
Entropy mesaures the information (surpise) element of the model . High entropy occurs when the probabilities are almost likely . So in our case hight enropy 1 means model is most confused.
Intuition:
Entropy measures the uncertainty in the probability distribution across all classes. Higher entropy means the prediction is more uncertain.
It considers the full distribution of predicted probabilities, making it a comprehensive method to assess uncertainty.
Formula:
For a classification problem with kk classes and predicted probabilities P(y1∣x),P(y2∣x),…,P(yk∣x)P(y1∣x),P(y2∣x),…,P(yk∣x), the entropy is:
Entropy(x) = −∑(i=1,k) P(yi∣x)logP(yi∣x)
The higher the entropy, the more uncertain the model is about the prediction.
Example:
P(dog∣x)=0.55
P(cat∣x)=0.35
P(bird∣x)=0.10
The entropy is:
Entropy(x)=−[0.55log0.55+0.35log0.35+0.10log0.10]≈0.97
This score reflects the overall uncertainty in the prediction, taking all three classes into account.
Pros:
Considers the entire distribution of class probabilities.
Captures the overall uncertainty, not just between the top predictions.
Cons:
More computationally expensive, as it requires evaluating the logarithms of probabilities.
May not always provide significantly better results than simpler methods like margin sampling.
Best For:
Multi-Class Problems: Entropy is particularly well-suited for tasks where there are many classes, and the model’s uncertainty is spread across several of them.
Highly Confident and Highly Uncertain Models: If your model gives uncertain predictions across all classes (high entropy), it helps find instances where the model has no clear decision. It also helps detect cases where many classes have similar probabilities.
Complex Tasks (e.g., NLP, Vision): ES is useful in tasks that involve complex and diverse data (e.g., object detection or natural language understanding) where the uncertainty might not be concentrated on just two classes.
Not Ideal For:
Binary Classification: In binary classification, entropy sampling usually reduces to Least Confidence Sampling, so it may not offer significant advantages.
Low Computational Resources: Entropy sampling is computationally expensive as it requires calculating the full probability distribution and taking the logarithm of each class probability.
Example:
Scenario: In an object detection model for autonomous vehicles, where there are many possible objects (cars, people, bikes, etc.) and the model has uncertainty spread across these classes, entropy sampling helps capture the overall uncertainty.
Use ES: When the prediction is uncertain across many classes (e.g., 30% cat, 30% dog, 20% bird, 20% other), and you want to address cases where there’s a lot of indecision between classes.
Summary of Pros and Cons for Multi-Class Classification
Sampling Method
Pros
Cons
Least Confidence (LC)
Simple, efficient, works well when the top prediction dominates.
Ignores how close other class probabilities are.
Margin Sampling (MS)
More sensitive to differences between the top two predictions.
Ignores the full probability distribution, only considers top two.
Ratio Sampling
Captures relative uncertainty between top classes.
Ignores the rest of the class distribution.
Entropy Sampling (ES)
Considers the entire probability distribution, comprehensive measure of uncertainty.
More computationally expensive, may not be necessary for all use cases.
Thresholds
1. Classification vs. Uncertainty Sampling Threshold
Factor
Classification Threshold
Uncertainty Sampling Threshold
Purpose
Decides the cutoff for assigning a class label (e.g., 0.5 for binary classification).
Proportion of least confident predictions chosen for retraining or further examination.
Effect on Confidence
Higher threshold results in fewer positive classifications but more confident predictions.
Higher threshold selects more uncertain instances (e.g., top 10% uncertain samples).
Impact on Sampling
Higher thresholds result in more confident predictions, reducing the pool of uncertain data points.
A higher uncertainty sampling threshold selects a broader range of uncertain data, useful for retraining.
Interaction
A lower classification threshold increases the number of uncertain instances, and vice versa.
Higher uncertainty sampling thresholds are usually paired with lower classification thresholds.
2. Precision and Recall Influence on Thresholds
Use Case
Precision
Recall
Classification Threshold
Uncertainty Sampling Method
Uncertainty Sampling Threshold
Fraud Detection
High
Moderate
High (e.g., 0.8)
Margin Sampling to refine decision boundaries
10-15% (target least confident predictions)
Medical Diagnosis (Tumor Detection)
Moderate
High
Low (e.g., 0.5)
Least Confidence Sampling for identifying possible false negatives
10-20% (target lowest confidence cases)
Spam Detection
Balanced
Balanced
Medium (e.g., 0.5)
Entropy Sampling to capture uncertainty in various classes
5-10% (focus on medium uncertainty cases)
Autonomous Driving
High
High
High (e.g., 0.8)
Least Confidence or Entropy Samplingto ensure model certainty in safety-critical decisions
5-10% (limit to highly uncertain scenarios)
3. Examples of Thresholds for Different Classification Types
Classification Type
Scenario
Sampling Method
Suggested Threshold
Explanation
Binary Classification
Fraud detection
Least Confidence
10-15%
Focus on least confident predictions to refine the decision boundary and avoid false positives.
Multi-Class Classification
Skin lesion classification
Entropy Sampling
15-20%
Capture samples where the model has similar confidence in multiple classes (high entropy).
Multi-Label Classification
Content moderation (spam, hate speech)
Margin Sampling
5-10%
Target instances where the model struggles between two or more labels, improving recall.
4. Best Practices for Threshold Selection Based on Scenarios
Scenario
Approach
Best Method
Threshold
New Model in Initial Stages
Start with a moderate threshold (10-20%) to focus on the least confident samples.
Least Confidence Sampling
10-20%
Model Maturing with High Accuracy
Gradually lower uncertainty sampling threshold to focus on fewer, more challenging cases.
Entropy Sampling
5-10%
High Precision Use Case (e.g., fraud detection)
Prioritize precision by capturing uncertain predictions with potential false positives.
Margin Sampling
10-15%
High Recall Use Case (e.g., medical diagnosis)
Prioritize recall by focusing on uncertain samples with potential false negatives.
Least Confidence Sampling
10-20%
Balanced Precision-Recall Use Case (e.g., spam detection)
Use Entropy Sampling to capture uncertainty across all classes and labels.
Entropy Sampling
5-10%
Edge Cases or Rare Classes
Focus on edge cases where the model is least confident, particularly for rare classes.
Least Confidence or Margin
10-20%
5. Thresholds in Relation to Classification Metrics
Metric
Effect on Thresholds
Recommended Approach
Precision-Focused
Higher uncertainty sampling threshold to avoid false positives.
Use Margin Sampling with a moderate threshold (10-15%).
Recall-Focused
Lower classification threshold, higher sampling threshold to catch false negatives.
Use Least Confidence Sampling with a higher threshold (10-20%).
Balanced Precision-Recall
Focus on entropy to balance between catching both false positives and false negatives.
Use Entropy Sampling with a medium threshold (5-10%).
6. Threshold Selection for Different Scenarios
Sampling Method
Scenario
Example Thresholds
Explanation
Least Confidence
Fraud detection (binary classification)
0.2 (query samples where confidence < 0.8)
Target cases where the model is less than 80% confident in its predictions.
Margin Sampling
Fine-grained image classification (multi-class)
0.05 (query small margin between top 2 classes)
Focus on instances where the model is confused between two similar classes.
Entropy Sampling
Medical diagnosis (multi-class classification)
1.0 (query instances with high entropy)
Target samples with the highest uncertainty across all classes.
Ratio Sampling
Text classification (binary)
1.2 (query ratio of top two predictions)
Focus on cases where the model has low certainty between the top two predictions.
7. Consideration of Precision and Recall with Thresholds
Precision/Recall Trade-off
Effect on Sampling Threshold
Example Threshold
High Precision Use Case
Set a lower uncertainty sampling threshold to focus on confident, uncertain cases
0.1 (Least Confidence)
High Recall Use Case
Set a higher uncertainty sampling threshold to catch more possible false negatives
0.3 (Least Confidence)
Balanced Precision/Recall
Set a medium threshold to balance sampling between uncertain and confident cases
1.0 (Entropy Sampling)
4. Best Practices for Setting Thresholds:
Adjust dynamically: Start with a moderate threshold (5-10%) and adjust as you observe model performance and uncertainty distribution in your data.
Analyze Confidence Distribution: Use histograms or other visual tools to understand how the model's confidence is spread across the dataset. This can help you choose the right uncertainty sampling method and threshold.
Iterative Approach: Thresholds should be adaptive. As your model becomes more confident (and accurate), lower your threshold to focus on the remaining difficult cases. Conversely, if your model struggles, increase the threshold to gather more uncertain data points.
Use domain knowledge: If you know that certain types of misclassifications are more critical for your use case, adjust your uncertainty sampling method and threshold accordingly.
Summary:
Least Confidence Sampling is effective when focusing on instances where the model is least certain overall, and works well for both binary and multi-class tasks.
Margin Sampling is useful when the model is confused between two classes, making it a good choice when the focus is on fine-tuning decision boundaries.
Entropy Sampling captures the full distribution of uncertainty across all classes, making it useful for multi-class or multi-label classification.
The sampling threshold depends on the classification task's nature, the precision-recall trade-off, and the stage of model development. Start with moderate thresholds and adapt based on your model's performance and confidence distribution.
Last updated