Pattern 13: Confidence Scoring
Intent
Quantify the uncertainty and reliability of predictions, enabling appropriate human oversight, selective automation, and transparent communication about when predictions should be trusted versus questioned.
Also Known As
- Prediction Uncertainty
- Confidence Intervals
- Model Certainty
- Reliability Scoring
- Epistemic Uncertainty
Problem
Not all predictions are equally reliable.
ML model predicts Martinez family: 87% withdrawal risk
But critical questions remain: - How confident is the model in this 87%? - Is this based on strong evidence or weak signals? - Have we seen families like Martinez before? - Are features complete or missing data? - Is this prediction in-distribution or edge case?
Two families, both predicted 87% risk:
Family A: - Model confidence: HIGH - 20 similar training cases (all withdrew) - Complete feature data - In-distribution (typical pattern) - Action: Trust prediction, intervene urgently
Family B: - Model confidence: LOW - 2 similar training cases (conflicting outcomes) - Missing payment history data - Out-of-distribution (unusual pattern) - Action: Flag for human review before acting
Same prediction, different confidence, different actions.
Without confidence scoring: - Treat all predictions equally - Act on unreliable predictions (false alarms) - Miss nuances (model uncertainty ignored) - No way to prioritize human review - Can't explain "how sure are you?"
Context
When this pattern applies:
- Making automated decisions based on predictions
- Need to prioritize human review of uncertain cases
- Users ask "how confident are you?"
- Model predictions drive real-world actions
- Want to avoid false positives from low-confidence predictions
When this pattern may not be needed:
- All predictions manually reviewed anyway
- Low stakes (mistakes are cheap)
- Simple rules-based system (not ML)
- Users don't care about uncertainty
Forces
Competing concerns:
1. Simple vs Sophisticated - Simple: Just use model probability as confidence - Sophisticated: Calibration, ensembles, uncertainty quantification - Balance: Start simple, add sophistication as needed
2. Confidence vs Coverage - High confidence threshold = fewer predictions - Low confidence threshold = more predictions (some unreliable) - Balance: Tune based on cost of false positives vs false negatives
3. Interpretability vs Accuracy - Want users to understand confidence - But statistical rigor may be complex - Balance: Multiple confidence signals, explain clearly
4. Computational Cost vs Benefit - Sophisticated confidence estimation expensive - But critical for high-stakes decisions - Balance: Cache confidence, compute offline when possible
5. Known Unknowns vs Unknown Unknowns - Can quantify some sources of uncertainty - But can't know what we don't know - Balance: Be humble, flag edge cases
Solution
Build multi-faceted confidence scoring that considers:
- Model Confidence - How certain is the ML model?
- Data Quality - Are features complete and reliable?
- Similarity to Training - Have we seen this pattern before?
- Consensus Across Methods - Do different approaches agree?
- Historical Accuracy - Has model been right in similar cases?
Confidence Score Formula:
Overall Confidence = f(
model_probability_confidence,
data_completeness_score,
training_similarity_score,
ensemble_agreement_score,
historical_accuracy_score
)
Result: 0-100 confidence score
- 80-100: High confidence (act automatically)
- 50-79: Medium confidence (act with caution)
- 0-49: Low confidence (human review required)
Structure
Confidence Metadata Tables
-- Extend predictions with confidence scoring
ALTER TABLE ml_predictions
ADD confidence_score DECIMAL(5,2), -- Overall 0-100
ADD confidence_level VARCHAR(20), -- 'high', 'medium', 'low'
ADD confidence_factors NVARCHAR(MAX), -- JSON breakdown
ADD requires_human_review BIT DEFAULT 0,
ADD human_review_reason NVARCHAR(500);
-- Store confidence calibration data
CREATE TABLE confidence_calibration (
calibration_id INT PRIMARY KEY IDENTITY(1,1),
model_id INT NOT NULL,
-- Predicted probability bins
probability_bin_start DECIMAL(3,2), -- 0.80
probability_bin_end DECIMAL(3,2), -- 0.90
-- Actual outcomes in this bin
total_predictions INT,
actual_positives INT,
actual_rate DECIMAL(5,2), -- What % actually happened
-- Calibration
predicted_rate DECIMAL(5,2), -- What % model predicted (bin midpoint)
calibration_error DECIMAL(5,2), -- Difference
-- Time period
period_start DATE,
period_end DATE,
CONSTRAINT FK_calib_model FOREIGN KEY (model_id)
REFERENCES ml_models(model_id)
);
-- Store edge case flags
CREATE TABLE edge_case_detections (
detection_id INT PRIMARY KEY IDENTITY(1,1),
prediction_id INT NOT NULL,
edge_case_type VARCHAR(100), -- 'out_of_distribution', 'missing_features', 'novel_pattern'
severity VARCHAR(20), -- 'high', 'medium', 'low'
description NVARCHAR(500),
detected_date DATETIME2 DEFAULT GETDATE(),
CONSTRAINT FK_edge_prediction FOREIGN KEY (prediction_id)
REFERENCES ml_predictions(prediction_id)
);
Implementation
Confidence Scorer
class ConfidenceScorer {
constructor(db) {
this.db = db;
}
async calculateConfidence(predictionId) {
const prediction = await this.getPrediction(predictionId);
// Calculate each confidence component
const modelConf = this.calculateModelConfidence(prediction);
const dataQuality = await this.calculateDataQuality(prediction);
const similarity = await this.calculateTrainingSimilarity(prediction);
const consensus = await this.calculateEnsembleConsensus(prediction);
const historicalAcc = await this.calculateHistoricalAccuracy(prediction);
// Weighted combination
const weights = {
model: 0.30,
dataQuality: 0.20,
similarity: 0.20,
consensus: 0.20,
historical: 0.10
};
const overallScore =
(modelConf * weights.model) +
(dataQuality * weights.dataQuality) +
(similarity * weights.similarity) +
(consensus * weights.consensus) +
(historicalAcc * weights.historical);
const confidenceLevel = this.determineConfidenceLevel(overallScore);
const requiresReview = overallScore < 50;
// Detect edge cases
const edgeCases = await this.detectEdgeCases(prediction);
// Save confidence metadata
await this.saveConfidence(predictionId, {
confidence_score: overallScore,
confidence_level: confidenceLevel,
confidence_factors: {
model_confidence: modelConf,
data_quality: dataQuality,
training_similarity: similarity,
ensemble_consensus: consensus,
historical_accuracy: historicalAcc
},
requires_human_review: requiresReview || edgeCases.length > 0,
edge_cases: edgeCases
});
return {
overall_score: overallScore,
level: confidenceLevel,
requires_review: requiresReview,
factors: {
model_confidence: modelConf,
data_quality: dataQuality,
training_similarity: similarity,
ensemble_consensus: consensus,
historical_accuracy: historicalAcc
},
edge_cases: edgeCases
};
}
calculateModelConfidence(prediction) {
// How confident is the model based on predicted probability?
// High probability (close to 0 or 1) = high confidence
// Mid probability (around 0.5) = low confidence
const prob = prediction.predicted_probability;
// Distance from decision boundary (0.5)
const distanceFrom50 = Math.abs(prob - 0.5);
// Convert to 0-100 scale
// 0.5 distance = 0, 0 or 1 distance = 100
const confidence = (distanceFrom50 / 0.5) * 100;
return Math.min(100, confidence);
}
async calculateDataQuality(prediction) {
// How complete and reliable is the input data?
const features = JSON.parse(prediction.feature_values);
const featureNames = Object.keys(features);
let qualityScore = 100;
let missingCount = 0;
let defaultedCount = 0;
// Check for missing/default values
for (const [feature, value] of Object.entries(features)) {
if (value === null || value === undefined) {
missingCount++;
qualityScore -= 10; // -10 points per missing feature
} else if (this.isDefaultValue(feature, value)) {
defaultedCount++;
qualityScore -= 5; // -5 points per defaulted feature
}
}
// Check data recency
const daysSinceUpdate = await this.db.query(`
SELECT DATEDIFF(NOW(), MAX(calculation_date)) as days
FROM family_engagement_metrics
WHERE family_id = ?
`, [prediction.family_id]);
const staleness = daysSinceUpdate[0]?.days || 0;
if (staleness > 7) {
qualityScore -= Math.min(30, staleness * 2); // Penalize stale data
}
return Math.max(0, qualityScore);
}
isDefaultValue(feature, value) {
// Detect if value is likely a default/imputed value
const defaults = {
engagement_score: 50,
email_open_rate: 50,
days_since_login: 999,
payment_on_time_rate: 50
};
return defaults[feature] === value;
}
async calculateTrainingSimilarity(prediction) {
// How similar is this case to training data?
// Use Pattern 11 logic to find similar cases
const features = JSON.parse(prediction.feature_values);
// Find K nearest neighbors in training set
const similarCases = await this.db.query(`
SELECT
hp.*,
-- Calculate simple distance (would be more sophisticated in practice)
ABS(hp.engagement_score - ?) +
ABS(hp.payment_risk - ?) +
ABS(hp.withdrawal_risk - ?) as distance
FROM historical_patterns hp
WHERE hp.outcome IS NOT NULL
ORDER BY distance ASC
LIMIT 10
`, [
features.engagement_score,
features.payment_risk,
features.withdrawal_risk
]);
if (similarCases.length === 0) {
return 0; // No similar training cases = low confidence
}
// Average distance to top 10 neighbors
const avgDistance = similarCases.reduce((sum, c) => sum + c.distance, 0) / similarCases.length;
// Convert to similarity score (lower distance = higher similarity)
// Assume distance of 50 or less = high similarity
const similarity = Math.max(0, 100 - (avgDistance * 2));
return Math.min(100, similarity);
}
async calculateEnsembleConsensus(prediction) {
// Do different prediction methods agree?
// Compare ML model with Pattern 11 pattern matching
const familyId = prediction.family_id;
// Get Pattern 11 prediction
const patternPrediction = await this.db.query(`
SELECT predicted_outcome, confidence
FROM pattern_predictions
WHERE family_id = ?
ORDER BY prediction_date DESC
LIMIT 1
`, [familyId]);
if (patternPrediction.length === 0) {
return 50; // No alternative method = medium confidence
}
const pp = patternPrediction[0];
// Do they agree on outcome?
const mlPredicts = prediction.predicted_probability > 0.5 ? 'withdrew' : 'remained';
const patternPredicts = pp.predicted_outcome;
if (mlPredicts === patternPredicts) {
// Both agree - confidence is higher
// Weight by how confident each method is
const mlConf = this.calculateModelConfidence(prediction);
const patternConf = pp.confidence || 50;
return (mlConf + patternConf) / 2;
} else {
// Disagreement - confidence is lower
return 30;
}
}
async calculateHistoricalAccuracy(prediction) {
// How accurate has the model been for similar predictions?
const modelId = prediction.model_id;
const probability = prediction.predicted_probability;
// Find historical predictions in similar probability range
const binStart = Math.floor(probability * 10) / 10; // 0.87 -> 0.80
const binEnd = binStart + 0.1;
const historicalPerformance = await this.db.query(`
SELECT
COUNT(*) as total,
SUM(CASE WHEN prediction_correct = 1 THEN 1 ELSE 0 END) as correct,
SUM(CASE WHEN prediction_correct = 1 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as accuracy
FROM ml_predictions
WHERE model_id = ?
AND predicted_probability >= ?
AND predicted_probability < ?
AND actual_outcome IS NOT NULL
`, [modelId, binStart, binEnd]);
const perf = historicalPerformance[0];
if (!perf || perf.total < 5) {
return 50; // Not enough data = medium confidence
}
// Return historical accuracy as confidence
return perf.accuracy || 50;
}
determineConfidenceLevel(score) {
if (score >= 80) return 'high';
if (score >= 50) return 'medium';
return 'low';
}
async detectEdgeCases(prediction) {
const edgeCases = [];
const features = JSON.parse(prediction.feature_values);
// Check 1: Missing critical features
const criticalFeatures = ['engagement_score', 'payment_risk', 'withdrawal_risk'];
const missing = criticalFeatures.filter(f =>
features[f] === null || features[f] === undefined
);
if (missing.length > 0) {
edgeCases.push({
type: 'missing_features',
severity: 'high',
description: `Missing critical features: ${missing.join(', ')}`
});
}
// Check 2: Extreme feature values
if (features.engagement_score < 10 || features.engagement_score > 90) {
edgeCases.push({
type: 'extreme_value',
severity: 'medium',
description: `Extreme engagement score: ${features.engagement_score}`
});
}
// Check 3: Very new family (limited history)
if (features.tenure_score < 20) {
edgeCases.push({
type: 'limited_history',
severity: 'medium',
description: 'Very new family, limited behavioral history'
});
}
// Check 4: Out-of-distribution detection
// Compare feature distributions to training data
const isOOD = await this.isOutOfDistribution(features);
if (isOOD) {
edgeCases.push({
type: 'out_of_distribution',
severity: 'high',
description: 'Feature pattern not seen in training data'
});
}
return edgeCases;
}
async isOutOfDistribution(features) {
// Simple OOD detection: compare to training data statistics
const stats = await this.db.query(`
SELECT
AVG(engagement_score) as avg_eng,
STDDEV(engagement_score) as std_eng,
AVG(payment_risk) as avg_pay,
STDDEV(payment_risk) as std_pay
FROM historical_patterns
WHERE outcome IS NOT NULL
`);
const s = stats[0];
// Check if features are > 3 standard deviations from mean
const engZScore = Math.abs((features.engagement_score - s.avg_eng) / s.std_eng);
const payZScore = Math.abs((features.payment_risk - s.avg_pay) / s.std_pay);
return engZScore > 3 || payZScore > 3;
}
async saveConfidence(predictionId, confidence) {
await this.db.query(`
UPDATE ml_predictions
SET
confidence_score = ?,
confidence_level = ?,
confidence_factors = ?,
requires_human_review = ?
WHERE prediction_id = ?
`, [
confidence.confidence_score,
confidence.confidence_level,
JSON.stringify(confidence.confidence_factors),
confidence.requires_human_review ? 1 : 0,
predictionId
]);
// Save edge cases
for (const edge of confidence.edge_cases || []) {
await this.db.query(`
INSERT INTO edge_case_detections (
prediction_id, edge_case_type, severity, description
) VALUES (?, ?, ?, ?)
`, [predictionId, edge.type, edge.severity, edge.description]);
}
}
async getPrediction(predictionId) {
const result = await this.db.query(`
SELECT * FROM ml_predictions WHERE prediction_id = ?
`, [predictionId]);
return result[0];
}
}
module.exports = ConfidenceScorer;
Model Calibration
# calibrate_model.py
import numpy as np
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt
def calibrate_and_plot(y_true, y_pred_proba, n_bins=10):
"""
Assess and visualize model calibration
Well-calibrated model: if model says 80%, then 80% of those cases actually happen
"""
# Calculate calibration curve
fraction_of_positives, mean_predicted_value = calibration_curve(
y_true, y_pred_proba, n_bins=n_bins
)
# Plot
plt.figure(figsize=(10, 6))
plt.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
plt.plot(mean_predicted_value, fraction_of_positives, 'o-', label='Model')
plt.xlabel('Predicted Probability')
plt.ylabel('Actual Frequency')
plt.title('Calibration Curve')
plt.legend()
plt.grid(True)
plt.savefig('calibration_curve.png')
# Calculate calibration error
calibration_error = np.mean(np.abs(fraction_of_positives - mean_predicted_value))
print(f"Mean Calibration Error: {calibration_error:.3f}")
return {
'fraction_of_positives': fraction_of_positives.tolist(),
'mean_predicted_value': mean_predicted_value.tolist(),
'calibration_error': float(calibration_error)
}
def recalibrate_model(model, X_train, y_train, X_test):
"""
Apply Platt scaling to improve calibration
"""
from sklearn.calibration import CalibratedClassifierCV
calibrated_model = CalibratedClassifierCV(model, method='sigmoid', cv=5)
calibrated_model.fit(X_train, y_train)
# Get calibrated probabilities
calibrated_probs = calibrated_model.predict_proba(X_test)[:, 1]
return calibrated_model, calibrated_probs
Usage Example
const scorer = new ConfidenceScorer(db);
// After making prediction
const predictionId = 12345;
const confidence = await scorer.calculateConfidence(predictionId);
console.log(`
Confidence Assessment for Prediction ${predictionId}:
Overall Confidence: ${confidence.overall_score.toFixed(1)}/100
Level: ${confidence.level.toUpperCase()}
Requires Human Review: ${confidence.requires_review ? 'YES ⚠️' : 'No'}
Confidence Factors:
Model Confidence: ${confidence.factors.model_confidence.toFixed(1)}/100
(How certain is the ML model based on probability)
Data Quality: ${confidence.factors.data_quality.toFixed(1)}/100
(Completeness and recency of input features)
Training Similarity: ${confidence.factors.training_similarity.toFixed(1)}/100
(How similar to cases model was trained on)
Ensemble Consensus: ${confidence.factors.ensemble_consensus.toFixed(1)}/100
(Agreement between ML and pattern matching)
Historical Accuracy: ${confidence.factors.historical_accuracy.toFixed(1)}/100
(Past accuracy for similar predictions)
Edge Cases Detected: ${confidence.edge_cases.length}
`);
if (confidence.edge_cases.length > 0) {
console.log('\n Edge Case Flags:');
confidence.edge_cases.forEach(edge => {
console.log(` - [${edge.severity.toUpperCase()}] ${edge.type}: ${edge.description}`);
});
}
// Example output:
// Confidence Assessment for Prediction 12345:
// Overall Confidence: 82.4/100
// Level: HIGH
// Requires Human Review: No
//
// Confidence Factors:
// Model Confidence: 74.0/100
// Data Quality: 95.0/100
// Training Similarity: 88.0/100
// Ensemble Consensus: 85.0/100
// Historical Accuracy: 90.0/100
//
// Edge Cases Detected: 0
Confidence-Based Actions
async function actOnPredictionWithConfidence(predictionId) {
const prediction = await db.query(`
SELECT
p.family_id,
p.predicted_probability,
p.confidence_score,
p.confidence_level,
p.requires_human_review
FROM ml_predictions p
WHERE p.prediction_id = ?
`, [predictionId]);
const p = prediction[0];
// Decision logic based on confidence
if (p.confidence_level === 'high' && p.predicted_probability > 0.80) {
// High confidence, high risk = Automatic urgent intervention
await triggerAutomaticIntervention(p.family_id, 'urgent');
return {
action: 'automatic_urgent_intervention',
reason: 'High confidence (>80), high risk (>80%)'
};
} else if (p.confidence_level === 'high' && p.predicted_probability > 0.60) {
// High confidence, moderate risk = Automatic standard intervention
await triggerAutomaticIntervention(p.family_id, 'standard');
return {
action: 'automatic_standard_intervention',
reason: 'High confidence (>80), moderate risk (60-80%)'
};
} else if (p.confidence_level === 'medium') {
// Medium confidence = Queue for coordinator review
await queueForHumanReview(p.family_id, predictionId, 'medium_confidence');
return {
action: 'queue_for_review',
reason: 'Medium confidence (50-79) - human judgment needed'
};
} else if (p.confidence_level === 'low' || p.requires_human_review) {
// Low confidence or edge cases = Flag for immediate review
await flagForImmediateReview(p.family_id, predictionId, 'low_confidence_or_edge_case');
return {
action: 'flag_for_immediate_review',
reason: 'Low confidence (<50) or edge cases detected'
};
} else {
// Low risk = Monitor only
await addToMonitoringList(p.family_id);
return {
action: 'monitor_only',
reason: 'Low risk prediction'
};
}
}
Variations
By Confidence Method
Simple: Model Probability Only - Distance from 0.5 threshold - Fast, no additional computation - Ignores data quality, training distribution
Intermediate: Multi-Factor Scoring - Model + data quality + similarity - Comprehensive, interpretable - Moderate computational cost
Advanced: Bayesian Uncertainty - Posterior distributions over predictions - Captures epistemic + aleatoric uncertainty - Computationally expensive, requires specialized models
Expert: Conformal Prediction - Provides prediction sets with guaranteed coverage - Rigorous statistical framework - Complex implementation
By Application
High Stakes (Medical, Financial): - Conservative thresholds (require 90%+ confidence) - Multiple validation methods - Extensive human oversight - Rigorous calibration
Medium Stakes (Customer Operations): - Balanced thresholds (70%+ confidence) - Standard multi-factor scoring - Selective human review - Periodic calibration
Low Stakes (Marketing, Recommendations): - Permissive thresholds (50%+ confidence) - Simple probability-based - Minimal human review - Occasional validation
Consequences
Benefits
1. Appropriate automation High confidence = automate. Low confidence = human review. Right balance.
2. Reduced false alarms Don't act on unreliable predictions.
3. Transparent uncertainty "85% confident" more honest than "87% risk" alone.
4. Prioritized human effort Review low-confidence cases first.
5. Calibrated expectations Users learn when to trust system.
6. Edge case detection Flag unusual situations automatically.
Costs
1. Additional complexity Must calculate and store confidence scores.
2. Computational overhead Confidence calculation adds latency.
3. Calibration maintenance Must monitor and recalibrate over time.
4. Communication challenge Explaining confidence to non-technical users.
5. Threshold tuning What confidence level triggers what action?
Sample Code
Batch confidence calculation:
async function calculateConfidenceForAllPredictions() {
const scorer = new ConfidenceScorer(db);
// Get recent predictions without confidence
const predictions = await db.query(`
SELECT prediction_id
FROM ml_predictions
WHERE confidence_score IS NULL
AND prediction_date >= DATE_SUB(NOW(), INTERVAL 7 DAY)
`);
console.log(`Calculating confidence for ${predictions.length} predictions...`);
for (const pred of predictions) {
await scorer.calculateConfidence(pred.prediction_id);
}
console.log('Done!');
}
Known Uses
Homeschool Co-Op Intelligence Platform - Multi-factor confidence scoring - 89% of predictions high confidence (>80) - 8% require human review (<50) - Reduced false intervention rate by 40%
Medical AI Systems - FDA requires uncertainty quantification - Predictions with confidence intervals - Low confidence cases flagged for physician review
Autonomous Vehicles - Confidence in object detection - Low confidence = slow down, hand to driver - Critical safety application
Financial Trading - Confidence in price predictions - Trade size scales with confidence - Risk management essential
Related Patterns
Requires: - Pattern 11: Historical Pattern Matching - similarity scoring - Pattern 12: Risk Stratification Models - ML predictions
Enhances: - Pattern 15: Intervention Recommendation Engine - confidence determines automation level - Pattern 22: Progressive Escalation Sequences - confidence affects escalation - Pattern 26: Feedback Loop Implementation - validate confidence calibration
Enabled by this: - Selective automation (high confidence only) - Human-in-the-loop workflows - Adaptive action thresholds
References
- Guo, Chuan, et al. "On Calibration of Modern Neural Networks." ICML 2017.
- Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. "Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles." NeurIPS 2017.
- Angelopoulos, Anastasios N., and Stephen Bates. "Conformal Prediction: A Gentle Introduction." arXiv 2021.
- Platt, John. "Probabilistic Outputs for Support Vector Machines." Advances in Large Margin Classifiers 1999.