Pattern 16: Cohort Discovery & Analysis
Intent
Automatically discover meaningful cohorts (groups) within the population based on shared behavioral patterns, characteristics, and outcomes, enabling targeted strategies, segment-specific interventions, and discovery of hidden subpopulations with unique needs.
Also Known As
- Cluster Analysis
- Behavioral Segmentation
- Market Segmentation Discovery
- Pattern-Based Grouping
- Unsupervised Cohort Identification
Problem
Manual segmentation misses hidden patterns.
Sarah segments families manually: - "New families" (enrolled < 6 months) - "Established families" (enrolled 6+ months) - "At-risk families" (low engagement scores)
But the data reveals unexpected cohorts:
Cohort A: "Silent Strugglers" (12 families) - Never complain or ask for help - Payment always on time - But quietly disengaging (portal use dropping, event attendance declining) - 83% withdrawal rate if unaddressed
Cohort B: "Vocal But Committed" (18 families) - Frequent questions, concerns, feedback - Occasional payment delays - But deeply engaged (high participation, referrals) - Only 8% withdrawal rate
Manual segmentation would treat these similarly: - Both might score 60-70 on engagement - Both have some issues - But completely different risk profiles and needs
Without cohort discovery: - Miss these distinct subpopulations - Apply wrong interventions (one-size-fits-all) - Can't target resources effectively - Don't learn segment-specific patterns
With cohort discovery: - Find 8-12 natural cohorts automatically - Understand each cohort's unique characteristics - Tailor interventions per cohort - Track cohort-specific outcomes
Context
When this pattern applies:
- Population large enough for meaningful segments (100+ entities)
- Behavioral diversity exists (not all the same)
- Want to discover unknown patterns (not predefined segments)
- Targeted strategies are feasible
- Can act on segment-specific insights
When this pattern may not be needed:
- Very small population (<50)
- Population is homogeneous
- Already have good manual segmentation
- Can't differentiate strategies by segment
Forces
Competing concerns:
1. Granularity vs Actionability - Many small cohorts = precise but hard to act on - Few large cohorts = actionable but lose nuance - Balance: 6-12 cohorts for most organizations
2. Stability vs Freshness - Stable cohorts = consistent but may be outdated - Dynamic cohorts = current but members shift constantly - Balance: Quarterly reclustering
3. Interpretability vs Accuracy - Simple features = interpretable cohorts ("high engagement group") - Complex features = accurate but hard to explain - Balance: Use interpretable features, validate with complex ones
4. Supervised vs Unsupervised - Supervised = optimize for outcome (e.g., retention) - Unsupervised = discover natural groups - Balance: Start unsupervised, validate with outcomes
5. Predefined vs Discovered - Predefined = hypothesis-driven (test ideas) - Discovered = data-driven (find surprises) - Balance: Do both, compare results
Solution
Use clustering algorithms to discover natural groupings:
Approach: 1. Feature Selection - Choose behavioral and characteristic features 2. Clustering - Apply algorithm (K-means, hierarchical, DBSCAN) 3. Validation - Ensure cohorts are meaningful and stable 4. Characterization - Describe what makes each cohort unique 5. Outcome Analysis - How do outcomes differ by cohort? 6. Strategy Development - Tailor approach per cohort 7. Monitoring - Track cohort evolution over time
Popular Clustering Algorithms:
K-Means: - Specify number of clusters (k) - Fast, simple, interpretable - Assumes spherical clusters - Good starting point
Hierarchical: - Creates tree of clusters - Don't need to specify k upfront - Can cut tree at different levels - Visual dendrogram helpful
DBSCAN: - Finds arbitrary-shaped clusters - Automatically determines cluster count - Handles noise (outliers) - More complex to tune
Gaussian Mixture Models (GMM): - Probabilistic clustering - Soft assignment (probability of membership) - Handles overlapping clusters - More sophisticated
Structure
Cohort Analysis Tables
-- Store discovered cohorts
CREATE TABLE cohorts (
cohort_id INT PRIMARY KEY IDENTITY(1,1),
cohort_name VARCHAR(100),
cohort_description NVARCHAR(1000),
-- Clustering details
clustering_algorithm VARCHAR(50), -- 'kmeans', 'hierarchical', 'dbscan'
clustering_run_id INT,
cluster_number INT, -- Original cluster ID from algorithm
-- Characteristics (what defines this cohort)
characteristic_features NVARCHAR(MAX), -- JSON
avg_engagement_score DECIMAL(5,2),
avg_risk_score DECIMAL(5,2),
-- Size
member_count INT,
-- Outcomes
retention_rate DECIMAL(5,2),
avg_lifetime_value DECIMAL(10,2),
-- Period
analysis_date DATE NOT NULL,
active BIT DEFAULT 1,
CONSTRAINT UQ_cohort_name UNIQUE (cohort_name, analysis_date)
);
-- Cohort membership (which families belong to which cohorts)
CREATE TABLE cohort_membership (
membership_id INT PRIMARY KEY IDENTITY(1,1),
cohort_id INT NOT NULL,
family_id INT NOT NULL,
-- Assignment
assignment_date DATE NOT NULL,
membership_confidence DECIMAL(5,2), -- For soft clustering (GMM)
-- Historical tracking
previous_cohort_id INT, -- Which cohort were they in before?
cohort_tenure_days INT, -- How long in this cohort?
CONSTRAINT FK_membership_cohort FOREIGN KEY (cohort_id)
REFERENCES cohorts(cohort_id),
CONSTRAINT FK_membership_family FOREIGN KEY (family_id)
REFERENCES families(family_id)
);
-- Cohort-specific strategies
CREATE TABLE cohort_strategies (
strategy_id INT PRIMARY KEY IDENTITY(1,1),
cohort_id INT NOT NULL,
strategy_name VARCHAR(200),
strategy_description NVARCHAR(MAX),
-- Recommended interventions for this cohort
recommended_interventions NVARCHAR(MAX), -- JSON
communication_approach NVARCHAR(500),
engagement_tactics NVARCHAR(500),
-- Effectiveness
times_applied INT DEFAULT 0,
success_rate DECIMAL(5,2),
active BIT DEFAULT 1,
created_date DATETIME2 DEFAULT GETDATE(),
CONSTRAINT FK_strategy_cohort FOREIGN KEY (cohort_id)
REFERENCES cohorts(cohort_id)
);
-- Track cohort metrics over time
CREATE TABLE cohort_metrics_history (
history_id INT PRIMARY KEY IDENTITY(1,1),
cohort_id INT NOT NULL,
snapshot_date DATE NOT NULL,
member_count INT,
avg_engagement_score DECIMAL(5,2),
avg_risk_score DECIMAL(5,2),
retention_rate DECIMAL(5,2),
CONSTRAINT FK_history_cohort FOREIGN KEY (cohort_id)
REFERENCES cohorts(cohort_id)
);
Implementation
Cohort Discovery Engine
class CohortDiscoveryEngine {
constructor(db) {
this.db = db;
}
async discoverCohorts(k = 8, algorithm = 'kmeans') {
// Step 1: Extract features for all families
const features = await this.extractFeaturesForClustering();
// Step 2: Run clustering algorithm (call Python)
const clusteringResult = await this.runClustering(features, k, algorithm);
// Step 3: Validate cluster quality
const validation = this.validateClusters(clusteringResult);
if (validation.silhouette_score < 0.3) {
console.warn('Poor cluster quality. Consider different k or algorithm.');
}
// Step 4: Characterize each cohort
const cohorts = await this.characterizeCohorts(clusteringResult, features);
// Step 5: Analyze outcomes by cohort
const withOutcomes = await this.analyzeCohortOutcomes(cohorts);
// Step 6: Save to database
await this.saveCohorts(withOutcomes);
// Step 7: Generate human-readable names
await this.nameCohorts(withOutcomes);
return withOutcomes;
}
async extractFeaturesForClustering() {
const families = await this.db.query(`
SELECT
f.family_id,
f.family_name,
-- Engagement features
fem.engagement_score,
fem.communication_score,
fem.platform_engagement_score,
fem.participation_score,
fem.score_velocity,
-- Risk features
ra.withdrawal_risk,
ra.payment_risk,
-- Behavioral features
DATEDIFF(NOW(), f.enrollment_date) as tenure_days,
COALESCE(pmt.on_time_rate, 50) as payment_reliability,
COALESCE(evt.attendance_rate, 0) as event_attendance_rate,
COALESCE(vol.total_hours, 0) as volunteer_hours,
COALESCE(ref.referral_count, 0) as referrals_made
FROM families f
LEFT JOIN family_engagement_metrics fem ON f.family_id = fem.family_id
LEFT JOIN risk_assessments ra ON f.family_id = ra.family_id
LEFT JOIN (
SELECT family_id,
SUM(CASE WHEN outcome = 'paid_on_time' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as on_time_rate
FROM interaction_log
WHERE interaction_type = 'payment_received'
GROUP BY family_id
) pmt ON f.family_id = pmt.family_id
LEFT JOIN (
SELECT family_id,
SUM(CASE WHEN interaction_type = 'event_attended' THEN 1 ELSE 0 END) * 100.0 /
NULLIF(SUM(CASE WHEN interaction_type = 'event_invited' THEN 1 ELSE 0 END), 0) as attendance_rate
FROM interaction_log
GROUP BY family_id
) evt ON f.family_id = evt.family_id
LEFT JOIN (
SELECT family_id,
SUM(CAST(JSON_VALUE(metadata, '$.hours') AS DECIMAL)) as total_hours
FROM interaction_log
WHERE interaction_type = 'volunteer_hours_logged'
GROUP BY family_id
) vol ON f.family_id = vol.family_id
LEFT JOIN (
SELECT family_id,
COUNT(*) as referral_count
FROM interaction_log
WHERE interaction_type = 'referral_made'
GROUP BY family_id
) ref ON f.family_id = ref.family_id
WHERE f.enrolled_current_semester = 1
`);
return families;
}
async runClustering(features, k, algorithm) {
// Call Python clustering script
const { spawn } = require('child_process');
const path = require('path');
return new Promise((resolve, reject) => {
const python = spawn('python3', [
path.join(__dirname, 'clustering.py'),
JSON.stringify(features),
k.toString(),
algorithm
]);
let result = '';
python.stdout.on('data', (data) => result += data.toString());
python.stderr.on('data', (data) => console.error(data.toString()));
python.on('close', (code) => {
if (code !== 0) {
reject(new Error(`Clustering failed with code ${code}`));
} else {
resolve(JSON.parse(result));
}
});
});
}
validateClusters(clusteringResult) {
// Cluster quality metrics returned from Python
return {
silhouette_score: clusteringResult.silhouette_score,
calinski_harabasz_score: clusteringResult.calinski_harabasz_score,
davies_bouldin_score: clusteringResult.davies_bouldin_score
};
}
async characterizeCohorts(clusteringResult, features) {
const cohorts = [];
for (let i = 0; i < clusteringResult.n_clusters; i++) {
// Get members of this cluster
const members = clusteringResult.labels
.map((label, idx) => label === i ? features[idx] : null)
.filter(f => f !== null);
// Calculate average characteristics
const characteristics = this.calculateCohortCharacteristics(members);
cohorts.push({
cluster_number: i,
member_count: members.length,
members: members,
characteristics: characteristics,
centroid: clusteringResult.centroids[i]
});
}
return cohorts;
}
calculateCohortCharacteristics(members) {
const avg = (arr, key) => arr.reduce((sum, m) => sum + (m[key] || 0), 0) / arr.length;
return {
avg_engagement_score: avg(members, 'engagement_score'),
avg_communication_score: avg(members, 'communication_score'),
avg_platform_engagement: avg(members, 'platform_engagement_score'),
avg_participation: avg(members, 'participation_score'),
avg_score_velocity: avg(members, 'score_velocity'),
avg_withdrawal_risk: avg(members, 'withdrawal_risk'),
avg_payment_risk: avg(members, 'payment_risk'),
avg_tenure_days: avg(members, 'tenure_days'),
avg_payment_reliability: avg(members, 'payment_reliability'),
avg_attendance_rate: avg(members, 'event_attendance_rate'),
avg_volunteer_hours: avg(members, 'volunteer_hours'),
avg_referrals: avg(members, 'referrals_made')
};
}
async analyzeCohortOutcomes(cohorts) {
for (const cohort of cohorts) {
const memberIds = cohort.members.map(m => m.family_id);
// Calculate retention rate
const outcomes = await this.db.query(`
SELECT
COUNT(*) as total,
SUM(CASE WHEN enrollment_status = 'active' THEN 1 ELSE 0 END) as retained,
SUM(CASE WHEN withdrawal_date IS NOT NULL THEN 1 ELSE 0 END) as withdrew
FROM families
WHERE family_id IN (${memberIds.join(',')})
`);
const o = outcomes[0];
cohort.retention_rate = o.total > 0 ? (o.retained / o.total) * 100 : 0;
cohort.withdrawal_rate = o.total > 0 ? (o.withdrew / o.total) * 100 : 0;
// Calculate lifetime value (simplified)
const ltv = await this.db.query(`
SELECT AVG(total_paid) as avg_ltv
FROM (
SELECT family_id, SUM(amount) as total_paid
FROM payments
WHERE family_id IN (${memberIds.join(',')})
GROUP BY family_id
) subq
`);
cohort.avg_lifetime_value = ltv[0]?.avg_ltv || 0;
}
return cohorts;
}
async nameCohorts(cohorts) {
// Generate human-readable names based on characteristics
for (const cohort of cohorts) {
const c = cohort.characteristics;
// Pattern matching to generate descriptive names
if (c.avg_engagement_score > 80 && c.avg_referrals > 0.5) {
cohort.name = "Champions & Advocates";
cohort.description = "High engagement, active promoters, excellent retention";
} else if (c.avg_engagement_score > 70 && c.avg_participation > 70) {
cohort.name = "Highly Engaged";
cohort.description = "Active participants, strong community connection";
} else if (c.avg_engagement_score < 50 && c.avg_communication_score < 40) {
cohort.name = "Silent Strugglers";
cohort.description = "Low engagement, minimal communication, high risk";
} else if (c.avg_payment_reliability < 60 && c.avg_engagement_score > 60) {
cohort.name = "Engaged But Financially Stressed";
cohort.description = "Active participation but payment challenges";
} else if (c.avg_tenure_days < 180 && c.avg_engagement_score > 65) {
cohort.name = "New & Promising";
cohort.description = "Recently enrolled, showing strong early engagement";
} else if (c.avg_tenure_days < 180 && c.avg_engagement_score < 50) {
cohort.name = "Struggling Newcomers";
cohort.description = "Recently enrolled, having difficulty connecting";
} else if (c.avg_score_velocity < -3 && c.avg_engagement_score < 60) {
cohort.name = "Declining Rapidly";
cohort.description = "Engagement dropping fast, urgent intervention needed";
} else if (c.avg_volunteer_hours > 5) {
cohort.name = "Core Volunteers";
cohort.description = "Heavy volunteer contributors, deeply committed";
} else {
cohort.name = `Cohort ${cohort.cluster_number + 1}`;
cohort.description = "General population segment";
}
}
return cohorts;
}
async saveCohorts(cohorts) {
const today = new Date().toISOString().split('T')[0];
for (const cohort of cohorts) {
// Save cohort
const cohortResult = await this.db.query(`
INSERT INTO cohorts (
cohort_name,
cohort_description,
clustering_algorithm,
cluster_number,
characteristic_features,
avg_engagement_score,
avg_risk_score,
member_count,
retention_rate,
avg_lifetime_value,
analysis_date
) VALUES (?, ?, 'kmeans', ?, ?, ?, ?, ?, ?, ?, ?)
RETURNING cohort_id
`, [
cohort.name,
cohort.description,
cohort.cluster_number,
JSON.stringify(cohort.characteristics),
cohort.characteristics.avg_engagement_score,
cohort.characteristics.avg_withdrawal_risk,
cohort.member_count,
cohort.retention_rate,
cohort.avg_lifetime_value,
today
]);
const cohortId = cohortResult[0].cohort_id;
// Save memberships
for (const member of cohort.members) {
await this.db.query(`
INSERT INTO cohort_membership (
cohort_id,
family_id,
assignment_date
) VALUES (?, ?, ?)
`, [cohortId, member.family_id, today]);
}
}
}
}
module.exports = CohortDiscoveryEngine;
Python Clustering Script
# clustering.py
import sys
import json
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
def cluster_data(features_json, k, algorithm='kmeans'):
# Parse features
features_list = json.loads(features_json)
df = pd.DataFrame(features_list)
# Select numeric features for clustering
feature_cols = [
'engagement_score', 'communication_score', 'platform_engagement_score',
'participation_score', 'score_velocity', 'withdrawal_risk', 'payment_risk',
'tenure_days', 'payment_reliability', 'event_attendance_rate',
'volunteer_hours', 'referrals_made'
]
X = df[feature_cols].fillna(0)
# Normalize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply clustering algorithm
if algorithm == 'kmeans':
model = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = model.fit_predict(X_scaled)
centroids = scaler.inverse_transform(model.cluster_centers_)
elif algorithm == 'hierarchical':
model = AgglomerativeClustering(n_clusters=k)
labels = model.fit_predict(X_scaled)
# Calculate centroids manually
centroids = []
for i in range(k):
cluster_points = X_scaled[labels == i]
centroid = cluster_points.mean(axis=0)
centroids.append(scaler.inverse_transform([centroid])[0])
elif algorithm == 'dbscan':
model = DBSCAN(eps=0.5, min_samples=5)
labels = model.fit_predict(X_scaled)
k = len(set(labels)) - (1 if -1 in labels else 0) # Exclude noise
# Calculate centroids
centroids = []
for i in range(k):
cluster_points = X_scaled[labels == i]
if len(cluster_points) > 0:
centroid = cluster_points.mean(axis=0)
centroids.append(scaler.inverse_transform([centroid])[0])
elif algorithm == 'gmm':
model = GaussianMixture(n_components=k, random_state=42)
labels = model.fit_predict(X_scaled)
centroids = scaler.inverse_transform(model.means_)
# Calculate quality metrics
silhouette = silhouette_score(X_scaled, labels) if k > 1 else 0
calinski = calinski_harabasz_score(X_scaled, labels) if k > 1 else 0
davies = davies_bouldin_score(X_scaled, labels) if k > 1 else 0
result = {
'n_clusters': k,
'labels': labels.tolist(),
'centroids': [c.tolist() for c in centroids],
'silhouette_score': float(silhouette),
'calinski_harabasz_score': float(calinski),
'davies_bouldin_score': float(davies),
'feature_names': feature_cols
}
print(json.dumps(result))
if __name__ == "__main__":
features_json = sys.argv[1]
k = int(sys.argv[2])
algorithm = sys.argv[3] if len(sys.argv) > 3 else 'kmeans'
cluster_data(features_json, k, algorithm)
Usage Example
const engine = new CohortDiscoveryEngine(db);
// Discover cohorts
const cohorts = await engine.discoverCohorts(8, 'kmeans');
console.log(`\n=== DISCOVERED ${cohorts.length} COHORTS ===\n`);
cohorts.forEach((cohort, i) => {
console.log(`COHORT ${i+1}: ${cohort.name}`);
console.log(`Description: ${cohort.description}`);
console.log(`Size: ${cohort.member_count} families`);
console.log(`Retention Rate: ${cohort.retention_rate.toFixed(1)}%`);
console.log(`Avg LTV: $${cohort.avg_lifetime_value.toFixed(0)}`);
console.log(`\nKey Characteristics:`);
console.log(` Engagement: ${cohort.characteristics.avg_engagement_score.toFixed(1)}/100`);
console.log(` Withdrawal Risk: ${cohort.characteristics.avg_withdrawal_risk.toFixed(1)}/100`);
console.log(` Payment Reliability: ${cohort.characteristics.avg_payment_reliability.toFixed(1)}%`);
console.log(` Volunteer Hours: ${cohort.characteristics.avg_volunteer_hours.toFixed(1)}`);
console.log(`\n`);
});
// Example output:
// === DISCOVERED 8 COHORTS ===
//
// COHORT 1: Champions & Advocates
// Description: High engagement, active promoters, excellent retention
// Size: 18 families
// Retention Rate: 97.2%
// Avg LTV: $2,450
//
// Key Characteristics:
// Engagement: 89.3/100
// Withdrawal Risk: 12.4/100
// Payment Reliability: 98.7%
// Volunteer Hours: 8.3
//
// COHORT 2: Silent Strugglers
// Description: Low engagement, minimal communication, high risk
// Size: 12 families
// Retention Rate: 25.0%
// Avg LTV: $890
//
// Key Characteristics:
// Engagement: 38.2/100
// Withdrawal Risk: 82.6/100
// Payment Reliability: 68.3%
// Volunteer Hours: 0.2
Variations
By Algorithm Choice
K-Means: - Simple, fast, interpretable - Need to specify k upfront - Works for spherical clusters - Good default choice
Hierarchical: - Don't need to specify k - Creates cluster tree (dendrogram) - Can explore different granularities - Slower for large datasets
DBSCAN: - Finds arbitrary shapes - Auto-detects cluster count - Handles outliers well - Requires tuning eps/min_samples
Gaussian Mixture Models: - Probabilistic (soft) clustering - Handles overlapping cohorts - More sophisticated - Computationally expensive
By Feature Selection
Behavioral Only: - Engagement, participation, communication - Pure behavior-based segmentation - Independent of demographics
Demographic + Behavioral: - Add age, location, family size - Richer segmentation - May capture natural groups
Outcome-Optimized: - Weight features by outcome correlation - Segments optimized for retention/LTV - More predictive, less exploratory
By Update Frequency
Static (Annual): - Cluster once, use all year - Simple, stable - May become outdated
Periodic (Quarterly): - Recluster each quarter - Balance stability and freshness - Most common approach
Dynamic (Continuous): - Assign new members to nearest cluster - Always current - Requires online clustering
Consequences
Benefits
1. Discover hidden patterns Find "Silent Strugglers" - never would have segmented manually.
2. Targeted strategies Different approach for each cohort (not one-size-fits-all).
3. Resource optimization Focus on high-value or high-risk cohorts.
4. Benchmarking Compare families to their cohort (not global average).
5. Predictive power Cohort membership predicts outcomes.
6. Learning at scale Learn what works for each cohort type.
Costs
1. Requires sufficient data Need 100+ families for meaningful clustering.
2. Interpretation challenge Algorithms find patterns, humans must interpret.
3. Membership fluidity Families move between cohorts over time.
4. Naming difficulty Converting clusters to human-understandable names.
5. Computational cost Clustering large datasets can be slow.
6. Validation complexity How do you know clusters are "good"?
Sample Code
Optimal k selection (elbow method):
def find_optimal_k(X_scaled, k_range=range(2, 15)):
inertias = []
silhouettes = []
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42)
labels = kmeans.fit_predict(X_scaled)
inertias.append(kmeans.inertia_)
silhouettes.append(silhouette_score(X_scaled, labels))
# Plot elbow curve
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(k_range, inertias, 'o-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.subplot(1, 2, 2)
plt.plot(k_range, silhouettes, 'o-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Analysis')
plt.tight_layout()
plt.savefig('optimal_k.png')
# Recommend k
optimal_k = k_range[np.argmax(silhouettes)]
print(f"Recommended k: {optimal_k} (highest silhouette score)")
return optimal_k
Known Uses
Homeschool Co-op Intelligence Platform - 8 cohorts discovered - "Silent Strugglers" identified (83% withdrawal without intervention) - "Champions & Advocates" receive different engagement - Improved retention by 12% through targeted strategies
E-Commerce: - Customer segmentation standard practice - RFM analysis (Recency, Frequency, Monetary) - Personalized recommendations per segment
Healthcare: - Patient cohorts for clinical trials - Risk stratification (Hierarchical Condition Categories) - Personalized treatment protocols
SaaS: - User segmentation by usage patterns - Churn-risk cohorts - Expansion-opportunity cohorts
Related Patterns
Requires: - Pattern 1: Universal Event Log - behavioral data for clustering - Pattern 6: Composite Health Scoring - features for clustering
Enables: - Pattern 17: Anomaly Detection - cohort baselines for anomalies - Pattern 18: Opportunity Mining - identify high-value cohorts - Pattern 15: Intervention Recommendation - cohort-specific recommendations
Enhanced by: - Pattern 11: Historical Pattern Matching - validate cohort patterns - Pattern 12: Risk Stratification Models - predict cohort transitions
References
On Unsupervised Learning: - Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer, 2009. (Chapter 14: Unsupervised Learning) https://web.stanford.edu/~hastie/ElemStatLearn/ - Aggarwal, Charu C., and Chandan K. Reddy, eds. Data Clustering: Algorithms and Applications. CRC Press, 2013.
On Clustering Algorithms: - Kaufman, Leonard, and Peter J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, 1990. - Jain, Anil K. "Data Clustering: 50 Years Beyond K-Means." Pattern Recognition Letters 31(8), 2010: 651-666. (Survey of clustering methods) - Ester, Martin, et al. "A Density-Based Algorithm for Discovering Clusters (DBSCAN)." KDD 1996. (DBSCAN algorithm)
On Association Rule Mining: - Agrawal, Rakesh, and Ramakrishnan Srikant. "Fast Algorithms for Mining Association Rules." VLDB 1994. (The Apriori algorithm) - Han, Jiawei, et al. Data Mining: Concepts and Techniques, 3rd Edition. Morgan Kaufmann, 2011. (Chapter 6: Mining Frequent Patterns)
On Anomaly Detection: - Chandola, Varun, Arindam Banerjee, and Vipin Kumar. "Anomaly Detection: A Survey." ACM Computing Surveys 41(3), 2009. (Comprehensive anomaly detection survey) - Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou. "Isolation Forest." ICDM 2008. (Isolation forest algorithm)
On Implementation: - Scikit-learn Clustering: https://scikit-learn.org/stable/modules/clustering.html (Clustering algorithms) - mlxtend (Apriori): https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/ (Association rules in Python) - PyOD: https://pyod.readthedocs.io/ (Python Outlier Detection library)
On Visualization: - "t-SNE." van der Maaten and Hinton, 2008. https://lvdmaaten.github.io/tsne/ (Dimensionality reduction for visualization) - UMAP: https://umap-learn.readthedocs.io/ (Modern alternative to t-SNE)
Related Patterns in This Trilogy: - Pattern 1 (Universal Event Log): Data source for pattern mining - Pattern 11 (Historical Pattern Matching): Uses discovered patterns - Pattern 17 (Anomaly Detection): Specialized pattern mining - Pattern 18 (Cohort Analysis): Grouping by discovered patterns - Pattern 26 (Feedback Loop): Validates discovered patterns - Volume 3: Can discover UX problems in form interactions