Pattern 16: Cohort Discovery & Analysis

Intent

Automatically discover meaningful cohorts (groups) within the population based on shared behavioral patterns, characteristics, and outcomes, enabling targeted strategies, segment-specific interventions, and discovery of hidden subpopulations with unique needs.

Also Known As

Cluster Analysis
Behavioral Segmentation
Market Segmentation Discovery
Pattern-Based Grouping
Unsupervised Cohort Identification

Problem

Manual segmentation misses hidden patterns.

Sarah segments families manually: - "New families" (enrolled < 6 months) - "Established families" (enrolled 6+ months) - "At-risk families" (low engagement scores)

But the data reveals unexpected cohorts:

Cohort A: "Silent Strugglers" (12 families) - Never complain or ask for help - Payment always on time - But quietly disengaging (portal use dropping, event attendance declining) - 83% withdrawal rate if unaddressed

Cohort B: "Vocal But Committed" (18 families) - Frequent questions, concerns, feedback - Occasional payment delays - But deeply engaged (high participation, referrals) - Only 8% withdrawal rate

Manual segmentation would treat these similarly: - Both might score 60-70 on engagement - Both have some issues - But completely different risk profiles and needs

Without cohort discovery: - Miss these distinct subpopulations - Apply wrong interventions (one-size-fits-all) - Can't target resources effectively - Don't learn segment-specific patterns

With cohort discovery: - Find 8-12 natural cohorts automatically - Understand each cohort's unique characteristics - Tailor interventions per cohort - Track cohort-specific outcomes

Context

When this pattern applies:

Population large enough for meaningful segments (100+ entities)
Behavioral diversity exists (not all the same)
Want to discover unknown patterns (not predefined segments)
Targeted strategies are feasible
Can act on segment-specific insights

When this pattern may not be needed:

Very small population (<50)
Population is homogeneous
Already have good manual segmentation
Can't differentiate strategies by segment

Forces

Competing concerns:

1. Granularity vs Actionability - Many small cohorts = precise but hard to act on - Few large cohorts = actionable but lose nuance - Balance: 6-12 cohorts for most organizations

2. Stability vs Freshness - Stable cohorts = consistent but may be outdated - Dynamic cohorts = current but members shift constantly - Balance: Quarterly reclustering

3. Interpretability vs Accuracy - Simple features = interpretable cohorts ("high engagement group") - Complex features = accurate but hard to explain - Balance: Use interpretable features, validate with complex ones

4. Supervised vs Unsupervised - Supervised = optimize for outcome (e.g., retention) - Unsupervised = discover natural groups - Balance: Start unsupervised, validate with outcomes

5. Predefined vs Discovered - Predefined = hypothesis-driven (test ideas) - Discovered = data-driven (find surprises) - Balance: Do both, compare results

Solution

Use clustering algorithms to discover natural groupings:

Approach: 1. Feature Selection - Choose behavioral and characteristic features 2. Clustering - Apply algorithm (K-means, hierarchical, DBSCAN) 3. Validation - Ensure cohorts are meaningful and stable 4. Characterization - Describe what makes each cohort unique 5. Outcome Analysis - How do outcomes differ by cohort? 6. Strategy Development - Tailor approach per cohort 7. Monitoring - Track cohort evolution over time

Popular Clustering Algorithms:

K-Means: - Specify number of clusters (k) - Fast, simple, interpretable - Assumes spherical clusters - Good starting point

Hierarchical: - Creates tree of clusters - Don't need to specify k upfront - Can cut tree at different levels - Visual dendrogram helpful

DBSCAN: - Finds arbitrary-shaped clusters - Automatically determines cluster count - Handles noise (outliers) - More complex to tune

Gaussian Mixture Models (GMM): - Probabilistic clustering - Soft assignment (probability of membership) - Handles overlapping clusters - More sophisticated

Structure

Cohort Analysis Tables

-- Store discovered cohorts
CREATE TABLE cohorts (
  cohort_id INT PRIMARY KEY IDENTITY(1,1),

  cohort_name VARCHAR(100),
  cohort_description NVARCHAR(1000),

  -- Clustering details
  clustering_algorithm VARCHAR(50),  -- 'kmeans', 'hierarchical', 'dbscan'
  clustering_run_id INT,
  cluster_number INT,  -- Original cluster ID from algorithm

  -- Characteristics (what defines this cohort)
  characteristic_features NVARCHAR(MAX),  -- JSON
  avg_engagement_score DECIMAL(5,2),
  avg_risk_score DECIMAL(5,2),

  -- Size
  member_count INT,

  -- Outcomes
  retention_rate DECIMAL(5,2),
  avg_lifetime_value DECIMAL(10,2),

  -- Period
  analysis_date DATE NOT NULL,
  active BIT DEFAULT 1,

  CONSTRAINT UQ_cohort_name UNIQUE (cohort_name, analysis_date)
);

-- Cohort membership (which families belong to which cohorts)
CREATE TABLE cohort_membership (
  membership_id INT PRIMARY KEY IDENTITY(1,1),
  cohort_id INT NOT NULL,
  family_id INT NOT NULL,

  -- Assignment
  assignment_date DATE NOT NULL,
  membership_confidence DECIMAL(5,2),  -- For soft clustering (GMM)

  -- Historical tracking
  previous_cohort_id INT,  -- Which cohort were they in before?
  cohort_tenure_days INT,  -- How long in this cohort?

  CONSTRAINT FK_membership_cohort FOREIGN KEY (cohort_id)
    REFERENCES cohorts(cohort_id),
  CONSTRAINT FK_membership_family FOREIGN KEY (family_id)
    REFERENCES families(family_id)
);

-- Cohort-specific strategies
CREATE TABLE cohort_strategies (
  strategy_id INT PRIMARY KEY IDENTITY(1,1),
  cohort_id INT NOT NULL,

  strategy_name VARCHAR(200),
  strategy_description NVARCHAR(MAX),

  -- Recommended interventions for this cohort
  recommended_interventions NVARCHAR(MAX),  -- JSON
  communication_approach NVARCHAR(500),
  engagement_tactics NVARCHAR(500),

  -- Effectiveness
  times_applied INT DEFAULT 0,
  success_rate DECIMAL(5,2),

  active BIT DEFAULT 1,
  created_date DATETIME2 DEFAULT GETDATE(),

  CONSTRAINT FK_strategy_cohort FOREIGN KEY (cohort_id)
    REFERENCES cohorts(cohort_id)
);

-- Track cohort metrics over time
CREATE TABLE cohort_metrics_history (
  history_id INT PRIMARY KEY IDENTITY(1,1),
  cohort_id INT NOT NULL,

  snapshot_date DATE NOT NULL,
  member_count INT,
  avg_engagement_score DECIMAL(5,2),
  avg_risk_score DECIMAL(5,2),
  retention_rate DECIMAL(5,2),

  CONSTRAINT FK_history_cohort FOREIGN KEY (cohort_id)
    REFERENCES cohorts(cohort_id)
);

Implementation

Cohort Discovery Engine

class CohortDiscoveryEngine {
  constructor(db) {
    this.db = db;
  }

  async discoverCohorts(k = 8, algorithm = 'kmeans') {
    // Step 1: Extract features for all families
    const features = await this.extractFeaturesForClustering();

    // Step 2: Run clustering algorithm (call Python)
    const clusteringResult = await this.runClustering(features, k, algorithm);

    // Step 3: Validate cluster quality
    const validation = this.validateClusters(clusteringResult);

    if (validation.silhouette_score < 0.3) {
      console.warn('Poor cluster quality. Consider different k or algorithm.');
    }

    // Step 4: Characterize each cohort
    const cohorts = await this.characterizeCohorts(clusteringResult, features);

    // Step 5: Analyze outcomes by cohort
    const withOutcomes = await this.analyzeCohortOutcomes(cohorts);

    // Step 6: Save to database
    await this.saveCohorts(withOutcomes);

    // Step 7: Generate human-readable names
    await this.nameCohorts(withOutcomes);

    return withOutcomes;
  }

  async extractFeaturesForClustering() {
    const families = await this.db.query(`
      SELECT 
        f.family_id,
        f.family_name,

        -- Engagement features
        fem.engagement_score,
        fem.communication_score,
        fem.platform_engagement_score,
        fem.participation_score,
        fem.score_velocity,

        -- Risk features
        ra.withdrawal_risk,
        ra.payment_risk,

        -- Behavioral features
        DATEDIFF(NOW(), f.enrollment_date) as tenure_days,
        COALESCE(pmt.on_time_rate, 50) as payment_reliability,
        COALESCE(evt.attendance_rate, 0) as event_attendance_rate,
        COALESCE(vol.total_hours, 0) as volunteer_hours,
        COALESCE(ref.referral_count, 0) as referrals_made

      FROM families f
      LEFT JOIN family_engagement_metrics fem ON f.family_id = fem.family_id
      LEFT JOIN risk_assessments ra ON f.family_id = ra.family_id
      LEFT JOIN (
        SELECT family_id, 
          SUM(CASE WHEN outcome = 'paid_on_time' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as on_time_rate
        FROM interaction_log
        WHERE interaction_type = 'payment_received'
        GROUP BY family_id
      ) pmt ON f.family_id = pmt.family_id
      LEFT JOIN (
        SELECT family_id,
          SUM(CASE WHEN interaction_type = 'event_attended' THEN 1 ELSE 0 END) * 100.0 /
          NULLIF(SUM(CASE WHEN interaction_type = 'event_invited' THEN 1 ELSE 0 END), 0) as attendance_rate
        FROM interaction_log
        GROUP BY family_id
      ) evt ON f.family_id = evt.family_id
      LEFT JOIN (
        SELECT family_id,
          SUM(CAST(JSON_VALUE(metadata, '$.hours') AS DECIMAL)) as total_hours
        FROM interaction_log
        WHERE interaction_type = 'volunteer_hours_logged'
        GROUP BY family_id
      ) vol ON f.family_id = vol.family_id
      LEFT JOIN (
        SELECT family_id,
          COUNT(*) as referral_count
        FROM interaction_log
        WHERE interaction_type = 'referral_made'
        GROUP BY family_id
      ) ref ON f.family_id = ref.family_id

      WHERE f.enrolled_current_semester = 1
    `);

    return families;
  }

  async runClustering(features, k, algorithm) {
    // Call Python clustering script
    const { spawn } = require('child_process');
    const path = require('path');

    return new Promise((resolve, reject) => {
      const python = spawn('python3', [
        path.join(__dirname, 'clustering.py'),
        JSON.stringify(features),
        k.toString(),
        algorithm
      ]);

      let result = '';
      python.stdout.on('data', (data) => result += data.toString());
      python.stderr.on('data', (data) => console.error(data.toString()));

      python.on('close', (code) => {
        if (code !== 0) {
          reject(new Error(`Clustering failed with code ${code}`));
        } else {
          resolve(JSON.parse(result));
        }
      });
    });
  }

  validateClusters(clusteringResult) {
    // Cluster quality metrics returned from Python
    return {
      silhouette_score: clusteringResult.silhouette_score,
      calinski_harabasz_score: clusteringResult.calinski_harabasz_score,
      davies_bouldin_score: clusteringResult.davies_bouldin_score
    };
  }

  async characterizeCohorts(clusteringResult, features) {
    const cohorts = [];

    for (let i = 0; i < clusteringResult.n_clusters; i++) {
      // Get members of this cluster
      const members = clusteringResult.labels
        .map((label, idx) => label === i ? features[idx] : null)
        .filter(f => f !== null);

      // Calculate average characteristics
      const characteristics = this.calculateCohortCharacteristics(members);

      cohorts.push({
        cluster_number: i,
        member_count: members.length,
        members: members,
        characteristics: characteristics,
        centroid: clusteringResult.centroids[i]
      });
    }

    return cohorts;
  }

  calculateCohortCharacteristics(members) {
    const avg = (arr, key) => arr.reduce((sum, m) => sum + (m[key] || 0), 0) / arr.length;

    return {
      avg_engagement_score: avg(members, 'engagement_score'),
      avg_communication_score: avg(members, 'communication_score'),
      avg_platform_engagement: avg(members, 'platform_engagement_score'),
      avg_participation: avg(members, 'participation_score'),
      avg_score_velocity: avg(members, 'score_velocity'),
      avg_withdrawal_risk: avg(members, 'withdrawal_risk'),
      avg_payment_risk: avg(members, 'payment_risk'),
      avg_tenure_days: avg(members, 'tenure_days'),
      avg_payment_reliability: avg(members, 'payment_reliability'),
      avg_attendance_rate: avg(members, 'event_attendance_rate'),
      avg_volunteer_hours: avg(members, 'volunteer_hours'),
      avg_referrals: avg(members, 'referrals_made')
    };
  }

  async analyzeCohortOutcomes(cohorts) {
    for (const cohort of cohorts) {
      const memberIds = cohort.members.map(m => m.family_id);

      // Calculate retention rate
      const outcomes = await this.db.query(`
        SELECT 
          COUNT(*) as total,
          SUM(CASE WHEN enrollment_status = 'active' THEN 1 ELSE 0 END) as retained,
          SUM(CASE WHEN withdrawal_date IS NOT NULL THEN 1 ELSE 0 END) as withdrew
        FROM families
        WHERE family_id IN (${memberIds.join(',')})
      `);

      const o = outcomes[0];
      cohort.retention_rate = o.total > 0 ? (o.retained / o.total) * 100 : 0;
      cohort.withdrawal_rate = o.total > 0 ? (o.withdrew / o.total) * 100 : 0;

      // Calculate lifetime value (simplified)
      const ltv = await this.db.query(`
        SELECT AVG(total_paid) as avg_ltv
        FROM (
          SELECT family_id, SUM(amount) as total_paid
          FROM payments
          WHERE family_id IN (${memberIds.join(',')})
          GROUP BY family_id
        ) subq
      `);

      cohort.avg_lifetime_value = ltv[0]?.avg_ltv || 0;
    }

    return cohorts;
  }

  async nameCohorts(cohorts) {
    // Generate human-readable names based on characteristics

    for (const cohort of cohorts) {
      const c = cohort.characteristics;

      // Pattern matching to generate descriptive names
      if (c.avg_engagement_score > 80 && c.avg_referrals > 0.5) {
        cohort.name = "Champions & Advocates";
        cohort.description = "High engagement, active promoters, excellent retention";

      } else if (c.avg_engagement_score > 70 && c.avg_participation > 70) {
        cohort.name = "Highly Engaged";
        cohort.description = "Active participants, strong community connection";

      } else if (c.avg_engagement_score < 50 && c.avg_communication_score < 40) {
        cohort.name = "Silent Strugglers";
        cohort.description = "Low engagement, minimal communication, high risk";

      } else if (c.avg_payment_reliability < 60 && c.avg_engagement_score > 60) {
        cohort.name = "Engaged But Financially Stressed";
        cohort.description = "Active participation but payment challenges";

      } else if (c.avg_tenure_days < 180 && c.avg_engagement_score > 65) {
        cohort.name = "New & Promising";
        cohort.description = "Recently enrolled, showing strong early engagement";

      } else if (c.avg_tenure_days < 180 && c.avg_engagement_score < 50) {
        cohort.name = "Struggling Newcomers";
        cohort.description = "Recently enrolled, having difficulty connecting";

      } else if (c.avg_score_velocity < -3 && c.avg_engagement_score < 60) {
        cohort.name = "Declining Rapidly";
        cohort.description = "Engagement dropping fast, urgent intervention needed";

      } else if (c.avg_volunteer_hours > 5) {
        cohort.name = "Core Volunteers";
        cohort.description = "Heavy volunteer contributors, deeply committed";

      } else {
        cohort.name = `Cohort ${cohort.cluster_number + 1}`;
        cohort.description = "General population segment";
      }
    }

    return cohorts;
  }

  async saveCohorts(cohorts) {
    const today = new Date().toISOString().split('T')[0];

    for (const cohort of cohorts) {
      // Save cohort
      const cohortResult = await this.db.query(`
        INSERT INTO cohorts (
          cohort_name,
          cohort_description,
          clustering_algorithm,
          cluster_number,
          characteristic_features,
          avg_engagement_score,
          avg_risk_score,
          member_count,
          retention_rate,
          avg_lifetime_value,
          analysis_date
        ) VALUES (?, ?, 'kmeans', ?, ?, ?, ?, ?, ?, ?, ?)
        RETURNING cohort_id
      `, [
        cohort.name,
        cohort.description,
        cohort.cluster_number,
        JSON.stringify(cohort.characteristics),
        cohort.characteristics.avg_engagement_score,
        cohort.characteristics.avg_withdrawal_risk,
        cohort.member_count,
        cohort.retention_rate,
        cohort.avg_lifetime_value,
        today
      ]);

      const cohortId = cohortResult[0].cohort_id;

      // Save memberships
      for (const member of cohort.members) {
        await this.db.query(`
          INSERT INTO cohort_membership (
            cohort_id,
            family_id,
            assignment_date
          ) VALUES (?, ?, ?)
        `, [cohortId, member.family_id, today]);
      }
    }
  }
}

module.exports = CohortDiscoveryEngine;

Python Clustering Script

# clustering.py
import sys
import json
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score

def cluster_data(features_json, k, algorithm='kmeans'):
    # Parse features
    features_list = json.loads(features_json)
    df = pd.DataFrame(features_list)

    # Select numeric features for clustering
    feature_cols = [
        'engagement_score', 'communication_score', 'platform_engagement_score',
        'participation_score', 'score_velocity', 'withdrawal_risk', 'payment_risk',
        'tenure_days', 'payment_reliability', 'event_attendance_rate',
        'volunteer_hours', 'referrals_made'
    ]

    X = df[feature_cols].fillna(0)

    # Normalize features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # Apply clustering algorithm
    if algorithm == 'kmeans':
        model = KMeans(n_clusters=k, random_state=42, n_init=10)
        labels = model.fit_predict(X_scaled)
        centroids = scaler.inverse_transform(model.cluster_centers_)

    elif algorithm == 'hierarchical':
        model = AgglomerativeClustering(n_clusters=k)
        labels = model.fit_predict(X_scaled)
        # Calculate centroids manually
        centroids = []
        for i in range(k):
            cluster_points = X_scaled[labels == i]
            centroid = cluster_points.mean(axis=0)
            centroids.append(scaler.inverse_transform([centroid])[0])

    elif algorithm == 'dbscan':
        model = DBSCAN(eps=0.5, min_samples=5)
        labels = model.fit_predict(X_scaled)
        k = len(set(labels)) - (1 if -1 in labels else 0)  # Exclude noise
        # Calculate centroids
        centroids = []
        for i in range(k):
            cluster_points = X_scaled[labels == i]
            if len(cluster_points) > 0:
                centroid = cluster_points.mean(axis=0)
                centroids.append(scaler.inverse_transform([centroid])[0])

    elif algorithm == 'gmm':
        model = GaussianMixture(n_components=k, random_state=42)
        labels = model.fit_predict(X_scaled)
        centroids = scaler.inverse_transform(model.means_)

    # Calculate quality metrics
    silhouette = silhouette_score(X_scaled, labels) if k > 1 else 0
    calinski = calinski_harabasz_score(X_scaled, labels) if k > 1 else 0
    davies = davies_bouldin_score(X_scaled, labels) if k > 1 else 0

    result = {
        'n_clusters': k,
        'labels': labels.tolist(),
        'centroids': [c.tolist() for c in centroids],
        'silhouette_score': float(silhouette),
        'calinski_harabasz_score': float(calinski),
        'davies_bouldin_score': float(davies),
        'feature_names': feature_cols
    }

    print(json.dumps(result))

if __name__ == "__main__":
    features_json = sys.argv[1]
    k = int(sys.argv[2])
    algorithm = sys.argv[3] if len(sys.argv) > 3 else 'kmeans'

    cluster_data(features_json, k, algorithm)

Usage Example

const engine = new CohortDiscoveryEngine(db);

// Discover cohorts
const cohorts = await engine.discoverCohorts(8, 'kmeans');

console.log(`\n=== DISCOVERED ${cohorts.length} COHORTS ===\n`);

cohorts.forEach((cohort, i) => {
  console.log(`COHORT ${i+1}: ${cohort.name}`);
  console.log(`Description: ${cohort.description}`);
  console.log(`Size: ${cohort.member_count} families`);
  console.log(`Retention Rate: ${cohort.retention_rate.toFixed(1)}%`);
  console.log(`Avg LTV: $${cohort.avg_lifetime_value.toFixed(0)}`);
  console.log(`\nKey Characteristics:`);
  console.log(`  Engagement: ${cohort.characteristics.avg_engagement_score.toFixed(1)}/100`);
  console.log(`  Withdrawal Risk: ${cohort.characteristics.avg_withdrawal_risk.toFixed(1)}/100`);
  console.log(`  Payment Reliability: ${cohort.characteristics.avg_payment_reliability.toFixed(1)}%`);
  console.log(`  Volunteer Hours: ${cohort.characteristics.avg_volunteer_hours.toFixed(1)}`);
  console.log(`\n`);
});

// Example output:
// === DISCOVERED 8 COHORTS ===
//
// COHORT 1: Champions & Advocates
// Description: High engagement, active promoters, excellent retention
// Size: 18 families
// Retention Rate: 97.2%
// Avg LTV: $2,450
//
// Key Characteristics:
//   Engagement: 89.3/100
//   Withdrawal Risk: 12.4/100
//   Payment Reliability: 98.7%
//   Volunteer Hours: 8.3
//
// COHORT 2: Silent Strugglers
// Description: Low engagement, minimal communication, high risk
// Size: 12 families
// Retention Rate: 25.0%
// Avg LTV: $890
//
// Key Characteristics:
//   Engagement: 38.2/100
//   Withdrawal Risk: 82.6/100
//   Payment Reliability: 68.3%
//   Volunteer Hours: 0.2

Variations

By Algorithm Choice

K-Means: - Simple, fast, interpretable - Need to specify k upfront - Works for spherical clusters - Good default choice

Hierarchical: - Don't need to specify k - Creates cluster tree (dendrogram) - Can explore different granularities - Slower for large datasets

DBSCAN: - Finds arbitrary shapes - Auto-detects cluster count - Handles outliers well - Requires tuning eps/min_samples

Gaussian Mixture Models: - Probabilistic (soft) clustering - Handles overlapping cohorts - More sophisticated - Computationally expensive

By Feature Selection

Behavioral Only: - Engagement, participation, communication - Pure behavior-based segmentation - Independent of demographics

Demographic + Behavioral: - Add age, location, family size - Richer segmentation - May capture natural groups

Outcome-Optimized: - Weight features by outcome correlation - Segments optimized for retention/LTV - More predictive, less exploratory

By Update Frequency

Static (Annual): - Cluster once, use all year - Simple, stable - May become outdated

Periodic (Quarterly): - Recluster each quarter - Balance stability and freshness - Most common approach

Dynamic (Continuous): - Assign new members to nearest cluster - Always current - Requires online clustering

Consequences

Benefits

1. Discover hidden patterns Find "Silent Strugglers" - never would have segmented manually.

2. Targeted strategies Different approach for each cohort (not one-size-fits-all).

3. Resource optimization Focus on high-value or high-risk cohorts.

4. Benchmarking Compare families to their cohort (not global average).

5. Predictive power Cohort membership predicts outcomes.

6. Learning at scale Learn what works for each cohort type.

Costs

1. Requires sufficient data Need 100+ families for meaningful clustering.

2. Interpretation challenge Algorithms find patterns, humans must interpret.

3. Membership fluidity Families move between cohorts over time.

4. Naming difficulty Converting clusters to human-understandable names.

5. Computational cost Clustering large datasets can be slow.

6. Validation complexity How do you know clusters are "good"?

Sample Code

Optimal k selection (elbow method):

def find_optimal_k(X_scaled, k_range=range(2, 15)):
    inertias = []
    silhouettes = []

    for k in k_range:
        kmeans = KMeans(n_clusters=k, random_state=42)
        labels = kmeans.fit_predict(X_scaled)

        inertias.append(kmeans.inertia_)
        silhouettes.append(silhouette_score(X_scaled, labels))

    # Plot elbow curve
    plt.figure(figsize=(12, 4))

    plt.subplot(1, 2, 1)
    plt.plot(k_range, inertias, 'o-')
    plt.xlabel('Number of Clusters (k)')
    plt.ylabel('Inertia')
    plt.title('Elbow Method')

    plt.subplot(1, 2, 2)
    plt.plot(k_range, silhouettes, 'o-')
    plt.xlabel('Number of Clusters (k)')
    plt.ylabel('Silhouette Score')
    plt.title('Silhouette Analysis')

    plt.tight_layout()
    plt.savefig('optimal_k.png')

    # Recommend k
    optimal_k = k_range[np.argmax(silhouettes)]
    print(f"Recommended k: {optimal_k} (highest silhouette score)")

    return optimal_k

Known Uses

Homeschool Co-op Intelligence Platform - 8 cohorts discovered - "Silent Strugglers" identified (83% withdrawal without intervention) - "Champions & Advocates" receive different engagement - Improved retention by 12% through targeted strategies

E-Commerce: - Customer segmentation standard practice - RFM analysis (Recency, Frequency, Monetary) - Personalized recommendations per segment

Healthcare: - Patient cohorts for clinical trials - Risk stratification (Hierarchical Condition Categories) - Personalized treatment protocols

SaaS: - User segmentation by usage patterns - Churn-risk cohorts - Expansion-opportunity cohorts

Requires: - Pattern 1: Universal Event Log - behavioral data for clustering - Pattern 6: Composite Health Scoring - features for clustering

Enables: - Pattern 17: Anomaly Detection - cohort baselines for anomalies - Pattern 18: Opportunity Mining - identify high-value cohorts - Pattern 15: Intervention Recommendation - cohort-specific recommendations

Enhanced by: - Pattern 11: Historical Pattern Matching - validate cohort patterns - Pattern 12: Risk Stratification Models - predict cohort transitions

References

On Unsupervised Learning: - Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer, 2009. (Chapter 14: Unsupervised Learning) https://web.stanford.edu/~hastie/ElemStatLearn/ - Aggarwal, Charu C., and Chandan K. Reddy, eds. Data Clustering: Algorithms and Applications. CRC Press, 2013.

On Clustering Algorithms: - Kaufman, Leonard, and Peter J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, 1990. - Jain, Anil K. "Data Clustering: 50 Years Beyond K-Means." Pattern Recognition Letters 31(8), 2010: 651-666. (Survey of clustering methods) - Ester, Martin, et al. "A Density-Based Algorithm for Discovering Clusters (DBSCAN)." KDD 1996. (DBSCAN algorithm)

On Association Rule Mining: - Agrawal, Rakesh, and Ramakrishnan Srikant. "Fast Algorithms for Mining Association Rules." VLDB 1994. (The Apriori algorithm) - Han, Jiawei, et al. Data Mining: Concepts and Techniques, 3rd Edition. Morgan Kaufmann, 2011. (Chapter 6: Mining Frequent Patterns)

On Anomaly Detection: - Chandola, Varun, Arindam Banerjee, and Vipin Kumar. "Anomaly Detection: A Survey." ACM Computing Surveys 41(3), 2009. (Comprehensive anomaly detection survey) - Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou. "Isolation Forest." ICDM 2008. (Isolation forest algorithm)

On Implementation: - Scikit-learn Clustering: https://scikit-learn.org/stable/modules/clustering.html (Clustering algorithms) - mlxtend (Apriori): https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/ (Association rules in Python) - PyOD: https://pyod.readthedocs.io/ (Python Outlier Detection library)

On Visualization: - "t-SNE." van der Maaten and Hinton, 2008. https://lvdmaaten.github.io/tsne/ (Dimensionality reduction for visualization) - UMAP: https://umap-learn.readthedocs.io/ (Modern alternative to t-SNE)

Related Patterns in This Trilogy: - Pattern 1 (Universal Event Log): Data source for pattern mining - Pattern 11 (Historical Pattern Matching): Uses discovered patterns - Pattern 17 (Anomaly Detection): Specialized pattern mining - Pattern 18 (Cohort Analysis): Grouping by discovered patterns - Pattern 26 (Feedback Loop): Validates discovered patterns - Volume 3: Can discover UX problems in form interactions