BeyondCX_Insights/docs/ARCHITECTURE.md

# CXInsights - Arquitectura del Sistema

## Visión del Producto

CXInsights transforma 5,000-20,000 llamadas de contact center en **RCA Trees ejecutivos** que identifican las causas raíz de:
- **Lost Sales**: Oportunidades de venta perdidas
- **Poor CX**: Experiencias de cliente deficientes

---

## Principios de Diseño Críticos

### 1. Separación Estricta: Observed vs Inferred

**Todo dato debe estar claramente clasificado como HECHO o INFERENCIA.**

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    OBSERVED vs INFERRED                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  OBSERVED (Hechos medibles)          INFERRED (Opinión del modelo)         │
│  ─────────────────────────           ──────────────────────────────        │
│  ✓ Duración de la llamada            ✗ Sentimiento del cliente             │
│  ✓ Número de transfers               ✗ Motivo de pérdida de venta          │
│  ✓ Tiempo en hold (medido)           ✗ Calidad del agente                  │
│  ✓ Silencios detectados (>N seg)     ✗ Clasificación de intent             │
│  ✓ Texto transcrito                  ✗ Resumen de la llamada               │
│  ✓ Quién habló cuánto (%)            ✗ Outcome (sale/no_sale/resolved)     │
│  ✓ Timestamp de eventos              ✗ Drivers de RCA                      │
│                                                                             │
│  Regla: Si el LLM lo genera → es INFERRED                                  │
│         Si viene del audio/STT → es OBSERVED                               │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

**Impacto**: RCA defendible ante stakeholders. Auditoría clara. Separación de hechos y opinión.

### 2. Evidencia Obligatoria por Driver

**Regla dura: Sin `evidence_spans` → el driver NO EXISTE**

```json
{
  "rca_code": "LONG_HOLD",
  "confidence": 0.77,
  "evidence_spans": [
    {"start": "02:14", "end": "03:52", "text": "[silence - hold]", "source": "observed"}
  ]
}
```

Un driver sin evidencia timestamped será rechazado por validación.

### 3. Versionado de Prompts + Schema

**Todo output incluye metadatos de versión para reproducibilidad.**

```json
{
  "_meta": {
    "schema_version": "1.0.0",
    "prompt_version": "call_analysis_v1.2",
    "model": "gpt-4o-mini",
    "model_version": "2024-07-18",
    "processed_at": "2024-01-15T10:30:00Z"
  }
}
```

### 4. Taxonomía RCA Cerrada + Canal de Emergentes

**Solo códigos del enum. Única excepción controlada: `OTHER_EMERGENT`**

```json
{
  "rca_code": "OTHER_EMERGENT",
  "proposed_label": "agent_rushed_due_to_queue_pressure",
  "evidence_spans": [...]
}
```

Los `OTHER_EMERGENT` se revisan manualmente y se promueven a taxonomía oficial en siguiente versión.

### 5. Eventos de Journey como Estructura

**No texto libre. Objetos tipados con timestamp.**

```json
{
  "journey_events": [
    {"type": "CALL_START", "t": "00:00"},
    {"type": "GREETING", "t": "00:03"},
    {"type": "TRANSFER", "t": "01:42"},
    {"type": "HOLD_START", "t": "02:10"},
    {"type": "HOLD_END", "t": "03:40"},
    {"type": "NEGATIVE_SENTIMENT", "t": "04:05", "source": "inferred"},
    {"type": "RESOLUTION_ATTEMPT", "t": "05:20"},
    {"type": "CALL_END", "t": "06:15"}
  ]
}
```

### 6. Adaptador de STT (Sin Lock-in)

**Interfaz abstracta. El proveedor es intercambiable.**

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                         TRANSCRIBER INTERFACE                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Interface: Transcriber                                                     │
│  ├─ transcribe(audio_path) → TranscriptContract                            │
│  └─ transcribe_batch(paths) → List[TranscriptContract]                     │
│                                                                             │
│  Implementations:                                                           │
│  ├─ AssemblyAITranscriber (default)                                        │
│  ├─ WhisperTranscriber (local/offline)                                     │
│  ├─ GoogleSTTTranscriber (alternative)                                     │
│  └─ AWSTranscribeTranscriber (alternative)                                 │
│                                                                             │
│  TranscriptContract (output normalizado):                                  │
│  ├─ call_id: str                                                           │
│  ├─ utterances: List[Utterance]                                            │
│  ├─ observed_events: List[ObservedEvent]                                   │
│  └─ metadata: TranscriptMetadata                                           │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

---

## Diagrama de Flujo End-to-End

```
┌─────────────────────────────────────────────────────────────────────────────────┐
│                              CXINSIGHTS PIPELINE                                │
└─────────────────────────────────────────────────────────────────────────────────┘

INPUT                           PROCESSING                              OUTPUT
─────                           ──────────                              ──────

┌──────────────┐
│  5K-20K      │
│  Audio Files │
│  (.mp3/.wav) │
└──────┬───────┘
       │
       ▼
╔══════════════════════════════════════════════════════════════════════════════╗
║  MODULE 1: BATCH TRANSCRIPTION (via Transcriber Interface)                   ║
║  ┌────────────────────────────────────────────────────────────────────────┐  ║
║  │  Transcriber Adapter (pluggable: AssemblyAI, Whisper, Google, AWS)     │  ║
║  │  ├─ Parallel uploads (configurable concurrency)                        │  ║
║  │  ├─ Spanish language model                                             │  ║
║  │  ├─ Speaker diarization (Agent vs Customer)                            │  ║
║  │  └─ Output: TranscriptContract (normalized)                            │  ║
║  └────────────────────────────────────────────────────────────────────────┘  ║
║         │                                                                     ║
║         ▼                                                                     ║
║  📁 data/transcripts/{call_id}.json (TranscriptContract)                     ║
╚══════════════════════════════════════════════════════════════════════════════╝
       │
       ▼
╔══════════════════════════════════════════════════════════════════════════════╗
║  MODULE 2: FEATURE EXTRACTION (OBSERVED ONLY)                                ║
║  ┌────────────────────────────────────────────────────────────────────────┐  ║
║  │  Extrae SOLO hechos medibles del transcript:                           │  ║
║  │  ├─ Duración total                                                     │  ║
║  │  ├─ % habla agente vs cliente (ratio)                                  │  ║
║  │  ├─ Silencios > 5s (timestamp + duración)                              │  ║
║  │  ├─ Interrupciones detectadas                                          │  ║
║  │  ├─ Transfers (si detectables por audio/metadata)                      │  ║
║  │  └─ Palabras clave literales (sin interpretación)                      │  ║
║  │                                                                         │  ║
║  │  Output: observed_features (100% verificable)                          │  ║
║  └────────────────────────────────────────────────────────────────────────┘  ║
║         │                                                                     ║
║         ▼                                                                     ║
║  📁 data/transcripts/{call_id}_features.json                                 ║
╚══════════════════════════════════════════════════════════════════════════════╝
       │
       ▼
╔══════════════════════════════════════════════════════════════════════════════╗
║  MODULE 3: PER-CALL INFERENCE (MAP) - Separación Observed/Inferred          ║
║  ┌────────────────────────────────────────────────────────────────────────┐  ║
║  │  LLM Analysis (GPT-4o-mini / Claude 3.5 Sonnet)                        │  ║
║  │                                                                         │  ║
║  │  Input al LLM:                                                         │  ║
║  │  ├─ Transcript comprimido                                              │  ║
║  │  ├─ observed_features (contexto factual)                               │  ║
║  │  └─ Taxonomía RCA (enum cerrado)                                       │  ║
║  │                                                                         │  ║
║  │  Output estructurado:                                                  │  ║
║  │  ├─ OBSERVED (pass-through, no inferido):                              │  ║
║  │  │   └─ observed_outcome (si explícito en audio: "venta cerrada")     │  ║
║  │  │                                                                      │  ║
║  │  ├─ INFERRED (con confidence + evidence obligatoria):                  │  ║
║  │  │   ├─ intent: {code, confidence, evidence_spans[]}                   │  ║
║  │  │   ├─ outcome: {code, confidence, evidence_spans[]}                  │  ║
║  │  │   ├─ sentiment: {score, confidence, evidence_spans[]}               │  ║
║  │  │   ├─ lost_sale_driver: {rca_code, confidence, evidence_spans[]}    │  ║
║  │  │   ├─ poor_cx_driver: {rca_code, confidence, evidence_spans[]}      │  ║
║  │  │   └─ agent_quality: {scores{}, confidence, evidence_spans[]}       │  ║
║  │  │                                                                      │  ║
║  │  └─ JOURNEY_EVENTS (structured timeline):                              │  ║
║  │      └─ events[]: {type, t, source: observed|inferred}                │  ║
║  └────────────────────────────────────────────────────────────────────────┘  ║
║         │                                                                     ║
║         ▼                                                                     ║
║  📁 data/processed/{call_id}_analysis.json                                   ║
╚══════════════════════════════════════════════════════════════════════════════╝
       │
       ▼
╔══════════════════════════════════════════════════════════════════════════════╗
║  MODULE 4: VALIDATION & QUALITY GATE                                         ║
║  ┌────────────────────────────────────────────────────────────────────────┐  ║
║  │  Validación estricta antes de agregar:                                 │  ║
║  │  ├─ ¿Tiene evidence_spans todo driver? → Si no, RECHAZAR driver       │  ║
║  │  ├─ ¿rca_code está en taxonomía? → Si no, marcar OTHER_EMERGENT       │  ║
║  │  ├─ ¿Confidence > umbral? → Si no, marcar low_confidence              │  ║
║  │  ├─ ¿Schema version match? → Si no, ERROR                             │  ║
║  │  └─ ¿Journey events tienen timestamps válidos?                        │  ║
║  │                                                                         │  ║
║  │  Output: validated_analysis.json + validation_report.json             │  ║
║  └────────────────────────────────────────────────────────────────────────┘  ║
╚══════════════════════════════════════════════════════════════════════════════╝
       │
       ▼
╔══════════════════════════════════════════════════════════════════════════════╗
║  MODULE 5: AGGREGATION (REDUCE)                                              ║
║  ┌────────────────────────────────────────────────────────────────────────┐  ║
║  │  Consolidación estadística (solo datos validados):                     │  ║
║  │  ├─ Conteo por rca_code (taxonomía cerrada)                           │  ║
║  │  ├─ Distribuciones con confidence_weighted                            │  ║
║  │  ├─ Separación: high_confidence vs low_confidence                     │  ║
║  │  ├─ Lista de OTHER_EMERGENT para revisión manual                      │  ║
║  │  ├─ Cross-tabs (intent × outcome × driver)                            │  ║
║  │  └─ Correlaciones observed_features ↔ inferred_outcomes               │  ║
║  └────────────────────────────────────────────────────────────────────────┘  ║
║         │                                                                     ║
║         ▼                                                                     ║
║  📁 data/outputs/aggregated_stats.json                                       ║
║  📁 data/outputs/emergent_drivers_review.json                                ║
╚══════════════════════════════════════════════════════════════════════════════╝
       │
       ▼
╔══════════════════════════════════════════════════════════════════════════════╗
║  MODULE 6: RCA TREE GENERATION                                               ║
║  ┌────────────────────────────────────────────────────────────────────────┐  ║
║  │  Construcción de árboles (determinístico, no LLM):                     │  ║
║  │                                                                         │  ║
║  │  🔴 LOST SALES RCA TREE                                                │  ║
║  │  └─ Lost Sales (N=1,250, 25%)                                          │  ║
║  │     ├─ PRICING (45%, avg_conf=0.82)                                   │  ║
║  │     │  ├─ TOO_EXPENSIVE (30%, n=375)                                  │  ║
║  │     │  │  └─ evidence_samples: ["...", "..."]                         │  ║
║  │     │  └─ COMPETITOR_CHEAPER (15%, n=187)                             │  ║
║  │     │     └─ evidence_samples: ["...", "..."]                         │  ║
║  │     └─ ...                                                             │  ║
║  │                                                                         │  ║
║  │  Cada nodo incluye:                                                    │  ║
║  │  ├─ rca_code (del enum)                                               │  ║
║  │  ├─ count, pct                                                        │  ║
║  │  ├─ avg_confidence                                                    │  ║
║  │  ├─ evidence_samples[] (verbatims representativos)                    │  ║
║  │  └─ call_ids[] (para drill-down)                                      │  ║
║  └────────────────────────────────────────────────────────────────────────┘  ║
║         │                                                                     ║
║         ▼                                                                     ║
║  📁 data/outputs/rca_lost_sales.json                                         ║
║  📁 data/outputs/rca_poor_cx.json                                            ║
╚══════════════════════════════════════════════════════════════════════════════╝
       │
       ▼
╔══════════════════════════════════════════════════════════════════════════════╗
║  MODULE 7: EXECUTIVE REPORTING                                               ║
║  ┌────────────────────────────────────────────────────────────────────────┐  ║
║  │  Formatos de salida:                                                   │  ║
║  │  ├─ 📊 Streamlit Dashboard (con filtro observed/inferred)             │  ║
║  │  ├─ 📑 PDF Executive Summary (incluye confidence disclaimers)         │  ║
║  │  ├─ 📈 Excel con drill-down (link a evidence_spans)                   │  ║
║  │  └─ 🖼️ PNG de árboles RCA (con leyenda de confidence)                │  ║
║  └────────────────────────────────────────────────────────────────────────┘  ║
╚══════════════════════════════════════════════════════════════════════════════╝
```

---

## Modelo de Datos (Actualizado)

### TranscriptContract (Module 1 output)

```json
{
  "_meta": {
    "schema_version": "1.0.0",
    "transcriber": "assemblyai",
    "transcriber_version": "2024-07",
    "processed_at": "2024-01-15T10:30:00Z"
  },
  "call_id": "c001",
  "observed": {
    "duration_seconds": 245,
    "language_detected": "es",
    "speakers": [
      {"id": "A", "label": "agent", "talk_time_pct": 0.45},
      {"id": "B", "label": "customer", "talk_time_pct": 0.55}
    ],
    "utterances": [
      {
        "speaker": "A",
        "text": "Buenos días, gracias por llamar a Movistar...",
        "start_ms": 0,
        "end_ms": 3500
      }
    ],
    "detected_events": [
      {"type": "SILENCE", "start_ms": 72000, "end_ms": 80000, "duration_ms": 8000},
      {"type": "CROSSTALK", "start_ms": 45000, "end_ms": 46500}
    ]
  }
}
```

### CallAnalysis (Module 3 output) - CON SEPARACIÓN OBSERVED/INFERRED

```json
{
  "_meta": {
    "schema_version": "1.0.0",
    "prompt_version": "call_analysis_v1.2",
    "model": "gpt-4o-mini",
    "model_version": "2024-07-18",
    "processed_at": "2024-01-15T10:35:00Z"
  },
  "call_id": "c001",

  "observed": {
    "duration_seconds": 245,
    "agent_talk_pct": 0.45,
    "customer_talk_pct": 0.55,
    "silence_total_seconds": 38,
    "silence_events": [
      {"start": "01:12", "end": "01:20", "duration_s": 8}
    ],
    "transfer_count": 0,
    "hold_events": [
      {"start": "02:14", "end": "03:52", "duration_s": 98}
    ],
    "explicit_outcome": null
  },

  "inferred": {
    "intent": {
      "code": "SALES_INQUIRY",
      "confidence": 0.91,
      "evidence_spans": [
        {"start": "00:15", "end": "00:28", "text": "Quería información sobre la fibra de 600 megas"}
      ]
    },

    "outcome": {
      "code": "NO_SALE",
      "confidence": 0.85,
      "evidence_spans": [
        {"start": "05:40", "end": "05:52", "text": "Lo voy a pensar y ya les llamo yo"}
      ]
    },

    "sentiment": {
      "overall_score": -0.3,
      "evolution": [
        {"segment": "start", "score": 0.2},
        {"segment": "middle", "score": -0.1},
        {"segment": "end", "score": -0.6}
      ],
      "confidence": 0.78,
      "evidence_spans": [
        {"start": "04:10", "end": "04:25", "text": "Es que me parece carísimo, la verdad"}
      ]
    },

    "lost_sale_driver": {
      "rca_code": "PRICING_TOO_EXPENSIVE",
      "confidence": 0.83,
      "evidence_spans": [
        {"start": "03:55", "end": "04:08", "text": "59 euros al mes es mucho dinero"},
        {"start": "04:10", "end": "04:25", "text": "Es que me parece carísimo, la verdad"}
      ],
      "secondary_driver": {
        "rca_code": "COMPETITOR_CHEAPER",
        "confidence": 0.71,
        "evidence_spans": [
          {"start": "04:30", "end": "04:45", "text": "En Vodafone me lo dejan por 45"}
        ]
      }
    },

    "poor_cx_driver": {
      "rca_code": "LONG_HOLD",
      "confidence": 0.77,
      "evidence_spans": [
        {"start": "02:14", "end": "03:52", "text": "[hold - 98 segundos]", "source": "observed"}
      ]
    },

    "agent_quality": {
      "overall_score": 6,
      "dimensions": {
        "empathy": 7,
        "product_knowledge": 8,
        "objection_handling": 4,
        "closing_skills": 5
      },
      "confidence": 0.72,
      "evidence_spans": [
        {"start": "04:50", "end": "05:10", "text": "Bueno, es el precio que tenemos...", "dimension": "objection_handling"}
      ]
    },

    "summary": "Cliente interesado en fibra 600Mb abandona por precio (59€) comparando con Vodafone (45€). Hold largo de 98s. Agente no rebatió objeción de precio."
  },

  "journey_events": [
    {"type": "CALL_START", "t": "00:00", "source": "observed"},
    {"type": "GREETING", "t": "00:03", "source": "observed"},
    {"type": "INTENT_STATED", "t": "00:15", "source": "inferred"},
    {"type": "HOLD_START", "t": "02:14", "source": "observed"},
    {"type": "HOLD_END", "t": "03:52", "source": "observed"},
    {"type": "PRICE_OBJECTION", "t": "03:55", "source": "inferred"},
    {"type": "COMPETITOR_MENTION", "t": "04:30", "source": "inferred"},
    {"type": "NEGATIVE_SENTIMENT_PEAK", "t": "04:10", "source": "inferred"},
    {"type": "SOFT_DECLINE", "t": "05:40", "source": "inferred"},
    {"type": "CALL_END", "t": "06:07", "source": "observed"}
  ]
}
```

### RCA Tree Node (Module 6 output)

```json
{
  "_meta": {
    "schema_version": "1.0.0",
    "generated_at": "2024-01-15T11:00:00Z",
    "taxonomy_version": "rca_taxonomy_v1.0",
    "total_calls_analyzed": 5000,
    "confidence_threshold_used": 0.70
  },
  "tree_type": "lost_sales",
  "total_affected": {
    "count": 1250,
    "pct_of_total": 25.0
  },
  "root": {
    "label": "Lost Sales",
    "children": [
      {
        "rca_code": "PRICING",
        "label": "Pricing Issues",
        "count": 562,
        "pct_of_parent": 45.0,
        "avg_confidence": 0.82,
        "children": [
          {
            "rca_code": "PRICING_TOO_EXPENSIVE",
            "label": "Too Expensive",
            "count": 375,
            "pct_of_parent": 66.7,
            "avg_confidence": 0.84,
            "evidence_samples": [
              {"call_id": "c001", "text": "59 euros al mes es mucho dinero", "t": "03:55"},
              {"call_id": "c042", "text": "No puedo pagar tanto", "t": "02:30"}
            ],
            "call_ids": ["c001", "c042", "c078", "..."]
          },
          {
            "rca_code": "PRICING_COMPETITOR_CHEAPER",
            "label": "Competitor Cheaper",
            "count": 187,
            "pct_of_parent": 33.3,
            "avg_confidence": 0.79,
            "evidence_samples": [
              {"call_id": "c001", "text": "En Vodafone me lo dejan por 45", "t": "04:30"}
            ],
            "call_ids": ["c001", "c015", "..."]
          }
        ]
      }
    ]
  },
  "other_emergent": [
    {
      "proposed_label": "agent_rushed_due_to_queue_pressure",
      "count": 23,
      "evidence_samples": [
        {"call_id": "c234", "text": "Perdona que voy con prisa que hay cola", "t": "01:15"}
      ],
      "recommendation": "Considerar añadir a taxonomía v1.1"
    }
  ]
}
```

---

## Taxonomía RCA (config/rca_taxonomy.yaml)

```yaml
# config/rca_taxonomy.yaml
# Version: 1.0.0
# Last updated: 2024-01-15

_meta:
  version: "1.0.0"
  author: "CXInsights Team"
  description: "Closed taxonomy for RCA classification. Only these codes are valid."

# ============================================================================
# INTENTS (Motivo de la llamada)
# ============================================================================
intents:
  - SALES_INQUIRY           # Consulta de venta
  - SALES_UPGRADE           # Upgrade de producto
  - SUPPORT_TECHNICAL       # Soporte técnico
  - SUPPORT_BILLING         # Consulta de facturación
  - COMPLAINT               # Queja/reclamación
  - CANCELLATION            # Solicitud de baja
  - GENERAL_INQUIRY         # Consulta general
  - OTHER_EMERGENT          # Captura de nuevos intents

# ============================================================================
# OUTCOMES (Resultado de la llamada)
# ============================================================================
outcomes:
  - SALE_COMPLETED          # Venta cerrada
  - SALE_LOST               # Venta perdida
  - ISSUE_RESOLVED          # Problema resuelto
  - ISSUE_UNRESOLVED        # Problema no resuelto
  - ESCALATED               # Escalado a supervisor/otro depto
  - CALLBACK_SCHEDULED      # Callback programado
  - OTHER_EMERGENT

# ============================================================================
# LOST SALE DRIVERS (Por qué se perdió la venta)
# ============================================================================
lost_sale_drivers:

  # Pricing cluster
  PRICING:
    - PRICING_TOO_EXPENSIVE         # "Es muy caro"
    - PRICING_COMPETITOR_CHEAPER    # "En X me lo dan más barato"
    - PRICING_NO_DISCOUNT           # No se ofreció descuento
    - PRICING_PAYMENT_TERMS         # Condiciones de pago no aceptables

  # Product fit cluster
  PRODUCT_FIT:
    - PRODUCT_FEATURE_MISSING       # Falta funcionalidad requerida
    - PRODUCT_WRONG_OFFERED         # Se ofreció producto equivocado
    - PRODUCT_COVERAGE_AREA         # Sin cobertura en su zona
    - PRODUCT_TECH_REQUIREMENTS     # No cumple requisitos técnicos

  # Process cluster
  PROCESS:
    - PROCESS_TOO_COMPLEX           # Proceso demasiado complicado
    - PROCESS_DOCUMENTATION         # Requiere mucha documentación
    - PROCESS_ACTIVATION_TIME       # Tiempo de activación largo
    - PROCESS_CONTRACT_TERMS        # Términos de contrato no aceptables

  # Agent cluster
  AGENT:
    - AGENT_COULDNT_CLOSE           # No cerró la venta
    - AGENT_POOR_OBJECTION          # Mal manejo de objeciones
    - AGENT_LACK_URGENCY            # No creó urgencia
    - AGENT_MISSED_UPSELL           # Perdió oportunidad de upsell

  # Timing cluster
  TIMING:
    - TIMING_NOT_READY              # Cliente no está listo
    - TIMING_COMPARING              # Comparando opciones
    - TIMING_BUDGET_PENDING         # Presupuesto pendiente

  # Catch-all
  OTHER_EMERGENT: []

# ============================================================================
# POOR CX DRIVERS (Por qué fue mala experiencia)
# ============================================================================
poor_cx_drivers:

  # Wait time cluster
  WAIT_TIME:
    - WAIT_INITIAL_LONG             # Espera inicial larga (>2min)
    - WAIT_HOLD_LONG                # Hold durante llamada largo (>1min)
    - WAIT_CALLBACK_NEVER           # Callback prometido no llegó

  # Resolution cluster
  RESOLUTION:
    - RESOLUTION_NOT_ACHIEVED       # Problema no resuelto
    - RESOLUTION_NEEDED_ESCALATION  # Necesitó escalación
    - RESOLUTION_CALLBACK_BROKEN    # Callback prometido incumplido
    - RESOLUTION_INCORRECT          # Resolución incorrecta

  # Agent behavior cluster
  AGENT_BEHAVIOR:
    - AGENT_LACK_EMPATHY            # Falta de empatía
    - AGENT_RUDE                    # Grosero/dismissive
    - AGENT_RUSHED                  # Con prisas
    - AGENT_NOT_LISTENING           # No escuchaba

  # Information cluster
  INFORMATION:
    - INFO_WRONG_GIVEN              # Información incorrecta
    - INFO_INCONSISTENT             # Información inconsistente
    - INFO_COULDNT_ANSWER           # No supo responder

  # Process/System cluster
  PROCESS_SYSTEM:
    - SYSTEM_DOWN                   # Sistema caído
    - POLICY_LIMITATION             # Limitación de política
    - TOO_MANY_TRANSFERS            # Demasiados transfers
    - AUTH_ISSUES                   # Problemas de autenticación

  # Catch-all
  OTHER_EMERGENT: []

# ============================================================================
# JOURNEY EVENT TYPES (Eventos del timeline)
# ============================================================================
journey_event_types:
  # Observed (vienen del audio/STT)
  observed:
    - CALL_START
    - CALL_END
    - GREETING
    - SILENCE                       # >5 segundos
    - HOLD_START
    - HOLD_END
    - TRANSFER
    - CROSSTALK                     # Hablan a la vez

  # Inferred (vienen del LLM)
  inferred:
    - INTENT_STATED
    - PRICE_OBJECTION
    - COMPETITOR_MENTION
    - NEGATIVE_SENTIMENT_PEAK
    - POSITIVE_SENTIMENT_PEAK
    - RESOLUTION_ATTEMPT
    - SOFT_DECLINE
    - HARD_DECLINE
    - COMMITMENT
    - ESCALATION_REQUEST
```

---

## Diagrama de Componentes (Actualizado)

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                           CXINSIGHTS COMPONENTS                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                    TRANSCRIBER INTERFACE (Adapter Pattern)          │   │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ │   │
│  │  │ AssemblyAI   │ │   Whisper    │ │  Google STT  │ │    AWS     │ │   │
│  │  │ Transcriber  │ │ Transcriber  │ │ Transcriber  │ │ Transcribe │ │   │
│  │  └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ └─────┬──────┘ │   │
│  │         └────────────────┴────────────────┴───────────────┘        │   │
│  │                              ▼                                      │   │
│  │                    TranscriptContract (normalized output)           │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                 │                                           │
│                                 ▼                                           │
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐        │
│  │    Feature      │    │    Inference    │    │   Validation    │        │
│  │   Extractor     │───▶│     Service     │───▶│     Gate        │        │
│  │ (observed only) │    │ (observed/infer)│    │ (evidence check)│        │
│  └─────────────────┘    └─────────────────┘    └─────────────────┘        │
│                                                         │                  │
│                                                         ▼                  │
│  ┌─────────────────────────────────────────────────────────────────────┐  │
│  │                         AGGREGATION LAYER                            │  │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐              │  │
│  │  │ Stats Engine │  │  RCA Builder │  │   Emergent   │              │  │
│  │  │ (by rca_code)│  │(deterministic│  │   Collector  │              │  │
│  │  └──────────────┘  └──────────────┘  └──────────────┘              │  │
│  └─────────────────────────────────────────────────────────────────────┘  │
│                                 │                                           │
│                                 ▼                                           │
│  ┌─────────────────────────────────────────────────────────────────────┐  │
│  │                      VISUALIZATION LAYER                             │  │
│  │  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐    │  │
│  │  │ Dashboard  │  │    PDF     │  │   Excel    │  │    PNG     │    │  │
│  │  │(obs/infer) │  │ (disclaim) │  │(drill-down)│  │  (legend)  │    │  │
│  │  └────────────┘  └────────────┘  └────────────┘  └────────────┘    │  │
│  └─────────────────────────────────────────────────────────────────────┘  │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐  │
│  │                         CONFIG LAYER                                 │  │
│  │  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐        │  │
│  │  │ rca_taxonomy   │  │ prompts/ +     │  │   settings     │        │  │
│  │  │ v1.0 (enum)    │  │ VERSION FILE   │  │    (.env)      │        │  │
│  │  └────────────────┘  └────────────────┘  └────────────────┘        │  │
│  └─────────────────────────────────────────────────────────────────────┘  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

---

## Reglas de Validación (Quality Gate)

```python
# Pseudocódigo de validación

def validate_call_analysis(analysis: CallAnalysis) -> ValidationResult:
    errors = []
    warnings = []

    # REGLA 1: Todo driver debe tener evidence_spans
    for driver in [analysis.inferred.lost_sale_driver, analysis.inferred.poor_cx_driver]:
        if driver and not driver.evidence_spans:
            errors.append(f"Driver {driver.rca_code} sin evidence_spans → RECHAZADO")

    # REGLA 2: rca_code debe estar en taxonomía
    if driver.rca_code not in TAXONOMY:
        if driver.rca_code != "OTHER_EMERGENT":
            errors.append(f"rca_code {driver.rca_code} no está en taxonomía")
        else:
            if not driver.proposed_label:
                errors.append("OTHER_EMERGENT requiere proposed_label")

    # REGLA 3: Confidence mínima
    if driver.confidence < CONFIDENCE_THRESHOLD:
        warnings.append(f"Driver {driver.rca_code} con low confidence: {driver.confidence}")

    # REGLA 4: Schema version debe coincidir
    if analysis._meta.schema_version != EXPECTED_SCHEMA_VERSION:
        errors.append(f"Schema mismatch: {analysis._meta.schema_version}")

    # REGLA 5: Journey events deben tener timestamps válidos
    for event in analysis.journey_events:
        if not is_valid_timestamp(event.t):
            errors.append(f"Invalid timestamp in event: {event}")

    return ValidationResult(
        valid=len(errors) == 0,
        errors=errors,
        warnings=warnings
    )
```

---

## Versionado de Prompts

```
config/prompts/
├── versions.yaml                    # Registry de versiones
├── call_analysis/
│   ├── v1.0/
│   │   ├── system.txt
│   │   ├── user.txt
│   │   └── schema.json              # JSON Schema esperado
│   ├── v1.1/
│   │   ├── system.txt
│   │   ├── user.txt
│   │   └── schema.json
│   └── v1.2/                        # Current
│       ├── system.txt
│       ├── user.txt
│       └── schema.json
└── rca_synthesis/
    └── v1.0/
        ├── system.txt
        └── user.txt
```

```yaml
# config/prompts/versions.yaml
current:
  call_analysis: "v1.2"
  rca_synthesis: "v1.0"

history:
  call_analysis:
    v1.0: "2024-01-01"
    v1.1: "2024-01-10"  # Added secondary_driver support
    v1.2: "2024-01-15"  # Added journey_events structure
```

---

## Estimaciones

### Tiempo Total (5,000 llamadas, ~4min promedio)

| Stage | Tiempo Estimado |
|-------|-----------------|
| Transcription | 3-4 horas |
| Feature Extraction | 15 min |
| Inference | 2-3 horas |
| Validation | 10 min |
| Aggregation | 10 min |
| RCA Tree Build | 5 min |
| Reporting | 5 min |
| **Total** | **6-8 horas** |

### Costes (ver TECH_STACK.md para detalle)

| Volumen | Transcription | Inference | Total |
|---------|---------------|-----------|-------|
| 5,000 calls | ~$300 | ~$15 | ~$315 |
| 20,000 calls | ~$1,200 | ~$60 | ~$1,260 |

---

## Implementation Status (2026-01-19)

| Module | Status | Location |
|--------|--------|----------|
| Transcription | ✅ Done | `src/transcription/` |
| Feature Extraction | ✅ Done | `src/features/` |
| Compression | ✅ Done | `src/compression/` |
| Inference | ✅ Done | `src/inference/` |
| Validation | ✅ Done | Built into models |
| Aggregation | ✅ Done | `src/aggregation/` |
| RCA Trees | ✅ Done | `src/aggregation/rca_tree.py` |
| Pipeline | ✅ Done | `src/pipeline/` |
| Exports | ✅ Done | `src/exports/` |
| CLI | ✅ Done | `cli.py` |

**Última actualización**: 2026-01-19 | **Versión**: 1.0.0