# CXInsights - Estructura del Proyecto

## Árbol de Carpetas Completo

```
cxinsights/
│
├── 📁 data/                          # Datos (ignorado en git excepto .gitkeep)
│   ├── raw/                          # Input original
│   │   ├── audio/                    # Archivos de audio (.mp3, .wav)
│   │   │   └── batch_2024_01/
│   │   │       ├── call_001.mp3
│   │   │       └── ...
│   │   └── metadata/                 # CSV con metadatos opcionales
│   │       └── calls_metadata.csv
│   │
│   ├── transcripts/                  # Output de STT
│   │   └── batch_2024_01/
│   │       ├── raw/                  # Transcripciones originales del STT
│   │       │   └── call_001.json
│   │       └── compressed/           # Transcripciones reducidas para LLM
│   │           └── call_001.json
│   │
│   ├── features/                     # Output de extracción de features (OBSERVED)
│   │   └── batch_2024_01/
│   │       └── call_001_features.json
│   │
│   ├── processed/                    # Output de LLM (Labels con INFERRED)
│   │   └── batch_2024_01/
│   │       └── call_001_labels.json
│   │
│   ├── outputs/                      # Output final
│   │   └── batch_2024_01/
│   │       ├── aggregated_stats.json
│   │       ├── call_matrix.csv
│   │       ├── rca_lost_sales.json
│   │       ├── rca_poor_cx.json
│   │       ├── emergent_drivers_review.json
│   │       ├── executive_summary.pdf
│   │       ├── full_analysis.xlsx
│   │       └── figures/
│   │           ├── rca_tree_lost_sales.png
│   │           └── rca_tree_poor_cx.png
│   │
│   ├── .checkpoints/                 # Estado del pipeline para resume
│   │   ├── transcription_state.json
│   │   ├── features_state.json
│   │   ├── inference_state.json
│   │   └── pipeline_state.json
│   │
│   └── logs/                         # Logs de ejecución
│       └── pipeline_2024_01_15.log
│
├── 📁 src/                           # Código fuente
│   ├── __init__.py
│   │
│   ├── 📁 transcription/             # Module 1: STT (SOLO transcripción)
│   │   ├── __init__.py
│   │   ├── base.py                   # Interface abstracta Transcriber
│   │   ├── assemblyai_client.py      # Implementación AssemblyAI
│   │   ├── whisper_client.py         # Implementación Whisper (futuro)
│   │   ├── batch_processor.py        # Procesamiento paralelo
│   │   ├── compressor.py             # SOLO reducción de texto para LLM
│   │   └── models.py                 # Pydantic models: TranscriptContract
│   │
│   ├── 📁 features/                  # Module 2: Extracción OBSERVED
│   │   ├── __init__.py
│   │   ├── turn_metrics.py           # talk ratio, interruptions, silence duration
│   │   ├── event_detector.py         # HOLD, TRANSFER, SILENCE events
│   │   └── models.py                 # Pydantic models: ObservedFeatures, Event
│   │
│   ├── 📁 inference/                 # Module 3: LLM Analysis (INFERRED)
│   │   ├── __init__.py
│   │   ├── client.py                 # OpenAI/Anthropic client wrapper
│   │   ├── prompt_manager.py         # Carga y renderiza prompts versionados
│   │   ├── analyzer.py               # Análisis por llamada → CallLabels
│   │   ├── batch_analyzer.py         # Procesamiento en lote con rate limiting
│   │   ├── rca_synthesizer.py        # (opcional) Síntesis narrativa del RCA vía LLM
│   │   └── models.py                 # CallLabels, InferredData, EvidenceSpan
│   │
│   ├── 📁 validation/                # Module 4: Quality Gate
│   │   ├── __init__.py
│   │   ├── validator.py              # Validación de evidence_spans, taxonomy, etc.
│   │   ├── schema_checker.py         # Verificación de schema_version
│   │   └── models.py                 # ValidationResult, ValidationError
│   │
│   ├── 📁 aggregation/               # Module 5-6: Stats + RCA (DETERMINÍSTICO)
│   │   ├── __init__.py
│   │   ├── stats_engine.py           # Cálculos estadísticos (pandas + DuckDB)
│   │   ├── rca_builder.py            # Construcción DETERMINÍSTICA del árbol RCA
│   │   ├── emergent_collector.py     # Recolección de OTHER_EMERGENT para revisión
│   │   ├── correlations.py           # Análisis de correlaciones observed↔inferred
│   │   └── models.py                 # AggregatedStats, RCATree, RCANode
│   │
│   ├── 📁 visualization/             # Module 7: Reports (SOLO presentación)
│   │   ├── __init__.py
│   │   ├── dashboard.py              # Streamlit app
│   │   ├── charts.py                 # Generación de gráficos (plotly/matplotlib)
│   │   ├── tree_renderer.py          # Visualización de árboles RCA como PNG/SVG
│   │   ├── pdf_report.py             # Generación PDF ejecutivo
│   │   └── excel_export.py           # Export a Excel con drill-down
│   │
│   ├── 📁 pipeline/                  # Orquestación
│   │   ├── __init__.py
│   │   ├── orchestrator.py           # Pipeline principal
│   │   ├── stages.py                 # Definición de stages
│   │   ├── checkpoint.py             # Gestión de checkpoints
│   │   └── cli.py                    # Interfaz de línea de comandos
│   │
│   └── 📁 utils/                     # Utilidades compartidas
│       ├── __init__.py
│       ├── file_io.py                # Lectura/escritura de archivos
│       ├── logging_config.py         # Setup de logging
│       └── validators.py             # Validación de archivos de audio
│
├── 📁 config/                        # Configuración
│   ├── rca_taxonomy.yaml             # Taxonomía cerrada de drivers (versionada)
│   ├── settings.yaml                 # Config general (no secrets)
│   │
│   └── 📁 prompts/                   # Templates de prompts LLM (versionados)
│       ├── versions.yaml             # Registry de versiones activas
│       ├── call_analysis/
│       │   └── v1.2/
│       │       ├── system.txt
│       │       ├── user.txt
│       │       └── schema.json
│       └── rca_synthesis/
│           └── v1.0/
│               ├── system.txt
│               └── user.txt
│
├── 📁 tests/                         # Tests
│   ├── __init__.py
│   ├── conftest.py                   # Fixtures compartidas
│   │
│   ├── 📁 fixtures/                  # Datos de prueba
│   │   ├── sample_audio/
│   │   │   └── test_call.mp3
│   │   ├── sample_transcripts/
│   │   │   ├── raw/
│   │   │   └── compressed/
│   │   ├── sample_features/
│   │   └── expected_outputs/
│   │
│   ├── 📁 unit/                      # Tests unitarios
│   │   ├── test_transcription.py
│   │   ├── test_features.py
│   │   ├── test_inference.py
│   │   ├── test_validation.py
│   │   ├── test_aggregation.py
│   │   └── test_visualization.py
│   │
│   └── 📁 integration/               # Tests de integración
│       └── test_pipeline.py
│
├── 📁 notebooks/                     # Jupyter notebooks para EDA
│   ├── 01_eda_transcripts.ipynb
│   ├── 02_feature_exploration.ipynb
│   ├── 03_prompt_testing.ipynb
│   ├── 04_aggregation_validation.ipynb
│   └── 05_visualization_prototypes.ipynb
│
├── 📁 scripts/                       # Scripts auxiliares
│   ├── estimate_costs.py             # Estimador de costes antes de ejecutar
│   ├── validate_audio.py             # Validar archivos de audio
│   └── sample_calls.py               # Extraer muestra para testing
│
├── 📁 docs/                          # Documentación
│   ├── ARCHITECTURE.md
│   ├── TECH_STACK.md
│   ├── PROJECT_STRUCTURE.md          # Este documento
│   ├── DEPLOYMENT.md
│   └── PROMPTS.md                    # Documentación de prompts
│
├── .env.example                      # Template de variables de entorno
├── .gitignore
├── pyproject.toml                    # Dependencias y metadata
├── Makefile                          # Comandos útiles
└── README.md                         # Documentación principal
```

---

## Responsabilidades por Módulo

### 📁 `src/transcription/`

**Propósito**: Convertir audio a texto con diarización. **SOLO STT, sin analítica.**

| Archivo | Responsabilidad |
|---------|-----------------|
| `base.py` | Interface abstracta `Transcriber`. Define contrato de salida. |
| `assemblyai_client.py` | Implementación AssemblyAI. Maneja auth, upload, polling. |
| `whisper_client.py` | Implementación Whisper local (futuro). |
| `batch_processor.py` | Procesa N archivos en paralelo. Gestiona concurrencia. |
| `compressor.py` | **SOLO reducción de texto**: quita muletillas, normaliza, acorta para LLM. **NO extrae features.** |
| `models.py` | `TranscriptContract`, `Utterance`, `Speaker` - schemas Pydantic. |

**Interfaces principales**:
```python
class Transcriber(ABC):
    """Interface abstracta - permite cambiar proveedor STT sin refactor."""
    async def transcribe(self, audio_path: Path) -> TranscriptContract
    async def transcribe_batch(self, paths: list[Path]) -> list[TranscriptContract]

class TranscriptCompressor:
    """SOLO reduce texto. NO calcula métricas ni detecta eventos."""
    def compress(self, transcript: TranscriptContract) -> CompressedTranscript
```

**Output**:
- `data/transcripts/raw/{call_id}.json` → Transcripción original del STT
- `data/transcripts/compressed/{call_id}.json` → Texto reducido para LLM

---

### 📁 `src/features/`

**Propósito**: Extracción **determinística** de métricas y eventos desde transcripts. **100% OBSERVED.**

| Archivo | Responsabilidad |
|---------|-----------------|
| `turn_metrics.py` | Calcula: talk_ratio, interruption_count, silence_total_seconds, avg_turn_duration. |
| `event_detector.py` | Detecta eventos observables: HOLD_START, HOLD_END, TRANSFER, SILENCE, CROSSTALK. |
| `models.py` | `ObservedFeatures`, `ObservedEvent`, `TurnMetrics`. |

**Interfaces principales**:
```python
class TurnMetricsExtractor:
    """Calcula métricas de turno desde utterances."""
    def extract(self, transcript: TranscriptContract) -> TurnMetrics

class EventDetector:
    """Detecta eventos observables (silencios, holds, transfers)."""
    def detect(self, transcript: TranscriptContract) -> list[ObservedEvent]
```

**Output**:
- `data/features/{call_id}_features.json` → Métricas y eventos OBSERVED

**Nota**: Este módulo **NO usa LLM**. Todo es cálculo determinístico sobre el transcript.

---

### 📁 `src/inference/`

**Propósito**: Analizar transcripciones con LLM para extraer **datos INFERRED**.

| Archivo | Responsabilidad |
|---------|-----------------|
| `client.py` | Wrapper sobre OpenAI/Anthropic SDK. Maneja retries, rate limiting. |
| `prompt_manager.py` | Carga templates versionados, renderiza con variables, valida schema. |
| `analyzer.py` | Análisis de una llamada → `CallLabels` con separación observed/inferred. |
| `batch_analyzer.py` | Procesa N llamadas con rate limiting y checkpoints. |
| `rca_synthesizer.py` | **(Opcional)** Síntesis narrativa del RCA tree vía LLM. NO construye el árbol. |
| `models.py` | `CallLabels`, `InferredData`, `EvidenceSpan`, `JourneyEvent`. |

**Interfaces principales**:
```python
class CallAnalyzer:
    """Genera labels INFERRED con evidence_spans obligatorias."""
    async def analyze(self, transcript: CompressedTranscript, features: ObservedFeatures) -> CallLabels

class RCASynthesizer:
    """(Opcional) Genera narrativa ejecutiva sobre RCA tree ya construido."""
    async def synthesize_narrative(self, rca_tree: RCATree) -> str
```

**Output**:
- `data/processed/{call_id}_labels.json` → Labels con observed + inferred

---

### 📁 `src/validation/`

**Propósito**: Quality gate antes de agregación. Rechaza datos inválidos.

| Archivo | Responsabilidad |
|---------|-----------------|
| `validator.py` | Valida: evidence_spans presente, rca_code en taxonomía, confidence > umbral. |
| `schema_checker.py` | Verifica que schema_version y prompt_version coinciden con esperados. |
| `models.py` | `ValidationResult`, `ValidationError`. |

**Interfaces principales**:
```python
class CallLabelsValidator:
    """Valida CallLabels antes de agregación."""
    def validate(self, labels: CallLabels) -> ValidationResult

    # Reglas:
    # - Driver sin evidence_spans → RECHAZADO
    # - rca_code no en taxonomía → marca como OTHER_EMERGENT o ERROR
    # - schema_version mismatch → ERROR
```

---

### 📁 `src/aggregation/`

**Propósito**: Consolidar labels validados en estadísticas y RCA trees. **DETERMINÍSTICO, no usa LLM.**

| Archivo | Responsabilidad |
|---------|-----------------|
| `stats_engine.py` | Cálculos: distribuciones, percentiles, cross-tabs. Usa pandas + DuckDB. |
| `rca_builder.py` | **Construcción DETERMINÍSTICA** del árbol RCA a partir de stats y taxonomía. NO usa LLM. |
| `emergent_collector.py` | Recolecta `OTHER_EMERGENT` para revisión manual y posible promoción a taxonomía. |
| `correlations.py` | Análisis de correlaciones entre observed_features e inferred_outcomes. |
| `models.py` | `AggregatedStats`, `RCATree`, `RCANode`, `Correlation`. |

**Interfaces principales**:
```python
class StatsEngine:
    """Agrega labels validados en estadísticas."""
    def aggregate(self, labels: list[CallLabels]) -> AggregatedStats

class RCABuilder:
    """Construye árbol RCA de forma DETERMINÍSTICA (conteo + jerarquía de taxonomía)."""
    def build_lost_sales_tree(self, stats: AggregatedStats, taxonomy: RCATaxonomy) -> RCATree
    def build_poor_cx_tree(self, stats: AggregatedStats, taxonomy: RCATaxonomy) -> RCATree

class EmergentCollector:
    """Recolecta OTHER_EMERGENT para revisión humana."""
    def collect(self, labels: list[CallLabels]) -> EmergentDriversReport
```

**Nota sobre RCA**:
- `rca_builder.py` → **Determinístico**: cuenta ocurrencias, agrupa por taxonomía, calcula porcentajes
- `inference/rca_synthesizer.py` → **(Opcional) LLM**: genera texto narrativo sobre el árbol ya construido

---

### 📁 `src/visualization/`

**Propósito**: Capa de salida. Genera reportes visuales. **NO recalcula métricas ni inferencias.**

| Archivo | Responsabilidad |
|---------|-----------------|
| `dashboard.py` | App Streamlit: filtros, gráficos interactivos, drill-down. |
| `charts.py` | Funciones para generar gráficos (plotly/matplotlib). |
| `tree_renderer.py` | Visualización de árboles RCA como PNG/SVG. |
| `pdf_report.py` | Generación de PDF ejecutivo con ReportLab. |
| `excel_export.py` | Export a Excel con múltiples hojas y formato. |

**Restricción crítica**: Este módulo **SOLO presenta datos pre-calculados**. No contiene lógica analítica.

**Interfaces principales**:
```python
class ReportGenerator:
    """Genera reportes a partir de datos ya calculados."""
    def generate_pdf(self, stats: AggregatedStats, trees: dict[str, RCATree]) -> Path
    def generate_excel(self, labels: list[CallLabels], stats: AggregatedStats) -> Path

class TreeRenderer:
    """Renderiza RCATree como imagen."""
    def render_png(self, tree: RCATree, output_path: Path) -> None
```

---

### 📁 `src/pipeline/`

**Propósito**: Orquestar el flujo completo de ejecución.

| Archivo | Responsabilidad |
|---------|-----------------|
| `orchestrator.py` | Ejecuta stages en orden, maneja errores, logging. |
| `stages.py` | Define cada stage: `transcribe`, `extract_features`, `analyze`, `validate`, `aggregate`, `report`. |
| `checkpoint.py` | Guarda/carga estado para resume. |
| `cli.py` | Interfaz CLI con argparse/typer. |

---

### 📁 `src/utils/`

**Propósito**: Funciones auxiliares compartidas.

| Archivo | Responsabilidad |
|---------|-----------------|
| `file_io.py` | Lectura/escritura JSON, CSV, audio. Glob patterns. |
| `logging_config.py` | Setup de logging estructurado (consola + archivo). |
| `validators.py` | Validación de archivos de audio (formato, duración). |

---

## Modelo de Datos (Output Artifacts)

### Estructura mínima obligatoria de `labels.json`

Todo archivo `{call_id}_labels.json` **SIEMPRE** incluye estos campos:

```json
{
  "_meta": {
    "schema_version": "1.0.0",      // OBLIGATORIO - versión del schema
    "prompt_version": "v1.2",       // OBLIGATORIO - versión del prompt usado
    "model_id": "gpt-4o-mini",      // OBLIGATORIO - modelo LLM usado
    "processed_at": "2024-01-15T10:35:00Z"
  },
  "call_id": "c001",                // OBLIGATORIO

  "observed": {                     // OBLIGATORIO - datos del STT/features
    "duration_seconds": 245,
    "agent_talk_pct": 0.45,
    "customer_talk_pct": 0.55,
    "silence_total_seconds": 38,
    "hold_events": [...],
    "transfer_count": 0
  },

  "inferred": {                     // OBLIGATORIO - datos del LLM
    "intent": { "code": "...", "confidence": 0.91, "evidence_spans": [...] },
    "outcome": { "code": "...", "confidence": 0.85, "evidence_spans": [...] },
    "lost_sale_driver": { ... } | null,
    "poor_cx_driver": { ... } | null,
    "sentiment": { ... },
    "agent_quality": { ... },
    "summary": "..."
  },

  "events": [                       // OBLIGATORIO - timeline estructurado
    {"type": "CALL_START", "t": "00:00", "source": "observed"},
    {"type": "HOLD_START", "t": "02:14", "source": "observed"},
    {"type": "PRICE_OBJECTION", "t": "03:55", "source": "inferred"},
    ...
  ]
}
```

### Sobre `events[]`

`events[]` es una **lista estructurada de eventos normalizados**, NO texto libre.

Cada evento tiene:
- `type`: Código del enum (`HOLD_START`, `TRANSFER`, `ESCALATION`, `NEGATIVE_SENTIMENT_PEAK`, etc.)
- `t`: Timestamp en formato `MM:SS` o `HH:MM:SS`
- `source`: `"observed"` (viene de STT/features) o `"inferred"` (viene de LLM)

Tipos de eventos válidos definidos en `config/rca_taxonomy.yaml`:
```yaml
journey_event_types:
  observed:
    - CALL_START
    - CALL_END
    - HOLD_START
    - HOLD_END
    - TRANSFER
    - SILENCE
    - CROSSTALK
  inferred:
    - INTENT_STATED
    - PRICE_OBJECTION
    - COMPETITOR_MENTION
    - NEGATIVE_SENTIMENT_PEAK
    - RESOLUTION_ATTEMPT
    - SOFT_DECLINE
    - ESCALATION_REQUEST
```

---

## Flujo de Datos entre Módulos

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                              DATA FLOW                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   data/raw/audio/*.mp3                                                      │
│           │                                                                 │
│           ▼                                                                 │
│   ┌───────────────┐                                                        │
│   │ transcription │ → data/transcripts/raw/*.json                          │
│   │   (STT only)  │ → data/transcripts/compressed/*.json                   │
│   └───────────────┘                                                        │
│           │                                                                 │
│           ▼                                                                 │
│   ┌───────────────┐                                                        │
│   │   features    │ → data/features/*_features.json                        │
│   │  (OBSERVED)   │   (turn_metrics + detected_events)                     │
│   └───────────────┘                                                        │
│           │                                                                 │
│           ▼                                                                 │
│   ┌───────────────┐                                                        │
│   │   inference   │ → data/processed/*_labels.json                         │
│   │  (INFERRED)   │   (observed + inferred + events)                       │
│   └───────────────┘                                                        │
│           │                                                                 │
│           ▼                                                                 │
│   ┌───────────────┐                                                        │
│   │  validation   │ → rechaza labels sin evidence_spans                    │
│   │ (quality gate)│ → marca low_confidence                                 │
│   └───────────────┘                                                        │
│           │                                                                 │
│           ▼                                                                 │
│   ┌───────────────┐                                                        │
│   │  aggregation  │ → data/outputs/aggregated_stats.json                   │
│   │(DETERMINISTIC)│ → data/outputs/rca_*.json                              │
│   └───────────────┘ → data/outputs/emergent_drivers_review.json            │
│           │                                                                 │
│           ▼                                                                 │
│   ┌───────────────┐                                                        │
│   │ visualization │ → data/outputs/executive_summary.pdf                   │
│   │(PRESENTATION) │ → data/outputs/full_analysis.xlsx                      │
│   └───────────────┘ → http://localhost:8501 (dashboard)                    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

---

## Separación de Responsabilidades (Resumen)

| Capa | Módulo | Tipo de Lógica | Usa LLM |
|------|--------|----------------|---------|
| STT | `transcription/` | Conversión audio→texto | No |
| Texto | `transcription/compressor.py` | Reducción de texto | No |
| Features | `features/` | Extracción determinística | No |
| Análisis | `inference/analyzer.py` | Clasificación + evidencia | **Sí** |
| Narrativa | `inference/rca_synthesizer.py` | Síntesis textual (opcional) | **Sí** |
| Validación | `validation/` | Reglas de calidad | No |
| Agregación | `aggregation/` | Estadísticas + RCA tree | No |
| Presentación | `visualization/` | Reportes + dashboard | No |

---

## Convenciones de Código

### Naming

- **Archivos**: `snake_case.py`
- **Clases**: `PascalCase`
- **Funciones/métodos**: `snake_case`
- **Constantes**: `UPPER_SNAKE_CASE`

### Type hints

Usar type hints en todas las funciones públicas. Pydantic para validación de datos.

### Ejemplo de estructura de módulo

```python
# src/features/turn_metrics.py

"""Deterministic extraction of turn-based metrics from transcripts."""

from __future__ import annotations

import logging
from dataclasses import dataclass

from src.transcription.models import TranscriptContract

logger = logging.getLogger(__name__)


@dataclass
class TurnMetrics:
    """Observed metrics extracted from transcript turns."""
    agent_talk_pct: float
    customer_talk_pct: float
    silence_total_seconds: float
    interruption_count: int
    avg_turn_duration_seconds: float


class TurnMetricsExtractor:
    """Extracts turn metrics from transcript. 100% deterministic, no LLM."""

    def extract(self, transcript: TranscriptContract) -> TurnMetrics:
        """Extract turn metrics from transcript utterances."""
        utterances = transcript.observed.utterances
        # ... cálculos determinísticos ...
        return TurnMetrics(...)
```