feat: Add Streamlit dashboard with Blueprint compliance (v2.1.0)

Dashboard Features: - 8 navigation sections: Overview, Outcomes, Poor CX, FCR, Churn, Agent, Call Explorer, Export - Beyond Brand Identity styling (colors #6D84E3, Outfit font) - RCA Sankey diagram (Driver → Outcome → Churn Risk flow) - Correlation heatmaps (driver co-occurrence, driver-outcome) - Outcome Deep Dive (root causes, correlation, duration analysis) - Export functionality (Excel, HTML, JSON) Blueprint Compliance: - FCR: 4 categories (Primera Llamada/Rellamada × Sin/Con Riesgo de Fuga) - Churn: Binary view (Sin Riesgo de Fuga / En Riesgo de Fuga) - Agent: Talento Para Replicar / Oportunidades de Mejora - Fixed FCR rate calculation (only FIRST_CALL counts as success) Technical: - Streamlit + Plotly for interactive visualizations - Light theme configuration (.streamlit/config.toml) - Fixed Plotly colorbar titlefont deprecation Documentation: - Updated PROJECT_CONTEXT.md, TODO.md, CHANGELOG.md - Added 4 new technical decisions (TD-014 to TD-017) - Created TROUBLESHOOTING.md with 10 common issues Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 16:27:30 +01:00
commit 75e7b9da3d
110 changed files with 28247 additions and 0 deletions
--- a/docs/TECH_STACK.md
+++ b/docs/TECH_STACK.md
@@ -0,0 +1,579 @@
+# CXInsights - Stack Tecnológico
+
+## Resumen de Decisiones
+
+| Componente | Elección | Alternativas Soportadas |
+|------------|----------|-------------------------|
+| **STT (Speech-to-Text)** | AssemblyAI (default) | Whisper, Google STT, AWS Transcribe (via adapter) |
+| **LLM** | OpenAI GPT-4o-mini | Claude 3.5 Sonnet (fallback) |
+| **Data Processing** | pandas + DuckDB | - |
+| **Visualization** | Streamlit (internal dashboard) | - |
+| **PDF Generation** | ReportLab | - |
+| **Config Management** | Pydantic Settings | - |
+| **PII Handling** | Presidio (opcional) + redaction pre-LLM | - |
+
+---
+
+## 1. Speech-to-Text: Arquitectura con Adapter
+
+### Decisión: **AssemblyAI (default)** + alternativas via STT Provider Adapter
+
+El sistema usa una **interfaz abstracta `Transcriber`** que permite cambiar de proveedor sin modificar el código del pipeline.
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                    STT PROVIDER ADAPTER                         │
+├─────────────────────────────────────────────────────────────────┤
+│  Interface: Transcriber                                         │
+│  └─ transcribe(audio) → TranscriptContract                     │
+│                                                                 │
+│  Implementations:                                               │
+│  ├─ AssemblyAITranscriber (DEFAULT - mejor calidad español)    │
+│  ├─ WhisperTranscriber (local, offline, $0)                    │
+│  ├─ GoogleSTTTranscriber (alternativa cloud)                   │
+│  └─ AWSTranscribeTranscriber (alternativa cloud)               │
+│                                                                 │
+│  Config: STT_PROVIDER=assemblyai|whisper|google|aws            │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+### Comparativa de Proveedores
+
+| Criterio | AssemblyAI | Whisper (local) | Google STT | AWS Transcribe |
+|----------|------------|-----------------|------------|----------------|
+| **Calidad español** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ |
+| **Speaker diarization** | ✅ Incluido | ❌ Requiere pyannote | ✅ Incluido | ✅ Incluido |
+| **Coste/minuto** | $0.015 | $0 (GPU local) | $0.016 | $0.015 |
+| **Setup complexity** | Bajo (API key) | Alto (GPU, modelos) | Medio | Medio |
+| **Batch processing** | ✅ Async nativo | Manual | ✅ | ✅ |
+| **Latencia** | ~0.3x realtime | ~1x realtime | ~0.2x realtime | ~0.3x realtime |
+
+### Por qué AssemblyAI como Default
+
+1. **Mejor modelo para español**: AssemblyAI Best tiene excelente rendimiento en español latinoamericano y castellano
+2. **Speaker diarization incluido**: Crítico para separar agente de cliente sin código adicional
+3. **API simple**: SDK Python bien documentado, async nativo
+4. **Batch processing**: Configurable concurrency, poll por resultados
+5. **Sin infraestructura**: No necesitas GPU ni mantener modelos
+
+### Cuándo usar alternativas
+
+| Alternativa | Usar cuando... |
+|-------------|----------------|
+| **Whisper local** | Presupuesto $0, tienes GPU (RTX 3080+), datos muy sensibles (offline) |
+| **Google STT** | Ya usas GCP, necesitas latencia mínima |
+| **AWS Transcribe** | Ya usas AWS, integración con S3 |
+
+### Estimación de Costes STT (AHT = 7 min)
+
+```
+AssemblyAI pricing: $0.015/minuto
+
+5,000 llamadas × 7 min = 35,000 min
+├─ Estimación baja (sin retries):  $525
+├─ Estimación media:               $550
+└─ Estimación alta (+10% retries): $580
+
+20,000 llamadas × 7 min = 140,000 min
+├─ Estimación baja:  $2,100
+├─ Estimación media: $2,200
+└─ Estimación alta:  $2,400
+
+RANGO TOTAL STT:
+├─ 5K calls:  $525 - $580
+└─ 20K calls: $2,100 - $2,400
+```
+
+---
+
+## 2. LLM: OpenAI GPT-4o-mini
+
+### Decisión: **GPT-4o-mini** (primary) + **Claude 3.5 Sonnet** (fallback)
+
+### Comparativa
+
+| Criterio | GPT-4o-mini | GPT-4o | Claude 3.5 Sonnet |
+|----------|-------------|--------|-------------------|
+| **Coste input** | $0.15/1M tokens | $2.50/1M tokens | $3.00/1M tokens |
+| **Coste output** | $0.60/1M tokens | $10.00/1M tokens | $15.00/1M tokens |
+| **Calidad español** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
+| **JSON structured** | ✅ Excelente | ✅ Excelente | ✅ Muy bueno |
+| **Context window** | 128K | 128K | 200K |
+| **Rate limits** | Depende del tier | Depende del tier | Depende del tier |
+
+### Rate Limits y Throttling
+
+**Los rate limits dependen del tier de tu cuenta OpenAI:**
+
+| Tier | RPM (requests/min) | TPM (tokens/min) |
+|------|-------------------|------------------|
+| Tier 1 (free) | 500 | 200K |
+| Tier 2 | 5,000 | 2M |
+| Tier 3 | 5,000 | 4M |
+| Tier 4+ | 10,000 | 10M |
+
+**Requisitos obligatorios en el código:**
+- Implementar throttling con tasa configurable (`LLM_REQUESTS_PER_MINUTE`)
+- Exponential backoff en errores 429 (rate limit exceeded)
+- Retry con jitter para evitar thundering herd
+- Logging de rate limit warnings
+
+```python
+# Configuración recomendada (conservadora)
+LLM_REQUESTS_PER_MINUTE=300  # Empezar bajo, escalar según tier
+LLM_BACKOFF_BASE=2.0         # Segundos base para backoff
+LLM_BACKOFF_MAX=60.0         # Máximo backoff
+LLM_MAX_RETRIES=5
+```
+
+### Estimación de Costes LLM por Llamada
+
+**IMPORTANTE**: Estos cálculos asumen **compresión previa del transcript** (Module 2).
+
+#### Escenario A: Con compresión (RECOMENDADO)
+
+```
+Transcript comprimido: ~1,200-1,800 tokens input
+Prompt template: ~400-600 tokens
+Output esperado: ~250-400 tokens
+
+Total por llamada (comprimido):
+├─ Input: ~2,000 tokens × $0.15/1M = $0.0003
+├─ Output: ~350 tokens × $0.60/1M = $0.0002
+└─ Total: $0.0004 - $0.0006 por llamada
+
+RANGO (5K calls): $2 - $3
+RANGO (20K calls): $8 - $12
+```
+
+#### Escenario B: Sin compresión (full transcript)
+
+```
+Transcript completo: ~4,000-8,000 tokens input (x3-x6)
+Prompt template: ~400-600 tokens
+Output esperado: ~250-400 tokens
+
+Total por llamada (full transcript):
+├─ Input: ~6,000 tokens × $0.15/1M = $0.0009
+├─ Output: ~350 tokens × $0.60/1M = $0.0002
+└─ Total: $0.0010 - $0.0020 por llamada
+
+RANGO (5K calls): $5 - $10
+RANGO (20K calls): $20 - $40
+
+⚠️ RECOMENDACIÓN: Siempre usar compresión para reducir costes 3-6x
+```
+
+### Por qué GPT-4o-mini
+
+1. **Coste-efectividad**: 17x más barato que GPT-4o, calidad suficiente para clasificación
+2. **Structured outputs**: JSON mode nativo, reduce errores de parsing
+3. **Consistencia**: Respuestas muy consistentes con prompts bien diseñados
+
+### Cuándo escalar a GPT-4o
+
+- Análisis que requiera razonamiento complejo
+- Casos edge con transcripciones ambiguas
+- Síntesis final de RCA trees (pocas llamadas, coste marginal)
+
+### Claude 3.5 Sonnet como fallback
+
+Usar cuando:
+- OpenAI tiene downtime
+- Necesitas segunda opinión en casos difíciles
+- Contexto muy largo (>100K tokens)
+
+---
+
+## 3. Data Processing: pandas + DuckDB
+
+### Decisión: **pandas** (manipulación) + **DuckDB** (queries analíticas)
+
+### Por qué esta combinación
+
+| Componente | Uso | Justificación |
+|------------|-----|---------------|
+| **pandas** | Load/transform JSON, merge data | Estándar de facto, excelente para datos semi-estructurados |
+| **DuckDB** | Queries SQL sobre datos, aggregations | SQL analítico sin servidor, integra con pandas |
+
+### Por qué NO Polars
+
+- Polars es más rápido, pero pandas es suficiente para 20K filas
+- Mejor ecosistema y documentación
+- Equipo probablemente ya conoce pandas
+
+### Por qué NO SQLite/PostgreSQL
+
+- DuckDB es columnar, optimizado para analytics
+- No requiere servidor ni conexión
+- Syntax SQL estándar
+- Lee/escribe parquet nativamente
+
+### Ejemplo de uso
+
+```python
+import pandas as pd
+import duckdb
+
+# Cargar todos los labels
+labels = pd.read_json("data/processed/*.json")  # via glob
+
+# Query analítico con DuckDB
+result = duckdb.sql("""
+    SELECT
+        lost_sale_driver,
+        COUNT(*) as count,
+        COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () as pct
+    FROM labels
+    WHERE outcome = 'no_sale'
+    GROUP BY lost_sale_driver
+    ORDER BY count DESC
+""").df()
+```
+
+---
+
+## 4. Visualization: Streamlit
+
+### Decisión: **Streamlit** (dashboard interno)
+
+### Alcance y Limitaciones
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                    STREAMLIT - ALCANCE                          │
+├─────────────────────────────────────────────────────────────────┤
+│  ✅ ES:                                                         │
+│  ├─ Dashboard interno para equipo de análisis                  │
+│  ├─ Visualización de resultados de batch procesado             │
+│  ├─ Drill-down por llamada individual                          │
+│  └─ Exportación a PDF/Excel                                    │
+│                                                                 │
+│  ❌ NO ES:                                                      │
+│  ├─ Portal enterprise multi-tenant                             │
+│  ├─ Aplicación de producción con SLA                          │
+│  ├─ Dashboard para >50 usuarios concurrentes                   │
+│  └─ Sistema con autenticación compleja                         │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+### Comparativa
+
+| Criterio | Streamlit | Plotly Dash | FastAPI+React |
+|----------|-----------|-------------|---------------|
+| **Setup time** | 1 hora | 4 horas | 2-3 días |
+| **Interactividad** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
+| **Aprendizaje** | Bajo | Medio | Alto |
+| **Customización** | Limitada | Alta | Total |
+| **Usuarios concurrentes** | ~10-50 | ~50-100 | Sin límite |
+
+### Deploy
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                    OPCIONES DE DEPLOY                           │
+├─────────────────────────────────────────────────────────────────┤
+│                                                                 │
+│  OPCIÓN 1: Local (desarrollo/análisis personal)                │
+│  $ streamlit run src/visualization/dashboard.py                │
+│  → http://localhost:8501                                       │
+│                                                                 │
+│  OPCIÓN 2: VM/Servidor interno (equipo pequeño)                │
+│  $ streamlit run dashboard.py --server.port 8501               │
+│  → Sin auth, acceso via VPN/red interna                        │
+│                                                                 │
+│  OPCIÓN 3: Con proxy + auth básica (recomendado producción)    │
+│  Nginx/Caddy → Basic Auth → Streamlit                          │
+│  → Auth configurable via .htpasswd o OAuth proxy               │
+│                                                                 │
+│  OPCIÓN 4: Streamlit Cloud (demos/POC)                         │
+│  → Gratis, pero datos públicos (no para producción)            │
+│                                                                 │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+### Configuración de Auth (opcional)
+
+```nginx
+# nginx.conf - Basic Auth para Streamlit
+server {
+    listen 443 ssl;
+    server_name dashboard.internal.company.com;
+
+    auth_basic "CXInsights Dashboard";
+    auth_basic_user_file /etc/nginx/.htpasswd;
+
+    location / {
+        proxy_pass http://localhost:8501;
+        proxy_http_version 1.1;
+        proxy_set_header Upgrade $http_upgrade;
+        proxy_set_header Connection "upgrade";
+    }
+}
+```
+
+### Alternativa futura
+
+Si necesitas dashboard enterprise:
+- Migrar a FastAPI backend + React frontend
+- Reusar lógica de aggregation
+- Añadir auth, multi-tenant, RBAC
+
+---
+
+## 5. PII Handling
+
+### Decisión: **Redaction pre-LLM obligatoria** + retención controlada
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                    PII HANDLING STRATEGY                        │
+├─────────────────────────────────────────────────────────────────┤
+│                                                                 │
+│  PRINCIPIO: Minimizar PII enviado a APIs externas              │
+│                                                                 │
+│  1. REDACTION PRE-LLM (obligatorio)                            │
+│     ├─ Nombres → [NOMBRE]                                      │
+│     ├─ Teléfonos → [TELEFONO]                                  │
+│     ├─ Emails → [EMAIL]                                        │
+│     ├─ DNI/NIE → [DOCUMENTO]                                   │
+│     ├─ Tarjetas → [TARJETA]                                    │
+│     └─ Direcciones → [DIRECCION]                               │
+│                                                                 │
+│  2. RETENCIÓN POR BATCH                                        │
+│     ├─ Transcripts raw: borrar tras 30 días o fin de proyecto │
+│     ├─ Transcripts compressed: borrar tras procesamiento       │
+│     ├─ Labels (sin PII): retener para análisis                │
+│     └─ Aggregated stats: retener indefinidamente              │
+│                                                                 │
+│  3. LOGS                                                        │
+│     ├─ NUNCA loguear transcript completo                       │
+│     ├─ Solo loguear: call_id, timestamps, errores             │
+│     └─ Logs en volumen separado, rotación 7 días              │
+│                                                                 │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+### Implementación
+
+```python
+# Opción 1: Regex básico (mínimo viable)
+REDACTION_PATTERNS = {
+    r'\b\d{8,9}[A-Z]?\b': '[DOCUMENTO]',           # DNI/NIE
+    r'\b\d{9}\b': '[TELEFONO]',                     # Teléfono
+    r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b': '[EMAIL]',
+    r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b': '[TARJETA]',
+}
+
+# Opción 2: Presidio (recomendado para producción)
+# Más preciso, soporta español, detecta contexto
+from presidio_analyzer import AnalyzerEngine
+from presidio_anonymizer import AnonymizerEngine
+```
+
+---
+
+## 6. Dependencias Python
+
+### Core Dependencies
+
+```toml
+[project]
+dependencies = [
+    # STT
+    "assemblyai>=0.26.0",
+
+    # LLM
+    "openai>=1.40.0",
+    "anthropic>=0.34.0",  # fallback
+
+    # Data Processing
+    "pandas>=2.2.0",
+    "duckdb>=1.0.0",
+    "pydantic>=2.8.0",
+
+    # Visualization
+    "streamlit>=1.38.0",
+    "plotly>=5.24.0",
+    "matplotlib>=3.9.0",
+
+    # PDF/Excel Export
+    "reportlab>=4.2.0",
+    "openpyxl>=3.1.0",
+    "xlsxwriter>=3.2.0",
+
+    # Config & Utils
+    "pydantic-settings>=2.4.0",
+    "python-dotenv>=1.0.0",
+    "pyyaml>=6.0.0",
+    "tqdm>=4.66.0",
+    "tenacity>=8.5.0",  # retry logic
+
+    # JSON (performance + validation)
+    "orjson>=3.10.0",             # Fast JSON serialization
+    "jsonschema>=4.23.0",         # Schema validation
+
+    # Async
+    "aiofiles>=24.1.0",
+    "httpx>=0.27.0",
+]
+
+[project.optional-dependencies]
+# PII detection (opcional pero recomendado)
+pii = [
+    "presidio-analyzer>=2.2.0",
+    "presidio-anonymizer>=2.2.0",
+    "spacy>=3.7.0",
+    "es-core-news-sm @ https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-3.7.0/es_core_news_sm-3.7.0-py3-none-any.whl",
+]
+
+dev = [
+    "pytest>=8.3.0",
+    "pytest-asyncio>=0.24.0",
+    "pytest-cov>=5.0.0",
+    "ruff>=0.6.0",
+    "mypy>=1.11.0",
+]
+```
+
+### Justificación de cada dependencia
+
+| Dependencia | Propósito | Por qué esta |
+|-------------|-----------|--------------|
+| `assemblyai` | SDK oficial STT | Mejor integración, async nativo |
+| `openai` | SDK oficial GPT | Structured outputs, streaming |
+| `anthropic` | SDK oficial Claude | Fallback LLM |
+| `pandas` | Manipulación datos | Estándar industria |
+| `duckdb` | Queries SQL | Analytics sin servidor |
+| `pydantic` | Validación schemas | Type safety, JSON parsing |
+| `streamlit` | Dashboard | Rápido, Python-only |
+| `plotly` | Gráficos interactivos | Mejor para web |
+| `matplotlib` | Gráficos estáticos | Export PNG |
+| `reportlab` | PDF generation | Maduro, flexible |
+| `openpyxl` | Excel read/write | Pandas integration |
+| `pydantic-settings` | Config management | .env + validation |
+| `tqdm` | Progress bars | UX en CLI |
+| `tenacity` | Retry logic | Rate limits, API errors |
+| `orjson` | JSON serialization | 10x más rápido que json stdlib |
+| `jsonschema` | Schema validation | Validar outputs LLM |
+| `httpx` | HTTP client async | Mejor que requests |
+| `presidio-*` | PII detection | Precisión en español, contexto |
+
+---
+
+## 7. Versiones de Python
+
+### Decisión: **Python 3.11+**
+
+### Justificación
+
+- 3.11: 10-60% más rápido que 3.10
+- 3.11: Better error messages
+- 3.12: Algunas libs aún no compatibles
+- Match pattern (3.10+) útil para parsing
+
+---
+
+## 8. Consideraciones de Seguridad
+
+### API Keys
+
+```bash
+# .env (NUNCA en git)
+ASSEMBLYAI_API_KEY=xxx
+OPENAI_API_KEY=sk-xxx
+ANTHROPIC_API_KEY=sk-ant-xxx  # opcional
+```
+
+### Rate Limiting (implementación obligatoria)
+
+```python
+# src/inference/client.py
+from tenacity import retry, wait_exponential, stop_after_attempt
+
+@retry(
+    wait=wait_exponential(multiplier=2, min=1, max=60),
+    stop=stop_after_attempt(5),
+    retry=retry_if_exception_type(RateLimitError)
+)
+async def call_llm(prompt: str) -> str:
+    # Throttle requests
+    await self.rate_limiter.acquire()
+    # ... llamada a API
+```
+
+### Checklist de seguridad
+
+- [ ] API keys en .env, nunca en código
+- [ ] .env en .gitignore
+- [ ] PII redactado antes de LLM
+- [ ] Logs sin transcripts completos
+- [ ] Rate limiting implementado
+- [ ] Backoff exponencial en errores 429
+
+---
+
+## 9. Alternativas Descartadas
+
+### Whisper Local
+- **Pro**: Gratis, offline, datos sensibles
+- **Contra**: Necesita GPU, sin diarization nativo, más lento
+- **Decisión**: Soportado via adapter, no es default
+
+### LangChain
+- **Pro**: Abstracciones útiles, chains
+- **Contra**: Overhead innecesario para este caso, complejidad
+- **Decisión**: Llamadas directas a SDK son suficientes
+
+### PostgreSQL/MySQL
+- **Pro**: Persistencia, queries complejas
+- **Contra**: Requiere servidor, overkill para batch
+- **Decisión**: DuckDB + archivos JSON/parquet
+
+### Celery/Redis
+- **Pro**: Job queue distribuida
+- **Contra**: Infraestructura adicional
+- **Decisión**: asyncio + checkpointing es suficiente
+
+---
+
+## 10. Resumen de Costes
+
+### Parámetros base
+
+- **AHT (Average Handle Time)**: 7 minutos
+- **Compresión de transcript**: Asumida (reducción ~60% tokens)
+
+### Por 5,000 llamadas
+
+| Servicio | Cálculo | Rango |
+|----------|---------|-------|
+| AssemblyAI STT | 35,000 min × $0.015/min | $525 - $580 |
+| OpenAI LLM (comprimido) | 5,000 × $0.0005 | $2 - $3 |
+| OpenAI RCA synthesis | ~10 calls × $0.02 | $0.20 |
+| **TOTAL** | | **$530 - $590** |
+
+### Por 20,000 llamadas
+
+| Servicio | Cálculo | Rango |
+|----------|---------|-------|
+| AssemblyAI STT | 140,000 min × $0.015/min | $2,100 - $2,400 |
+| OpenAI LLM (comprimido) | 20,000 × $0.0005 | $8 - $12 |
+| OpenAI RCA synthesis | ~10 calls × $0.02 | $0.20 |
+| **TOTAL** | | **$2,110 - $2,420** |
+
+### Sin compresión (escenario pesimista)
+
+| Volumen | STT | LLM (full transcript) | Total |
+|---------|-----|----------------------|-------|
+| 5,000 calls | $525-580 | $5-10 | **$530 - $590** |
+| 20,000 calls | $2,100-2,400 | $20-40 | **$2,120 - $2,440** |
+
+### Coste de infraestructura
+
+| Opción | Coste |
+|--------|-------|
+| Local (tu máquina) | $0 |
+| VM cloud (procesamiento) | $20-50/mes |
+| Streamlit Cloud (demos) | Gratis |
+| VM + Nginx (producción) | $30-80/mes |