BeyondCX_Insights/docs/DEPLOYMENT.md

# CXInsights - Deployment Guide

## Modelo de Deployment

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                         DEPLOYMENT MODEL                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  CXInsights está diseñado para ejecutarse como LONG-RUNNING BATCH JOBS     │
│  en un servidor dedicado (físico o VM), NO como microservicio elástico.    │
│                                                                             │
│  ✅ Modelo principal: Servidor dedicado con ejecución via tmux/systemd     │
│  ⚠️ Modelo secundario: Cloud VM (misma arquitectura, diferente hosting)    │
│  📦 Opcional: Docker (para portabilidad, no para orquestación)             │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

---

## Prerequisitos

### Software requerido

| Software | Versión | Propósito |
|----------|---------|-----------|
| Python | 3.11+ | Runtime |
| Git | 2.40+ | Control de versiones |
| ffmpeg | 6.0+ | Validación de audio (opcional) |
| tmux | 3.0+ | Sesiones persistentes para batch jobs |

### Cuentas y API Keys

| Servicio | URL | Necesario para |
|----------|-----|----------------|
| AssemblyAI | https://assemblyai.com | Transcripción STT |
| OpenAI | https://platform.openai.com | Análisis LLM |
| Anthropic | https://console.anthropic.com | Backup LLM (opcional) |

---

## Capacity Planning (Sizing Estático)

### Requisitos de Hardware

El sizing es **estático** para el volumen máximo esperado. No hay auto-scaling.

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    CAPACITY PLANNING                                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  VOLUMEN: 5,000 llamadas / batch                                           │
│  ├─ CPU: 4 cores (transcripción es I/O bound, no CPU bound)                │
│  ├─ RAM: 8 GB                                                              │
│  ├─ Disco: 50 GB SSD (audio + transcripts + outputs)                       │
│  └─ Red: 100 Mbps (upload audio a STT API)                                 │
│                                                                             │
│  VOLUMEN: 20,000 llamadas / batch                                          │
│  ├─ CPU: 4-8 cores                                                         │
│  ├─ RAM: 16 GB                                                             │
│  ├─ Disco: 200 GB SSD                                                      │
│  └─ Red: 100+ Mbps                                                         │
│                                                                             │
│  NOTA: El cuello de botella es el rate limit de APIs externas,            │
│        no el hardware local. Más cores no acelera el pipeline.            │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Estimación de espacio en disco

```
Por cada 1,000 llamadas (AHT = 7 min):
├─ Audio original:     ~2-4 GB (depende de bitrate)
├─ Transcripts raw:    ~100 MB
├─ Transcripts compressed: ~40 MB
├─ Features:           ~20 MB
├─ Labels (processed): ~50 MB
├─ Outputs finales:    ~10 MB
└─ TOTAL:              ~2.5-4.5 GB por 1,000 calls

Recomendación:
├─ 5K calls:  50 GB disponibles
└─ 20K calls: 200 GB disponibles
```

---

## Deployment Estándar (Servidor Dedicado)

### 1. Preparar servidor

```bash
# Ubuntu 22.04 LTS (o similar)
sudo apt update
sudo apt install -y python3.11 python3.11-venv git ffmpeg tmux
```

### 2. Clonar repositorio

```bash
# Ubicación recomendada: /opt/cxinsights o ~/cxinsights
cd /opt
git clone https://github.com/tu-org/cxinsights.git
cd cxinsights
```

### 3. Crear entorno virtual

```bash
python3.11 -m venv .venv
source .venv/bin/activate
```

### 4. Instalar dependencias

```bash
# Instalación base
pip install -e .

# Con PII detection (recomendado)
pip install -e ".[pii]"

# Con herramientas de desarrollo
pip install -e ".[dev]"
```

### 5. Configurar variables de entorno

```bash
cp .env.example .env
nano .env
```

Contenido de `.env`:

```bash
# === API KEYS ===
ASSEMBLYAI_API_KEY=your_assemblyai_key_here
OPENAI_API_KEY=sk-your_openai_key_here
ANTHROPIC_API_KEY=sk-ant-your_anthropic_key_here  # Opcional

# === THROTTLING (ajustar manualmente según tier y pruebas) ===
# Estos son LÍMITES INTERNOS, no promesas de las APIs
MAX_CONCURRENT_TRANSCRIPTIONS=30    # AssemblyAI: empezar conservador
LLM_REQUESTS_PER_MINUTE=200         # OpenAI: depende de tu tier
LLM_BACKOFF_BASE=2.0                # Segundos base para retry
LLM_BACKOFF_MAX=60.0                # Máximo backoff
LLM_MAX_RETRIES=5

# === LOGGING ===
LOG_LEVEL=INFO
LOG_DIR=./data/logs

# === RUTAS ===
DATA_DIR=./data
CONFIG_DIR=./config
```

### 6. Crear estructura de datos persistente

```bash
# Script de inicialización (ejecutar una sola vez)
./scripts/init_data_structure.sh
```

O manualmente:

```bash
mkdir -p data/{raw/audio,raw/metadata}
mkdir -p data/{transcripts/raw,transcripts/compressed}
mkdir -p data/features
mkdir -p data/processed
mkdir -p data/outputs
mkdir -p data/logs
mkdir -p data/.checkpoints
```

### 7. Verificar instalación

```bash
python -m cxinsights.pipeline.cli --help
```

---

## Configuración de Throttling

### Concepto clave

Los parámetros `MAX_CONCURRENT_*` y `*_REQUESTS_PER_MINUTE` son **throttles internos** que tú ajustas manualmente según:
1. Tu tier en las APIs (OpenAI, AssemblyAI)
2. Pruebas reales de comportamiento
3. Errores 429 observados

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    THROTTLING CONFIGURATION                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ASSEMBLYAI:                                                                │
│  ├─ Default: 100 concurrent transcriptions (según docs)                    │
│  ├─ Recomendación inicial: 30 (conservador)                                │
│  └─ Ajustar según errores observados                                       │
│                                                                             │
│  OPENAI:                                                                    │
│  ├─ Tier 1 (free): 500 RPM → configurar 200 RPM interno                   │
│  ├─ Tier 2: 5000 RPM → configurar 2000 RPM interno                        │
│  ├─ Tier 3+: 5000+ RPM → configurar según necesidad                       │
│  └─ SIEMPRE dejar margen (40-50% del límite real)                         │
│                                                                             │
│  Si ves errores 429:                                                        │
│  1. Reducir *_REQUESTS_PER_MINUTE                                          │
│  2. El backoff exponencial manejará picos                                  │
│  3. Loguear y ajustar para siguiente batch                                 │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

---

## Ejecución de Batch Jobs

### Modelo de ejecución: Long-running batch jobs

CXInsights ejecuta **procesos de larga duración** (6-24+ horas). Usa tmux o systemd para persistencia.

### Opción A: tmux (recomendado para operación manual)

```bash
# Crear sesión tmux
tmux new-session -s cxinsights

# Dentro de tmux, ejecutar pipeline
source .venv/bin/activate
python -m cxinsights.pipeline.cli run \
    --input ./data/raw/audio/batch_2024_01 \
    --batch-id batch_2024_01

# Detach de tmux: Ctrl+B, luego D
# Re-attach: tmux attach -t cxinsights

# Ver logs en otra ventana tmux
# Ctrl+B, luego C (nueva ventana)
tail -f data/logs/pipeline_*.log
```

### Opción B: systemd (recomendado para ejecución programada)

```ini
# /etc/systemd/system/cxinsights-batch.service
[Unit]
Description=CXInsights Batch Processing
After=network.target

[Service]
Type=simple
User=cxinsights
WorkingDirectory=/opt/cxinsights
Environment="PATH=/opt/cxinsights/.venv/bin"
ExecStart=/opt/cxinsights/.venv/bin/python -m cxinsights.pipeline.cli run \
    --input /opt/cxinsights/data/raw/audio/current_batch \
    --batch-id current_batch
Restart=no
StandardOutput=append:/opt/cxinsights/data/logs/systemd.log
StandardError=append:/opt/cxinsights/data/logs/systemd.log

[Install]
WantedBy=multi-user.target
```

```bash
# Activar y ejecutar
sudo systemctl daemon-reload
sudo systemctl start cxinsights-batch

# Ver estado
sudo systemctl status cxinsights-batch
journalctl -u cxinsights-batch -f
```

### Comando básico

```bash
python -m cxinsights.pipeline.cli run \
    --input ./data/raw/audio/batch_2024_01 \
    --batch-id batch_2024_01
```

### Opciones disponibles

```bash
python -m cxinsights.pipeline.cli run --help

# Opciones:
#   --input PATH          Carpeta con archivos de audio [required]
#   --output PATH         Carpeta de salida [default: ./data]
#   --batch-id TEXT       Identificador del batch [required]
#   --config PATH         Archivo de configuración [default: ./config/settings.yaml]
#   --stages TEXT         Stages a ejecutar (comma-separated) [default: all]
#   --skip-transcription  Saltar transcripción (usar existentes)
#   --skip-inference      Saltar inferencia (usar existentes)
#   --dry-run             Mostrar qué se haría sin ejecutar
#   --verbose             Logging detallado
```

### Ejecución por stages (útil para debugging)

```bash
# Solo transcripción
python -m cxinsights.pipeline.cli run \
    --input ./data/raw/audio/batch_01 \
    --batch-id batch_01 \
    --stages transcription

# Solo features (requiere transcripts)
python -m cxinsights.pipeline.cli run \
    --batch-id batch_01 \
    --stages features

# Solo inferencia (requiere transcripts + features)
python -m cxinsights.pipeline.cli run \
    --batch-id batch_01 \
    --stages inference

# Agregación y reportes (requiere labels)
python -m cxinsights.pipeline.cli run \
    --batch-id batch_01 \
    --stages aggregation,visualization
```

### Resumir desde checkpoint

```bash
# Si el pipeline falló o se interrumpió
python -m cxinsights.pipeline.cli resume --batch-id batch_01

# El sistema detecta automáticamente:
# - Transcripciones completadas
# - Features extraídos
# - Labels ya generados
# - Continúa desde donde se quedó
```

### Estimación de costes antes de ejecutar

```bash
python -m cxinsights.pipeline.cli estimate --input ./data/raw/audio/batch_01

# Output:
# ┌─────────────────────────────────────────────────┐
# │           COST ESTIMATION (AHT=7min)            │
# ├─────────────────────────────────────────────────┤
# │ Files found:           5,234                    │
# │ Total duration:        ~611 hours               │
# │ Avg duration/call:     7.0 min                  │
# ├─────────────────────────────────────────────────┤
# │ Transcription (STT):   $540 - $600              │
# │ Inference (LLM):       $2.50 - $3.50            │
# │ TOTAL ESTIMATED:       $543 - $604              │
# └─────────────────────────────────────────────────┘
# Proceed? [y/N]:
```

---

## Política de Logs y Retención

### Estructura de logs

```
data/logs/
├── pipeline_2024_01_15_103000.log    # Log principal del batch
├── pipeline_2024_01_15_103000.err    # Errores separados
├── transcription_2024_01_15.log      # Detalle STT
├── inference_2024_01_15.log          # Detalle LLM
└── systemd.log                       # Si usas systemd
```

### Política de retención

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    RETENTION POLICY                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  LOGS:                                                                      │
│  ├─ Pipeline logs: 30 días                                                 │
│  ├─ Error logs: 90 días                                                    │
│  └─ Rotación: diaria, compresión gzip después de 7 días                   │
│                                                                             │
│  DATOS:                                                                     │
│  ├─ Audio raw: borrar tras procesamiento exitoso (o retener 30 días)      │
│  ├─ Transcripts raw: borrar tras 30 días                                  │
│  ├─ Transcripts compressed: borrar tras procesamiento LLM                 │
│  ├─ Features: retener mientras existan labels                             │
│  ├─ Labels (processed): retener indefinidamente (sin PII)                 │
│  ├─ Outputs (stats, RCA): retener indefinidamente                         │
│  └─ Checkpoints: borrar tras completar batch                              │
│                                                                             │
│  IMPORTANTE: Los logs NUNCA contienen transcripts completos               │
│              Solo: call_id, timestamps, errores, métricas                 │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Configuración de logrotate (Linux)

```bash
# /etc/logrotate.d/cxinsights
/opt/cxinsights/data/logs/*.log {
    daily
    rotate 30
    compress
    delaycompress
    missingok
    notifempty
    create 644 cxinsights cxinsights
}
```

### Script de limpieza manual

```bash
# scripts/cleanup_old_data.sh
#!/bin/bash
# Ejecutar periódicamente (cron semanal)

DATA_DIR="/opt/cxinsights/data"
RETENTION_DAYS=30

echo "Cleaning data older than $RETENTION_DAYS days..."

# Logs antiguos
find "$DATA_DIR/logs" -name "*.log" -mtime +$RETENTION_DAYS -delete
find "$DATA_DIR/logs" -name "*.gz" -mtime +90 -delete

# Transcripts raw antiguos
find "$DATA_DIR/transcripts/raw" -name "*.json" -mtime +$RETENTION_DAYS -delete

# Checkpoints de batches completados (manual review recomendado)
echo "Review and delete completed checkpoints manually:"
ls -la "$DATA_DIR/.checkpoints/"

echo "Cleanup complete."
```

---

## Dashboard (Visualización)

```bash
# Lanzar dashboard
streamlit run src/visualization/dashboard.py -- --batch-id batch_2024_01

# Acceder en: http://localhost:8501
# O si es servidor remoto: http://servidor:8501
```

### Con autenticación (proxy nginx)

Ver TECH_STACK.md sección "Streamlit - Deploy" para configuración de nginx con basic auth.

---

## Estructura de Outputs

Después de ejecutar el pipeline:

```
data/outputs/batch_2024_01/
├── aggregated_stats.json           # Estadísticas consolidadas
├── call_matrix.csv                 # Todas las llamadas con labels
├── rca_lost_sales.json             # Árbol RCA de ventas perdidas
├── rca_poor_cx.json                # Árbol RCA de CX deficiente
├── emergent_drivers_review.json    # OTHER_EMERGENT para revisión
├── validation_report.json          # Resultado de quality gate
├── executive_summary.pdf           # Reporte ejecutivo
├── full_analysis.xlsx              # Excel con drill-down
└── figures/
    ├── rca_tree_lost_sales.png
    ├── rca_tree_poor_cx.png
    └── ...
```

---

## Script de Deployment (deploy.sh)

Script para configuración inicial del entorno persistente.

```bash
#!/bin/bash
# deploy.sh - Configuración inicial de entorno persistente
# Ejecutar UNA VEZ al instalar en nuevo servidor

set -e

INSTALL_DIR="${INSTALL_DIR:-/opt/cxinsights}"
PYTHON_VERSION="python3.11"

echo "======================================"
echo "CXInsights - Initial Deployment"
echo "======================================"
echo "Install directory: $INSTALL_DIR"
echo ""

# 1. Verificar Python
if ! command -v $PYTHON_VERSION &> /dev/null; then
    echo "ERROR: $PYTHON_VERSION not found"
    echo "Install with: sudo apt install python3.11 python3.11-venv"
    exit 1
fi
echo "✓ Python: $($PYTHON_VERSION --version)"

# 2. Verificar que estamos en el directorio correcto
if [ ! -f "pyproject.toml" ]; then
    echo "ERROR: pyproject.toml not found. Run from repository root."
    exit 1
fi
echo "✓ Repository structure verified"

# 3. Crear entorno virtual (si no existe)
if [ ! -d ".venv" ]; then
    echo "Creating virtual environment..."
    $PYTHON_VERSION -m venv .venv
fi
source .venv/bin/activate
echo "✓ Virtual environment: .venv"

# 4. Instalar dependencias
echo "Installing dependencies..."
pip install -q --upgrade pip
pip install -q -e .
echo "✓ Dependencies installed"

# 5. Configurar .env (si no existe)
if [ ! -f ".env" ]; then
    if [ -f ".env.example" ]; then
        cp .env.example .env
        echo "⚠ Created .env from template - CONFIGURE API KEYS"
    else
        echo "ERROR: .env.example not found"
        exit 1
    fi
else
    echo "✓ .env exists"
fi

# 6. Crear estructura de datos persistente (idempotente)
echo "Creating data directory structure..."
mkdir -p data/raw/audio
mkdir -p data/raw/metadata
mkdir -p data/transcripts/raw
mkdir -p data/transcripts/compressed
mkdir -p data/features
mkdir -p data/processed
mkdir -p data/outputs
mkdir -p data/logs
mkdir -p data/.checkpoints

# Crear .gitkeep para preservar estructura en git
touch data/raw/audio/.gitkeep
touch data/raw/metadata/.gitkeep
touch data/transcripts/raw/.gitkeep
touch data/transcripts/compressed/.gitkeep
touch data/features/.gitkeep
touch data/processed/.gitkeep
touch data/outputs/.gitkeep
touch data/logs/.gitkeep

echo "✓ Data directories created"

# 7. Verificar API keys en .env
source .env
if [ -z "$ASSEMBLYAI_API_KEY" ] || [ "$ASSEMBLYAI_API_KEY" = "your_assemblyai_key_here" ]; then
    echo ""
    echo "⚠ WARNING: ASSEMBLYAI_API_KEY not configured in .env"
fi
if [ -z "$OPENAI_API_KEY" ] || [ "$OPENAI_API_KEY" = "sk-your_openai_key_here" ]; then
    echo "⚠ WARNING: OPENAI_API_KEY not configured in .env"
fi

# 8. Verificar instalación
echo ""
echo "Verifying installation..."
python -m cxinsights.pipeline.cli --help > /dev/null 2>&1
if [ $? -eq 0 ]; then
    echo "✓ CLI verification passed"
else
    echo "ERROR: CLI verification failed"
    exit 1
fi

echo ""
echo "======================================"
echo "Deployment complete!"
echo "======================================"
echo ""
echo "Next steps:"
echo "  1. Configure API keys in .env"
echo "  2. Copy audio files to data/raw/audio/your_batch/"
echo "  3. Start tmux session: tmux new -s cxinsights"
echo "  4. Activate venv: source .venv/bin/activate"
echo "  5. Run pipeline:"
echo "     python -m cxinsights.pipeline.cli run \\"
echo "         --input ./data/raw/audio/your_batch \\"
echo "         --batch-id your_batch"
echo ""
```

```bash
# Uso:
chmod +x deploy.sh
./deploy.sh
```

---

## Docker (Opcional)

Docker es una opción para **portabilidad**, no el camino principal de deployment.

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    DOCKER - DISCLAIMER                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Docker es OPCIONAL y se proporciona para:                                 │
│  ├─ Entornos donde no se puede instalar Python directamente               │
│  ├─ Reproducibilidad exacta del entorno                                   │
│  └─ Integración con sistemas de CI/CD existentes                          │
│                                                                             │
│  Docker NO es necesario para:                                              │
│  ├─ Ejecución normal en servidor dedicado                                 │
│  ├─ Obtener mejor rendimiento                                             │
│  └─ Escalar horizontalmente (no aplica a este workload)                   │
│                                                                             │
│  El deployment estándar (venv + tmux/systemd) es preferido.               │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Dockerfile

```dockerfile
FROM python:3.11-slim

WORKDIR /app

# Dependencias del sistema
RUN apt-get update && \
    apt-get install -y ffmpeg && \
    rm -rf /var/lib/apt/lists/*

# Copiar código
COPY pyproject.toml .
COPY src/ src/
COPY config/ config/

# Instalar dependencias Python
RUN pip install --no-cache-dir -e .

# Volumen para datos persistentes
VOLUME ["/app/data"]

ENTRYPOINT ["python", "-m", "cxinsights.pipeline.cli"]
```

### Uso

```bash
# Build
docker build -t cxinsights:latest .

# Run (montar volumen de datos)
docker run -it \
    -v /path/to/data:/app/data \
    --env-file .env \
    cxinsights:latest run \
    --input /app/data/raw/audio/batch_01 \
    --batch-id batch_01
```

---

## Cloud VM (Opción Secundaria)

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    CLOUD VM - DISCLAIMER                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Usar Cloud VM (AWS EC2, GCP Compute, Azure VM) cuando:                   │
│  ├─ No tienes servidor físico disponible                                  │
│  ├─ Necesitas acceso remoto desde múltiples ubicaciones                   │
│  └─ Quieres delegar mantenimiento de hardware                             │
│                                                                             │
│  La arquitectura es IDÉNTICA al servidor dedicado:                         │
│  ├─ Mismo sizing estático (no auto-scaling)                               │
│  ├─ Mismo modelo de ejecución (long-running batch)                        │
│  ├─ Misma configuración de throttling manual                              │
│  └─ Solo cambia dónde está el servidor                                    │
│                                                                             │
│  COSTE ADICIONAL: $30-100/mes por la VM (según specs)                     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Setup en Cloud VM

```bash
# 1. Crear VM (ejemplo AWS)
# - Ubuntu 22.04 LTS
# - t3.xlarge (4 vCPU, 16 GB RAM) para 20K calls
# - 200 GB gp3 SSD
# - Security group: SSH (22), HTTP opcional (8501 para dashboard)

# 2. Conectar
ssh -i key.pem ubuntu@vm-ip

# 3. Seguir pasos de "Deployment Estándar" arriba
# (idéntico a servidor dedicado)
```

---

## Troubleshooting

### Error: API key inválida

```
Error: AssemblyAI authentication failed
```

**Solución**: Verificar `ASSEMBLYAI_API_KEY` en `.env`

### Error: Rate limit exceeded (429)

```
Error: OpenAI rate limit exceeded
```

**Solución**:
1. Reducir `LLM_REQUESTS_PER_MINUTE` en `.env`
2. El backoff automático manejará picos temporales
3. Revisar tu tier en OpenAI dashboard

### Error: Memoria insuficiente

```
MemoryError: Unable to allocate array
```

**Solución**:
- Procesar en batches más pequeños
- Aumentar RAM del servidor
- Usar `--stages` para ejecutar por partes

### Error: Transcripción fallida

```
Error: Transcription failed for call_xxx.mp3
```

**Solución**:
- Verificar archivo: `ffprobe call_xxx.mp3`
- Verificar que no excede 5 horas (límite AssemblyAI)
- El pipeline continúa con las demás llamadas

### Ver logs detallados

```bash
# Log principal del pipeline
tail -f data/logs/pipeline_*.log

# Verbose mode
python -m cxinsights.pipeline.cli run ... --verbose

# Si usas systemd
journalctl -u cxinsights-batch -f
```

---

## Checklist Pre-Ejecución

```
SERVIDOR:
[ ] Python 3.11+ instalado
[ ] tmux instalado
[ ] Suficiente espacio en disco (ver Capacity Planning)
[ ] Conectividad de red estable

APLICACIÓN:
[ ] Repositorio clonado
[ ] Entorno virtual creado y activado
[ ] Dependencias instaladas (pip install -e .)
[ ] .env configurado con API keys
[ ] Throttling configurado según tu tier

DATOS:
[ ] Archivos de audio en data/raw/audio/batch_id/
[ ] Estimación de costes revisada (estimate command)
[ ] Estructura de directorios creada

EJECUCIÓN:
[ ] Sesión tmux iniciada (o systemd configurado)
[ ] Logs monitoreables
```

---

## Makefile (Comandos útiles)

```makefile
.PHONY: install dev test lint run dashboard status logs clean-logs

# Instalación
install:
	pip install -e .

install-pii:
	pip install -e ".[pii]"

dev:
	pip install -e ".[dev]"

# Testing
test:
	pytest tests/ -v

test-cov:
	pytest tests/ --cov=src --cov-report=html

# Linting
lint:
	ruff check src/
	mypy src/

format:
	ruff format src/

# Ejecución
run:
	python -m cxinsights.pipeline.cli run --input $(INPUT) --batch-id $(BATCH)

estimate:
	python -m cxinsights.pipeline.cli estimate --input $(INPUT)

resume:
	python -m cxinsights.pipeline.cli resume --batch-id $(BATCH)

dashboard:
	streamlit run src/visualization/dashboard.py -- --batch-id $(BATCH)

# Monitoreo
status:
	@echo "=== Pipeline Status ==="
	@ls -la data/.checkpoints/ 2>/dev/null || echo "No active checkpoints"
	@echo ""
	@echo "=== Recent Logs ==="
	@ls -lt data/logs/*.log 2>/dev/null | head -5 || echo "No logs found"

logs:
	tail -f data/logs/pipeline_*.log

# Limpieza (CUIDADO: no borrar datos de producción)
clean-logs:
	find data/logs -name "*.log" -mtime +30 -delete
	find data/logs -name "*.gz" -mtime +90 -delete

clean-checkpoints:
	@echo "Review before deleting:"
	@ls -la data/.checkpoints/
	@read -p "Delete all checkpoints? [y/N] " confirm && [ "$$confirm" = "y" ] && rm -rf data/.checkpoints/*
```

Uso:

```bash
make install
make run INPUT=./data/raw/audio/batch_01 BATCH=batch_01
make logs
make status
make dashboard BATCH=batch_01
```