Files

sujucu70 75e7b9da3d feat: Add Streamlit dashboard with Blueprint compliance (v2.1.0)

Dashboard Features:
- 8 navigation sections: Overview, Outcomes, Poor CX, FCR, Churn, Agent, Call Explorer, Export
- Beyond Brand Identity styling (colors #6D84E3, Outfit font)
- RCA Sankey diagram (Driver → Outcome → Churn Risk flow)
- Correlation heatmaps (driver co-occurrence, driver-outcome)
- Outcome Deep Dive (root causes, correlation, duration analysis)
- Export functionality (Excel, HTML, JSON)

Blueprint Compliance:
- FCR: 4 categories (Primera Llamada/Rellamada × Sin/Con Riesgo de Fuga)
- Churn: Binary view (Sin Riesgo de Fuga / En Riesgo de Fuga)
- Agent: Talento Para Replicar / Oportunidades de Mejora
- Fixed FCR rate calculation (only FIRST_CALL counts as success)

Technical:
- Streamlit + Plotly for interactive visualizations
- Light theme configuration (.streamlit/config.toml)
- Fixed Plotly colorbar titlefont deprecation

Documentation:
- Updated PROJECT_CONTEXT.md, TODO.md, CHANGELOG.md
- Added 4 new technical decisions (TD-014 to TD-017)
- Created TROUBLESHOOTING.md with 10 common issues

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-19 16:27:30 +01:00

30 KiB

Raw Blame History

CXInsights - Deployment Guide

Modelo de Deployment

┌─────────────────────────────────────────────────────────────────────────────┐
│                         DEPLOYMENT MODEL                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  CXInsights está diseñado para ejecutarse como LONG-RUNNING BATCH JOBS     │
│  en un servidor dedicado (físico o VM), NO como microservicio elástico.    │
│                                                                             │
│  ✅ Modelo principal: Servidor dedicado con ejecución via tmux/systemd     │
│  ⚠️ Modelo secundario: Cloud VM (misma arquitectura, diferente hosting)    │
│  📦 Opcional: Docker (para portabilidad, no para orquestación)             │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Prerequisitos

Software requerido

Software	Versión	Propósito
Python	3.11+	Runtime
Git	2.40+	Control de versiones
ffmpeg	6.0+	Validación de audio (opcional)
tmux	3.0+	Sesiones persistentes para batch jobs

Cuentas y API Keys

Servicio	URL	Necesario para
AssemblyAI	https://assemblyai.com	Transcripción STT
OpenAI	https://platform.openai.com	Análisis LLM
Anthropic	https://console.anthropic.com	Backup LLM (opcional)

Capacity Planning (Sizing Estático)

Requisitos de Hardware

El sizing es estático para el volumen máximo esperado. No hay auto-scaling.

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CAPACITY PLANNING                                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  VOLUMEN: 5,000 llamadas / batch                                           │
│  ├─ CPU: 4 cores (transcripción es I/O bound, no CPU bound)                │
│  ├─ RAM: 8 GB                                                              │
│  ├─ Disco: 50 GB SSD (audio + transcripts + outputs)                       │
│  └─ Red: 100 Mbps (upload audio a STT API)                                 │
│                                                                             │
│  VOLUMEN: 20,000 llamadas / batch                                          │
│  ├─ CPU: 4-8 cores                                                         │
│  ├─ RAM: 16 GB                                                             │
│  ├─ Disco: 200 GB SSD                                                      │
│  └─ Red: 100+ Mbps                                                         │
│                                                                             │
│  NOTA: El cuello de botella es el rate limit de APIs externas,            │
│        no el hardware local. Más cores no acelera el pipeline.            │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Estimación de espacio en disco

Por cada 1,000 llamadas (AHT = 7 min):
├─ Audio original:     ~2-4 GB (depende de bitrate)
├─ Transcripts raw:    ~100 MB
├─ Transcripts compressed: ~40 MB
├─ Features:           ~20 MB
├─ Labels (processed): ~50 MB
├─ Outputs finales:    ~10 MB
└─ TOTAL:              ~2.5-4.5 GB por 1,000 calls

Recomendación:
├─ 5K calls:  50 GB disponibles
└─ 20K calls: 200 GB disponibles

Deployment Estándar (Servidor Dedicado)

1. Preparar servidor

# Ubuntu 22.04 LTS (o similar)
sudo apt update
sudo apt install -y python3.11 python3.11-venv git ffmpeg tmux

2. Clonar repositorio

# Ubicación recomendada: /opt/cxinsights o ~/cxinsights
cd /opt
git clone https://github.com/tu-org/cxinsights.git
cd cxinsights

3. Crear entorno virtual

python3.11 -m venv .venv
source .venv/bin/activate

4. Instalar dependencias

# Instalación base
pip install -e .

# Con PII detection (recomendado)
pip install -e ".[pii]"

# Con herramientas de desarrollo
pip install -e ".[dev]"

5. Configurar variables de entorno

cp .env.example .env
nano .env

Contenido de .env:

# === API KEYS ===
ASSEMBLYAI_API_KEY=your_assemblyai_key_here
OPENAI_API_KEY=sk-your_openai_key_here
ANTHROPIC_API_KEY=sk-ant-your_anthropic_key_here  # Opcional

# === THROTTLING (ajustar manualmente según tier y pruebas) ===
# Estos son LÍMITES INTERNOS, no promesas de las APIs
MAX_CONCURRENT_TRANSCRIPTIONS=30    # AssemblyAI: empezar conservador
LLM_REQUESTS_PER_MINUTE=200         # OpenAI: depende de tu tier
LLM_BACKOFF_BASE=2.0                # Segundos base para retry
LLM_BACKOFF_MAX=60.0                # Máximo backoff
LLM_MAX_RETRIES=5

# === LOGGING ===
LOG_LEVEL=INFO
LOG_DIR=./data/logs

# === RUTAS ===
DATA_DIR=./data
CONFIG_DIR=./config

6. Crear estructura de datos persistente

# Script de inicialización (ejecutar una sola vez)
./scripts/init_data_structure.sh

O manualmente:

mkdir -p data/{raw/audio,raw/metadata}
mkdir -p data/{transcripts/raw,transcripts/compressed}
mkdir -p data/features
mkdir -p data/processed
mkdir -p data/outputs
mkdir -p data/logs
mkdir -p data/.checkpoints

7. Verificar instalación

python -m cxinsights.pipeline.cli --help

Configuración de Throttling

Concepto clave

Los parámetros MAX_CONCURRENT_* y *_REQUESTS_PER_MINUTE son throttles internos que tú ajustas manualmente según:

Tu tier en las APIs (OpenAI, AssemblyAI)
Pruebas reales de comportamiento
Errores 429 observados

┌─────────────────────────────────────────────────────────────────────────────┐
│                    THROTTLING CONFIGURATION                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ASSEMBLYAI:                                                                │
│  ├─ Default: 100 concurrent transcriptions (según docs)                    │
│  ├─ Recomendación inicial: 30 (conservador)                                │
│  └─ Ajustar según errores observados                                       │
│                                                                             │
│  OPENAI:                                                                    │
│  ├─ Tier 1 (free): 500 RPM → configurar 200 RPM interno                   │
│  ├─ Tier 2: 5000 RPM → configurar 2000 RPM interno                        │
│  ├─ Tier 3+: 5000+ RPM → configurar según necesidad                       │
│  └─ SIEMPRE dejar margen (40-50% del límite real)                         │
│                                                                             │
│  Si ves errores 429:                                                        │
│  1. Reducir *_REQUESTS_PER_MINUTE                                          │
│  2. El backoff exponencial manejará picos                                  │
│  3. Loguear y ajustar para siguiente batch                                 │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Ejecución de Batch Jobs

Modelo de ejecución: Long-running batch jobs

CXInsights ejecuta procesos de larga duración (6-24+ horas). Usa tmux o systemd para persistencia.

Opción A: tmux (recomendado para operación manual)

# Crear sesión tmux
tmux new-session -s cxinsights

# Dentro de tmux, ejecutar pipeline
source .venv/bin/activate
python -m cxinsights.pipeline.cli run \
    --input ./data/raw/audio/batch_2024_01 \
    --batch-id batch_2024_01

# Detach de tmux: Ctrl+B, luego D
# Re-attach: tmux attach -t cxinsights

# Ver logs en otra ventana tmux
# Ctrl+B, luego C (nueva ventana)
tail -f data/logs/pipeline_*.log

Opción B: systemd (recomendado para ejecución programada)

# /etc/systemd/system/cxinsights-batch.service
[Unit]
Description=CXInsights Batch Processing
After=network.target

[Service]
Type=simple
User=cxinsights
WorkingDirectory=/opt/cxinsights
Environment="PATH=/opt/cxinsights/.venv/bin"
ExecStart=/opt/cxinsights/.venv/bin/python -m cxinsights.pipeline.cli run \
    --input /opt/cxinsights/data/raw/audio/current_batch \
    --batch-id current_batch
Restart=no
StandardOutput=append:/opt/cxinsights/data/logs/systemd.log
StandardError=append:/opt/cxinsights/data/logs/systemd.log

[Install]
WantedBy=multi-user.target

# Activar y ejecutar
sudo systemctl daemon-reload
sudo systemctl start cxinsights-batch

# Ver estado
sudo systemctl status cxinsights-batch
journalctl -u cxinsights-batch -f

Comando básico

python -m cxinsights.pipeline.cli run \
    --input ./data/raw/audio/batch_2024_01 \
    --batch-id batch_2024_01

Opciones disponibles

python -m cxinsights.pipeline.cli run --help

# Opciones:
#   --input PATH          Carpeta con archivos de audio [required]
#   --output PATH         Carpeta de salida [default: ./data]
#   --batch-id TEXT       Identificador del batch [required]
#   --config PATH         Archivo de configuración [default: ./config/settings.yaml]
#   --stages TEXT         Stages a ejecutar (comma-separated) [default: all]
#   --skip-transcription  Saltar transcripción (usar existentes)
#   --skip-inference      Saltar inferencia (usar existentes)
#   --dry-run             Mostrar qué se haría sin ejecutar
#   --verbose             Logging detallado

Ejecución por stages (útil para debugging)

# Solo transcripción
python -m cxinsights.pipeline.cli run \
    --input ./data/raw/audio/batch_01 \
    --batch-id batch_01 \
    --stages transcription

# Solo features (requiere transcripts)
python -m cxinsights.pipeline.cli run \
    --batch-id batch_01 \
    --stages features

# Solo inferencia (requiere transcripts + features)
python -m cxinsights.pipeline.cli run \
    --batch-id batch_01 \
    --stages inference

# Agregación y reportes (requiere labels)
python -m cxinsights.pipeline.cli run \
    --batch-id batch_01 \
    --stages aggregation,visualization

Resumir desde checkpoint

# Si el pipeline falló o se interrumpió
python -m cxinsights.pipeline.cli resume --batch-id batch_01

# El sistema detecta automáticamente:
# - Transcripciones completadas
# - Features extraídos
# - Labels ya generados
# - Continúa desde donde se quedó

Estimación de costes antes de ejecutar

python -m cxinsights.pipeline.cli estimate --input ./data/raw/audio/batch_01

# Output:
# ┌─────────────────────────────────────────────────┐
# │           COST ESTIMATION (AHT=7min)            │
# ├─────────────────────────────────────────────────┤
# │ Files found:           5,234                    │
# │ Total duration:        ~611 hours               │
# │ Avg duration/call:     7.0 min                  │
# ├─────────────────────────────────────────────────┤
# │ Transcription (STT):   $540 - $600              │
# │ Inference (LLM):       $2.50 - $3.50            │
# │ TOTAL ESTIMATED:       $543 - $604              │
# └─────────────────────────────────────────────────┘
# Proceed? [y/N]:

Política de Logs y Retención

Estructura de logs

data/logs/
├── pipeline_2024_01_15_103000.log    # Log principal del batch
├── pipeline_2024_01_15_103000.err    # Errores separados
├── transcription_2024_01_15.log      # Detalle STT
├── inference_2024_01_15.log          # Detalle LLM
└── systemd.log                       # Si usas systemd

Política de retención

┌─────────────────────────────────────────────────────────────────────────────┐
│                    RETENTION POLICY                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  LOGS:                                                                      │
│  ├─ Pipeline logs: 30 días                                                 │
│  ├─ Error logs: 90 días                                                    │
│  └─ Rotación: diaria, compresión gzip después de 7 días                   │
│                                                                             │
│  DATOS:                                                                     │
│  ├─ Audio raw: borrar tras procesamiento exitoso (o retener 30 días)      │
│  ├─ Transcripts raw: borrar tras 30 días                                  │
│  ├─ Transcripts compressed: borrar tras procesamiento LLM                 │
│  ├─ Features: retener mientras existan labels                             │
│  ├─ Labels (processed): retener indefinidamente (sin PII)                 │
│  ├─ Outputs (stats, RCA): retener indefinidamente                         │
│  └─ Checkpoints: borrar tras completar batch                              │
│                                                                             │
│  IMPORTANTE: Los logs NUNCA contienen transcripts completos               │
│              Solo: call_id, timestamps, errores, métricas                 │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Configuración de logrotate (Linux)

# /etc/logrotate.d/cxinsights
/opt/cxinsights/data/logs/*.log {
    daily
    rotate 30
    compress
    delaycompress
    missingok
    notifempty
    create 644 cxinsights cxinsights
}

Script de limpieza manual

# scripts/cleanup_old_data.sh
#!/bin/bash
# Ejecutar periódicamente (cron semanal)

DATA_DIR="/opt/cxinsights/data"
RETENTION_DAYS=30

echo "Cleaning data older than $RETENTION_DAYS days..."

# Logs antiguos
find "$DATA_DIR/logs" -name "*.log" -mtime +$RETENTION_DAYS -delete
find "$DATA_DIR/logs" -name "*.gz" -mtime +90 -delete

# Transcripts raw antiguos
find "$DATA_DIR/transcripts/raw" -name "*.json" -mtime +$RETENTION_DAYS -delete

# Checkpoints de batches completados (manual review recomendado)
echo "Review and delete completed checkpoints manually:"
ls -la "$DATA_DIR/.checkpoints/"

echo "Cleanup complete."

Dashboard (Visualización)

# Lanzar dashboard
streamlit run src/visualization/dashboard.py -- --batch-id batch_2024_01

# Acceder en: http://localhost:8501
# O si es servidor remoto: http://servidor:8501

Con autenticación (proxy nginx)

Ver TECH_STACK.md sección "Streamlit - Deploy" para configuración de nginx con basic auth.

Estructura de Outputs

Después de ejecutar el pipeline:

data/outputs/batch_2024_01/
├── aggregated_stats.json           # Estadísticas consolidadas
├── call_matrix.csv                 # Todas las llamadas con labels
├── rca_lost_sales.json             # Árbol RCA de ventas perdidas
├── rca_poor_cx.json                # Árbol RCA de CX deficiente
├── emergent_drivers_review.json    # OTHER_EMERGENT para revisión
├── validation_report.json          # Resultado de quality gate
├── executive_summary.pdf           # Reporte ejecutivo
├── full_analysis.xlsx              # Excel con drill-down
└── figures/
    ├── rca_tree_lost_sales.png
    ├── rca_tree_poor_cx.png
    └── ...

Script de Deployment (deploy.sh)

Script para configuración inicial del entorno persistente.

#!/bin/bash
# deploy.sh - Configuración inicial de entorno persistente
# Ejecutar UNA VEZ al instalar en nuevo servidor

set -e

INSTALL_DIR="${INSTALL_DIR:-/opt/cxinsights}"
PYTHON_VERSION="python3.11"

echo "======================================"
echo "CXInsights - Initial Deployment"
echo "======================================"
echo "Install directory: $INSTALL_DIR"
echo ""

# 1. Verificar Python
if ! command -v $PYTHON_VERSION &> /dev/null; then
    echo "ERROR: $PYTHON_VERSION not found"
    echo "Install with: sudo apt install python3.11 python3.11-venv"
    exit 1
fi
echo "✓ Python: $($PYTHON_VERSION --version)"

# 2. Verificar que estamos en el directorio correcto
if [ ! -f "pyproject.toml" ]; then
    echo "ERROR: pyproject.toml not found. Run from repository root."
    exit 1
fi
echo "✓ Repository structure verified"

# 3. Crear entorno virtual (si no existe)
if [ ! -d ".venv" ]; then
    echo "Creating virtual environment..."
    $PYTHON_VERSION -m venv .venv
fi
source .venv/bin/activate
echo "✓ Virtual environment: .venv"

# 4. Instalar dependencias
echo "Installing dependencies..."
pip install -q --upgrade pip
pip install -q -e .
echo "✓ Dependencies installed"

# 5. Configurar .env (si no existe)
if [ ! -f ".env" ]; then
    if [ -f ".env.example" ]; then
        cp .env.example .env
        echo "⚠ Created .env from template - CONFIGURE API KEYS"
    else
        echo "ERROR: .env.example not found"
        exit 1
    fi
else
    echo "✓ .env exists"
fi

# 6. Crear estructura de datos persistente (idempotente)
echo "Creating data directory structure..."
mkdir -p data/raw/audio
mkdir -p data/raw/metadata
mkdir -p data/transcripts/raw
mkdir -p data/transcripts/compressed
mkdir -p data/features
mkdir -p data/processed
mkdir -p data/outputs
mkdir -p data/logs
mkdir -p data/.checkpoints

# Crear .gitkeep para preservar estructura en git
touch data/raw/audio/.gitkeep
touch data/raw/metadata/.gitkeep
touch data/transcripts/raw/.gitkeep
touch data/transcripts/compressed/.gitkeep
touch data/features/.gitkeep
touch data/processed/.gitkeep
touch data/outputs/.gitkeep
touch data/logs/.gitkeep

echo "✓ Data directories created"

# 7. Verificar API keys en .env
source .env
if [ -z "$ASSEMBLYAI_API_KEY" ] || [ "$ASSEMBLYAI_API_KEY" = "your_assemblyai_key_here" ]; then
    echo ""
    echo "⚠ WARNING: ASSEMBLYAI_API_KEY not configured in .env"
fi
if [ -z "$OPENAI_API_KEY" ] || [ "$OPENAI_API_KEY" = "sk-your_openai_key_here" ]; then
    echo "⚠ WARNING: OPENAI_API_KEY not configured in .env"
fi

# 8. Verificar instalación
echo ""
echo "Verifying installation..."
python -m cxinsights.pipeline.cli --help > /dev/null 2>&1
if [ $? -eq 0 ]; then
    echo "✓ CLI verification passed"
else
    echo "ERROR: CLI verification failed"
    exit 1
fi

echo ""
echo "======================================"
echo "Deployment complete!"
echo "======================================"
echo ""
echo "Next steps:"
echo "  1. Configure API keys in .env"
echo "  2. Copy audio files to data/raw/audio/your_batch/"
echo "  3. Start tmux session: tmux new -s cxinsights"
echo "  4. Activate venv: source .venv/bin/activate"
echo "  5. Run pipeline:"
echo "     python -m cxinsights.pipeline.cli run \\"
echo "         --input ./data/raw/audio/your_batch \\"
echo "         --batch-id your_batch"
echo ""

# Uso:
chmod +x deploy.sh
./deploy.sh

Docker (Opcional)

Docker es una opción para portabilidad, no el camino principal de deployment.

┌─────────────────────────────────────────────────────────────────────────────┐
│                    DOCKER - DISCLAIMER                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Docker es OPCIONAL y se proporciona para:                                 │
│  ├─ Entornos donde no se puede instalar Python directamente               │
│  ├─ Reproducibilidad exacta del entorno                                   │
│  └─ Integración con sistemas de CI/CD existentes                          │
│                                                                             │
│  Docker NO es necesario para:                                              │
│  ├─ Ejecución normal en servidor dedicado                                 │
│  ├─ Obtener mejor rendimiento                                             │
│  └─ Escalar horizontalmente (no aplica a este workload)                   │
│                                                                             │
│  El deployment estándar (venv + tmux/systemd) es preferido.               │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Dockerfile

FROM python:3.11-slim

WORKDIR /app

# Dependencias del sistema
RUN apt-get update && \
    apt-get install -y ffmpeg && \
    rm -rf /var/lib/apt/lists/*

# Copiar código
COPY pyproject.toml .
COPY src/ src/
COPY config/ config/

# Instalar dependencias Python
RUN pip install --no-cache-dir -e .

# Volumen para datos persistentes
VOLUME ["/app/data"]

ENTRYPOINT ["python", "-m", "cxinsights.pipeline.cli"]

Uso

# Build
docker build -t cxinsights:latest .

# Run (montar volumen de datos)
docker run -it \
    -v /path/to/data:/app/data \
    --env-file .env \
    cxinsights:latest run \
    --input /app/data/raw/audio/batch_01 \
    --batch-id batch_01

Cloud VM (Opción Secundaria)

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CLOUD VM - DISCLAIMER                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Usar Cloud VM (AWS EC2, GCP Compute, Azure VM) cuando:                   │
│  ├─ No tienes servidor físico disponible                                  │
│  ├─ Necesitas acceso remoto desde múltiples ubicaciones                   │
│  └─ Quieres delegar mantenimiento de hardware                             │
│                                                                             │
│  La arquitectura es IDÉNTICA al servidor dedicado:                         │
│  ├─ Mismo sizing estático (no auto-scaling)                               │
│  ├─ Mismo modelo de ejecución (long-running batch)                        │
│  ├─ Misma configuración de throttling manual                              │
│  └─ Solo cambia dónde está el servidor                                    │
│                                                                             │
│  COSTE ADICIONAL: $30-100/mes por la VM (según specs)                     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Setup en Cloud VM

# 1. Crear VM (ejemplo AWS)
# - Ubuntu 22.04 LTS
# - t3.xlarge (4 vCPU, 16 GB RAM) para 20K calls
# - 200 GB gp3 SSD
# - Security group: SSH (22), HTTP opcional (8501 para dashboard)

# 2. Conectar
ssh -i key.pem ubuntu@vm-ip

# 3. Seguir pasos de "Deployment Estándar" arriba
# (idéntico a servidor dedicado)

Troubleshooting

Error: API key inválida

Error: AssemblyAI authentication failed

Solución: Verificar ASSEMBLYAI_API_KEY en .env

Error: Rate limit exceeded (429)

Error: OpenAI rate limit exceeded

Solución:

Reducir LLM_REQUESTS_PER_MINUTE en .env
El backoff automático manejará picos temporales
Revisar tu tier en OpenAI dashboard

Error: Memoria insuficiente

MemoryError: Unable to allocate array

Solución:

Procesar en batches más pequeños
Aumentar RAM del servidor
Usar --stages para ejecutar por partes

Error: Transcripción fallida

Error: Transcription failed for call_xxx.mp3

Solución:

Verificar archivo: ffprobe call_xxx.mp3
Verificar que no excede 5 horas (límite AssemblyAI)
El pipeline continúa con las demás llamadas

Ver logs detallados

# Log principal del pipeline
tail -f data/logs/pipeline_*.log

# Verbose mode
python -m cxinsights.pipeline.cli run ... --verbose

# Si usas systemd
journalctl -u cxinsights-batch -f

Checklist Pre-Ejecución

SERVIDOR:
[ ] Python 3.11+ instalado
[ ] tmux instalado
[ ] Suficiente espacio en disco (ver Capacity Planning)
[ ] Conectividad de red estable

APLICACIÓN:
[ ] Repositorio clonado
[ ] Entorno virtual creado y activado
[ ] Dependencias instaladas (pip install -e .)
[ ] .env configurado con API keys
[ ] Throttling configurado según tu tier

DATOS:
[ ] Archivos de audio en data/raw/audio/batch_id/
[ ] Estimación de costes revisada (estimate command)
[ ] Estructura de directorios creada

EJECUCIÓN:
[ ] Sesión tmux iniciada (o systemd configurado)
[ ] Logs monitoreables

Makefile (Comandos útiles)

.PHONY: install dev test lint run dashboard status logs clean-logs

# Instalación
install:
	pip install -e .

install-pii:
	pip install -e ".[pii]"

dev:
	pip install -e ".[dev]"

# Testing
test:
	pytest tests/ -v

test-cov:
	pytest tests/ --cov=src --cov-report=html

# Linting
lint:
	ruff check src/
	mypy src/

format:
	ruff format src/

# Ejecución
run:
	python -m cxinsights.pipeline.cli run --input $(INPUT) --batch-id $(BATCH)

estimate:
	python -m cxinsights.pipeline.cli estimate --input $(INPUT)

resume:
	python -m cxinsights.pipeline.cli resume --batch-id $(BATCH)

dashboard:
	streamlit run src/visualization/dashboard.py -- --batch-id $(BATCH)

# Monitoreo
status:
	@echo "=== Pipeline Status ==="
	@ls -la data/.checkpoints/ 2>/dev/null || echo "No active checkpoints"
	@echo ""
	@echo "=== Recent Logs ==="
	@ls -lt data/logs/*.log 2>/dev/null | head -5 || echo "No logs found"

logs:
	tail -f data/logs/pipeline_*.log

# Limpieza (CUIDADO: no borrar datos de producción)
clean-logs:
	find data/logs -name "*.log" -mtime +30 -delete
	find data/logs -name "*.gz" -mtime +90 -delete

clean-checkpoints:
	@echo "Review before deleting:"
	@ls -la data/.checkpoints/
	@read -p "Delete all checkpoints? [y/N] " confirm && [ "$$confirm" = "y" ] && rm -rf data/.checkpoints/*

Uso:

make install
make run INPUT=./data/raw/audio/batch_01 BATCH=batch_01
make logs
make status
make dashboard BATCH=batch_01

30 KiB Raw Blame History

CXInsights - Deployment Guide

Modelo de Deployment

Prerequisitos

Software requerido

Cuentas y API Keys

Capacity Planning (Sizing Estático)

Requisitos de Hardware

Estimación de espacio en disco

Deployment Estándar (Servidor Dedicado)

1. Preparar servidor

2. Clonar repositorio

3. Crear entorno virtual

4. Instalar dependencias

5. Configurar variables de entorno

6. Crear estructura de datos persistente

7. Verificar instalación

Configuración de Throttling

Concepto clave

Ejecución de Batch Jobs

Modelo de ejecución: Long-running batch jobs

Opción A: tmux (recomendado para operación manual)

Opción B: systemd (recomendado para ejecución programada)

Comando básico

Opciones disponibles

Ejecución por stages (útil para debugging)

Resumir desde checkpoint

Estimación de costes antes de ejecutar

Política de Logs y Retención

Estructura de logs

Política de retención

Configuración de logrotate (Linux)

Script de limpieza manual

Dashboard (Visualización)

Con autenticación (proxy nginx)

Estructura de Outputs

Script de Deployment (deploy.sh)

Docker (Opcional)

Dockerfile

Uso

Cloud VM (Opción Secundaria)

Setup en Cloud VM

Troubleshooting

Error: API key inválida

Error: Rate limit exceeded (429)

Error: Memoria insuficiente

Error: Transcripción fallida

Ver logs detallados

Checklist Pre-Ejecución

Makefile (Comandos útiles)

30 KiB

Raw Blame History