Files
BeyondCX_Insights/docs/DATA_CONTRACTS.md
sujucu70 75e7b9da3d feat: Add Streamlit dashboard with Blueprint compliance (v2.1.0)
Dashboard Features:
- 8 navigation sections: Overview, Outcomes, Poor CX, FCR, Churn, Agent, Call Explorer, Export
- Beyond Brand Identity styling (colors #6D84E3, Outfit font)
- RCA Sankey diagram (Driver → Outcome → Churn Risk flow)
- Correlation heatmaps (driver co-occurrence, driver-outcome)
- Outcome Deep Dive (root causes, correlation, duration analysis)
- Export functionality (Excel, HTML, JSON)

Blueprint Compliance:
- FCR: 4 categories (Primera Llamada/Rellamada × Sin/Con Riesgo de Fuga)
- Churn: Binary view (Sin Riesgo de Fuga / En Riesgo de Fuga)
- Agent: Talento Para Replicar / Oportunidades de Mejora
- Fixed FCR rate calculation (only FIRST_CALL counts as success)

Technical:
- Streamlit + Plotly for interactive visualizations
- Light theme configuration (.streamlit/config.toml)
- Fixed Plotly colorbar titlefont deprecation

Documentation:
- Updated PROJECT_CONTEXT.md, TODO.md, CHANGELOG.md
- Added 4 new technical decisions (TD-014 to TD-017)
- Created TROUBLESHOOTING.md with 10 common issues

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 16:27:30 +01:00

6.8 KiB

DATA_CONTRACTS.md

Schemas de todos los datos que fluyen por el sistema


Regla de oro

Si cambias un schema, actualiza este doc PRIMERO, luego implementa el código.


Schema: Transcript

Archivo: src/transcription/models.py

@dataclass
class SpeakerTurn:
    speaker: Literal["agent", "customer"]
    text: str
    start_time: float  # seconds
    end_time: float    # seconds
    confidence: float = 1.0

@dataclass
class TranscriptMetadata:
    audio_duration_sec: float
    language: str = "es"
    provider: str = "assemblyai"
    job_id: str | None = None
    created_at: datetime = field(default_factory=datetime.now)

@dataclass
class Transcript:
    call_id: str
    turns: list[SpeakerTurn]
    metadata: TranscriptMetadata
    detected_events: list[Event] = field(default_factory=list)

Schema: Event

Archivo: src/models/call_analysis.py

class EventType(str, Enum):
    HOLD_START = "hold_start"
    HOLD_END = "hold_end"
    TRANSFER = "transfer"
    ESCALATION = "escalation"
    SILENCE = "silence"
    INTERRUPTION = "interruption"

@dataclass
class Event:
    event_type: EventType
    timestamp: float  # seconds from call start
    duration_sec: float | None = None
    metadata: dict = field(default_factory=dict)

Schema: CompressedTranscript

Archivo: src/compression/models.py

@dataclass
class CustomerIntent:
    intent_type: IntentType  # CANCEL, INQUIRY, COMPLAINT, etc.
    text: str
    timestamp: float
    confidence: float = 0.8

@dataclass
class AgentOffer:
    offer_type: OfferType  # DISCOUNT, UPGRADE, RETENTION, etc.
    text: str
    timestamp: float

@dataclass
class CustomerObjection:
    objection_type: ObjectionType  # PRICE, SERVICE, COMPETITOR, etc.
    text: str
    timestamp: float

@dataclass
class CompressedTranscript:
    call_id: str
    customer_intents: list[CustomerIntent]
    agent_offers: list[AgentOffer]
    objections: list[CustomerObjection]
    resolutions: list[ResolutionStatement]
    key_moments: list[KeyMoment]
    compression_ratio: float = 0.0  # tokens_after / tokens_before

Schema: CallAnalysis

Archivo: src/models/call_analysis.py

@dataclass
class EvidenceSpan:
    text: str
    start_time: float | None = None
    end_time: float | None = None

@dataclass
class RCALabel:
    driver_code: str  # From rca_taxonomy.yaml
    confidence: float  # 0.0-1.0
    evidence_spans: list[EvidenceSpan]  # Min 1 required!
    reasoning: str | None = None

@dataclass
class ObservedFeatures:
    audio_duration_sec: float
    agent_talk_ratio: float | None = None
    customer_talk_ratio: float | None = None
    hold_time_total_sec: float | None = None
    transfer_count: int = 0
    silence_count: int = 0

@dataclass
class Traceability:
    schema_version: str
    prompt_version: str
    model_id: str
    processed_at: datetime = field(default_factory=datetime.now)

class CallOutcome(str, Enum):
    SALE_COMPLETED = "sale_completed"
    SALE_LOST = "sale_lost"
    INQUIRY_RESOLVED = "inquiry_resolved"
    INQUIRY_UNRESOLVED = "inquiry_unresolved"
    COMPLAINT_RESOLVED = "complaint_resolved"
    COMPLAINT_UNRESOLVED = "complaint_unresolved"

class ProcessingStatus(str, Enum):
    SUCCESS = "success"
    PARTIAL = "partial"
    FAILED = "failed"

@dataclass
class CallAnalysis:
    call_id: str
    batch_id: str
    status: ProcessingStatus
    observed: ObservedFeatures
    outcome: CallOutcome | None = None
    lost_sales_drivers: list[RCALabel] = field(default_factory=list)
    poor_cx_drivers: list[RCALabel] = field(default_factory=list)
    traceability: Traceability | None = None
    error_message: str | None = None

Schema: BatchAggregation

Archivo: src/aggregation/models.py

@dataclass
class DriverFrequency:
    driver_code: str
    category: Literal["lost_sales", "poor_cx"]
    total_occurrences: int
    calls_affected: int
    total_calls_in_batch: int
    occurrence_rate: float  # occurrences / total_calls
    call_rate: float        # calls_affected / total_calls
    avg_confidence: float
    min_confidence: float
    max_confidence: float

class ImpactLevel(str, Enum):
    CRITICAL = "critical"
    HIGH = "high"
    MEDIUM = "medium"
    LOW = "low"

@dataclass
class DriverSeverity:
    driver_code: str
    category: Literal["lost_sales", "poor_cx"]
    base_severity: float
    frequency_factor: float
    confidence_factor: float
    co_occurrence_factor: float
    severity_score: float  # 0-100
    impact_level: ImpactLevel

@dataclass
class RCATree:
    batch_id: str
    total_calls: int
    calls_with_lost_sales: int
    calls_with_poor_cx: int
    calls_with_both: int
    top_lost_sales_drivers: list[str]
    top_poor_cx_drivers: list[str]
    nodes: list[RCANode] = field(default_factory=list)

@dataclass
class BatchAggregation:
    batch_id: str
    total_calls_processed: int
    successful_analyses: int
    failed_analyses: int
    lost_sales_frequencies: list[DriverFrequency]
    poor_cx_frequencies: list[DriverFrequency]
    lost_sales_severities: list[DriverSeverity]
    poor_cx_severities: list[DriverSeverity]
    rca_tree: RCATree | None = None
    emergent_patterns: list[dict] = field(default_factory=list)

Schema: PipelineManifest

Archivo: src/pipeline/models.py

class PipelineStage(str, Enum):
    TRANSCRIPTION = "transcription"
    FEATURE_EXTRACTION = "feature_extraction"
    COMPRESSION = "compression"
    INFERENCE = "inference"
    AGGREGATION = "aggregation"
    EXPORT = "export"

class StageStatus(str, Enum):
    PENDING = "pending"
    RUNNING = "running"
    COMPLETED = "completed"
    FAILED = "failed"
    SKIPPED = "skipped"

@dataclass
class StageManifest:
    stage: PipelineStage
    status: StageStatus = StageStatus.PENDING
    started_at: datetime | None = None
    completed_at: datetime | None = None
    total_items: int = 0
    processed_items: int = 0
    failed_items: int = 0
    errors: list[dict] = field(default_factory=list)
    metadata: dict = field(default_factory=dict)

@dataclass
class PipelineManifest:
    batch_id: str
    created_at: datetime = field(default_factory=datetime.now)
    status: StageStatus = StageStatus.PENDING
    current_stage: PipelineStage | None = None
    total_audio_files: int = 0
    stages: dict[PipelineStage, StageManifest] = field(default_factory=dict)

Validation Rules

RCALabel

  • evidence_spans MUST have at least 1 element
  • driver_code MUST be in rca_taxonomy.yaml OR be "OTHER_EMERGENT"
  • confidence MUST be between 0.0 and 1.0

CallAnalysis

  • traceability MUST be present
  • If status == SUCCESS, outcome MUST be present
  • If outcome == SALE_LOST, lost_sales_drivers SHOULD have entries

BatchAggregation

  • total_calls_processed == successful_analyses + failed_analyses

Última actualización: 2026-01-19