# DATA_CONTRACTS.md > Schemas de todos los datos que fluyen por el sistema --- ## Regla de oro > Si cambias un schema, actualiza este doc PRIMERO, luego implementa el código. --- ## Schema: Transcript **Archivo**: `src/transcription/models.py` ```python @dataclass class SpeakerTurn: speaker: Literal["agent", "customer"] text: str start_time: float # seconds end_time: float # seconds confidence: float = 1.0 @dataclass class TranscriptMetadata: audio_duration_sec: float language: str = "es" provider: str = "assemblyai" job_id: str | None = None created_at: datetime = field(default_factory=datetime.now) @dataclass class Transcript: call_id: str turns: list[SpeakerTurn] metadata: TranscriptMetadata detected_events: list[Event] = field(default_factory=list) ``` --- ## Schema: Event **Archivo**: `src/models/call_analysis.py` ```python class EventType(str, Enum): HOLD_START = "hold_start" HOLD_END = "hold_end" TRANSFER = "transfer" ESCALATION = "escalation" SILENCE = "silence" INTERRUPTION = "interruption" @dataclass class Event: event_type: EventType timestamp: float # seconds from call start duration_sec: float | None = None metadata: dict = field(default_factory=dict) ``` --- ## Schema: CompressedTranscript **Archivo**: `src/compression/models.py` ```python @dataclass class CustomerIntent: intent_type: IntentType # CANCEL, INQUIRY, COMPLAINT, etc. text: str timestamp: float confidence: float = 0.8 @dataclass class AgentOffer: offer_type: OfferType # DISCOUNT, UPGRADE, RETENTION, etc. text: str timestamp: float @dataclass class CustomerObjection: objection_type: ObjectionType # PRICE, SERVICE, COMPETITOR, etc. text: str timestamp: float @dataclass class CompressedTranscript: call_id: str customer_intents: list[CustomerIntent] agent_offers: list[AgentOffer] objections: list[CustomerObjection] resolutions: list[ResolutionStatement] key_moments: list[KeyMoment] compression_ratio: float = 0.0 # tokens_after / tokens_before ``` --- ## Schema: CallAnalysis **Archivo**: `src/models/call_analysis.py` ```python @dataclass class EvidenceSpan: text: str start_time: float | None = None end_time: float | None = None @dataclass class RCALabel: driver_code: str # From rca_taxonomy.yaml confidence: float # 0.0-1.0 evidence_spans: list[EvidenceSpan] # Min 1 required! reasoning: str | None = None @dataclass class ObservedFeatures: audio_duration_sec: float agent_talk_ratio: float | None = None customer_talk_ratio: float | None = None hold_time_total_sec: float | None = None transfer_count: int = 0 silence_count: int = 0 @dataclass class Traceability: schema_version: str prompt_version: str model_id: str processed_at: datetime = field(default_factory=datetime.now) class CallOutcome(str, Enum): SALE_COMPLETED = "sale_completed" SALE_LOST = "sale_lost" INQUIRY_RESOLVED = "inquiry_resolved" INQUIRY_UNRESOLVED = "inquiry_unresolved" COMPLAINT_RESOLVED = "complaint_resolved" COMPLAINT_UNRESOLVED = "complaint_unresolved" class ProcessingStatus(str, Enum): SUCCESS = "success" PARTIAL = "partial" FAILED = "failed" @dataclass class CallAnalysis: call_id: str batch_id: str status: ProcessingStatus observed: ObservedFeatures outcome: CallOutcome | None = None lost_sales_drivers: list[RCALabel] = field(default_factory=list) poor_cx_drivers: list[RCALabel] = field(default_factory=list) traceability: Traceability | None = None error_message: str | None = None ``` --- ## Schema: BatchAggregation **Archivo**: `src/aggregation/models.py` ```python @dataclass class DriverFrequency: driver_code: str category: Literal["lost_sales", "poor_cx"] total_occurrences: int calls_affected: int total_calls_in_batch: int occurrence_rate: float # occurrences / total_calls call_rate: float # calls_affected / total_calls avg_confidence: float min_confidence: float max_confidence: float class ImpactLevel(str, Enum): CRITICAL = "critical" HIGH = "high" MEDIUM = "medium" LOW = "low" @dataclass class DriverSeverity: driver_code: str category: Literal["lost_sales", "poor_cx"] base_severity: float frequency_factor: float confidence_factor: float co_occurrence_factor: float severity_score: float # 0-100 impact_level: ImpactLevel @dataclass class RCATree: batch_id: str total_calls: int calls_with_lost_sales: int calls_with_poor_cx: int calls_with_both: int top_lost_sales_drivers: list[str] top_poor_cx_drivers: list[str] nodes: list[RCANode] = field(default_factory=list) @dataclass class BatchAggregation: batch_id: str total_calls_processed: int successful_analyses: int failed_analyses: int lost_sales_frequencies: list[DriverFrequency] poor_cx_frequencies: list[DriverFrequency] lost_sales_severities: list[DriverSeverity] poor_cx_severities: list[DriverSeverity] rca_tree: RCATree | None = None emergent_patterns: list[dict] = field(default_factory=list) ``` --- ## Schema: PipelineManifest **Archivo**: `src/pipeline/models.py` ```python class PipelineStage(str, Enum): TRANSCRIPTION = "transcription" FEATURE_EXTRACTION = "feature_extraction" COMPRESSION = "compression" INFERENCE = "inference" AGGREGATION = "aggregation" EXPORT = "export" class StageStatus(str, Enum): PENDING = "pending" RUNNING = "running" COMPLETED = "completed" FAILED = "failed" SKIPPED = "skipped" @dataclass class StageManifest: stage: PipelineStage status: StageStatus = StageStatus.PENDING started_at: datetime | None = None completed_at: datetime | None = None total_items: int = 0 processed_items: int = 0 failed_items: int = 0 errors: list[dict] = field(default_factory=list) metadata: dict = field(default_factory=dict) @dataclass class PipelineManifest: batch_id: str created_at: datetime = field(default_factory=datetime.now) status: StageStatus = StageStatus.PENDING current_stage: PipelineStage | None = None total_audio_files: int = 0 stages: dict[PipelineStage, StageManifest] = field(default_factory=dict) ``` --- ## Validation Rules ### RCALabel - `evidence_spans` MUST have at least 1 element - `driver_code` MUST be in rca_taxonomy.yaml OR be "OTHER_EMERGENT" - `confidence` MUST be between 0.0 and 1.0 ### CallAnalysis - `traceability` MUST be present - If `status == SUCCESS`, `outcome` MUST be present - If `outcome == SALE_LOST`, `lost_sales_drivers` SHOULD have entries ### BatchAggregation - `total_calls_processed` == `successful_analyses` + `failed_analyses` --- **Última actualización**: 2026-01-19