Knowledge Base
Data Unification
A complete data unification pipeline — golden records with entity graphs, confidence scoring, probabilistic dedup, field-level merge with conflict resolution, and a worker infrastructure for processing at scale.
Mithrilis
Overview#
In logistics, the same shipment generates data across dozens of systems — carrier APIs, TMS platforms, ERPs, email confirmations, manual spreadsheets. Each source has a partial, sometimes conflicting, view of the truth. Data Unification takes all of that fragmented data and produces a single, reliable golden record for every shipment.
This isn't a simple dedup. The unification pipeline handles conflicting fields, scores the reliability of each data source, merges records with confidence tracking, and gives your team tools to resolve the edge cases where automation alone isn't enough.
Golden Record and Entity Graph#
Each unified shipment gets a dedicated detail page that serves as the single source of truth.
Golden Record View#
The golden record page shows the best-known values for every field, assembled from all contributing data sources. The view includes:
- Hero card — Key fields and current status at a glance
- Tabbed interface — Navigate between different aspects of the record (details, sources, confidence, history)
- Field provenance — Every field shows which data source it came from and when it was last updated
Entity Graph#
The entity graph is a visual map of how data flows into the golden record. It supports two viewing modes:
- Data sources view — See which systems contributed data, with edges showing the data flow from each source to the golden record
- Relationship view — A simplified view showing how this shipment relates to other entities (orders, carriers, routes)
Toggle between views depending on whether you're investigating a data quality issue (use sources view) or understanding the broader context of a shipment (use relationship view).
Sources Accordion#
Expand each contributing source to see exactly what data it provided. Each source panel uses a minimalistic card design showing:
- The source system name and connection type
- Timestamp of the last data sync
- Every field the source contributed, with its raw value
- Whether the field was accepted into the golden record or overridden by a higher-confidence source
Investigating data discrepancies
When a field value looks wrong in the golden record, expand the Sources Accordion to see what each system reported. The Confidence tab will tell you why one source's value was chosen over another.
Confidence Scoring#
Not all data sources are equally reliable. The confidence system scores every field based on the trustworthiness of its source and the level of agreement across sources.
How Confidence Works#
Each data source has a reliability score based on:
- Historical accuracy — How often this source's data has been correct in past merges
- Schema completeness — How many required fields the source typically provides
- Timeliness — How fresh the data is (more recent data generally gets higher confidence)
When multiple sources provide the same field, the system uses Bayesian reliability scoring to determine which value to trust. If three sources agree on a delivery date and one disagrees, the majority value gets high confidence. If two sources disagree with no tiebreaker, the field is flagged for manual review.
Confidence Tab#
The confidence tab on the golden record shows field-level confidence details:
- A confidence indicator for each field (High, Medium, Low, Conflicted)
- The contributing sources and their individual reliability scores
- An explanation of why the system chose the winning value
Conflict Resolution#
When sources disagree and the system can't auto-resolve, the conflict is surfaced for human review.
Conflict Detail Page#
Each conflict gets a dedicated page with:
- Diff table — Side-by-side comparison of the conflicting values from each source
- Source context — Full detail about each contributing source, including reliability scores and timestamps
- Resolution history — If this conflict was previously resolved and has recurred, the history is shown
Resolution Options#
Your team has several ways to resolve conflicts:
- Field-level merge — Choose the correct value for each field individually. For example, accept the delivery date from Source A but the weight from Source B.
- Merge-to-different — If the system grouped two records together that actually represent different shipments, split them apart by merging one into a different target record.
- Post-hoc merge — If the system kept two records separate that should have been unified, combine them after the fact.
Auto-resolve audit trail
When the system automatically resolves a conflict (because one source has significantly higher confidence), the full reasoning is logged. You can review these auto-resolutions at any time to verify the system is making good decisions.
Merge Quality Scoring#
Every merge decision — whether manual or automatic — is tracked with a quality score. A calibration logger monitors scoring accuracy over time, so you can see whether the system's confidence scores are well-calibrated or need adjustment.
Processing Pipeline#
The backend infrastructure handles unification at scale, processing thousands of records per minute without blocking your team's workflow.
Worker Architecture#
The pipeline uses BullMQ workers for async processing:
- Staging queue — Incoming data from all sources lands here first. Records are batched and prioritized before entering the merge pipeline.
- SLA worker — Monitors unification jobs against processing time targets. If a job is taking too long, it's flagged or re-prioritized.
- Notification worker — Sends alerts when conflicts need manual review or when processing errors occur.
Dead Letter Queue#
Failed processing jobs are captured in a dead letter queue (DLQ) with a dedicated admin UI:
- Inspect the full job payload and error details
- Replay failed jobs after fixing the underlying issue
- Discard jobs that are no longer relevant
Probabilistic Dedup#
Before merging, the pipeline runs probabilistic deduplication to identify records that likely refer to the same shipment. The dedup engine includes:
- Fuzzy matching on key identifiers (tracking numbers, reference IDs, addresses)
- VIP sender filters to prevent false matches on high-priority accounts where similar-looking records are actually distinct
- Configurable thresholds so you can tune the sensitivity based on your data quality
Data Foundation#
The foundational layer (Phase -1) handles the core infrastructure that everything else builds on:
- Schema definition and validation
- Data ingestion from connected sources
- Record linking and entity resolution
- History tracking for every field change
- Conflict detection and flagging
- Email extraction for unstructured logistics communications
Getting Started#
Connect your data sources
The unification pipeline works with data sources configured in the Knowledge Base. Ensure your key systems (carrier APIs, TMS, ERP) are connected before enabling unification.
Review initial golden records
Once data starts flowing, golden records are created automatically. Review the first batch to verify that the system is correctly identifying and merging related records.
Resolve conflicts
Check the Conflicts queue for any records where sources disagree. Use the diff table to understand the discrepancy and choose the correct values. Your resolutions train the system's confidence scoring over time.
Monitor pipeline health
Use the DLQ admin UI to monitor for processing failures. The SLA worker ensures unification jobs complete within target timeframes. Review the merge quality dashboard to track scoring accuracy.
Topics
Related updates
Knowledge Base
Knowledge Base
A centralized knowledge layer connecting your databases, documents, and business definitions — with schema introspection, a SQL query playground, and workflow-native KB nodes.
AI
AI Query & Assistant
Ask questions in plain English and get answers from your data — a query router with 4-way classification, RAG-powered retrieval, a conversational assistant with multi-conversation support, and a shared reasoning layer.
Workflows
Workflow Builder
A full visual automation engine — drag-and-drop ReactFlow canvas, 50+ integration nodes, AI-powered steps, cross-node variables, versioning with redo/undo, error handling with dead letter queues, and workflow templates.