unify API
Entity deduplication and resolution across messy datasets (Arrow IPC only).
npm install @rowops/unify
Worker Factory
createUnifyWorker
Creates a web worker for unification.
import { createUnifyWorker } from "@rowops/unify";
const worker = createUnifyWorker();
Types
UnifyConfig
Configuration for entity resolution.
interface UnifyConfig {
/** Exact match key fields */
keys?: string[];
/** Fuzzy matching rules */
fuzzy: UnifyFuzzyRule[];
/** Similarity threshold (0.0 - 1.0) */
threshold: number;
}
UnifyFuzzyRule
interface UnifyFuzzyRule {
field: string;
weight: number;
method?: "jaro" | "levenshtein" | "exact";
}
UnifyCluster
Worker protocol cluster output.
interface UnifyCluster {
id: string;
record_ids: string[];
scores: Record<string, number>;
}
EntityCluster
Cluster metadata emitted by the hero API.
interface EntityCluster {
id: string;
recordIds: string[];
score: number;
reasons: UnifyReason[];
}
UnifyCanonicalRecord
Canonical record metadata (no row materialization).
interface UnifyCanonicalRecord {
entityId: string;
sourceIds: string[];
confidence: number;
}
Worker Protocol
StreamingUnifyConfig
interface StreamingUnifyConfig {
keys?: string[];
fuzzy: UnifyFuzzyRule[];
threshold: number;
tierGate?: TierGateInit;
idColumn?: string;
}
StreamUnifyChunkResult
interface StreamUnifyChunkResult {
type: "STREAM_UNIFY_CHUNK_RESULT";
jobId: string;
chunkIndex: number;
originalIpcBytes: Uint8Array;
clusterMappings: Array<{ recordId: string; clusterId: string; confidence: number }>;
stats: { inputRecords: number; clusteredRecords: number; newClusters: number };
}
StreamUnifyFinalSummary
interface StreamUnifyFinalSummary {
totalClusters: number;
totalRecords: number;
dedupRatio: number;
canonicals: UnifyCanonicalRecord[];
clusters: UnifyCluster[];
}
Usage with Importer
import { RowOpsImporter } from "@rowops/importer";
<RowOpsImporter
projectId="proj_xxx"
schemaId="contacts"
publishableKey="pk_xxx"
enableUnify={true}
unifyConfig={{
keys: ["email"],
fuzzy: [
{ field: "first_name", weight: 2.0, method: "jaro" },
{ field: "last_name", weight: 2.0, method: "jaro" },
{ field: "phone", weight: 1.0, method: "exact" },
],
threshold: 0.85,
}}
onUnifyComplete={(result) => {
console.log(`Run: ${result.runMeta.importRunId}`);
console.log(`Clusters: ${result.totalClusters}`);
console.log(`Records: ${result.totalRecords}`);
console.log(`Dedup ratio: ${(result.dedupRatio * 100).toFixed(1)}%`);
}}
/>
Configuration Examples
Email-based Dedup
const config: UnifyConfig = {
keys: ["email"], // Exact match on email
fuzzy: [],
threshold: 1.0,
};
Name Fuzzy Matching
const config: UnifyConfig = {
fuzzy: [
{ field: "first_name", weight: 1.5, method: "jaro" },
{ field: "last_name", weight: 2.0, method: "jaro" },
{ field: "company", weight: 1.0, method: "levenshtein" },
],
threshold: 0.85,
};
Composite Key + Fuzzy
const config: UnifyConfig = {
keys: ["customer_id"], // Exact match first
fuzzy: [
{ field: "name", weight: 2.0, method: "jaro" },
{ field: "address", weight: 1.0, method: "levenshtein" },
],
threshold: 0.80,
};
Matching Methods
| Method | Best For | Speed |
|---|---|---|
exact | IDs, emails, codes | Fast |
jaro | Names, short strings | Medium |
levenshtein | Addresses, typos | Slower |
Threshold Guidelines
| Use Case | Threshold |
|---|---|
| Strict dedup (same person) | 0.90+ |
| Moderate dedup (likely same) | 0.80-0.89 |
| Aggressive dedup (maybe same) | 0.70-0.79 |
Canonical Record Selection
When multiple records match, the canonical record is selected by:
- Most complete - Fewest null fields
- Most recent - If timestamp field exists
- First seen - Fallback to row order
You can customize this via the canonicalStrategy option (Scale+ tier).
Tier Restrictions
| Feature | Free | Pro | Scale | Enterprise |
|---|---|---|---|---|
| Exact key dedup | No | Yes | Yes | Yes |
| Fuzzy matching | No | No | Yes | Yes |
| Weighted fields | No | No | No | Yes |
| Custom canonical strategy | No | No | Yes | Yes |
| Max rows per unify | - | 100,000 | 1,000,000 | Unlimited |