Skip to main content

unify API

Entity deduplication and resolution across messy datasets (Arrow IPC only).

npm install @rowops/unify

Worker Factory

createUnifyWorker

Creates a web worker for unification.

import { createUnifyWorker } from "@rowops/unify";

const worker = createUnifyWorker();

Types

UnifyConfig

Configuration for entity resolution.

interface UnifyConfig {
/** Exact match key fields */
keys?: string[];

/** Fuzzy matching rules */
fuzzy: UnifyFuzzyRule[];

/** Similarity threshold (0.0 - 1.0) */
threshold: number;
}

UnifyFuzzyRule

interface UnifyFuzzyRule {
field: string;
weight: number;
method?: "jaro" | "levenshtein" | "exact";
}

UnifyCluster

Worker protocol cluster output.

interface UnifyCluster {
id: string;
record_ids: string[];
scores: Record<string, number>;
}

EntityCluster

Cluster metadata emitted by the hero API.

interface EntityCluster {
id: string;
recordIds: string[];
score: number;
reasons: UnifyReason[];
}

UnifyCanonicalRecord

Canonical record metadata (no row materialization).

interface UnifyCanonicalRecord {
entityId: string;
sourceIds: string[];
confidence: number;
}

Worker Protocol

StreamingUnifyConfig

interface StreamingUnifyConfig {
keys?: string[];
fuzzy: UnifyFuzzyRule[];
threshold: number;
tierGate?: TierGateInit;
idColumn?: string;
}

StreamUnifyChunkResult

interface StreamUnifyChunkResult {
type: "STREAM_UNIFY_CHUNK_RESULT";
jobId: string;
chunkIndex: number;
originalIpcBytes: Uint8Array;
clusterMappings: Array<{ recordId: string; clusterId: string; confidence: number }>;
stats: { inputRecords: number; clusteredRecords: number; newClusters: number };
}

StreamUnifyFinalSummary

interface StreamUnifyFinalSummary {
totalClusters: number;
totalRecords: number;
dedupRatio: number;
canonicals: UnifyCanonicalRecord[];
clusters: UnifyCluster[];
}

Usage with Importer

import { RowOpsImporter } from "@rowops/importer";

<RowOpsImporter
projectId="proj_xxx"
schemaId="contacts"
publishableKey="pk_xxx"
enableUnify={true}
unifyConfig={{
keys: ["email"],
fuzzy: [
{ field: "first_name", weight: 2.0, method: "jaro" },
{ field: "last_name", weight: 2.0, method: "jaro" },
{ field: "phone", weight: 1.0, method: "exact" },
],
threshold: 0.85,
}}
onUnifyComplete={(result) => {
console.log(`Run: ${result.runMeta.importRunId}`);
console.log(`Clusters: ${result.totalClusters}`);
console.log(`Records: ${result.totalRecords}`);
console.log(`Dedup ratio: ${(result.dedupRatio * 100).toFixed(1)}%`);
}}
/>

Configuration Examples

Email-based Dedup

const config: UnifyConfig = {
keys: ["email"], // Exact match on email
fuzzy: [],
threshold: 1.0,
};

Name Fuzzy Matching

const config: UnifyConfig = {
fuzzy: [
{ field: "first_name", weight: 1.5, method: "jaro" },
{ field: "last_name", weight: 2.0, method: "jaro" },
{ field: "company", weight: 1.0, method: "levenshtein" },
],
threshold: 0.85,
};

Composite Key + Fuzzy

const config: UnifyConfig = {
keys: ["customer_id"], // Exact match first
fuzzy: [
{ field: "name", weight: 2.0, method: "jaro" },
{ field: "address", weight: 1.0, method: "levenshtein" },
],
threshold: 0.80,
};

Matching Methods

MethodBest ForSpeed
exactIDs, emails, codesFast
jaroNames, short stringsMedium
levenshteinAddresses, typosSlower

Threshold Guidelines

Use CaseThreshold
Strict dedup (same person)0.90+
Moderate dedup (likely same)0.80-0.89
Aggressive dedup (maybe same)0.70-0.79

Canonical Record Selection

When multiple records match, the canonical record is selected by:

  1. Most complete - Fewest null fields
  2. Most recent - If timestamp field exists
  3. First seen - Fallback to row order

You can customize this via the canonicalStrategy option (Scale+ tier).


Tier Restrictions

FeatureFreeProScaleEnterprise
Exact key dedupNoYesYesYes
Fuzzy matchingNoNoYesYes
Weighted fieldsNoNoNoYes
Custom canonical strategyNoNoYesYes
Max rows per unify-100,0001,000,000Unlimited

See Also