unify API

Entity deduplication and resolution across messy datasets (Arrow IPC only).

npm install @rowops/unify

Worker Factory

createUnifyWorker

Creates a web worker for unification.

import { createUnifyWorker } from "@rowops/unify";

const worker = createUnifyWorker();

Types

UnifyConfig

Configuration for entity resolution.

interface UnifyConfig {
  /** Exact match key fields */
  keys?: string[];

  /** Fuzzy matching rules */
  fuzzy: UnifyFuzzyRule[];

  /** Similarity threshold (0.0 - 1.0) */
  threshold: number;
}

UnifyFuzzyRule

interface UnifyFuzzyRule {
  field: string;
  weight: number;
  method?: "jaro" | "levenshtein" | "exact";
}

UnifyCluster

Worker protocol cluster output.

interface UnifyCluster {
  id: string;
  record_ids: string[];
  scores: Record<string, number>;
}

EntityCluster

Cluster metadata emitted by the hero API.

interface EntityCluster {
  id: string;
  recordIds: string[];
  score: number;
  reasons: UnifyReason[];
}

UnifyCanonicalRecord

Canonical record metadata (no row materialization).

interface UnifyCanonicalRecord {
  entityId: string;
  sourceIds: string[];
  confidence: number;
}

Worker Protocol

StreamingUnifyConfig

interface StreamingUnifyConfig {
  keys?: string[];
  fuzzy: UnifyFuzzyRule[];
  threshold: number;
  tierGate?: TierGateInit;
  idColumn?: string;
}

StreamUnifyChunkResult

interface StreamUnifyChunkResult {
  type: "STREAM_UNIFY_CHUNK_RESULT";
  jobId: string;
  chunkIndex: number;
  originalIpcBytes: Uint8Array;
  clusterMappings: Array<{ recordId: string; clusterId: string; confidence: number }>;
  stats: { inputRecords: number; clusteredRecords: number; newClusters: number };
}

StreamUnifyFinalSummary

interface StreamUnifyFinalSummary {
  totalClusters: number;
  totalRecords: number;
  dedupRatio: number;
  canonicals: UnifyCanonicalRecord[];
  clusters: UnifyCluster[];
}

Usage with Importer

import { RowOpsImporter } from "@rowops/importer";

<RowOpsImporter
  projectId="proj_xxx"
  schemaId="contacts"
  publishableKey="pk_xxx"
  enableUnify={true}
  unifyConfig={{
    keys: ["email"],
    fuzzy: [
      { field: "first_name", weight: 2.0, method: "jaro" },
      { field: "last_name", weight: 2.0, method: "jaro" },
      { field: "phone", weight: 1.0, method: "exact" },
    ],
    threshold: 0.85,
  }}
  onUnifyComplete={(result) => {
    console.log(`Run: ${result.runMeta.importRunId}`);
    console.log(`Clusters: ${result.totalClusters}`);
    console.log(`Records: ${result.totalRecords}`);
    console.log(`Dedup ratio: ${(result.dedupRatio * 100).toFixed(1)}%`);
  }}
/>

Configuration Examples

Email-based Dedup

const config: UnifyConfig = {
  keys: ["email"], // Exact match on email
  fuzzy: [],
  threshold: 1.0,
};

Name Fuzzy Matching

const config: UnifyConfig = {
  fuzzy: [
    { field: "first_name", weight: 1.5, method: "jaro" },
    { field: "last_name", weight: 2.0, method: "jaro" },
    { field: "company", weight: 1.0, method: "levenshtein" },
  ],
  threshold: 0.85,
};

Composite Key + Fuzzy

const config: UnifyConfig = {
  keys: ["customer_id"], // Exact match first
  fuzzy: [
    { field: "name", weight: 2.0, method: "jaro" },
    { field: "address", weight: 1.0, method: "levenshtein" },
  ],
  threshold: 0.80,
};

Matching Methods

Method	Best For	Speed
`exact`	IDs, emails, codes	Fast
`jaro`	Names, short strings	Medium
`levenshtein`	Addresses, typos	Slower

Threshold Guidelines

Use Case	Threshold
Strict dedup (same person)	0.90+
Moderate dedup (likely same)	0.80-0.89
Aggressive dedup (maybe same)	0.70-0.79

Canonical Record Selection

When multiple records match, the canonical record is selected by:

Most complete - Fewest null fields
Most recent - If timestamp field exists
First seen - Fallback to row order

You can customize this via the canonicalStrategy option (Scale+ tier).

Tier Restrictions

Feature	Free	Pro	Scale	Enterprise
Exact key dedup	No	Yes	Yes	Yes
Fuzzy matching	No	No	Yes	Yes
Weighted fields	No	No	No	Yes
Custom canonical strategy	No	No	Yes	Yes
Max rows per unify	-	100,000	1,000,000	Unlimited

Worker Factory​

createUnifyWorker​

Types​

UnifyConfig​

UnifyFuzzyRule​

UnifyCluster​

EntityCluster​

UnifyCanonicalRecord​

Worker Protocol​

StreamingUnifyConfig​

StreamUnifyChunkResult​

StreamUnifyFinalSummary​

Usage with Importer​

Configuration Examples​

Email-based Dedup​

Name Fuzzy Matching​

Composite Key + Fuzzy​

Matching Methods​

Threshold Guidelines​

Canonical Record Selection​

Tier Restrictions​

See Also​