Skip to main content

Pipeline Overview

Why This Pipeline Exists

RowOps processes tabular data through a sequence of discrete stages. The pipeline exists to separate concerns - parsing, validation, masking, transformation, and profiling - into individually configurable steps. Each stage operates on the output of the previous stage, and the pipeline aims to produce deterministic results given identical inputs and configuration.


Canonical Pipeline Stages

Data flows through the pipeline in this order:

Parse → Validate → Mask → Transform → Profile

Parse

Purpose: Convert file bytes into tabular row data.

Inputs: Binary file streams (CSV, XLSX, and other supported formats).

Outputs: Parsed tabular representations containing column headers and row data.

What this stage does not do:

  • Does not validate field values against a schema
  • Does not apply masking or transformations
  • Does not persist row data to any server

Validate

Purpose: Apply schema-defined validation rules to each row.

Inputs: Parsed rows and a schema containing field definitions (type, required, regex, enumValues).

Outputs: A validation result containing three arrays:

  • valid[]: Rows that passed all validation rules
  • invalid[]: Validation errors with row index, field, code, and message
  • validated[]: Combined result set with per-row status

What this stage does not do:

  • Does not silently drop invalid rows
  • Does not automatically correct values
  • Does not modify the original row data

Mask

Purpose: Apply masking strategies to sensitive fields before downstream processing or export.

Inputs: Validated rows and a MaskConfig defining masking intent per field or data type.

Outputs: Rows with masked values according to the configured strategy.

What this stage does not do:

  • Does not detect PII automatically (PII detection is a separate, partially implemented module)
  • Does not guarantee identical enforcement across all execution modes
  • Does not provide reversible encryption (masking is destructive by design)

Note: Masking behavior is implemented in the core execution engine. Configuration support exists, but runtime enforcement should be validated per execution mode.


Transform

Purpose: Apply schema-driven transformations to reshape or derive column values.

Inputs: Rows (typically post-masking) and a TransformPipelineConfig defining operations.

Outputs: Transformed rows with derived or modified values.

What this stage does not do:

  • Does not guarantee all DSL operations are uniformly enforced across execution modes
  • Does not support row-filtering transforms in headless mode (explicitly disallowed)

Note: Transform behavior is implemented in the core execution engine. Not all configuration implies identical runtime enforcement across execution modes.


Profile

Purpose: Generate statistical metadata about column contents.

Inputs: Rows from previous stage.

Outputs: ColumnProfile[] containing derived statistical metadata such as:

  • Total, non-null, and null counts
  • Distinct value count
  • Top values by frequency
  • Min/max for numeric columns
  • Inferred type

What this stage does not do:

  • Does not modify row data
  • Does not persist profile results to any server
  • Does not provide all features at all tiers (some statistics are tier-gated)

Valid vs Invalid Row Handling

Invalid rows are identified during the Validate stage based on schema rules. The observed behavior is:

  1. Invalid rows are not silently dropped. The validation result includes both valid and invalid rows with explicit status markers.

  2. Invalid rows are accessible after validation. The validated[] array contains every row with its validation status, allowing downstream consumers to inspect or export error rows.

  3. Invalid rows do not proceed through Mask/Transform by default. In observed pipeline flows, only valid rows continue to masking and transformation stages. Invalid rows remain available for inspection or error export.

  4. Users can export invalid rows separately. In observed flows, invalid rows can be exported separately.

What users cannot do:

  • Automatically "fix" invalid rows within the pipeline
  • Configure the pipeline to apply transforms to invalid rows
  • Suppress validation errors without modifying the schema

Determinism and Replay

Design Intent

The pipeline is designed to produce identical outputs given identical inputs, schema, and configuration. This supports replay scenarios where the same file can be re-processed to yield the same results.

Observed Behavior

  • In tested headless paths, the pipeline aims for stable field ordering in outputs.
  • Masking and transform failures throw errors and halt execution rather than producing partial results.

What Breaks Replay

Replay consistency is not asserted as a universal guarantee. The following may produce different results:

  • Schema changes between runs (added/removed fields, modified rules)
  • Configuration changes (different mask strategies, transform operations)
  • Tier changes (features gated by tier may alter available operations)
  • Version updates to the processing engine

What This Pipeline Does Not Attempt

The following are explicit non-goals of this pipeline:

  • Per-cell editing: The pipeline operates on rows, not individual cells. There is no cell-level correction interface.

  • Spreadsheet-style workflows: This is not an interactive editor. Users cannot drag, merge, or manually adjust values within the pipeline.

  • Automatic "fix everything" behavior: Invalid data produces errors. The system does not guess corrections or apply heuristics to repair values.

  • Background scheduling or orchestration: The pipeline executes when invoked. There is no built-in scheduler, queue, or retry orchestration for failed imports.

  • Row-level data persistence to server: No evidence found that row-level datasets are persisted server-side. Metadata and configuration are persisted; row content is not.


Failure Modes and Constraints

Where Failures Can Occur

StageFailure TypeObserved Behavior
ParseMalformed file, encoding errorsPipeline halts with parse error
ValidateSchema violationRow marked invalid; pipeline continues
MaskConfiguration error, unsupported strategyMaskingFailedError thrown; pipeline halts
TransformInvalid expression, type coercion failureTransformFailedError thrown; pipeline halts
ProfileUnsupported data typePartial profile output may be returned depending on data and configuration

How Failures Are Surfaced

  • Client-side errors: Thrown as typed exceptions (MaskingFailedError, TransformFailedError)
  • Validation errors: Returned in the invalid[] array with structured error objects
  • Console output: Prefixed with [RowOps] for client-side messages

What the System Does Not Attempt to Recover From

  • Parse failures on malformed files (no partial parse recovery)
  • Masking failures (fail-closed; no fallback to unmasked data)
  • Transform failures (fail-closed; no fallback to untransformed data)
  • Engine initialization failures (pipeline cannot proceed)