Pipeline Overview

Why This Pipeline Exists

RowOps processes tabular data through a sequence of discrete stages. The pipeline exists to separate concerns - parsing, validation, masking, transformation, and profiling - into individually configurable steps. Each stage operates on the output of the previous stage, and the pipeline aims to produce deterministic results given identical inputs and configuration.

Canonical Pipeline Stages

Data flows through the pipeline in this order:

Parse → Validate → Mask → Transform → Profile

Parse

Purpose: Convert file bytes into tabular row data.

Inputs: Binary file streams (CSV, XLSX, and other supported formats).

Outputs: Parsed tabular representations containing column headers and row data.

What this stage does not do:

Does not validate field values against a schema
Does not apply masking or transformations
Does not persist row data to any server

Validate

Purpose: Apply schema-defined validation rules to each row.

Inputs: Parsed rows and a schema containing field definitions (type, required, regex, enumValues).

Outputs: A validation result containing three arrays:

valid[]: Rows that passed all validation rules
invalid[]: Validation errors with row index, field, code, and message
validated[]: Combined result set with per-row status

What this stage does not do:

Does not silently drop invalid rows
Does not automatically correct values
Does not modify the original row data

Mask

Purpose: Apply masking strategies to sensitive fields before downstream processing or export.

Inputs: Validated rows and a MaskConfig defining masking intent per field or data type.

Outputs: Rows with masked values according to the configured strategy.

What this stage does not do:

Does not detect PII automatically (PII detection is a separate, partially implemented module)
Does not guarantee identical enforcement across all execution modes
Does not provide reversible encryption (masking is destructive by design)

Note: Masking behavior is implemented in the core execution engine. Configuration support exists, but runtime enforcement should be validated per execution mode.

Transform

Purpose: Apply schema-driven transformations to reshape or derive column values.

Inputs: Rows (typically post-masking) and a TransformPipelineConfig defining operations.

Outputs: Transformed rows with derived or modified values.

What this stage does not do:

Does not guarantee all DSL operations are uniformly enforced across execution modes
Does not support row-filtering transforms in headless mode (explicitly disallowed)

Note: Transform behavior is implemented in the core execution engine. Not all configuration implies identical runtime enforcement across execution modes.

Profile

Purpose: Generate statistical metadata about column contents.

Inputs: Rows from previous stage.

Outputs: ColumnProfile[] containing derived statistical metadata such as:

Total, non-null, and null counts
Distinct value count
Top values by frequency
Min/max for numeric columns
Inferred type

What this stage does not do:

Does not modify row data
Does not persist profile results to any server
Does not provide all features at all tiers (some statistics are tier-gated)

Valid vs Invalid Row Handling

Invalid rows are identified during the Validate stage based on schema rules. The observed behavior is:

Invalid rows are not silently dropped. The validation result includes both valid and invalid rows with explicit status markers.
Invalid rows are accessible after validation. The validated[] array contains every row with its validation status, allowing downstream consumers to inspect or export error rows.
Invalid rows do not proceed through Mask/Transform by default. In observed pipeline flows, only valid rows continue to masking and transformation stages. Invalid rows remain available for inspection or error export.
Users can export invalid rows separately. In observed flows, invalid rows can be exported separately.

What users cannot do:

Automatically "fix" invalid rows within the pipeline
Configure the pipeline to apply transforms to invalid rows
Suppress validation errors without modifying the schema

Determinism and Replay

Design Intent

The pipeline is designed to produce identical outputs given identical inputs, schema, and configuration. This supports replay scenarios where the same file can be re-processed to yield the same results.

Observed Behavior

In tested headless paths, the pipeline aims for stable field ordering in outputs.
Masking and transform failures throw errors and halt execution rather than producing partial results.

What Breaks Replay

Replay consistency is not asserted as a universal guarantee. The following may produce different results:

Schema changes between runs (added/removed fields, modified rules)
Configuration changes (different mask strategies, transform operations)
Tier changes (features gated by tier may alter available operations)
Version updates to the processing engine

What This Pipeline Does Not Attempt

The following are explicit non-goals of this pipeline:

Per-cell editing: The pipeline operates on rows, not individual cells. There is no cell-level correction interface.
Spreadsheet-style workflows: This is not an interactive editor. Users cannot drag, merge, or manually adjust values within the pipeline.
Automatic "fix everything" behavior: Invalid data produces errors. The system does not guess corrections or apply heuristics to repair values.
Background scheduling or orchestration: The pipeline executes when invoked. There is no built-in scheduler, queue, or retry orchestration for failed imports.
Row-level data persistence to server: No evidence found that row-level datasets are persisted server-side. Metadata and configuration are persisted; row content is not.

Failure Modes and Constraints

Where Failures Can Occur

Stage	Failure Type	Observed Behavior
Parse	Malformed file, encoding errors	Pipeline halts with parse error
Validate	Schema violation	Row marked invalid; pipeline continues
Mask	Configuration error, unsupported strategy	`MaskingFailedError` thrown; pipeline halts
Transform	Invalid expression, type coercion failure	`TransformFailedError` thrown; pipeline halts
Profile	Unsupported data type	Partial profile output may be returned depending on data and configuration

How Failures Are Surfaced

Client-side errors: Thrown as typed exceptions (MaskingFailedError, TransformFailedError)
Validation errors: Returned in the invalid[] array with structured error objects
Console output: Prefixed with [RowOps] for client-side messages

What the System Does Not Attempt to Recover From

Parse failures on malformed files (no partial parse recovery)
Masking failures (fail-closed; no fallback to unmasked data)
Transform failures (fail-closed; no fallback to untransformed data)
Engine initialization failures (pipeline cannot proceed)

Why This Pipeline Exists​

Canonical Pipeline Stages​

Parse​

Validate​

Mask​

Transform​

Profile​

Valid vs Invalid Row Handling​

Determinism and Replay​

Design Intent​

Observed Behavior​

What Breaks Replay​

What This Pipeline Does Not Attempt​

Failure Modes and Constraints​

Where Failures Can Occur​

How Failures Are Surfaced​

What the System Does Not Attempt to Recover From​

Why This Pipeline Exists

Canonical Pipeline Stages

Parse

Validate

Mask

Transform

Profile

Valid vs Invalid Row Handling

Determinism and Replay

Design Intent

Observed Behavior

What Breaks Replay

What This Pipeline Does Not Attempt

Failure Modes and Constraints

Where Failures Can Occur

How Failures Are Surfaced

What the System Does Not Attempt to Recover From