Parse Module

The Parse module converts binary file streams into tabular data structures. It is the first stage in the data ingestion pipeline.

Purpose

Convert file bytes into tabular row data, extracting column headers and row values from supported file formats.

When It Runs

Pipeline Position: First stage (entry point)

**Parse** → Validate → Mask → Transform → Profile

Parsing is the initial transformation from file bytes to structured data.

Inputs

Input	Type	Description
File stream	Binary stream	File bytes (CSV, XLSX, or other supported format)
Config	Format-specific config	Optional parsing configuration

Outputs

Output	Type	Description
IPC stream	Arrow IPC	RecordBatch stream + table boundary events

The canonical output is an IPC event stream:

for await (const event of parseToArrowIpc(file, { datasetNonce: 42 })) {
  if (event.type === "PARSE_CHUNK") {
    // event.ipcBytes is Arrow IPC (RecordBatch stream)
  }
}

Each emitted row includes __ro_row_id for stable identity.

Configuration

Format-specific configuration objects are available for supported formats:

Format	Config Type	Purpose
CSV/TSV	`ParseConfig`	Base parse options (maxRows, trimStrings, etc.)
XLSX	`{ sheetName?: string; config?: ParseConfig }`	Sheet selection + base options
PDF	`PdfParseConfig`	Text-only extraction (experimental)
DOCX	`DocxParseConfig`	Header handling
ZIP	`ZipParseConfig`	Entry constraints + nested format config

Supported Formats

Actively Supported

Format	Status	Notes
CSV	Active	Streaming parser
TSV	Active	Tab-delimited parsing
XLSX	Active	Full support

Available via RowOpsParse (parse-react)

The following formats are available via the standalone RowOpsParse UI:

Format	Status
JSON	Available via RowOpsParse
XML	Available via RowOpsParse
HTML	Available via RowOpsParse
Fixed-Width	Available via RowOpsParse
DOCX	Available via RowOpsParse (tables only)
ZIP	Available via RowOpsParse (multi-table, schema match optional)
PDF	Experimental (text-only, no OCR)

PDF parsing is experimental, text-only (no OCR), and uses pdfjs-dist in a worker.

What This Module Does Not Do

Does not validate field values against a schema: Parsing extracts data; validation is a separate stage
Does not apply masking or transformations: Data is parsed as-is
Does not persist row data to any server: Parsing occurs client-side
Does not infer schema types: Type inference occurs during profiling or validation

Constraints

Client-Side Execution

All parsing occurs in the client environment (browser or Node.js). File bytes never leave the client during parsing.

Memory

Large files are processed in streaming chunks where supported. However, complete file content may need to reside in memory for some formats.

Encoding

CSV parsing assumes UTF-8 encoding by default. Other encodings may require explicit configuration.

Failure Modes

Failure	Behavior
Malformed file	Pipeline halts with parse error
Encoding errors	Pipeline halts or produces garbled data
Unsupported format	Error thrown before parsing begins
Memory exhaustion	Browser may crash on very large files

Parse failures are fatal for malformed/unsupported inputs. Row-limit enforcement defaults to stop_with_warning, which returns partial results and emits ROW_LIMIT_REACHED with skipped row metadata.

Observed Status

Partial implementation. CSV and XLSX parsing is actively used in importer flows. Additional formats are exposed via RowOpsParse (parse-react).

Purpose​

When It Runs​

Inputs​

Outputs​

Configuration​

Supported Formats​

Actively Supported​

Available via RowOpsParse (parse-react)​

What This Module Does Not Do​

Constraints​

Client-Side Execution​

Memory​

Encoding​

Failure Modes​

Observed Status​