Skip to main content

Parse Module

The Parse module converts binary file streams into tabular data structures. It is the first stage in the data ingestion pipeline.


Purpose

Convert file bytes into tabular row data, extracting column headers and row values from supported file formats.


When It Runs

Pipeline Position: First stage (entry point)

**Parse** → Validate → Mask → Transform → Profile

Parsing is the initial transformation from file bytes to structured data.


Inputs

InputTypeDescription
File streamBinary streamFile bytes (CSV, XLSX, or other supported format)
ConfigFormat-specific configOptional parsing configuration

Outputs

OutputTypeDescription
IPC streamArrow IPCRecordBatch stream + table boundary events

The canonical output is an IPC event stream:

for await (const event of parseToArrowIpc(file, { datasetNonce: 42 })) {
if (event.type === "PARSE_CHUNK") {
// event.ipcBytes is Arrow IPC (RecordBatch stream)
}
}

Each emitted row includes __ro_row_id for stable identity.


Configuration

Format-specific configuration objects are available for supported formats:

FormatConfig TypePurpose
CSV/TSVParseConfigBase parse options (maxRows, trimStrings, etc.)
XLSX{ sheetName?: string; config?: ParseConfig }Sheet selection + base options
PDFPdfParseConfigText-only extraction (experimental)
DOCXDocxParseConfigHeader handling
ZIPZipParseConfigEntry constraints + nested format config

Supported Formats

Actively Supported

FormatStatusNotes
CSVActiveStreaming parser
TSVActiveTab-delimited parsing
XLSXActiveFull support

Available via RowOpsParse (parse-react)

The following formats are available via the standalone RowOpsParse UI:

FormatStatus
JSONAvailable via RowOpsParse
XMLAvailable via RowOpsParse
HTMLAvailable via RowOpsParse
Fixed-WidthAvailable via RowOpsParse
DOCXAvailable via RowOpsParse (tables only)
ZIPAvailable via RowOpsParse (multi-table, schema match optional)
PDFExperimental (text-only, no OCR)

PDF parsing is experimental, text-only (no OCR), and uses pdfjs-dist in a worker.


What This Module Does Not Do

  • Does not validate field values against a schema: Parsing extracts data; validation is a separate stage
  • Does not apply masking or transformations: Data is parsed as-is
  • Does not persist row data to any server: Parsing occurs client-side
  • Does not infer schema types: Type inference occurs during profiling or validation

Constraints

Client-Side Execution

All parsing occurs in the client environment (browser or Node.js). File bytes never leave the client during parsing.

Memory

Large files are processed in streaming chunks where supported. However, complete file content may need to reside in memory for some formats.

Encoding

CSV parsing assumes UTF-8 encoding by default. Other encodings may require explicit configuration.


Failure Modes

FailureBehavior
Malformed filePipeline halts with parse error
Encoding errorsPipeline halts or produces garbled data
Unsupported formatError thrown before parsing begins
Memory exhaustionBrowser may crash on very large files

Parse failures are fatal for malformed/unsupported inputs. Row-limit enforcement defaults to stop_with_warning, which returns partial results and emits ROW_LIMIT_REACHED with skipped row metadata.


Observed Status

Partial implementation. CSV and XLSX parsing is actively used in importer flows. Additional formats are exposed via RowOpsParse (parse-react).