Parse Module
The Parse module converts binary file streams into tabular data structures. It is the first stage in the data ingestion pipeline.
Purpose
Convert file bytes into tabular row data, extracting column headers and row values from supported file formats.
When It Runs
Pipeline Position: First stage (entry point)
**Parse** → Validate → Mask → Transform → Profile
Parsing is the initial transformation from file bytes to structured data.
Inputs
| Input | Type | Description |
|---|---|---|
| File stream | Binary stream | File bytes (CSV, XLSX, or other supported format) |
| Config | Format-specific config | Optional parsing configuration |
Outputs
| Output | Type | Description |
|---|---|---|
| IPC stream | Arrow IPC | RecordBatch stream + table boundary events |
The canonical output is an IPC event stream:
for await (const event of parseToArrowIpc(file, { datasetNonce: 42 })) {
if (event.type === "PARSE_CHUNK") {
// event.ipcBytes is Arrow IPC (RecordBatch stream)
}
}
Each emitted row includes __ro_row_id for stable identity.
Configuration
Format-specific configuration objects are available for supported formats:
| Format | Config Type | Purpose |
|---|---|---|
| CSV/TSV | ParseConfig | Base parse options (maxRows, trimStrings, etc.) |
| XLSX | { sheetName?: string; config?: ParseConfig } | Sheet selection + base options |
PdfParseConfig | Text-only extraction (experimental) | |
| DOCX | DocxParseConfig | Header handling |
| ZIP | ZipParseConfig | Entry constraints + nested format config |
Supported Formats
Actively Supported
| Format | Status | Notes |
|---|---|---|
| CSV | Active | Streaming parser |
| TSV | Active | Tab-delimited parsing |
| XLSX | Active | Full support |
Available via RowOpsParse (parse-react)
The following formats are available via the standalone RowOpsParse UI:
| Format | Status |
|---|---|
| JSON | Available via RowOpsParse |
| XML | Available via RowOpsParse |
| HTML | Available via RowOpsParse |
| Fixed-Width | Available via RowOpsParse |
| DOCX | Available via RowOpsParse (tables only) |
| ZIP | Available via RowOpsParse (multi-table, schema match optional) |
| Experimental (text-only, no OCR) |
PDF parsing is experimental, text-only (no OCR), and uses pdfjs-dist in a worker.
What This Module Does Not Do
- Does not validate field values against a schema: Parsing extracts data; validation is a separate stage
- Does not apply masking or transformations: Data is parsed as-is
- Does not persist row data to any server: Parsing occurs client-side
- Does not infer schema types: Type inference occurs during profiling or validation
Constraints
Client-Side Execution
All parsing occurs in the client environment (browser or Node.js). File bytes never leave the client during parsing.
Memory
Large files are processed in streaming chunks where supported. However, complete file content may need to reside in memory for some formats.
Encoding
CSV parsing assumes UTF-8 encoding by default. Other encodings may require explicit configuration.
Failure Modes
| Failure | Behavior |
|---|---|
| Malformed file | Pipeline halts with parse error |
| Encoding errors | Pipeline halts or produces garbled data |
| Unsupported format | Error thrown before parsing begins |
| Memory exhaustion | Browser may crash on very large files |
Parse failures are fatal for malformed/unsupported inputs. Row-limit enforcement
defaults to stop_with_warning, which returns partial results and emits
ROW_LIMIT_REACHED with skipped row metadata.
Observed Status
Partial implementation. CSV and XLSX parsing is actively used in importer flows. Additional formats are exposed via RowOpsParse (parse-react).