Profile Module

The Profile module generates statistical metadata about column contents, supporting data quality assessment, schema inference, and semantic type detection.

Purpose

Generate column-level statistics including counts, type inference, value distributions, numeric ranges, quality scores, and semantic type detection. Profiling supports data quality review before downstream delivery.

When It Runs

Pipeline Position: After Transform (optional)

Parse → Validate → Mask → Transform → **Profile**

Profiling can also run independently on any tabular data, not only as the final pipeline stage.

Inputs

Input	Type	Description
Data	Arrow Table / Rows	Tabular data to profile
Sample Size	number	Optional row limit for large datasets (default: 1000 for Scale+ tiers)

Outputs

Output	Type	Description
ProfileReport	`ProfileReport`	Dataset-level profile with column statistics

ProfileReport Structure

{
  tier: string,             // Current tier (free, pro, scale, enterprise)
  topN: number,             // Top values limit
  profiles: ColumnProfile[] // Per-column statistical metadata
}

ColumnProfile Structure

Each column profile contains derived statistical metadata:

{
  field: string,              // Column name
  totalCount: number,         // Total row count
  nonNullCount: number,       // Non-null value count
  nullCount: number,          // Null/empty value count
  distinctCount: number,      // Unique value count
  inferredType: string,       // Detected data type
  topValues: TopValue[],      // Most frequent values

  // Numeric columns
  minNumeric?: number,        // Minimum value
  maxNumeric?: number,        // Maximum value
  mean?: number,              // Mean value
  median?: number,            // Median value (Scale+ tier)
  stddev?: number,            // Standard deviation

  // String columns (Scale+ tier)
  minLength?: number,         // Minimum string length
  maxLength?: number,         // Maximum string length

  // Enhanced stats (Scale+ tier)
  numericStats?: NumericStats,  // Detailed numeric statistics
  stringStats?: StringStats,    // Detailed string statistics
  outliers?: OutlierStats,      // Outlier detection results
  histogram?: HistogramBin[],   // Value distribution

  // Data quality (Scale+ tier)
  quality?: QualityGrade,       // Quality grade (A+ to F)
  semanticType?: SemanticType   // Detected semantic type
}

NumericStats Structure (Scale+ Tier)

{
  min: number,
  max: number,
  mean: number,
  median: number,
  stdDev: number,
  mode?: number    // Most frequent value (rounded)
}

StringStats Structure (Scale+ Tier)

{
  minLength: number,
  maxLength: number,
  avgLength: number
}

OutlierStats Structure (Scale+ Tier)

{
  count: number,         // IQR-based outlier count
  q1: number,            // First quartile
  q3: number,            // Third quartile
  iqr: number,           // Interquartile range
  lowerBound: number,    // Q1 - 1.5*IQR
  upperBound: number,    // Q3 + 1.5*IQR
  zScoreOutliers: number // Z-score based outliers (|z| > 3)
}

QualityGrade

Quality grades indicate overall column data quality:

Grade	Score Range	Description
A+	≥ 95%	Excellent quality
A	85-94%	Good quality
B	75-84%	Acceptable quality
C	60-74%	Needs improvement
D	40-59%	Poor quality
F	< 40%	Critical issues

Quality score is computed using a weighted formula:

Completeness (40%): Non-null ratio
Type confidence (25%): Consistent type inference
Uniqueness (20%): Distinct value ratio
Outlier penalty (10%): Reduction for outliers
Semantic score (5%): Bonus for detected semantic types

SemanticType

Detected semantic types for columns:

Type	Description
email	Email addresses
phone	Phone numbers
url	URLs/web addresses
uuid	UUIDs
date	Date values
currency	Currency values
ssn	Social Security Numbers
ip_address	IP addresses (v4/v6)
zipcode	ZIP/postal codes
creditcard	Credit card numbers

Semantic detection uses pattern matching with 80% threshold for confidence.

Configuration

Profiling behavior is influenced by tier-based feature access. Sample size defaults vary by tier:

Tier	Default Sample Size
Free	Full dataset (no sampling)
Pro	Full dataset (no sampling)
Scale	1000 rows
Enterprise	1000 rows

React Component

The @rowops/profile-react package provides a standalone React component for displaying profile results:

import { RowOpsProfile } from '@rowops/profile-react';

function ProfileView({ profile, tier }) {
  return (
    <RowOpsProfile
      profile={profile}
      tier={tier}
      displayMode="detailed"
      showQualityGrades={true}
      showSemanticTypes={true}
      showOutliers={true}
    />
  );
}

Display Modes

Mode	Description
compact	Column name, type, basic counts
detailed	Adds completeness/uniqueness bars
full	All statistics including outliers, semantic types

What This Module Does Not Do

Does not modify row data: Profiling is read-only
Does not persist profile results to any server: Profiles are returned to the client
Does not provide all features at all tiers: Enhanced statistics are tier-gated
Does not send data externally: All computation happens in WASM on the client

Constraints

Tier-Gated Features

Feature	Free	Pro	Scale	Enterprise
Basic counts	✓	✓	✓	✓
Distinct count	✓	✓	✓	✓
Type inference	✓	✓	✓	✓
Top values	5	10	20	50
Median/Mode			✓	✓
String stats			✓	✓
Outlier detection			✓	✓
Quality grades			✓	✓
Semantic types			✓	✓
Histograms			✓	✓

Performance

Multi-pass streaming scans over data
O(n) complexity over rows
WASM-accelerated computation

Memory

Large datasets may require significant memory during profiling. Sampling is recommended for datasets exceeding 10,000 rows.

Failure Modes

Failure	Behavior
Unsupported data type	Partial profile output may be returned
Memory pressure	Profile may fail on very large datasets
Empty dataset	Profile returns zero counts, no error
Invalid numeric values	Excluded from statistics

Observed Status

Used in importer flows for column analysis. Actively exercised in dashboard-assisted mode for data quality preview and semantic type detection.

Purpose​

When It Runs​

Inputs​

Outputs​

ProfileReport Structure​

ColumnProfile Structure​

NumericStats Structure (Scale+ Tier)​

StringStats Structure (Scale+ Tier)​

OutlierStats Structure (Scale+ Tier)​

QualityGrade​

SemanticType​

Configuration​

React Component​

Display Modes​

What This Module Does Not Do​

Constraints​

Tier-Gated Features​

Performance​

Memory​

Failure Modes​

Observed Status​