Profile Module
The Profile module generates statistical metadata about column contents, supporting data quality assessment, schema inference, and semantic type detection.
Purpose
Generate column-level statistics including counts, type inference, value distributions, numeric ranges, quality scores, and semantic type detection. Profiling supports data quality review before downstream delivery.
When It Runs
Pipeline Position: After Transform (optional)
Parse → Validate → Mask → Transform → **Profile**
Profiling can also run independently on any tabular data, not only as the final pipeline stage.
Inputs
| Input | Type | Description |
|---|---|---|
| Data | Arrow Table / Rows | Tabular data to profile |
| Sample Size | number | Optional row limit for large datasets (default: 1000 for Scale+ tiers) |
Outputs
| Output | Type | Description |
|---|---|---|
| ProfileReport | ProfileReport | Dataset-level profile with column statistics |
ProfileReport Structure
{
tier: string, // Current tier (free, pro, scale, enterprise)
topN: number, // Top values limit
profiles: ColumnProfile[] // Per-column statistical metadata
}
ColumnProfile Structure
Each column profile contains derived statistical metadata:
{
field: string, // Column name
totalCount: number, // Total row count
nonNullCount: number, // Non-null value count
nullCount: number, // Null/empty value count
distinctCount: number, // Unique value count
inferredType: string, // Detected data type
topValues: TopValue[], // Most frequent values
// Numeric columns
minNumeric?: number, // Minimum value
maxNumeric?: number, // Maximum value
mean?: number, // Mean value
median?: number, // Median value (Scale+ tier)
stddev?: number, // Standard deviation
// String columns (Scale+ tier)
minLength?: number, // Minimum string length
maxLength?: number, // Maximum string length
// Enhanced stats (Scale+ tier)
numericStats?: NumericStats, // Detailed numeric statistics
stringStats?: StringStats, // Detailed string statistics
outliers?: OutlierStats, // Outlier detection results
histogram?: HistogramBin[], // Value distribution
// Data quality (Scale+ tier)
quality?: QualityGrade, // Quality grade (A+ to F)
semanticType?: SemanticType // Detected semantic type
}
NumericStats Structure (Scale+ Tier)
{
min: number,
max: number,
mean: number,
median: number,
stdDev: number,
mode?: number // Most frequent value (rounded)
}
StringStats Structure (Scale+ Tier)
{
minLength: number,
maxLength: number,
avgLength: number
}
OutlierStats Structure (Scale+ Tier)
{
count: number, // IQR-based outlier count
q1: number, // First quartile
q3: number, // Third quartile
iqr: number, // Interquartile range
lowerBound: number, // Q1 - 1.5*IQR
upperBound: number, // Q3 + 1.5*IQR
zScoreOutliers: number // Z-score based outliers (|z| > 3)
}
QualityGrade
Quality grades indicate overall column data quality:
| Grade | Score Range | Description |
|---|---|---|
| A+ | ≥ 95% | Excellent quality |
| A | 85-94% | Good quality |
| B | 75-84% | Acceptable quality |
| C | 60-74% | Needs improvement |
| D | 40-59% | Poor quality |
| F | < 40% | Critical issues |
Quality score is computed using a weighted formula:
- Completeness (40%): Non-null ratio
- Type confidence (25%): Consistent type inference
- Uniqueness (20%): Distinct value ratio
- Outlier penalty (10%): Reduction for outliers
- Semantic score (5%): Bonus for detected semantic types
SemanticType
Detected semantic types for columns:
| Type | Description |
|---|---|
| Email addresses | |
| phone | Phone numbers |
| url | URLs/web addresses |
| uuid | UUIDs |
| date | Date values |
| currency | Currency values |
| ssn | Social Security Numbers |
| ip_address | IP addresses (v4/v6) |
| zipcode | ZIP/postal codes |
| creditcard | Credit card numbers |
Semantic detection uses pattern matching with 80% threshold for confidence.
Configuration
Profiling behavior is influenced by tier-based feature access. Sample size defaults vary by tier:
| Tier | Default Sample Size |
|---|---|
| Free | Full dataset (no sampling) |
| Pro | Full dataset (no sampling) |
| Scale | 1000 rows |
| Enterprise | 1000 rows |
React Component
The @rowops/profile-react package provides a standalone React component for displaying profile results:
import { RowOpsProfile } from '@rowops/profile-react';
function ProfileView({ profile, tier }) {
return (
<RowOpsProfile
profile={profile}
tier={tier}
displayMode="detailed"
showQualityGrades={true}
showSemanticTypes={true}
showOutliers={true}
/>
);
}
Display Modes
| Mode | Description |
|---|---|
| compact | Column name, type, basic counts |
| detailed | Adds completeness/uniqueness bars |
| full | All statistics including outliers, semantic types |
What This Module Does Not Do
- Does not modify row data: Profiling is read-only
- Does not persist profile results to any server: Profiles are returned to the client
- Does not provide all features at all tiers: Enhanced statistics are tier-gated
- Does not send data externally: All computation happens in WASM on the client
Constraints
Tier-Gated Features
| Feature | Free | Pro | Scale | Enterprise |
|---|---|---|---|---|
| Basic counts | ✓ | ✓ | ✓ | ✓ |
| Distinct count | ✓ | ✓ | ✓ | ✓ |
| Type inference | ✓ | ✓ | ✓ | ✓ |
| Top values | 5 | 10 | 20 | 50 |
| Median/Mode | ✓ | ✓ | ||
| String stats | ✓ | ✓ | ||
| Outlier detection | ✓ | ✓ | ||
| Quality grades | ✓ | ✓ | ||
| Semantic types | ✓ | ✓ | ||
| Histograms | ✓ | ✓ |
Performance
- Multi-pass streaming scans over data
- O(n) complexity over rows
- WASM-accelerated computation
Memory
Large datasets may require significant memory during profiling. Sampling is recommended for datasets exceeding 10,000 rows.
Failure Modes
| Failure | Behavior |
|---|---|
| Unsupported data type | Partial profile output may be returned |
| Memory pressure | Profile may fail on very large datasets |
| Empty dataset | Profile returns zero counts, no error |
| Invalid numeric values | Excluded from statistics |
Observed Status
Used in importer flows for column analysis. Actively exercised in dashboard-assisted mode for data quality preview and semantic type detection.