Skip to main content

Profile Module

The Profile module generates statistical metadata about column contents, supporting data quality assessment, schema inference, and semantic type detection.


Purpose

Generate column-level statistics including counts, type inference, value distributions, numeric ranges, quality scores, and semantic type detection. Profiling supports data quality review before downstream delivery.


When It Runs

Pipeline Position: After Transform (optional)

Parse → Validate → Mask → Transform → **Profile**

Profiling can also run independently on any tabular data, not only as the final pipeline stage.


Inputs

InputTypeDescription
DataArrow Table / RowsTabular data to profile
Sample SizenumberOptional row limit for large datasets (default: 1000 for Scale+ tiers)

Outputs

OutputTypeDescription
ProfileReportProfileReportDataset-level profile with column statistics

ProfileReport Structure

{
tier: string, // Current tier (free, pro, scale, enterprise)
topN: number, // Top values limit
profiles: ColumnProfile[] // Per-column statistical metadata
}

ColumnProfile Structure

Each column profile contains derived statistical metadata:

{
field: string, // Column name
totalCount: number, // Total row count
nonNullCount: number, // Non-null value count
nullCount: number, // Null/empty value count
distinctCount: number, // Unique value count
inferredType: string, // Detected data type
topValues: TopValue[], // Most frequent values

// Numeric columns
minNumeric?: number, // Minimum value
maxNumeric?: number, // Maximum value
mean?: number, // Mean value
median?: number, // Median value (Scale+ tier)
stddev?: number, // Standard deviation

// String columns (Scale+ tier)
minLength?: number, // Minimum string length
maxLength?: number, // Maximum string length

// Enhanced stats (Scale+ tier)
numericStats?: NumericStats, // Detailed numeric statistics
stringStats?: StringStats, // Detailed string statistics
outliers?: OutlierStats, // Outlier detection results
histogram?: HistogramBin[], // Value distribution

// Data quality (Scale+ tier)
quality?: QualityGrade, // Quality grade (A+ to F)
semanticType?: SemanticType // Detected semantic type
}

NumericStats Structure (Scale+ Tier)

{
min: number,
max: number,
mean: number,
median: number,
stdDev: number,
mode?: number // Most frequent value (rounded)
}

StringStats Structure (Scale+ Tier)

{
minLength: number,
maxLength: number,
avgLength: number
}

OutlierStats Structure (Scale+ Tier)

{
count: number, // IQR-based outlier count
q1: number, // First quartile
q3: number, // Third quartile
iqr: number, // Interquartile range
lowerBound: number, // Q1 - 1.5*IQR
upperBound: number, // Q3 + 1.5*IQR
zScoreOutliers: number // Z-score based outliers (|z| > 3)
}

QualityGrade

Quality grades indicate overall column data quality:

GradeScore RangeDescription
A+≥ 95%Excellent quality
A85-94%Good quality
B75-84%Acceptable quality
C60-74%Needs improvement
D40-59%Poor quality
F< 40%Critical issues

Quality score is computed using a weighted formula:

  • Completeness (40%): Non-null ratio
  • Type confidence (25%): Consistent type inference
  • Uniqueness (20%): Distinct value ratio
  • Outlier penalty (10%): Reduction for outliers
  • Semantic score (5%): Bonus for detected semantic types

SemanticType

Detected semantic types for columns:

TypeDescription
emailEmail addresses
phonePhone numbers
urlURLs/web addresses
uuidUUIDs
dateDate values
currencyCurrency values
ssnSocial Security Numbers
ip_addressIP addresses (v4/v6)
zipcodeZIP/postal codes
creditcardCredit card numbers

Semantic detection uses pattern matching with 80% threshold for confidence.


Configuration

Profiling behavior is influenced by tier-based feature access. Sample size defaults vary by tier:

TierDefault Sample Size
FreeFull dataset (no sampling)
ProFull dataset (no sampling)
Scale1000 rows
Enterprise1000 rows

React Component

The @rowops/profile-react package provides a standalone React component for displaying profile results:

import { RowOpsProfile } from '@rowops/profile-react';

function ProfileView({ profile, tier }) {
return (
<RowOpsProfile
profile={profile}
tier={tier}
displayMode="detailed"
showQualityGrades={true}
showSemanticTypes={true}
showOutliers={true}
/>
);
}

Display Modes

ModeDescription
compactColumn name, type, basic counts
detailedAdds completeness/uniqueness bars
fullAll statistics including outliers, semantic types

What This Module Does Not Do

  • Does not modify row data: Profiling is read-only
  • Does not persist profile results to any server: Profiles are returned to the client
  • Does not provide all features at all tiers: Enhanced statistics are tier-gated
  • Does not send data externally: All computation happens in WASM on the client

Constraints

Tier-Gated Features

FeatureFreeProScaleEnterprise
Basic counts
Distinct count
Type inference
Top values5102050
Median/Mode
String stats
Outlier detection
Quality grades
Semantic types
Histograms

Performance

  • Multi-pass streaming scans over data
  • O(n) complexity over rows
  • WASM-accelerated computation

Memory

Large datasets may require significant memory during profiling. Sampling is recommended for datasets exceeding 10,000 rows.


Failure Modes

FailureBehavior
Unsupported data typePartial profile output may be returned
Memory pressureProfile may fail on very large datasets
Empty datasetProfile returns zero counts, no error
Invalid numeric valuesExcluded from statistics

Observed Status

Used in importer flows for column analysis. Actively exercised in dashboard-assisted mode for data quality preview and semantic type detection.