CSV Data Validation Guide

Context Slice

Purpose

This guide provides validation rules for detecting data quality issues after CSV parsing. Apply these checks to the parsed data before analysis to catch problems early and report them with actionable detail.

Validation Checks

Missing Value Detection

Scan each column for null, empty string, or whitespace-only values. Report as: "Column '[name]' has [N] missing values ([X]% of rows) in rows [list]." When more than 10 rows are affected, show the first 5 and note "and [N] more." Columns with >50% missing values should be flagged as potentially unusable for analysis.

Date Format Validation

For columns identified as dates, attempt to parse each value. Valid formats include ISO (YYYY-MM-DD), US (MM/DD/YYYY), European (DD/MM/YYYY), and natural language (Jan 15, 2024). Report unparseable dates as: "Column '[name]' has [N] invalid dates in rows [list]. Expected formats: YYYY-MM-DD, MM/DD/YYYY, or 'Jan 15, 2024'." Flag future dates (more than 7 days ahead) as suspicious unless the column name suggests forecasts or expected dates.

Duplicate Detection

If an ID-like column exists (containing "id", "key", "number", or unique identifiers), check for duplicate values. Report as: "Found [N] duplicate IDs: [ID1] (rows [list]), [ID2] (rows [list])." Show up to 5 duplicates; if more exist, note "and [N] more duplicate IDs." Ask the user how to handle: keep first, keep latest (by date column), sum amounts, or remove entirely.

Suspicious Value Flags

Check for values that are technically valid but likely erroneous:

Negative amounts in columns that should be positive (revenue, counts, quantities): "Column '[name]' has [N] negative values in rows [list]. Verify these are intentional."
Future dates in historical columns (created_date, close_date): "Column '[name]' has [N] dates in the future in rows [list]."
Percentages outside 0-100 (unless explicitly a multiplier): "Column '[name]' has values outside 0-100% range in rows [list]."
Outliers more than 3 standard deviations from mean in numeric columns: "Column '[name]' has [N] extreme outliers in rows [list]. Largest: [value], smallest: [value]."

Error Message Templates

Use these templates for consistent, actionable error reporting:

Issue Type	Template
Missing values	"Column '[column]' has [N] missing values ([X]% of rows) in rows [first 5 rows]. [Add 'and N more' if >5]"
Invalid dates	"Column '[column]' has [N] invalid dates in rows [list]. Expected: YYYY-MM-DD, MM/DD/YYYY, or 'Jan 15, 2024'"
Duplicates	"Found [N] duplicate IDs: [ID] (rows [list]), [ID] (rows [list]). [Add 'and N more' if >5]"
Negative amounts	"Column '[column]' has [N] negative values in rows [list]. Verify intentional."
Future dates	"Column '[column]' has [N] future dates in rows [list]. Expected historical data."
Percentage range	"Column '[column]' has [N] values outside 0-100% in rows [list]."
Outliers	"Column '[column]' has [N] extreme outliers (>3σ) in rows [list]. Range: [min] to [max]."

Output Format

Add a "validation" object to the interpreted CSV output:

{
  "validation": {
    "status": "clean | warnings | errors",
    "issues": [
      {
        "type": "missing_values | invalid_dates | duplicates | suspicious_values",
        "column": "column name",
        "severity": "warning | error",
        "count": N,
        "rows": [row numbers],
        "message": "Human-readable message using templates above"
      }
    ],
    "summary": "Brief overall assessment"
  }
}

Set status to "errors" if any issue would prevent meaningful analysis (>50% missing in key column, all dates invalid). Set to "warnings" for issues that should be noted but don't block analysis. Set to "clean" only if no issues found.

Severity Guidelines

Errors (block analysis or require user decision):

ID column has duplicates (affects counts and aggregations)
Required column has >50% missing values
All values in a date column are unparseable

Warnings (note in output, proceed with caution):

<20% missing values in non-critical columns
A few suspicious values (negatives, outliers)
Some unparseable dates with most valid

Handling Decisions

When validation finds issues requiring user input, present options clearly:

Duplicates: "Keep first occurrence, keep latest by [date column], sum amounts, or exclude duplicates?"
Missing values: "Exclude rows with missing [column], treat as zero, or proceed with partial data?"
Invalid dates: "Exclude rows with invalid dates, or proceed without date-based analysis?"

Slice Info

Description

Rules for detecting and reporting data quality issues in parsed CSV data

Tokens

1,127

Used By

Data Utilities skill

Parse and Interpret CSV task

slice:shared.csv.validation