Purpose
This guide provides validation rules for detecting data quality issues after CSV parsing. Apply these checks to the parsed data before analysis to catch problems early and report them with actionable detail.
Validation Checks
Missing Value Detection
Scan each column for null, empty string, or whitespace-only values. Report as: "Column '[name]' has [N] missing values ([X]% of rows) in rows [list]." When more than 10 rows are affected, show the first 5 and note "and [N] more." Columns with >50% missing values should be flagged as potentially unusable for analysis.
Date Format Validation
For columns identified as dates, attempt to parse each value. Valid formats include ISO (YYYY-MM-DD), US (MM/DD/YYYY), European (DD/MM/YYYY), and natural language (Jan 15, 2024). Report unparseable dates as: "Column '[name]' has [N] invalid dates in rows [list]. Expected formats: YYYY-MM-DD, MM/DD/YYYY, or 'Jan 15, 2024'." Flag future dates (more than 7 days ahead) as suspicious unless the column name suggests forecasts or expected dates.
Duplicate Detection
If an ID-like column exists (containing "id", "key", "number", or unique identifiers), check for duplicate values. Report as: "Found [N] duplicate IDs: [ID1] (rows [list]), [ID2] (rows [list])." Show up to 5 duplicates; if more exist, note "and [N] more duplicate IDs." Ask the user how to handle: keep first, keep latest (by date column), sum amounts, or remove entirely.
Suspicious Value Flags
Check for values that are technically valid but likely erroneous:
- Negative amounts in columns that should be positive (revenue, counts, quantities): "Column '[name]' has [N] negative values in rows [list]. Verify these are intentional."
- Future dates in historical columns (created_date, close_date): "Column '[name]' has [N] dates in the future in rows [list]."
- Percentages outside 0-100 (unless explicitly a multiplier): "Column '[name]' has values outside 0-100% range in rows [list]."
- Outliers more than 3 standard deviations from mean in numeric columns: "Column '[name]' has [N] extreme outliers in rows [list]. Largest: [value], smallest: [value]."
Error Message Templates
Use these templates for consistent, actionable error reporting:
| Issue Type | Template |
|---|---|
| Missing values | "Column '[column]' has [N] missing values ([X]% of rows) in rows [first 5 rows]. [Add 'and N more' if >5]" |
| Invalid dates | "Column '[column]' has [N] invalid dates in rows [list]. Expected: YYYY-MM-DD, MM/DD/YYYY, or 'Jan 15, 2024'" |
| Duplicates | "Found [N] duplicate IDs: [ID] (rows [list]), [ID] (rows [list]). [Add 'and N more' if >5]" |
| Negative amounts | "Column '[column]' has [N] negative values in rows [list]. Verify intentional." |
| Future dates | "Column '[column]' has [N] future dates in rows [list]. Expected historical data." |
| Percentage range | "Column '[column]' has [N] values outside 0-100% in rows [list]." |
| Outliers | "Column '[column]' has [N] extreme outliers (>3σ) in rows [list]. Range: [min] to [max]." |
Output Format
Add a "validation" object to the interpreted CSV output:
{
"validation": {
"status": "clean | warnings | errors",
"issues": [
{
"type": "missing_values | invalid_dates | duplicates | suspicious_values",
"column": "column name",
"severity": "warning | error",
"count": N,
"rows": [row numbers],
"message": "Human-readable message using templates above"
}
],
"summary": "Brief overall assessment"
}
}Set status to "errors" if any issue would prevent meaningful analysis (>50% missing in key column, all dates invalid). Set to "warnings" for issues that should be noted but don't block analysis. Set to "clean" only if no issues found.
Severity Guidelines
Errors (block analysis or require user decision):
- ID column has duplicates (affects counts and aggregations)
- Required column has >50% missing values
- All values in a date column are unparseable
Warnings (note in output, proceed with caution):
- <20% missing values in non-critical columns
- A few suspicious values (negatives, outliers)
- Some unparseable dates with most valid
Handling Decisions
When validation finds issues requiring user input, present options clearly:
- Duplicates: "Keep first occurrence, keep latest by [date column], sum amounts, or exclude duplicates?"
- Missing values: "Exclude rows with missing [column], treat as zero, or proceed with partial data?"
- Invalid dates: "Exclude rows with invalid dates, or proceed without date-based analysis?"