Purpose
This guide helps interpret CSV columns semantically—understanding what data they contain based on meaning, not keyword matching. A column named "Customer_Rating" is a score column even though it doesn't contain the word "score."
Interpretation Process
For each column, examine three signals:
- Header name — What does the name suggest? Consider synonyms, abbreviations, domain conventions
- Sample values — What does the actual data look like? Numbers, dates, text patterns?
- Context hints — What column types were requested? Match against those categories
Common Column Categories
Score Columns
Headers suggesting ratings, satisfaction, or numeric evaluations:
- Direct: score, rating, nps, csat, satisfaction, stars
- Indirect: sentiment, feedback_score, review_rating, customer_rating, health_score
- Patterns: typically 1-10, 1-5, or 0-100 scales; sometimes -100 to 100 (NPS)
Customer/Account Columns
Headers identifying entities being measured:
- Direct: customer, account, company, client, organization, tenant
- Indirect: customer_name, account_id, company_name, org, business_name
- Patterns: unique identifiers, names, or codes that repeat across rows
Date/Time Columns
Headers indicating when something occurred:
- Direct: date, time, timestamp, datetime, created, updated
- Indirect: created_at, submitted_on, closed_date, renewal_date, start_date, end_date
- Patterns: ISO dates, US dates (MM/DD/YYYY), epoch timestamps, relative dates
Amount/Value Columns
Headers indicating monetary or quantity values:
- Direct: amount, value, revenue, price, cost, total, sum
- Indirect: deal_value, arr, mrr, contract_value, order_total, spend
- Patterns: numbers often with currency symbols ($, €) or large values
Category/Type Columns
Headers indicating classification or grouping:
- Direct: category, type, status, priority, tier, segment, stage
- Indirect: issue_type, ticket_category, deal_stage, customer_tier, severity
- Patterns: limited set of repeated string values (low cardinality)
Person/Rep Columns
Headers identifying people (employees, owners, assignees):
- Direct: rep, owner, assignee, agent, manager, employee
- Indirect: sales_rep, account_owner, assigned_to, created_by, handled_by
- Patterns: names or email addresses, limited unique values
Campaign/Source Columns
Headers indicating origin or attribution:
- Direct: campaign, source, channel, medium, referrer
- Indirect: utm_source, lead_source, marketing_campaign, acquisition_channel
- Patterns: campaign names, channel codes, UTM parameters
Ticket/Issue Columns
Headers related to support cases:
- Direct: ticket, issue, case, incident, request
- Indirect: ticket_id, case_number, issue_description, support_request
- Patterns: IDs (often numeric), or text descriptions
Usage/Activity Columns
Headers indicating engagement or usage metrics:
- Direct: usage, logins, sessions, active, engagement
- Indirect: login_count, daily_active, feature_usage, page_views, api_calls
- Patterns: numeric counts, often integers
Confidence Levels
Assign confidence based on signal strength:
- High: Header clearly indicates type AND sample values confirm it
- Medium: Header suggests type OR sample values match, but not both
- Low: Weak signals, ambiguous header, values could fit multiple types
Output Format
Update the parsed CSV output file with interpretation metadata:
{
"source": "original file path",
"rowCount": N,
"columns": ["header1", "header2", ...],
"interpretations": {
"header1": {
"type": "score|customer|date|amount|category|person|campaign|ticket|usage|text",
"confidence": "high|medium|low",
"reasoning": "Brief explanation of why this classification"
}
},
"detected": {
"scoreColumns": ["columns identified as scores"],
"customerColumns": ["columns identified as customers"],
"dateColumns": ["columns identified as dates"]
},
"possibleAnalyses": ["what analyses this data supports based on detected columns"],
"data": [all rows from parsed CSV]
}The detected object should include keys matching the column types requested in requirements.
Handling Ambiguity
When a column could fit multiple types:
- Prefer the type explicitly mentioned in requirements
- Let sample values break ties (dates look like dates, scores look like scores)
- If still ambiguous, note it and pick the most likely based on context
- Flag low confidence so downstream analysis can ask for clarification
Examples
| Header | Sample Values | Likely Type | Confidence |
|---|---|---|---|
| CSAT_Score | 4, 5, 3, 5, 4 | score | high |
| Customer_Rating | 8.5, 9.0, 7.2 | score | high |
| Account | "Acme Corp", "Widget Inc" | customer | high |
| CustomerID | 10042, 10043, 10044 | customer | high |
| Created | 2024-01-15, 2024-01-16 | date | high |
| Status | "Open", "Closed", "Pending" | category | high |
| Owner | "John Smith", "Jane Doe" | person | high |
| Amount | $50,000, $75,000 | amount | high |
| Notes | "Customer requested..." | text | medium |
| Value | 42, 87, 15 | amount OR score | low (needs context) |