Profile Extraction Guidelines
How to extract persistent user insights from writing samples without overfitting
Extracting User Profile from Writing Samples
When analyzing writing samples to build a user profile, the goal is to capture persistent patterns that reflect who the user is—not transient activity or current projects.
What to Extract
Work Domain
Infer the user's field or industry from recurring themes across documents:
- "Works in software development" ✓
- "Working on Project X" ✗ (too specific, transient)
Look for: technical vocabulary, domain concepts, types of problems discussed.
Interests
Topics or subjects that appear meaningfully across multiple documents:
- Must appear in 2+ documents to suggest genuine interest
- Should be specific enough to be meaningful ("machine learning" not "technology")
What NOT to Extract
- Specific project names - These are transient
- Current tasks or deadlines - Changes constantly
- Topics from single documents - Could be one-off research
- Generic topics - "meetings", "updates", "notes" tell us nothing
- Collaborator names - Privacy concern, not about the user
The Overfitting Problem
Writing samples capture a snapshot in time. Someone researching a topic for one week might have 5 documents about it, but it doesn't mean it's a core interest.
Signals of persistence:
- Topic appears across documents with different dates
- Topic relates to their apparent work domain
- Topic shows depth, not just mentions
Signals of transience:
- All mentions clustered in time
- Topic unrelated to other themes
- Surface-level mentions only
Output Guidelines
When writing to user profile:
- Use natural language, not keywords
- Be specific but not overly detailed
- When uncertain, don't write anything—false positives are worse than missing data
# Extracting User Profile from Writing Samples
When analyzing writing samples to build a user profile, the goal is to capture **persistent patterns** that reflect who the user is—not transient activity or current projects.
## What to Extract
### Work Domain
Infer the user's field or industry from recurring themes across documents:
- "Works in software development" ✓
- "Working on Project X" ✗ (too specific, transient)
Look for: technical vocabulary, domain concepts, types of problems discussed.
### Interests
Topics or subjects that appear meaningfully across multiple documents:
- Must appear in 2+ documents to suggest genuine interest
- Should be specific enough to be meaningful ("machine learning" not "technology")
## What NOT to Extract
- **Specific project names** - These are transient
- **Current tasks or deadlines** - Changes constantly
- **Topics from single documents** - Could be one-off research
- **Generic topics** - "meetings", "updates", "notes" tell us nothing
- **Collaborator names** - Privacy concern, not about the user
## The Overfitting Problem
Writing samples capture a snapshot in time. Someone researching a topic for one week might have 5 documents about it, but it doesn't mean it's a core interest.
**Signals of persistence:**
- Topic appears across documents with different dates
- Topic relates to their apparent work domain
- Topic shows depth, not just mentions
**Signals of transience:**
- All mentions clustered in time
- Topic unrelated to other themes
- Surface-level mentions only
## Output Guidelines
When writing to user profile:
- Use natural language, not keywords
- Be specific but not overly detailed
- When uncertain, don't write anything—false positives are worse than missing data