Context Slice

Profile Extraction Guidelines

How to extract persistent user insights from writing samples without overfitting

Extracting User Profile from Writing Samples

When analyzing writing samples to build a user profile, the goal is to capture persistent patterns that reflect who the user is—not transient activity or current projects.

What to Extract

Work Domain

Infer the user's field or industry from recurring themes across documents:

"Works in software development" ✓
"Working on Project X" ✗ (too specific, transient)

Look for: technical vocabulary, domain concepts, types of problems discussed.

Interests

Topics or subjects that appear meaningfully across multiple documents:

Must appear in 2+ documents to suggest genuine interest
Should be specific enough to be meaningful ("machine learning" not "technology")

What NOT to Extract

Specific project names - These are transient
Current tasks or deadlines - Changes constantly
Topics from single documents - Could be one-off research
Generic topics - "meetings", "updates", "notes" tell us nothing
Collaborator names - Privacy concern, not about the user

The Overfitting Problem

Writing samples capture a snapshot in time. Someone researching a topic for one week might have 5 documents about it, but it doesn't mean it's a core interest.

Signals of persistence:

Topic appears across documents with different dates
Topic relates to their apparent work domain
Topic shows depth, not just mentions

Signals of transience:

All mentions clustered in time
Topic unrelated to other themes
Surface-level mentions only

Output Guidelines

When writing to user profile:

Use natural language, not keywords
Be specific but not overly detailed
When uncertain, don't write anything—false positives are worse than missing data

# Extracting User Profile from Writing Samples

When analyzing writing samples to build a user profile, the goal is to capture **persistent patterns** that reflect who the user is—not transient activity or current projects.

## What to Extract

### Work Domain

Infer the user's field or industry from recurring themes across documents:

- "Works in software development" ✓
- "Working on Project X" ✗ (too specific, transient)

Look for: technical vocabulary, domain concepts, types of problems discussed.

### Interests

Topics or subjects that appear meaningfully across multiple documents:

- Must appear in 2+ documents to suggest genuine interest
- Should be specific enough to be meaningful ("machine learning" not "technology")

## What NOT to Extract

- **Specific project names** - These are transient
- **Current tasks or deadlines** - Changes constantly
- **Topics from single documents** - Could be one-off research
- **Generic topics** - "meetings", "updates", "notes" tell us nothing
- **Collaborator names** - Privacy concern, not about the user

## The Overfitting Problem

Writing samples capture a snapshot in time. Someone researching a topic for one week might have 5 documents about it, but it doesn't mean it's a core interest.

**Signals of persistence:**

- Topic appears across documents with different dates
- Topic relates to their apparent work domain
- Topic shows depth, not just mentions

**Signals of transience:**

- All mentions clustered in time
- Topic unrelated to other themes
- Surface-level mentions only

## Output Guidelines

When writing to user profile:

- Use natural language, not keywords
- Be specific but not overly detailed
- When uncertain, don't write anything—false positives are worse than missing data