Gmail Profile Extraction
You have two data sources. Extract different things from each. Each discovery category has a tier field (1-5) indicating confidence level.
Extraction Transparency
When analyzing, mentally track what you're doing:
- Extracted: What was added and why (include in source attribution)
- Rejected: What was skipped and why (marketing, ambiguous, etc.)
- Low confidence: Potentially valid but uncertain - flag for user
This helps ensure quality and lets you explain decisions if asked.
Discovery Tiers and Categories
Tier 1: High-Confidence Personal Facts
These come from user's own sent emails or verification codes — highest confidence.
| Category | What to Extract |
|---|---|
| children | Names, ages, schools mentioned by user |
| partner | Name, relationship type (user wrote about them) |
| pets | Names, species/breed (user mentioned) |
| phone_numbers | Phone numbers user shared ("my number is...") |
| birthday | Birth date from birthday greetings received |
| location | City, state, zip from shipping confirmations |
| whatsapp, signal_app, telegram, slack | Account confirmation (proves platform usage) |
Tier 2: Tool Usage (High Signal)
Shows what the user actively works with.
| Category | What to Extract |
|---|---|
| saas_trials | Tools they're actively exploring |
| receipts | Services they pay for (high commitment signal) |
| project_mgmt | PM tools in active use (Linear, Jira, Asana) |
| password_mgr | Security tool preference |
| video_platform | Preferred video call platform |
| subscriptions | Tools/services they signed up for |
Tier 3: Professional Interests
| Category | What to Extract |
|---|---|
| conferences | Events they registered for (professional interests) |
| newsletters | Content they follow (Substack, etc.) |
| certifications | Skills they're developing |
| instagram, linkedin, twitter, github | Platform presence (usernames if visible) |
Tier 4: Infrastructure (Tech Users)
| Category | What to Extract |
|---|---|
| hosting | Deployment platforms (Vercel, Netlify, etc.) |
| domains | Domain ownership (side projects/business) |
| cloud_storage | Sharing patterns |
Tier 5: Lifestyle Signals
| Category | What to Extract |
|---|---|
| amazon | Shopping patterns |
| banking, investments | Institution names only (never account numbers) |
| travel | Frequent destinations, airlines |
| health | Healthcare providers (never conditions) |
| spotify, netflix, discord | Entertainment platforms |
Confidence Levels for Extracted Facts
When writing facts, mentally categorize by confidence:
High Confidence — Extract Freely
- Tier 1 categories (user wrote it or verification code)
- User's own sent emails mentioning personal details
- Receipts/confirmations with user's name in To: field
Medium Confidence — Extract with Context
- Tier 2-3 categories (tool signups, subscriptions)
- Shipping addresses (could be gifts — note uncertainty)
- Single mentions of tools/interests
Low Confidence — Flag or Skip
- Generic service emails with no specific details
- Promotional content that mentions personal concepts
- Ambiguous signals that could go either way
When uncertain, add context like "possibly" or "appears to use" rather than stating as fact.
Recency Matters
Use timeAgo field to weight information:
- Location: Prefer recent addresses.
2mo ago>3y ago - Job/Tools: Recent usage overrides old patterns
- Partner/children: Older mentions are fine (stable facts)
- Phone numbers: Prefer recent (numbers change)
- Platform accounts: Any age confirms existence
Ignore These Patterns
Even with targeted queries, noise gets through. Skip:
Marketing/Promotional:
- Emails with "unsubscribe" in footer but no personal info
- "Your husband will love this!" — not about THEIR husband
- Generic "dear customer" or "dear member" emails
False Positives:
| Category | Ignore |
|---|---|
| children | "Kids sale!" "For your kids" (marketing) |
| partner | "Gift for your wife", "business partner", "design partner" |
| location | Gift shipping addresses (name doesn't match user) |
| conferences | Spam conference invites (not actual registrations) |
Signal vs Noise:
- Signal: Specific names, dates, confirmation language, user in To: field
- Noise: Generic language, bulk sender patterns, promotional tone
Source 2: Writing Samples
The writing samples contain sent email content. Extract persistent patterns, not transient activity.
Extract:
- Work domain: Infer field/industry from recurring themes, technical vocabulary
- Interests: Topics appearing meaningfully across multiple emails (2+ mentions)
Do NOT extract:
- Specific project names (transient)
- Current tasks or deadlines (changes constantly)
- Topics from single emails (could be one-off)
- Collaborator names (privacy concern)
Output
Write to User Profile Facts:
- personal.md — location, family, pets, birthday
- interests.md — platforms with usernames, tools, topics of genuine interest
- goals.md — only if explicit evidence (certifications in progress, conferences)
Write to Work Overview:
- Work domain, industry, role (only if clear pattern emerges from email signatures, calendar invites)
Rules: Only write facts with clear evidence. Skip weak signals. Never write sensitive financial details. When uncertain, add qualifying language or skip entirely—false positives are worse than missing data.