slice icon Context Slice

Gmail Profile Extraction

You have two data sources. Extract different things from each. Each discovery category has a tier field (1-5) indicating confidence level.

Extraction Transparency

When analyzing, mentally track what you're doing:

  • Extracted: What was added and why (include in source attribution)
  • Rejected: What was skipped and why (marketing, ambiguous, etc.)
  • Low confidence: Potentially valid but uncertain - flag for user

This helps ensure quality and lets you explain decisions if asked.

Discovery Tiers and Categories

Tier 1: High-Confidence Personal Facts

These come from user's own sent emails or verification codes — highest confidence.

Category What to Extract
children Names, ages, schools mentioned by user
partner Name, relationship type (user wrote about them)
pets Names, species/breed (user mentioned)
phone_numbers Phone numbers user shared ("my number is...")
birthday Birth date from birthday greetings received
location City, state, zip from shipping confirmations
whatsapp, signal_app, telegram, slack Account confirmation (proves platform usage)

Tier 2: Tool Usage (High Signal)

Shows what the user actively works with.

Category What to Extract
saas_trials Tools they're actively exploring
receipts Services they pay for (high commitment signal)
project_mgmt PM tools in active use (Linear, Jira, Asana)
password_mgr Security tool preference
video_platform Preferred video call platform
subscriptions Tools/services they signed up for

Tier 3: Professional Interests

Category What to Extract
conferences Events they registered for (professional interests)
newsletters Content they follow (Substack, etc.)
certifications Skills they're developing
instagram, linkedin, twitter, github Platform presence (usernames if visible)

Tier 4: Infrastructure (Tech Users)

Category What to Extract
hosting Deployment platforms (Vercel, Netlify, etc.)
domains Domain ownership (side projects/business)
cloud_storage Sharing patterns

Tier 5: Lifestyle Signals

Category What to Extract
amazon Shopping patterns
banking, investments Institution names only (never account numbers)
travel Frequent destinations, airlines
health Healthcare providers (never conditions)
spotify, netflix, discord Entertainment platforms

Confidence Levels for Extracted Facts

When writing facts, mentally categorize by confidence:

High Confidence — Extract Freely

  • Tier 1 categories (user wrote it or verification code)
  • User's own sent emails mentioning personal details
  • Receipts/confirmations with user's name in To: field

Medium Confidence — Extract with Context

  • Tier 2-3 categories (tool signups, subscriptions)
  • Shipping addresses (could be gifts — note uncertainty)
  • Single mentions of tools/interests

Low Confidence — Flag or Skip

  • Generic service emails with no specific details
  • Promotional content that mentions personal concepts
  • Ambiguous signals that could go either way

When uncertain, add context like "possibly" or "appears to use" rather than stating as fact.

Recency Matters

Use timeAgo field to weight information:

  • Location: Prefer recent addresses. 2mo ago > 3y ago
  • Job/Tools: Recent usage overrides old patterns
  • Partner/children: Older mentions are fine (stable facts)
  • Phone numbers: Prefer recent (numbers change)
  • Platform accounts: Any age confirms existence

Ignore These Patterns

Even with targeted queries, noise gets through. Skip:

Marketing/Promotional:

  • Emails with "unsubscribe" in footer but no personal info
  • "Your husband will love this!" — not about THEIR husband
  • Generic "dear customer" or "dear member" emails

False Positives:

Category Ignore
children "Kids sale!" "For your kids" (marketing)
partner "Gift for your wife", "business partner", "design partner"
location Gift shipping addresses (name doesn't match user)
conferences Spam conference invites (not actual registrations)

Signal vs Noise:

  • Signal: Specific names, dates, confirmation language, user in To: field
  • Noise: Generic language, bulk sender patterns, promotional tone

Source 2: Writing Samples

The writing samples contain sent email content. Extract persistent patterns, not transient activity.

Extract:

  • Work domain: Infer field/industry from recurring themes, technical vocabulary
  • Interests: Topics appearing meaningfully across multiple emails (2+ mentions)

Do NOT extract:

  • Specific project names (transient)
  • Current tasks or deadlines (changes constantly)
  • Topics from single emails (could be one-off)
  • Collaborator names (privacy concern)

Output

Write to uiUser Profile Facts:

  • personal.md — location, family, pets, birthday
  • interests.md — platforms with usernames, tools, topics of genuine interest
  • goals.md — only if explicit evidence (certifications in progress, conferences)

Write to uiWork Overview:

  • Work domain, industry, role (only if clear pattern emerges from email signatures, calendar invites)

Rules: Only write facts with clear evidence. Skip weak signals. Never write sensitive financial details. When uncertain, add qualifying language or skip entirely—false positives are worse than missing data.