Experiment Design Framework
Elements of a Good Experiment
- Clear hypothesis — What do you believe and why?
- Measurable outcome — What metric will change?
- Defined segments — Who sees what?
- Statistical rigor — Sample size and duration
- Decision criteria — What results mean what action?
Hypothesis Formula
Template: "If we [change], then [metric] will [direction] by [amount], because [rationale]."
Example: "If we add progress indicators to onboarding, then completion rate will increase by 15%, because users will have clearer expectations and motivation to continue."
Experiment Design Template
# A/B Test: [Test Name]
## Overview
**Owner:** [Who's responsible]
**Status:** [Proposed / Running / Completed]
**Duration:** [Start date] to [End date]
## Hypothesis
If we [specific change], then [metric] will [increase/decrease] by [expected amount], because [reasoning based on user behavior or prior evidence].
## Changes Being Tested
### Control (A)
[Description of current experience]
### Variant (B)
[Description of changed experience]
[Include mockups or screenshots if available]
## Success Metrics
### Primary Metric
**Metric:** [Name]
**Current baseline:** [Value]
**Target:** [Value or % change]
**Why this metric:** [Connection to hypothesis]
### Secondary Metrics
| Metric | Baseline | Watch for |
|--------|----------|-----------|
| [Metric 1] | [Value] | [Expected direction] |
| [Metric 2] | [Value] | [Expected direction] |
### Guardrail Metrics
Metrics that should NOT change negatively:
- [Metric] — Acceptable range: [Range]
- [Metric] — Acceptable range: [Range]
## Experiment Setup
### Traffic Allocation
- Control: [X%]
- Variant: [X%]
### User Segments
**Included:** [Who's in the experiment]
**Excluded:** [Who's excluded and why]
### Sample Size & Duration
- **Minimum sample size:** [N per variant]
- **Estimated duration:** [Days/weeks to reach significance]
- **Statistical significance threshold:** [Usually 95%]
## Decision Framework
| Result | Action |
|--------|--------|
| Variant wins significantly | Ship to 100% |
| Variant wins marginally | Consider extending test or iterating |
| No significant difference | Evaluate cost; may ship simpler version |
| Control wins | Don't ship; analyze why hypothesis was wrong |
| Guardrails violated | Stop test, investigate |
## Risks & Mitigations
| Risk | Likelihood | Mitigation |
|------|------------|------------|
| [Risk 1] | [H/M/L] | [How to address] |
| [Risk 2] | [H/M/L] | [How to address] |
## Post-Test Analysis Plan
- [What additional analysis we'll do]
- [Segments to investigate]
- [Follow-up experiments to consider]Multi-Variant Tests
When testing more than one variant:
## Variants
### Control (A): [Name]
[Description]
### Variant B: [Name]
[Description]
### Variant C: [Name]
[Description]
## Traffic Split
- Control (A): [X%]
- Variant B: [X%]
- Variant C: [X%]
## Comparison Plan
- Compare B vs A (primary comparison)
- Compare C vs A
- Compare B vs C (if both beat control)Experimentation Tips
Before Running
- Validate tracking — Can you actually measure what you need?
- Check for conflicts — Other tests running on same users?
- Document baseline — Know your starting point precisely
- Align stakeholders — Everyone agrees on decision criteria?
While Running
- Don't peek too often — Multiple looks increase false positives
- Watch for bugs — Variant errors can invalidate results
- Monitor guardrails — Stop if something breaks
After Running
- Segment analysis — Did it work differently for different users?
- Learn from losses — Failed tests teach more than wins
- Document everything — Future you will thank past you
Common Pitfalls
- Underpowered tests — Not enough traffic to detect real effects
- Too many metrics — With enough metrics, something will be "significant"
- Stopping early — That early winner might regress to mean
- Ignoring segments — Average hides important differences
- No baseline — Can't measure change without a starting point
After the Experiment
Once results are in, export your experiment data as CSV and use the Product Data Analyzer skill to interpret results. It can calculate statistical significance, analyze segment effects, and provide ship/no-ship recommendations based on your data.