Experimentation Playbook: 10 A/B Tests a Month as a Solo Founder

Most founders run one A/B test a month — maybe two — and wonder why nothing compounds. The math is brutal: at a 20% win rate, that’s 2 wins per year. Barely a rounding error. But 10 tests a month? That’s 24 wins, each stacking on the last. I’ve built a system that gets there in 3.5 hours of monthly effort. Here’s the full framework: hypothesis generation, templated process, and the right tools.

Why 10 Tests a Month Instead of 2

Experimentation works through volume. One test per month is 12 hypotheses per year. At an average win rate of 15–20% (industry norm), that’s 2 successful tests. Two improvements over 12 months don’t compound into anything.

10 tests per month is 120 hypotheses. At the same win rate — 18–24 wins. Each win delivers a 3–10% lift in a metric. 20 wins at an average 5% improvement compound to 165% cumulative growth over the year (1.05^20). Not linear — exponential.

The bottleneck for a solo founder isn’t tools — it’s time. Without a templated process, one experiment from hypothesis to analysis takes 45–90 minutes of hands-on work. With a framework — 15–20 minutes. 10 tests × 20 minutes = 200 minutes per month. Three and a half hours. The tools handle the rest.

Hypothesis Generation Framework

The most common mistake is testing what “feels right.” Your intuition is valuable but biased. Systematic hypothesis generation covers the blind spots you can’t see.

Four Sources of Hypotheses

1. Data (quantitative analysis). The conversion funnel tells you where users drop off. If 60% leave at the payment page — that’s not a “UX problem.” That’s a specific point that spawns 3–5 hypotheses: price, payment methods, trust signals, form length, guest checkout.

2. Feedback (qualitative analysis). Reviews, support tickets, session recordings (Hotjar, PostHog Recordings). One complaint pattern = one hypothesis. “I don’t understand what I’m paying for” maps directly to a hypothesis about the value proposition on the pricing page. If you’ve already identified your product’s magic number for retention, test the activation flow that drives users toward it.

3. Competitors. Not copying — systematic observation. A competitor added social proof to their landing page? Hypothesis: testimonials on the homepage will lift sign-up conversion by X%.

4. LLM assistant. The prompt below generates 10–15 hypotheses in a single call. Filter by ICE scoring. (If your product itself uses LLMs, you can also A/B test the prompts themselves — same framework, different unit of change.)

Hypothesis Generation Prompt

Context:
- Product: [product description, target audience]
- Current metrics: [sign-up conversion X%, activation rate Y%, churn Z%]
- Last 3 tests: [brief results]
- Known issues: [from analytics and feedback]

Task: Generate 15 A/B test hypotheses.

Format for each hypothesis:
1. Hypothesis: "If [change], then [metric] will [increase/decrease] by [expected effect], because [rationale]"
2. Metric: primary metric to measure
3. ICE Score: Impact (1-10), Confidence (1-10), Ease (1-10), Total
4. Minimum sample size to detect the effect

Requirements:
- Variety: cover onboarding, activation, retention, monetization, referral
- At least 5 hypotheses with Ease >= 7 (implementable in 1-2 hours)
- Flag hypotheses that contradict each other

The output is a prioritized list. Top 10 by ICE score go into the queue. The rest go into backlog for next month.

ICE Scoring: Quick Prioritization

Parameter	What It Measures	Example (High Score)
Impact	Potential effect on the metric	Changing the main CTA — high
Confidence	Confidence in the result	Data confirms the problem — high
Ease	Implementation complexity	Changing button text — high

Total = I × C × E. Tests with Total >= 343 (all three scores ≥ 7): launch immediately. 125–343: current sprint. Below 125: backlog. If you’re familiar with RICE scoring for backlog prioritization, ICE is the same idea stripped down for speed — no “Reach” component, because experiments target a defined audience segment anyway.

Experiment Doc Template

Every experiment gets documented before it launches. No doc, no experiment. This isn’t bureaucracy — it’s insurance against p-hacking and lost context three weeks later when you can’t remember why you ran the test.

# Experiment: [EXP-2026-03-001] [Title]

## Hypothesis
If [specific change], then [metric] will increase/decrease by [X%],
because [rationale based on data or qualitative analysis].

## Metrics
- Primary: [one metric on which the decision is made]
- Secondary: [1-2 metrics to monitor for side effects]
- Guardrail: [metric that must NOT degrade]

## Design
- Type: A/B | A/B/C | feature flag
- Audience: [all users | new users | segment]
- Split: [50/50 | 90/10]
- Duration: [7-14 days or until N events are reached]
- Minimum sample size: [calculate before launch]

## Variants
- Control (A): [current behavior]
- Treatment (B): [what changes, screenshot or description]

## Decision Criteria (set BEFORE launch)
- Ship (B wins): primary metric + X% at p < 0.05
- Kill (B loses): primary metric - Y% or guardrail degraded
- Inconclusive: insufficient data → extend or increase traffic

## Results (fill in after)
- Start date / End date:
- Sample size: Control N / Treatment N
- Primary metric: Control X% → Treatment Y% (delta Z%, p-value, CI)
- Decision: Ship / Kill / Iterate
- Learnings: [what we learned, even if the test failed]

The “Decision Criteria” section is the most important part. Fill it in before the test runs — not after. Without pre-committed criteria, you’ll rationalize whatever the data shows.

A/B Test Statistics: The Minimum You Need to Make Decisions

Sample Size

Formula: n = (Z_{α/2} + Z_β)² × 2 × p(1-p) / δ²

Where:

p — current conversion rate (e.g., 5%)
δ — minimum detectable effect (MDE), e.g., 1 percentage point
Z_{α/2} = 1.96 (for α = 0.05)
Z_β = 0.84 (for power = 80%)

Practical table:

Base Conversion	MDE (absolute)	Sample per variant
2%	0.5 pp	12,500
5%	1 pp	3,800
10%	2 pp	1,900
20%	3 pp	1,400
50%	5 pp	770

For a product with 100 DAU, 3,800 events per variant means 38 days at a 50/50 split. That’s the reality of a small product. You have two options: raise the MDE (only detect large effects) or switch to sequential testing.

Sequential Testing vs. Fixed-Horizon

Fixed-horizon: define sample size upfront, wait for full collection, analyze once. Classic and reliable, but slow.

Sequential testing: analyze as data arrives, with corrections for multiple looks. PostHog and Statsig support it out of the box. A test that’d take 30 days under fixed-horizon can stop on day 15 if the effect is clear enough.

Errors That Kill Validity

Peeking. Checking results daily and stopping at the first p < 0.05 is a reliable path to false positives. With α = 0.05 and daily checks over 14 days, the actual false positive rate hits 25–30%. Sequential testing handles this mathematically.

Multiple metrics. A test measuring 10 metrics at α = 0.05 will produce at least one “significant” result by pure chance about 40% of the time. Bonferroni correction divides α by the number of metrics. Or just pick one primary metric and commit to it.

Insufficient duration. The minimum is one full business cycle. For B2C SaaS that’s 7 days — weekday and weekend behavior differ. For B2B it’s 14 days, since activity patterns shift between the start and end of the month.

PostHog: Setting Up Experiments from Scratch

PostHog is an open-source product analytics and A/B testing platform. The free tier gives you 1 million events per month — more than enough for an early-stage product.

Integration in 10 Minutes

// posthog-js — client library
import posthog from 'posthog-js'

posthog.init('phc_YOUR_KEY', {
  api_host: 'https://app.posthog.com',
  autocapture: true,        // automatic tracking of clicks, pageviews
  capture_pageview: true,
  capture_pageleave: true,
  persistence: 'localStorage'
})

Autocapture handles basic events without extra code. For A/B tests, you’ll want custom events at the key funnel points (if you haven’t built a tracking plan yet, see how to create an event taxonomy from scratch):

// Sign-up
posthog.capture('user_signed_up', {
  method: 'email',          // or 'google', 'github'
  source: 'landing_page'
})

// Activation (user completed a key action)
posthog.capture('activation_completed', {
  time_to_activate_seconds: 340,
  steps_completed: 3
})

// Conversion to paid subscription
posthog.capture('subscription_started', {
  plan: 'pro',
  price: 29,
  trial_days: 14
})

Creating an Experiment in PostHog

PostHog builds experiments on top of feature flags. Here’s the setup:

Create a feature flag with variants (control + treatment)
Attach the flag to an experiment
Set the target metric and minimum sample size
Launch

// In code, check the variant:
if (posthog.getFeatureFlag('exp-new-pricing-page') === 'test') {
  renderNewPricingPage()
} else {
  renderCurrentPricingPage()
}

// PostHog automatically tracks exposure:
// $feature_flag_called event with payload {flag: 'exp-new-pricing-page', variant: 'test'}

PostHog calculates significance automatically. The experiment dashboard shows conversion by variant, credible interval, and a plain-English recommendation: ship, continue, or discard.

PostHog’s Bayesian Approach

PostHog doesn’t use p-values — it runs Bayesian statistics. Results read like: “Probability that variant B is better than A — 94.3%.” The decision threshold is 95% (configurable).

That framing is more intuitive, and it’s far less prone to the peeking problem. Check results daily without inflating your false positive rate.

Statsig: For Those Who Need Speed

Statsig is a managed experimentation platform. The free Developer tier covers 2 million events per month — generous for an early-stage product. Its edge over PostHog is setup speed and automatic test termination.

Differences from PostHog

Parameter	PostHog	Statsig
Hosting	Self-hosted or Cloud	Cloud only
Statistics	Bayesian	Frequentist + sequential
Auto-termination	No	Yes (upon reaching significance)
Warehouse-native	No	Yes (Snowflake, BigQuery, Redshift)
Open source	Yes	No
Free plan	1M events	2M events

Statsig stops an experiment automatically once significance is reached. If you don’t have time to check dashboards daily — and you don’t — this matters. Set it up, walk away, get a notification when it’s done.

Statsig Integration

import statsig from 'statsig-js'

await statsig.initialize('client-YOUR_KEY', {
  userID: user.id,
  email: user.email,
  custom: { plan: user.plan, signup_date: user.created_at }
})

// Check the variant
const experiment = statsig.getExperiment('new_onboarding_flow')
const variant = experiment.get('variant', 'control')

if (variant === 'streamlined') {
  renderStreamlinedOnboarding()
} else {
  renderCurrentOnboarding()
}

// Log an event
statsig.logEvent('onboarding_completed', null, {
  steps_shown: 3,
  time_seconds: 180
})

Warehouse-Native Experiments

Statsig Warehouse Native runs statistical analysis directly against Snowflake or BigQuery. Events stay in your warehouse — Statsig connects and computes results without a separate data pipeline.

If you’ve already got analytics in BigQuery, this removes the duplication entirely. Feature flags still run through the Statsig SDK, but metrics come from your own data.

The Process: 10 Tests in 3.5 Hours a Month

Week 1: Generation and Prioritization (60 minutes)

Collect last month’s data: funnel, retention, feedback (20 min)
Run the LLM prompt for hypothesis generation (10 min)
ICE scoring: evaluate each hypothesis (15 min)
Select the top 10, fill in experiment docs (15 min)

Week 1-2: Launch (50 minutes)

5 tests × 10 minutes each for feature flag setup and events. Running tests in parallel is fine as long as audiences don’t overlap or the tests touch different parts of the product. PostHog and Statsig both support mutual exclusion groups — one user enters only one experiment.

Week 2-3: Monitoring (20 minutes)

Two checkpoints: day 7 and day 14. Look for bugs (zero conversions usually means a broken variant), confirm guardrail metrics are healthy, verify the sample is accumulating. Don’t interpret results yet.

Week 3-4: Second Batch Launch (50 minutes)

Another 5 tests using the same scheme.

End of Month: Analysis and Documentation (30 minutes)

Close completed tests. Fill in the “Results” section in each experiment doc. Ship the winners. Move inconclusive tests to backlog. Update your knowledge base — what worked, what didn’t, which patterns keep showing up. If you’re tracking results across experiments, a dedicated experimentation dashboard saves time over checking each tool separately.

Total: 210 minutes = 3.5 hours per month

At 10 tests, that’s 21 minutes per test from idea to decision. The time savings come from two places: templates (filling in an experiment doc takes 2 minutes) and automation (the tools do the statistics).

Connecting to Observability

Experiments produce data. Data needs observability. If your product makes LLM calls — chatbot, AI features — you’ll want to correlate A/B results with model response quality. A conversion drop in variant B might not be a design problem. It could be model degradation.

LLM observability via Langfuse solves this: LLM call traces are tagged with the feature flag variant, and the dashboard shows response quality per experiment variant.

Automation: Experiment Pipeline

Once you’re hitting 10 tests consistently, you’ll want a pipeline. Minimal automation looks like this:

Notion/Linear (hypothesis backlog)
    ↓ ICE score > 7
PostHog/Statsig (feature flags + experiments)
    ↓ automatic analysis
Slack/Telegram notification (result ready)
    ↓ decision: ship/kill
Feature flag → permanent / removed
    ↓ documentation
Experiment doc → updated

Statsig sends a webhook when an experiment closes. PostHog integrates with Slack. The notification arrives without manual polling. Open the dashboard, read the result, decide in 2 minutes.

Common Mistakes and How to Avoid Them

Testing minor details. Button color, font size, placeholder text. These produce effects below 1% — undetectable at low traffic. Test structural changes instead: a different onboarding flow, a different pricing model, a different value proposition.

Launching without a minimum sample size. A test with 50 users per variant shows nothing but noise. Calculate the required sample before launch. If you don’t have the traffic, raise the MDE or replace the test with qualitative research — interviews, usability testing.

Not documenting failed tests. A test that “didn’t work” still tells you something. Writing down losing tests prevents repeating the same hypotheses and builds a record of what doesn’t move users.

Stopping tests early. “B is clearly winning” on day three is noise. Wait for the minimum sample size, or use sequential testing to stop early with statistical backing.

No guardrail metric. A lift in sign-up conversion means nothing if retention drops 20%. A guardrail protects you from optimizing one metric by destroying another.

Pre-Launch Checklist

Before every experiment:

Experiment doc is complete (hypothesis, metrics, decision criteria)
Minimum sample size is calculated
Primary metric is singular
Guardrail metric is defined
Feature flag is configured and tested (both branches work)
Events are tracking correctly (verified in debug mode)
Decision criteria are fixed before launch
Duration is at least one full business cycle

When This Framework Doesn’t Work

Pre-product-market fit. If you’re still searching for what to build, running 10 A/B tests on conversion details is wasted effort. You need qualitative discovery — interviews, prototypes, landing page tests — not statistically rigorous experiments on incremental changes.

Sub-50 DAU products. The math doesn’t close. At 50 daily active users, even a generous MDE of 20% requires months per test. Below this threshold, do usability testing and ship based on judgment. Come back to A/B testing when you hit 200+ DAU.

High-stakes irreversible decisions. Pricing model changes, market pivots, major feature bets — these aren’t A/B testable in the traditional sense. You can test elements around them (pricing page layout, onboarding for a new feature), but the core decision requires strategy, not statistics.

Products with long feedback loops. B2B enterprise with 90-day sales cycles won’t produce meaningful A/B test results in 14 days. The framework works best for B2C and self-serve B2B SaaS where the feedback loop (signup → activation → conversion) fits within 2–4 weeks.

Ten experiments per month isn’t a feat and it’s not a growth hack. It’s a repeatable system: a hypothesis framework, a documentation template, two tools, and three and a half hours. The output is decisions grounded in data, not intuition — and growth that compounds instead of spiking once and flattening.

Need help building an experimentation system? I help startups build AI products and automate processes — belov.works.

FAQ

How do you run A/B tests when your product has fewer than 500 daily active users?

With under 500 DAU, standard statistical significance requires months per test, which defeats the purpose. Two practical adjustments: first, raise the Minimum Detectable Effect to 10–20% (only test changes that could plausibly produce large impacts — structural changes to onboarding, pricing page layout, core CTA). Second, switch to Bayesian sequential testing in PostHog, which can detect a strong enough signal in 7–14 days at low traffic volumes. A third option: replace A/B tests with qualitative research (user interviews, session recordings) until you have enough traffic — usability testing with 5 users often reveals more than a statistically underpowered experiment.

Can you run A/B tests on the same users across multiple experiments simultaneously?

Yes, with one critical safeguard: experiments must not interact with each other. PostHog and Statsig both support mutual exclusion groups — a mechanism that guarantees each user is assigned to only one experiment at a time. Without mutual exclusion, a user in the “new onboarding flow” experiment who is also in the “new pricing page” experiment creates interaction effects that corrupt both results. The practical rule: tests on different pages or features with no functional overlap can run in parallel; tests that share a user flow should be sequenced or use mutual exclusion.

What’s the difference between a guardrail metric and a secondary metric in an experiment doc?

A secondary metric tracks potential side effects you care about but won’t use to make the ship/kill decision (e.g., session duration when you’re testing sign-up conversion). A guardrail metric is a hard constraint: if it degrades beyond a threshold, the experiment is killed regardless of primary metric performance. For example, a test that lifts sign-up conversion by 15% but drops 7-day retention by 25% would be killed by the guardrail. Secondary metrics inform future hypotheses. Guardrail metrics protect the business from optimization that’s locally positive but globally destructive.