A/B testing with feature flags: what works and what doesn’t

A feature flag with two variants can run an A/B test. For many tests, that is all you need. For some, it’s not enough. Knowing which is which saves you from building the wrong thing.

The simple case: copy and UI tests

You want to test two headlines on the pricing page. Set up a variant flag:

const headline = flagify.getVariant('pricing-headline', 'control')

const copy = {
  control: 'Start for free',
  'variant-a': 'Get started in 30 seconds',
  'variant-b': 'Try it free, no credit card',
}[headline]

Configure the flag to split 50/50 (or three-way split). Send the variant to your analytics tool as a property:

analytics.identify({
  experiment_pricing_headline: headline,
})
analytics.track('signup_started', { variant: headline })

Run for a week or two. Check which variant has higher conversion in your analytics tool. Pick a winner. Remove the flag.

This works great for:

Copy changes (headlines, button text)
Visual changes (color, layout)
Feature order or surface (which tab is first)
Onboarding flow variations

The harder case: measuring statistical significance

If you want confidence that the winner actually won (and isn’t just noise), you need more than a flag.

At low traffic (a few hundred conversions per variant), differences of 5-10% are usually noise. At higher traffic, smaller differences become real. Figuring out where that line is requires statistical calculation.

Flags alone don’t tell you when to stop. They don’t tell you the p-value. They don’t do power calculations. They don’t catch Simpson’s paradox when you slice by segment.

For experiments where rigor matters (pricing, checkout flow, core funnel changes), use a dedicated experimentation tool: Statsig, Eppo, GrowthBook, or Amplitude Experiments. They handle variant assignment, metric calculation, and significance testing in one place.

The middle ground: analytics + flags

Many teams run flag-based experiments and measure outcomes in their existing analytics tool (Mixpanel, Amplitude, PostHog). This is fine if:

You’re running 1-2 experiments at a time, not dozens
You have enough traffic to eyeball the difference (usually 1000+ conversions per variant)
You’re not making a decision that hinges on a 3% improvement
Your analytics tool supports segmentation by user property

PostHog has built-in experimentation. Amplitude has Experiment. Mixpanel has Experiments. All of them integrate with flag tools via webhooks or events.

Patterns that avoid common mistakes

Use consistent variant assignment

If user Alice sees variant A on day 1 and variant B on day 2, your metrics are garbage. Feature flags with deterministic hashing (based on user ID) keep each user on the same variant for the duration of the test. We wrote about why deterministic hashing matters.

Don’t A/B test safety features

If a test makes the product objectively worse for some users (e.g., slower checkout, fewer features), either don’t run it, or run it for a short time with a small percentage. You are not “experimenting” — you are degrading UX for measurement.

Watch for the novelty effect

Week 1 of a test, users react to the change because it’s new. Week 3, they’ve adjusted and the effect settles. If you stop at week 1, you measure novelty. Run tests long enough for behavior to stabilize, usually 2-4 weeks.

Define the metric before the test

Pick one primary metric and a couple of guardrails. “Conversion rate” is a primary metric. “Revenue per user” is a primary metric. “Revenue per user without affecting latency by more than 5%” is primary + guardrail.

Picking a metric after the test ends is how you p-hack your way to a false positive.

Don’t stack experiments

If you run 5 experiments at once on overlapping user sets, each experiment’s results depend on the state of the others. Ideally, randomize experiment assignment so users are in exactly one. If you can’t, space experiments out or use a tool that handles multi-variant design.

When a flag is all you need

You’re testing whether the new version of a feature is better than the old version. You’re shipping it anyway — you just want to know which to ship. Traffic is reasonable. Signal is expected to be strong (the new version is meaningfully different, not a 1% tweak).

Here, a flag + your existing analytics is fine.

When you need more

You’re making a pricing change. A checkout redesign. An onboarding flow change. Something where the answer matters enough to justify rigor. Traffic is lower and you need statistical confidence at smaller effect sizes. You’re running multiple experiments at once.

Get a real experimentation tool.

What Flagify supports

Flagify has variant flags with deterministic hashing. You can split 50/50, 33/33/33, or whatever weights you want. Each user is consistently assigned to the same variant.

We don’t do statistical analysis. We don’t show p-values. We aren’t trying to be Statsig. If you need that, use Flagify for the flag infrastructure and pair it with an experimentation platform for the analysis.

For most teams getting started with A/B testing, that combination costs less and gives you more control than a full experimentation suite.

See feature flag best practices for lifecycle rules, or percentage rollouts for the hashing details. Flagify supports variant flags out of the box — read the SDK docs for getVariant() usage.

Start for free — no credit card required.