How-To Guide

Incrementality Testing for GEO Investment

How to run geographic lift tests on AI visibility inputs to measure the causal revenue effect of GEO spend. Test design, sample sizing, analysis, and common pitfalls.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: May 17, 2026

Step 1: Pick the Right Intervention

The first design decision is what to pause. AI visibility is driven by inputs (PR, content publishing, syndication, structured data updates, Wikipedia work), not by direct spend. A clean test pauses one or more of these inputs in matched geographic regions, observes the AI visibility series in those regions versus controls, and then observes downstream business outcomes.

Most tests pause PR placement (the cleanest geographically targetable input) or syndicated content distribution. Site-wide changes like robots.txt edits affect AI visibility nationally and cannot be geo-tested.

Step 2: Match Markets

Pair test regions with control regions that have correlated pre-period business outcomes. Use 52 weeks of pre-period data and select matched pairs with high correlation and similar levels on the primary outcome (typically branded search volume, direct traffic, or conversions).

Modern practice uses synthetic control rather than strict pair matching. Tools like Google CausalImpact and Meta GeoLift construct a weighted donor pool that approximates the test region's pre-period trajectory better than any single match. This produces tighter confidence intervals and is the default for serious lift testing.

Step 3: Power the Test

Run a power calculation before the test starts. Inputs: expected effect size (typically 5 to 15 percent on the primary KPI for a meaningful AI visibility intervention), pre-period variance of the KPI, target power (usually 80 percent), and significance level (usually 5 percent). The calculation outputs the required test duration and number of test markets.

Common error: running the test for "as long as we can afford" rather than the duration the power calculation requires. Underpowered tests produce null results that get misread as "AI visibility does not work" when in fact the test was too small to detect a real effect.

Step 4: Confirm First-Stage Movement

Before analyzing business outcomes, confirm that the intervention actually moved the AI visibility series in test regions and not in controls. If the AI signal did not move, no downstream test of business effect is meaningful because the test never happened in the way the design intended.

This is where region-segmented AI visibility data is essential. Presenc AI exports weekly SOV at DMA level so the first-stage check is mechanical: test regions should show declining SOV during the holdout window, controls should show stable SOV.

Step 5: Analyze With Synthetic Control

Fit a synthetic control model on the pre-period using donor regions to construct a counterfactual for each test region. Project the counterfactual through the test window. The gap between observed test-region outcomes and the synthetic counterfactual, cumulated over the test window, is the estimated lift (or in this case the loss from holding out AI visibility).

Report the point estimate with the confidence interval. A test that shows a 4.2 percent loss with 95 percent CI of 1.1 to 7.3 percent is a positive result with meaningful uncertainty. A test that shows a 4.2 percent loss with 95 percent CI of -2.1 to 10.5 percent is underpowered and inconclusive.

Step 6: Translate Lift Into Causal ROI

Take the percentage lift, apply it to the annualized baseline business outcome in the test regions, and scale to the national footprint. Divide by the annualized cost of the AI visibility inputs that were paused. The result is a causal ROI estimate for AI visibility spend that survives finance scrutiny in a way that attribution-based ROI does not.

Step 7: Calibrate the MMM

Compare the causal ROI from the test to what the MMM coefficient on the AI variable implies for the same intervention. If the two agree within the confidence interval, the MMM is calibrated. If they disagree materially, the test is ground truth and the MMM spec needs to be revisited, typically by adjusting adstock priors or the saturation curve.

Common Pitfalls

Spillover: If test and control regions are geographically adjacent and the intervention is national-scale (a PR campaign in major outlets), the holdout leaks into controls. Use non-adjacent matched regions.

Seasonality contamination: AI visibility tests run during a major category seasonal event will measure seasonality, not AI effect. Run tests in stable periods.

Short tests: AI carryover is weeks to months. Tests shorter than eight weeks systematically undermeasure AI effect because the carryover from the pre-period continues to deliver outcomes during the early weeks of the holdout.

How Presenc AI Helps

Presenc AI provides DMA-level AI visibility data that powers the first-stage check and the donor-region selection for synthetic control. Without geographic AI signal, lift tests on AI visibility are flying blind. Presenc also publishes prompt-set governance metadata so the AI variable in the test analysis is documented to the same standard as the rest of the measurement stack.

Frequently Asked Questions

Eight to twelve weeks for most categories. Shorter tests undermeasure AI effect because of carryover from the pre-period; longer tests start to absorb seasonality and external shocks. The power calculation done before the test starts specifies the exact duration for your category.
Not properly. A national pause of AI visibility inputs has no counterfactual; you cannot tell whether changes in outcomes are due to the pause or to category-level events. Geographic holdouts produce the matched comparison that makes the test interpretable.
Test the inputs that are: regionally syndicated content, regional event sponsorship, region-targeted paid amplification. National PR is hard to geo-test, but the inputs around it (which amplify PR's AI visibility effect) usually are testable. Aryma's AdstockITSA and similar synthetic-control techniques can also extract causal estimates from national interventions in some cases.
Once every two to three quarters on a rolling basis. The purpose is to calibrate the MMM, not to run continuous experiments. Over-frequent testing produces noise and is expensive (each test reduces AI visibility activity in test regions for the duration). Under-frequent testing lets the MMM drift away from causal ground truth.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.