Release Impact

Overview

Release Impact is an analysis that tests how product changes—typically new versions or feature releases—affect selected KPIs (including conversion, engagement metrics, and revenue). While it serves the same purpose as an A/B test, this analysis doesn't randomly allocate users to variants and control groups. Instead, it assumes the entire population experiences the change. The analysis creates a synthetic control group to compare with the post-release period, estimating impact through a proprietary Loops algorithm. This algorithm uses data from before and after the release, accounting for user mix changes and complex trend patterns. The methodology has been extensively tested against A/B test benchmarks and proven highly accurate.

While this method effectively mimics A/B test estimations, it involves more complex statistics and requires specific assumptions. Though we employ various procedures to test these assumptions, best practice dictates verifying them before running the analysis.

Analysis assumptions

The following assumptions are necessary to estimate the release effect:

  1. The effect must be constant over time (appearing as a step function). Version releases typically follow this pattern, but brief changes like KPI "spikes" aren't suitable. Gradual releases—where some users experience the change before others—are supported but require clear tracking of which users were treated at each release stage.
  2. No other step-like events (such as holidays or major events) should occur on the release date. While major marketing strategy changes can be partially controlled for through user mix changes, a release during a significant holiday makes it impossible to separate the holiday's effect from the release's impact.
  3. Any other step-like events (such as other releases) during the analyzed period—typically one month before and after the release date—must be specified in the analysis parameters. This allows us to model these events appropriately and exclude them from tests that assume no other changes occurred.
    1. The user segments mix is not affected by the release. In other words, we assume all segments are determined prior to the KPI’s reference date (typically the join date), such as demographic / first-device data. We assume this because we use these changes to separate the product changes impact from marketing actions impact that often goes along new releases and affects the segment mix. For example, behavioral segments that rely on user action taken after the KPI’s reference date can bias the data.

Best practices

Release Impact is designed to estimate the impact of product changes. When A/B testing is viable (i.e., when true randomization and allocation mechanisms exist), it's preferable since it's simpler and more robust, requiring fewer assumptions. However, Release Impact is a good alternative when A/B testing is impractical, costly, or impossible.

Release Impact is not suitable for estimating marketing campaign impacts because the analysis is designed to detect and eliminate the effects of user-mix shifts from the total population change.

Due to its assumptions and limitations, avoid running Release Impact analyses during weeks with holidays, major events, or other releases that affect a large portion of your user population. If such events occur within the analysis range but not during the same week, you may proceed with the analysis by indicating these events in the setup (Advanced parameters → Other important dates).

For gradual releases (where users are exposed to changes on different dates), ensure this is registered in the event and properly configured in the Measurement Timeline section (Gradual Release).

Finally, note that the analysis works best with complete data (no missing dates), dense data (few zero-value dates), sufficiently long data (typically a month before and at least 1 week after the release), sufficient volume (typically >1K users per day), and ideally with ample segment data.

Interface

The analysis consists of three types of pages:

  1. Overview - presents a summary the main results over all analyzed KPIs.
  2. Tested KPIs - a deep dive into each KPI result
  3. Appendix - additional result that one might find interesting

Overview

The overview section shows 2 main modules:

  1. The first module is a summary table displays each KPI's value and its synthetic control calculated by our algorithm. A trophy icon indicates the winning "variant" when its KPI value is higher (for "higher is better" KPIs) and shows statistical significance (confidence > 95%). We run multiple validation tests to verify the previously mentioned assumptions and data integrity. The results appear for each analyzed KPI, and clicking the validation label opens a pop-up window with detailed test information. Above the summary table, a toggle lets you switch between two comparisons: the post-release period versus our calculated synthetic control (the causal result) or versus the pre-release KPI (the naive result). We don't recommend using the naive result as it can be significantly biased. This toggle exists to demonstrate the difference between causal and naive calculations.
  2. The second module displays two graphs. The first graph shows the confidence level (the statistical significance of the causal result) over time, starting from day 7 after the release. While this graph defaults to showing the main KPI's confidence trend, you can view any secondary KPI by selecting it from the tabs above. The second graph displays both the actual KPI and its synthetic control across the same time period, with confidence bands for each.

Tested KPIs

This section provides a detailed analysis of specific KPIs (with a separate page for each KPI). Each page displays the summary table along with the Actual KPI and simulated control, including user numbers for both pre- and post-release periods.

The next module displays changes across all segments between pre- and post-release periods. These are naive results (comparing average KPI post vs. pre) and should not be considered reliable estimates of the effect within each segment. However, they provide useful insights into which segments changed during the release and by how much.

The next module reveals the model's inner workings that led to the causal result, breaking down its different components. The graphs demonstrate how the simulated control is calculated and how different model steps affect the result:

  1. The first step shows the naive effect (pre vs. post average KPI)
  2. The second step shows the estimated effect based solely on modeled seasonality and trends in the KPI. This step ignores segment data completely. A significant result here indicates a detected step function impact from the data.
  3. The third step shows the estimated effect based only on segment changes (without modeling trends and seasonality). A significant result here indicates a substantial change in user segment mix that explains the difference in pre vs. post average KPI. This step controls for external changes unrelated to the release, such as marketing campaign changes (assuming the release cannot cause changes in user mix).
  4. The fourth and final step combines the previous two steps to produce the full causal estimate of the effect.

Appendix

The appendix displays additional metrics around the release date, including:

  1. Changes in feature adoption rates across all KPIs' relevant time ranges.
  2. Changes in the median time for entities to reach each KPI.

These metrics, combined with the main results and segment changes, provide a comprehensive view of all changes surrounding the release date.

Methodology

The Release Impact Analysis uses advanced statistical methods to estimate how product changes affect specific KPIs. It overcomes the major biases found in simple pre-versus-post comparisons (naive estimations) by accounting for temporal patterns (like complex trends and seasonality) and external factors such as marketing campaigns that coincide with the release.

Why Naive approaches fail

The simplest way to measure a product change's effect on a KPI would be to compare average values between users who joined after the release ("post-period") and those who joined before ("pre-period"). However, this approach—while straightforward—is strongly discouraged because it leads to misleading results due to significant biases. The main issues are:

  1. A change in user mix of the business: The mix of users after a release can differ significantly from those before it. This may happen when releases coincide with marketing campaigns or due to natural variations in user acquisition. Consequently, a product release that had no real impact might appear to have a significant effect simply because of changes in the user population.
  2. The dynamics of time: Simple before-after comparisons ignore important time-based patterns like trends and seasonality. For example, if a release occurs during an upward trend, the post-release average will appear higher than the pre-release average—suggesting a positive impact even when the raw data clearly shows a negative effect.

To address these biases, our methodology combines two statistical approaches:

  • Treatment Propensity Stratification: to account for changes in user mix.
  • Regression discontinuity analysis with smooth terms (generalized additive models) and seasonality: to account for temporal trends around the release date.

Step 1: Addressing changes in user mix with propensity stratification

The problem

Users in the pre- and post-release periods can differ significantly, whether due to marketing efforts around the release or natural variations. These differences manifest in changing segment distributions—for example, shifts in the proportion of Android users, users from the US, CA, and UK, or users from web sources. Since KPI levels often vary substantially between segments, these distributional shifts can explain changes in the overall KPI. We assume segment assignments are independent of the release since they are determined before the KPI's reference date (typically the join date). Therefore, we need to eliminate KPI changes caused by shifting user mix to isolate the true causal impact of the release. Without accounting for these shifts (also called confounders), any observed difference in KPI between adopters and non adopters may reflect these underlying differences rather than a true causal relationship.

How Propensity Stratification Works

For a release analysis, treatment simply means whether a user is in the post-release or pre-release period. We use a decision tree to group segments based on their likelihood of being in the post period, then analyze each group separately.

We group users into segments with similar treatment propensities using a decision tree. These segments are defined based on both:

  • Demographics (e.g., country, device, marketing source).
  • Behavioral Features (feature adoptions that occurred before the reference date, if applicable).

Within each segment, pre-release and post-release users become more comparable, which reduces confounding effects. We use statistical methods throughout this process to prevent overfitting and optimize the tree structure.

Step 2: Addressing time dynamics

The problem

As demonstrated in the graph below, trends in the data can severely bias the release estimation using a simple pre vs post analysis. Moreover, KPIs often exhibit strong weekly seasonality. If this is not taken into account this can further increase the noise and bias in the estimation. To account for that we use a form of Regression discontinuity analysis.

How Regression Discontinuity Works

Within each segment detected in step 1 we use a generalized additive model which models:

  • A smooth trend, which is general curve that fits the data trends, taking into account the noise and number of users to optimize the curve construction and avoid overfitting.
  • Weekly seasonality, modeled by adding a day-of-week fixed effects (i.e., separate intercepts).
  • Tenure - as an important predictor of the KPI, we use it in the model as well (as a smooth term so the modelled effect is more flexible).
  • Release effect - this models a step function of the impact of the release after taking into account the trends and seasonality (remember - the segment mix changes are taken into account by the first step).
  • Number of new users joining on the same day as the user - this helps accounting for marketing campaigns etc

Using a BIC score, we select or remove features from those listed above to prevent overfitting. The model fits across all pre- and post-release data. For each analyzed segment, the impact estimation comes from the model's release effect feature. The model determines the optimal fit by incorporating trend, seasonality, release impact step, and tenure effect. The synthetic control is then created by predicting the same data with one key difference: the release feature is set to "pre-release" mode for the post-release period.

To visualize this, consider the following real example. The graph below shows the actual KPI (black line) around the release date (dashed green line). The red line shows the fitted model, which detected a downward trend, weekly seasonality, and a significant step change at release. The dashed blue line represents the synthetic control, which continues the pre-release trend and seasonality pattern. The estimated impact of the release is shown by the difference between the fitted line (red) and the synthetic control (dashed blue) in the post-release period.

Many people mistakenly assume that the synthetic control in Release Impact analysis is created by training a model solely on pre-release data to forecast post-release outcomes (such as with the Prophet algorithm). However, this approach would be less effective at detecting changes because forecasting is inherently more complex than model fitting. It would require much more data and stricter assumptions to achieve the same statistical power and avoid bias.

Validation checks

While Release impact applies advanced statistical methods to mimic A/B-tests it is inherently more susceptible to bias and requires more assumptions to work properly. To mitigate this risk, we apply extensive validation checks that are designed to provide greater confidence in the release result. These validations include 13 different checks which eventually feed a confidence scoring mechanism which determines the level of confidence in the analysis assumptions.

KPI Data validations

We perform three types of validations on the KPI data before running it through the model.

  1. Percent missing dates: Problem: dates with missing KPI data, especially around the release date, can bias the result. Test: We check whether there are too many missing dates in the week before and after the release, and if there are any - we indicate the specific dates in which the data are missing. Possible resolution: Users are encouraged to check for any data validity issues around the indicated dates.
  2. Sparse KPI:

    Problem: A combination of small daily samples with a KPI which is very sparse (e.g., a conversion rate of 0.1%) can result in a considerable number of days where the KPI value is 0. This situation is not ideal for the model. Test: We check the percentage of 0-value KPI dates in the analysis period and warn in case where it is unusually high (>15% of dates). Possible resolution: We encourage users to either remove any population limitations (to increase sample size) or select a different KPI which would be less sparse.

  3. Enough post-release dates:

    Problem: The trend model works best if there is at least 1 week of data after the release date. Note that if the KPI has a long bound (e.g. conversion within 2 weeks) then this means at least 3 weeks should pass from the release before there is full KPI data for the first week. Test: Check whether there is less than a full week of fully matured KPI data. Possible resolution: Either wait more time for the data to collect and mature, or select a shorter bound for the KPI so it matures earlier.

Null days validation tests

To validate that our detected effect represents a true signal rather than random noise, we compare it to earlier periods before the analysis window. Even if the analysis detects statistically significant effects in earlier dates, we check that the effect in the release date is more significant than the effect found in at least 90% of the other dates. An important assumption for the validity of this test is that there are no other releases or major influential events in the pre-release period (if there are such events, the user should clearly define these in the analysis settings).

  1. Ran Null days validation test:

    Problem: The null days validation test sometimes cannot run, for example when there isn't enough historical data before the release. This reduces our confidence in the results since this validation is fairly important. Test: Verify that the null days validation test executed successfully. Possible resolution: Ensure KPI and segment data extends several months before the release date.

  2. Percentile in null test:

    Problem: When the effect on the release date is not at least in the 90th percentile of effects of pre-release null dates, this reduces our confidence in the result. Test: Check the percentile of effect compared to pre-release null dates (i.e. dates which were not indicated as other release or major event dates). Possible resolution: Check whether there are other release dates or major influential events in the pre-release period, and indicate them in the analysis parameters.

  3. Outlier null dates:

    Problem: Null days test assumes the comparison periods were free from major events or other releases. Specific null dates with an unusually large effect suggests these assumptions might not hold, which could invalidate our other test’s results. Test: Test for outliers in the null dates effects. Possible resolution: Check whether there are other release dates or major influential events in the pre-release period, and indicate them in the analysis parameters.

Segments data validations

We run three more analyses specifically focused on segment data richness and its weekly changes. These tests are crucial for properly handling external confounding effects (such as marketing campaigns—see step 1 of methodology) and for identifying other potential problems.

  1. Detected segments with moderately-high changes around release:

    Problem: In the best practices section, we encourage users to avoid including segment groups that are affected by the release, such as usage of a feature introduced by the release. Since such segment groups can severely bias the result, we apply a procedure that automatically detects and removes highly suspicious segments (those showing extreme changes around the release). When these changes are clearly extreme, we have high confidence in this process. However, things become trickier with segments showing moderately high changes. In these cases, while we still remove the segment from the analysis, we have less confidence in the result since there's a higher chance we've removed an informative segment. Test: We keep notes of all the segments groups we automatically excluded due to too-big differences between pre and post release periods, and measure the degree of differences. Out of all the excluded segment groups, we examine the one with the least amount of change. If the level of change of that group fall in a certain range (of large enough change to exclude but low enough to raise doubts) the test fails. Possible resolution: Manually exclude segment groups which are not suitable for the analysis (as they might be causally affected by the release).

  2. Number of informative segments:

    Problem: The model needs rich segment data to account for user mix changes around the release (as explained in step one of the methodology). Without enough segment data, we may fail to detect and account for potential biases.

    Test: We check if there are at least 5 segment groups that passed the segment validity checks (meaning they didn't show extreme changes between pre- and post-release periods).

    Possible resolution: Add more segments to the analysis.

  3. Outlier weeks in segment distribution changes:

    Problem: As mentioned before, the Null days test assumes the comparison periods were free from major events or other releases. Specific weeks in which unusually large segment changes are observed suggests these assumptions might not hold, which could invalidate our other test’s results. Test: We quantify weekly changes in the mix of every segment group and then test whether any group exhibits certain weeks with unusually high level of change in segment mix. If we find any, we indicate in which segment group and which specific weeks these changes were observed. Possible resolution: make sure no additional releases or holidays happened in the specified weeks, and if so - indicate these in the analysis parameters.

Model results validations

Finally, we also run some validation tests on different stages of the model results looking for red flags.

  1. Causal tree splits:

    Problem: The basis of stage 1 of the model (propensity stratification) is a tree algorithm that splits the population to different segments. In some cases, no statistically significant splits are detected, which might indicate not enough data to debias the results. Test: Check if the tree model found any splits for the population. Possible resolution: Add more meaningful segment groups to the data.

  2. High variability in effect between group:

    Problem: In stage 2 of the model, we estimate the effect separately for each tree-segment found in stage 1. Sometimes, the effects in these segments show opposite directions. When a significant portion of the population shows contradictory effects, this raises a red flag (even though it’s technically possible due to heterogeneous treatment effect) Test: Check which portion of the population resides in segments with statistically significant contradicting effects. Possible resolution: Get support from Loops team.

  3. Causal equal naive:

    Problem: When not enough data is available for the analysis the causal effect will equal the naive effect of post vs pre release. This mean that there are both no causal tree splits and also no trend and seasonality model, which implies a low signal to noise ratio. Test: Check whether the causal effect equals the naive one. Possible resolution: Collect more data.

  4. Unstable confidence:

    Problem: While the estimated effect using the most recent data could appear statistically significant, this could be due to random noise in post release dates. An effect which is not only significant but also stable across time is stronger evidence of an actual impact. Test: The release effect is estimated on all available lengths of post release period from 7 onward (i.e., 7 days, 8 days, 9 days etc.). We run a test to see whether the effect converged to a high enough confidence, which means it is both high and stable. Possible resolution: Keep collecting data for a few more days.

Each validation test carries a different weight in our general confidence score. This scoring system helps us assess our confidence level in each result and provides clear steps for improvement when validation tests fail. We developed this process based on numerous client escalations and analyst feedback to replicate our internal debugging procedures. The scoring mechanism has been refined through extensive training on cases that our domain experts manually evaluated for confidence levels.

Results

Since Loops includes an A/B testing platform, we can benchmark our methodology against these gold-standard tests and improve our algorithms to match their results. Using dozens of A/B tests across different KPIs from various clients, our current algorithm achieves 90% accuracy—meaning the release impact results (statistically significant positive/negative, or non-significant) match the A/B test results 90% of the time. In comparison, naive pre-post calculations only achieve 50% accuracy against the same A/B test data.

Summary

Release Impact Analysis is a specialized methodology that measures how product changes—particularly new versions or feature launches—affect key performance indicators (KPIs) such as conversion, engagement, and revenue. Like an A/B test, this method compares performance before and after changes, but with one key difference: it analyzes the entire user population instead of splitting users into control and test groups. Release Impact creates a synthetic control group, maintaining the scientific rigor of randomized experiments while avoiding their complexity and cost.

The approach requires specific conditions to work effectively. It needs a clear, sustained impact (a "step function" effect) rather than temporary spikes. It also requires the absence of other major events—such as holidays, marketing campaigns, or additional releases—during the analysis period. When such events occur, users must specify them so the algorithm can account for their influence and isolate the true release impact. Segment data is crucial: the method assumes that user characteristics and behaviors used for segmentation were established before the product change, preventing confusion between release effects and marketing-driven changes. Through statistical techniques, we adjust for changes in user mix and temporal patterns, significantly improving accuracy (from roughly 50% to over 90% when compared to A/B test benchmarks).

Finally, comprehensive validations ensure reliability at every step, identifying potential issues like sparse data, unstable trends, or unaccounted events. By uniting methodological rigor with practical safeguards, Release Impact serves as a robust alternative when A/B testing isn't feasible.

Still need help? Contact Us Contact Us