Staggered Difference-in-Differences#

This notebook demonstrates the StaggeredDifferenceInDifferences estimator in CausalPy, which handles staggered adoption settings where different units receive treatment at different times.

The Staggered Adoption Problem#

In many real-world settings, treatment is not applied to all units at the same time. Instead, different cohorts of units adopt treatment at different times. This creates a “staggered adoption” pattern.

Why Standard Two-Way Fixed Effects (TWFE) Can Fail#

The standard TWFE regression with a single treatment indicator:

\[Y_{it} = \alpha_i + \lambda_t + \tau D_{it} + \varepsilon_{it}\]

can produce biased estimates in staggered settings [Goodman-Bacon, 2021] because:

Already-treated units serve as implicit controls for later-treated units
The estimate is a weighted average of treatment effects that can include negative weights
Dynamic treatment effects (effects that vary over time since treatment) are not properly accounted for

Note

Notice that \(\tau\) in the TWFE formulation above is a scalar—a single number meant to summarize the treatment effect across all units and time periods. This is a fundamental limitation: it assumes the treatment effect is constant, regardless of when a unit was treated or how long they’ve been treated.

Event-study and modern staggered DiD approaches address this by treating treatment effects as dynamic—allowing \(\tau\) to vary by event-time (time relative to treatment). This captures realistic patterns like effects that build up over time, decay, or differ across treatment cohorts.

The Imputation-Based Solution#

CausalPy’s StaggeredDifferenceInDifferences uses an imputation-based approach inspired by Borusyak et al. [2024]:

Fit a model on untreated observations only - using pre-treatment periods for eventually-treated units plus all periods for never-treated units
Predict counterfactual outcomes for all observations
Compute treatment effects as the difference between observed and predicted outcomes
Aggregate effects by event-time (time relative to treatment) for an event-study curve

import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import causalpy as cp
from causalpy.data.simulate_data import generate_staggered_did_data

warnings.filterwarnings("ignore", category=FutureWarning)

%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'retina'

Generate Synthetic Staggered Panel Data#

We’ll create synthetic data with:

50 units observed over 20 time periods
3 treatment cohorts adopting at times 5, 10, and 15
Some never-treated units
A known dynamic treatment effect that we can verify

# Define the true treatment effects (dynamic effects that ramp up then stabilize)
# This is used both for data generation and for validating our estimates
TRUE_EFFECTS = {0: 1.0, 1: 1.5, 2: 2.0, 3: 2.5}

# Generate synthetic data with known treatment effects
df = generate_staggered_did_data(
    n_units=50,
    n_time_periods=20,
    treatment_cohorts={5: 12, 10: 12, 15: 12},  # 3 cohorts, 14 never-treated
    treatment_effects=TRUE_EFFECTS,
    unit_fe_scale=2.0,
    time_fe_scale=1.0,
    sigma=0.3,
    seed=42,
)
df.head()

	time	treatment_time	y	y0
0	0	15.0	0.618468	0.898554
1	1	15.0	1.179091	1.240722
2	2	15.0	-1.132728	-0.847722
3	3	15.0	0.188053	0.289763
4	4	15.0	0.391154	0.139062

Panel dimensions: 50 units × 20 periods
Total observations: 1000

Treatment cohorts:
treatment_time
5.0     12
10.0    12
15.0    12
inf     14
Name: unit, dtype: int64

Visualize the Staggered Treatment Pattern#

../_images/2be8225b7ab65dc0cd79d85ccca0bfa78e681725cc8e1354e34629564adb144a.png

Fit the Staggered DiD Model#

We’ll use a model with unit and time fixed effects, which is the baseline counterfactual model for imputation.

The formula defines a model of untreated potential outcomes. Crucially, this model is fitted only on observations that are not yet treated: pre-treatment periods for units that eventually receive treatment, plus all periods for never-treated units. The fitted model is then used to predict what the treated units would have experienced in the absence of treatment (the counterfactual). Treatment effects are computed as the difference between observed outcomes and these counterfactual predictions.

The formula y ~ 1 + C(unit) + C(time) specifies a two-way fixed effects model, but you’re not limited to this specification. If you have additional covariates that help explain variation in the outcome—such as weather conditions, seasonality indicators, economic indicators, or any other time-varying controls—you can include them in the formula. For example, y ~ 1 + C(unit) + C(time) + temperature + holiday would add temperature and holiday effects to the model. Including relevant covariates can improve the precision of your treatment effect estimates and strengthen the plausibility of the parallel trends assumption.

# Fit the staggered DiD model with PyMC
result = cp.StaggeredDifferenceInDifferences(
    df,
    formula="y ~ 1 + C(unit) + C(time)",
    unit_variable_name="unit",
    time_variable_name="time",
    treated_variable_name="treated",
    treatment_time_variable_name="treatment_time",
    model=cp.pymc_models.LinearRegression(
        sample_kwargs={
            "progressbar": True,
            "random_seed": 42,
        }
    ),
)

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta, y_hat_sigma]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 2 seconds.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details
Sampling: [beta, y_hat, y_hat_sigma]
Sampling: [y_hat]
Sampling: [y_hat]

Event-Study Plot#

The event-study plot shows estimates at each event-time (time relative to treatment). Key features:

Pre-treatment placebo estimates (event-time < 0, gray squares): These should be close to zero if the parallel trends assumption holds. They are computed as residuals (observed minus predicted) for eventually-treated units before they receive treatment. These are not treatment effects—they are fit diagnostics.
Post-treatment ATT estimates (event-time ≥ 0, blue circles): These are the actual Average Treatment effect on the Treated (ATT) estimates showing the dynamic treatment effect over time since treatment.
Error bars: 94% Highest Density Intervals (HDI) from the Bayesian posterior
Gray shaded region: Pre-treatment period (placebo check zone)

Note

Since we generated synthetic data with known treatment effects, we can overlay the true effects on the plot to validate the estimator’s performance. In real applications, the true effects are unknown.

fig, ax = result.plot()

# Overlay true treatment effects (only possible because we simulated the data)
att_et = result.att_event_time_
post_treatment = att_et[att_et["event_time"] >= 0]
true_vals = [TRUE_EFFECTS.get(e, 2.5) for e in post_treatment["event_time"]]
ax[0].scatter(
    post_treatment["event_time"],
    true_vals,
    color="red",
    marker="x",
    s=100,
    linewidths=2,
    zorder=5,
    label="True Effect",
)
ax[0].legend()

plt.show()

../_images/a8bf924042624c586eebcecc2c4736e3031de8f2ae9f51d33c9a0879254cea57.png

View Summary Statistics#

The n_obs column in the event-time ATT table shows the number of treated unit-time observations contributing to each event-time estimate. This varies across event-times because different cohorts have different lengths of post-treatment history. For example, units treated early in the panel contribute observations at all event-times, while units treated later only contribute to earlier event-times (e.g., event-time 0, 1, 2) before the panel ends.

result.summary(round_to=2)

======================Staggered Difference in Differences=======================
Formula: y ~ 1 + C(unit) + C(time)
Number of units: 50
Number of time periods: 20
Treatment cohorts: [np.float64(5.0), np.float64(10.0), np.float64(15.0)]
Never-treated units: 14

Event-time estimates:
 event_time    type       att  att_lower  att_upper  n_obs
        -15 placebo -0.089727  -0.182651   0.001678     12
        -14 placebo -0.076510  -0.165531   0.015464     12
        -13 placebo -0.117002  -0.208039  -0.026421     12
        -12 placebo  0.030745  -0.059782   0.121330     12
        -11 placebo  0.074539  -0.016959   0.164344     12
        -10 placebo -0.076136  -0.143237  -0.012652     24
         -9 placebo  0.081386   0.014587   0.148938     24
         -8 placebo -0.013679  -0.080336   0.051946     24
         -7 placebo -0.032442  -0.098791   0.033575     24
         -6 placebo  0.004729  -0.060527   0.071197     24
         -5 placebo  0.021254  -0.042810   0.082547     36
         -4 placebo  0.018158  -0.044094   0.081336     36
         -3 placebo  0.051156  -0.011460   0.112558     36
         -2 placebo  0.024409  -0.037726   0.085558     36
         -1 placebo -0.033889  -0.096769   0.030393     36
          0     ATT  1.074842   0.990362   1.159714     36
          1     ATT  1.477406   1.395102   1.559364     36
          2     ATT  1.936514   1.854290   2.016924     36
          3     ATT  2.491325   2.412439   2.572188     36
          4     ATT  2.494277   2.412555   2.577045     36
          5     ATT  2.439444   2.327826   2.554720     24
          6     ATT  2.508034   2.391311   2.625448     24
          7     ATT  2.495111   2.382095   2.607191     24
          8     ATT  2.554547   2.444958   2.666289     24
          9     ATT  2.504438   2.390342   2.620304     24
         10     ATT  2.699058   2.515626   2.879858     12
         11     ATT  2.379267   2.192838   2.562129     12
         12     ATT  2.441211   2.260847   2.617471     12
         13     ATT  2.580254   2.405220   2.759980     12
         14     ATT  2.625220   2.452779   2.803349     12

Model coefficients:
Model coefficients:
    Intercept      0.78, 94% HDI [0.63, 0.95]
    C(unit)[T.1]   -2.6, 94% HDI [-2.9, -2.3]
    C(unit)[T.2]   0.92, 94% HDI [0.62, 1.2]
    C(unit)[T.3]   1.4, 94% HDI [1.2, 1.6]
    C(unit)[T.4]   -4.4, 94% HDI [-4.6, -4.2]
    C(unit)[T.5]   -3.2, 94% HDI [-3.4, -3]
    C(unit)[T.6]   -0.37, 94% HDI [-0.58, -0.18]
    C(unit)[T.7]   -1.1, 94% HDI [-1.3, -0.94]
    C(unit)[T.8]   -0.43, 94% HDI [-0.73, -0.13]
    C(unit)[T.9]   -2.3, 94% HDI [-2.6, -2.1]
    C(unit)[T.10]  1.4, 94% HDI [1.2, 1.5]
    C(unit)[T.11]  1, 94% HDI [0.73, 1.3]
    C(unit)[T.12]  -0.37, 94% HDI [-0.58, -0.18]
    C(unit)[T.13]  1.8, 94% HDI [1.6, 2]
    C(unit)[T.14]  0.34, 94% HDI [0.12, 0.54]
    C(unit)[T.15]  -2.3, 94% HDI [-2.6, -2.1]
    C(unit)[T.16]  0.21, 94% HDI [-0.088, 0.51]
    C(unit)[T.17]  -2.4, 94% HDI [-2.6, -2.2]
    C(unit)[T.18]  1.2, 94% HDI [0.99, 1.5]
    C(unit)[T.19]  -0.7, 94% HDI [-0.89, -0.5]
    C(unit)[T.20]  -0.9, 94% HDI [-1.1, -0.7]
    C(unit)[T.21]  -1.7, 94% HDI [-2, -1.4]
    C(unit)[T.22]  2, 94% HDI [1.8, 2.2]
    C(unit)[T.23]  -0.9, 94% HDI [-1.1, -0.72]
    C(unit)[T.24]  -1.3, 94% HDI [-1.6, -1.1]
    C(unit)[T.25]  -1.3, 94% HDI [-1.6, -0.95]
    C(unit)[T.26]  0.51, 94% HDI [0.27, 0.74]
    C(unit)[T.27]  0.18, 94% HDI [-0.021, 0.37]
    C(unit)[T.28]  0.23, 94% HDI [0.015, 0.43]
    C(unit)[T.29]  0.18, 94% HDI [-0.025, 0.39]
    C(unit)[T.30]  3.7, 94% HDI [3.5, 4]
    C(unit)[T.31]  -1.3, 94% HDI [-1.6, -1.1]
    C(unit)[T.32]  -1.4, 94% HDI [-1.6, -1.2]
    C(unit)[T.33]  -2.1, 94% HDI [-2.3, -1.9]
    C(unit)[T.34]  0.65, 94% HDI [0.35, 0.96]
    C(unit)[T.35]  1.8, 94% HDI [1.5, 2.1]
    C(unit)[T.36]  -0.81, 94% HDI [-1, -0.61]
    C(unit)[T.37]  -2.1, 94% HDI [-2.3, -1.8]
    C(unit)[T.38]  -2.3, 94% HDI [-2.5, -2.1]
    C(unit)[T.39]  0.68, 94% HDI [0.48, 0.87]
    C(unit)[T.40]  0.91, 94% HDI [0.66, 1.1]
    C(unit)[T.41]  0.7, 94% HDI [0.47, 0.93]
    C(unit)[T.42]  -2.1, 94% HDI [-2.3, -1.9]
    C(unit)[T.43]  -0.16, 94% HDI [-0.37, 0.045]
    C(unit)[T.44]  -0.2, 94% HDI [-0.41, 0.0086]
    C(unit)[T.45]  -0.0012, 94% HDI [-0.29, 0.28]
    C(unit)[T.46]  1.1, 94% HDI [0.91, 1.4]
    C(unit)[T.47]  -0.23, 94% HDI [-0.52, 0.066]
    C(unit)[T.48]  0.81, 94% HDI [0.59, 1]
    C(unit)[T.49]  -0.54, 94% HDI [-0.84, -0.25]
    C(time)[T.1]   0.38, 94% HDI [0.26, 0.49]
    C(time)[T.2]   -1.7, 94% HDI [-1.8, -1.6]
    C(time)[T.3]   -0.56, 94% HDI [-0.68, -0.44]
    C(time)[T.4]   -0.77, 94% HDI [-0.89, -0.66]
    C(time)[T.5]   -0.96, 94% HDI [-1.1, -0.84]
    C(time)[T.6]   -0.51, 94% HDI [-0.64, -0.38]
    C(time)[T.7]   1.3, 94% HDI [1.2, 1.4]
    C(time)[T.8]   -1, 94% HDI [-1.1, -0.89]
    C(time)[T.9]   0.72, 94% HDI [0.59, 0.84]
    C(time)[T.10]  -1.9, 94% HDI [-2.1, -1.8]
    C(time)[T.11]  -0.55, 94% HDI [-0.7, -0.42]
    C(time)[T.12]  -0.028, 94% HDI [-0.17, 0.11]
    C(time)[T.13]  0.38, 94% HDI [0.24, 0.53]
    C(time)[T.14]  0.51, 94% HDI [0.37, 0.65]
    C(time)[T.15]  0.45, 94% HDI [0.28, 0.63]
    C(time)[T.16]  -0.61, 94% HDI [-0.79, -0.42]
    C(time)[T.17]  -0.69, 94% HDI [-0.87, -0.5]
    C(time)[T.18]  0.55, 94% HDI [0.37, 0.73]
    C(time)[T.19]  -0.55, 94% HDI [-0.73, -0.37]
    y_hat_sigma    0.31, 94% HDI [0.29, 0.33]

Effect Summary#

Get a prose summary of the causal effects. The summary includes the average post-treatment effect, and if pre-treatment placebo effects are available, it reports on whether the parallel trends assumption appears to be satisfied:

effect_summary = result.effect_summary()
print(effect_summary.text)

Staggered DiD analysis: The average post-treatment effect across event-times was 2.31 (average 94% HDI [2.19, 2.44]). Pre-treatment placebo check: Average pre-treatment effect was -0.01, consistent with parallel trends assumption. Analysis includes 3 treatment cohort(s).

Pre-Treatment Placebo Check#

The att_event_time_ table includes pre-treatment event times (negative values). These represent the average residuals (observed - predicted) for eventually-treated units before they receive treatment.

Important

Pre-treatment estimates are not ATT (Average Treatment effect on the Treated). They are placebo/fit diagnostics that validate the counterfactual model and parallel trends assumption. If these values are close to zero, it suggests the model fits well and parallel trends is plausible.

# Separate pre- and post-treatment effects
att_et = result.att_event_time_
pre_treatment = att_et[att_et["event_time"] < 0]
post_treatment = att_et[att_et["event_time"] >= 0]

print("Pre-treatment (placebo) effects:")
print(f"  Mean: {pre_treatment['att'].mean():.3f}")
print(f"  Should be close to zero if parallel trends holds")
print()
print("Post-treatment effects:")
print(f"  Mean: {post_treatment['att'].mean():.3f}")
print(f"  True average effect: {np.mean(list(TRUE_EFFECTS.values())):.3f}")

Pre-treatment (placebo) effects:
  Mean: -0.009
  Should be close to zero if parallel trends holds

Post-treatment effects:
  Mean: 2.313
  True average effect: 1.750

Examine the Event-Time ATT Table#

The att_event_time_ attribute provides direct access to the underlying event-time ATT estimates as a pandas DataFrame. This is the same data visualized in the event-study plot, but in tabular form.

When to use this table:

Reporting precise estimates: When you need exact point estimates and credible intervals for a paper or presentation, rather than reading approximate values from a plot.
Custom analysis: When you want to perform additional calculations on the estimates, such as computing cumulative effects, testing specific hypotheses, or comparing effects at particular event-times.
Debugging and validation: When checking that the model is behaving as expected, or comparing estimates across different model specifications.
Exporting results: When you need to save estimates to a file or integrate them into a larger analysis pipeline.

The table includes event_time (periods relative to treatment), type (placebo vs ATT), point estimates (att), credible/confidence intervals, and n_obs (sample size at each event-time).

result.att_event_time_

	event_time	att	att_lower	att_upper	n_obs
0	-15	-0.089727	-0.182651	0.001678	12
1	-14	-0.076510	-0.165531	0.015464	12
2	-13	-0.117002	-0.208039	-0.026421	12
3	-12	0.030745	-0.059782	0.121330	12
4	-11	0.074539	-0.016959	0.164344	12
5	-10	-0.076136	-0.143237	-0.012652	24
6	-9	0.081386	0.014587	0.148938	24
7	-8	-0.013679	-0.080336	0.051946	24
8	-7	-0.032442	-0.098791	0.033575	24
9	-6	0.004729	-0.060527	0.071197	24
10	-5	0.021254	-0.042810	0.082547	36
11	-4	0.018158	-0.044094	0.081336	36
12	-3	0.051156	-0.011460	0.112558	36
13	-2	0.024409	-0.037726	0.085558	36
14	-1	-0.033889	-0.096769	0.030393	36
15	0	1.074842	0.990362	1.159714	36
16	1	1.477406	1.395102	1.559364	36
17	2	1.936514	1.854290	2.016924	36
18	3	2.491325	2.412439	2.572188	36
19	4	2.494277	2.412555	2.577045	36
20	5	2.439444	2.327826	2.554720	24
21	6	2.508034	2.391311	2.625448	24
22	7	2.495111	2.382095	2.607191	24
23	8	2.554547	2.444958	2.666289	24
24	9	2.504438	2.390342	2.620304	24
25	10	2.699058	2.515626	2.879858	12
26	11	2.379267	2.192838	2.562129	12
27	12	2.441211	2.260847	2.617471	12
28	13	2.580254	2.405220	2.759980	12
29	14	2.625220	2.452779	2.803349	12

Group-Time ATT Table#

The att_group_time_ attribute provides the most granular level of treatment effect estimates: effects for each combination of treatment cohort (G) and calendar time (t). This is the “building block” data from which event-time effects are aggregated.

Understanding the difference from event-time effects:

Event-time ATT (att_event_time_): Aggregates effects across cohorts at the same relative time since treatment (e.g., “effect 2 periods after treatment” averages over all cohorts).
Group-time ATT (att_group_time_): Keeps cohort and calendar time separate, showing effects for specific cohort-time pairs (e.g., “effect for cohort treated at t=5, observed at t=7”).

When to use this table:

Cohort heterogeneity analysis: When you suspect treatment effects differ across cohorts (e.g., early adopters vs late adopters respond differently to treatment).
Calendar time effects: When you want to check if treatment effects vary with calendar time, not just time since treatment (e.g., macroeconomic conditions may amplify or dampen effects).
Diagnostics: When event-time effects look suspicious and you want to trace the issue back to specific cohort-time combinations.
Custom aggregation: When you want to compute alternative summary measures (e.g., cohort-specific average effects, or effects weighted by cohort size).

The table includes cohort (treatment time G), time (calendar time t), and treatment effect estimates with uncertainty intervals:

result.att_group_time_.head(20)

	cohort	time	att	att_lower	att_upper
0	5.0	5	1.105831	0.981070	1.234096
1	5.0	6	1.313366	1.191865	1.438411
2	5.0	7	1.874254	1.738034	2.002803
3	5.0	8	2.324531	2.195446	2.449806
4	5.0	9	2.495319	2.365605	2.621871
5	5.0	10	2.330355	2.188604	2.468263
6	5.0	11	2.469498	2.326770	2.611914
7	5.0	12	2.452538	2.311010	2.599357
8	5.0	13	2.521460	2.381104	2.661465
9	5.0	14	2.492747	2.348853	2.636524
10	5.0	15	2.699058	2.515626	2.879858
11	5.0	16	2.379267	2.192838	2.562129
12	5.0	17	2.441211	2.260847	2.617471
13	5.0	18	2.580254	2.405220	2.759980
14	5.0	19	2.625220	2.452779	2.803349
15	10.0	10	0.995009	0.864157	1.123532
16	10.0	11	1.493974	1.367916	1.621203
17	10.0	12	1.948346	1.815420	2.081204
18	10.0	13	2.485107	2.350294	2.616990
19	10.0	14	2.378103	2.246150	2.507052

Using scikit-learn Models#

For faster analysis, you can use scikit-learn models instead of PyMC. The error bars represent approximate 95% confidence intervals (±1.96 standard errors).

from sklearn.linear_model import LinearRegression

result_ols = cp.StaggeredDifferenceInDifferences(
    df,
    formula="y ~ 1 + C(unit) + C(time)",
    unit_variable_name="unit",
    time_variable_name="time",
    treated_variable_name="treated",
    treatment_time_variable_name="treatment_time",
    model=LinearRegression(),
)

# Plot the event-study results
fig, ax = result_ols.plot()
plt.show()

The OLS approach produces similar point estimates to the Bayesian model but runs much faster. However, the uncertainty quantification differs: OLS uses asymptotic standard errors while PyMC provides full posterior distributions.

Key Takeaways#

Staggered adoption requires special handling - standard TWFE can produce biased estimates
The imputation approach fits a model on untreated observations and predicts counterfactuals
Event-study curves show dynamic treatment effects and allow for parallel trends checks
Pre-treatment “placebo” estimates (event-time < 0) are not treatment effects—they are fit diagnostics. Values near zero support the parallel trends assumption.
Post-treatment ATT estimates (event-time ≥ 0) are the actual Average Treatment effect on the Treated
CausalPy supports both Bayesian and OLS approaches for flexibility

References#

[1]

Andrew Goodman-Bacon. Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2):254–277, 2021.

[2]

Kirill Borusyak, Xavier Jaravel, and Jann Spiess. Revisiting event-study designs: robust and efficient estimation. Review of Economic Studies, 91(6):3253–3285, 2024.