There are only 3 functions in this package:
SimDiD()
: This function simulates data.DiDge()
: This function estimates DiD for a single
cohort and a single event time.DiD()
: This function estimates DiD for all available
cohorts and event times.We now demonstrate the simplest application of the 3 functions.
Detailed documentation for each of these function is available from the Reference tab above.
I provide a simple data simulator as follows:
sim = SimDiD(sample_size = 400, seed=123)
# true ATTs in the simulation
print(sim$true_ATT)
#> cohort event ATTge
#> 1: 2007 0 1.000000
#> 2: 2007 1 2.000000
#> 3: 2007 2 3.000000
#> 4: 2007 3 4.000000
#> 5: 2007 4 5.000000
#> 6: 2007 5 6.000000
#> 7: 2007 6 7.000000
#> 8: 2010 0 1.500000
#> 9: 2010 1 2.500000
#> 10: 2010 2 3.500000
#> 11: 2010 3 4.500000
#> 12: 2012 0 2.000000
#> 13: 2012 1 3.000000
#> 14: Average 0 1.501672
#> 15: Average 1 2.501672
#> 16: Average 2 3.251256
#> 17: Average 3 4.251256
#> 18: Average 4 5.000000
#> 19: Average 5 6.000000
#> 20: Average 6 7.000000
# simulated data
simdata = sim$simdata
print(simdata)
#> id year cohort Y
#> 1: 1 2003 2010 8.773933
#> 2: 1 2004 2010 9.846116
#> 3: 1 2005 2010 9.963274
#> 4: 1 2006 2010 9.997385
#> 5: 1 2007 2010 10.060080
#> ---
#> 4396: 400 2009 2007 8.035127
#> 4397: 400 2010 2007 14.438798
#> 4398: 400 2011 2007 11.973035
#> 4399: 400 2012 2007 13.033367
#> 4400: 400 2013 2007 13.552533
Your real data needs to have this “long” format, i.e., there need to
be variables for the individual identifier (e.g. id
), the
time variable (e.g. year
), the cohort at which treatment
begins (e.g. cohort
), and the outcome variable
(e.g. Y
). No other variables are required. These variables
can have any names you prefer.
Before going to the estimation, we need to prepare a list of the variable names:
We choose an event time (+3) and a cohort of treated units (2010), then estimate DiD:
did_2010 = DiDge(inputdata = simdata, varnames = varnames,
cohort_time = 2010, event_postperiod = 3)
print(did_2010)
#> Cohort EventTime BaseEvent CalendarTime ATTge ATTge_SE Ncontrol Ntreated
#> 1: 2010 3 -1 2013 4.629839 0.1962355 101 100
Comparing this estimate to the true ATT above, we see that the estimation performed well.
Note that it used -1 as the base year by default. This is easy to change.
Suppose we want to estimate the ATT at each event time from -3 to +5. We can do so as follows:
The output of DiD() is a list. One object in the list is results_average, which includes the average ATT across cohorts:
print(did_all$results_average)
#> EventTime BaseEvent ATTe ATTe_SE Ncontrol Ntreated
#> 1: -3 -1 -0.03472821 0.10802340 603 299
#> 2: -2 -1 -0.06416254 0.09847063 603 299
#> 3: -1 -1 0.00000000 0.00000000 603 299
#> 4: 0 -1 1.44852075 0.10387376 603 299
#> 5: 1 -1 2.67299583 0.09964407 603 299
#> 6: 2 -1 3.17946138 0.12477922 402 199
#> 7: 3 -1 4.27349270 0.12596253 302 199
#> 8: 4 -1 4.98423853 0.17470913 201 99
#> 9: 5 -1 5.66743134 0.21029573 101 99
The other output from DiD() is results_cohort, which includes all combinations of event times and cohorts. It is too large to print here, so let’s just print the results for event times 1 and 2:
print(did_all$results_cohort[EventTime==1 | EventTime==2])
#> Cohort EventTime BaseEvent CalendarTime ATTge ATTge_SE Ncontrol Ntreated
#> 1: 2007 1 -1 2008 2.263430 0.1498733 301 99
#> 2: 2007 2 -1 2009 3.083096 0.1666782 301 99
#> 3: 2010 1 -1 2011 2.474058 0.1733037 201 100
#> 4: 2010 2 -1 2012 3.274863 0.1863323 101 100
#> 5: 2012 1 -1 2013 3.277404 0.2117916 101 100
Note: the simulated data ends in 2013, so event time 2 is not available for treatment cohort 2012.
To take an average across multiple event times, use the
Esets
argument. It accepts a list, in which each item is a
vector of event times over which to average: