Here, we apply regression analysis to analyze the impact of receiving monetary incentive on decision to learn the result of HIV test. Visit my previous blog to learn more about the theoretical aspects discussed in this section.
We use data from “The Demand for, and Impact of, Learning HIV Status” study in Malawi. The study uses a randomized controlled trial (RCT), where individuals were provided varying degrees of monetary incentives to learn about their HIV status after receiving an HIV Test.
Study: Thornton, Rebecca L. 2008. “The Demand for, and Impact of, Learning HIV Status.” American Economic Review, 98 (5): 1829-63.
Data file: Click here
Detailed description of the intervention: Click here
We use the “Thornton HIV Testing Data.dta” for the analysis.
Importing and describing the data
Execution in R
The data file is a Stata (.dta) file. To import the dataset in R, we will need to install the haven package in R and use the read_dta() function. Run the following code in R to install the haven package:
install.packages("haven")
Now, import the dataset and check the list of variables and number of observations. When you download the data file, it comes with a readme file. Please read the readme file to learn more about the variables.
The str () function in R provides the structure of the dataset. However, we will only use the names () and dim() function here to make the content of this analysis shorter. Please check the Stata execution section to get a detailed description of the variables.
library(haven)
# import the .dta file
data <- read_dta("C:/Data analysis/Thornton data/Data/Thornton HIV Testing Data.dta")
# List of variables
names(data)
[1] "site" "rumphi" "balaka"
[4] "villnum" "m1out" "m2out"
[7] "survey2004" "got" "zone"
[10] "distvct" "tinc" "Ti"
[13] "any" "under" "over"
[16] "simaverage" "age" "age2"
[19] "male" "mar" "educ2004"
[22] "timeshadsex_s" "hadsex12" "eversex"
[25] "usecondom04" "tb" "thinktreat"
[28] "a8" "land2004" "T_consentsti"
[31] "T_consenthiv" "T_final_trichresult" "T_final_result_ct"
[34] "T_final_result_gc" "hiv2004" "test2004"
[37] "followup_tested" "followupsurvey" "havesex_fo"
[40] "numsex_fo" "likelihoodhiv_fo" "numcond"
[43] "anycond" "bought"
# dimensions of the dataset
dim(data)
[1] 4820 44
There are 44 variables and 4,820 observations.
Execution in Stata
Use the cd command to import the dataset. The describe command provides a list of variables with their types and labels.
"C://Data analysis"
cd use "Thornton data/Data/Thornton HIV Testing Data.dta", clear
describe
. cd "C://Data analysis"
C:\Data analysis
. use "Thornton data/Data/Thornton HIV Testing Data.dta", clear
. describe
Contains data from Thornton data/Data/Thornton HIV Testing Data.dta
obs: 4,820
vars: 44 12 Mar 2008 11:10
size: 785,660 (_dta has notes)
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
site float %9.0g 1=Mchinji 2=Balaka 3=Rumphi
rumphi float %9.0g Rumphi
balaka float %9.0g Balaka
villnum double %9.0g VILLNUM
m1out float %9.0g Survey outcome in 1998
m2out float %9.0g Survey outcome in 2001
survey2004 float %9.0g completed baseline survey
got float %9.0g Got HIV results
zone float %9.0g VCT zone
distvct float %9.0g Distance in km
tinc float %9.0g Total value of the incentive
(kwacha)
Ti float %9.0g Value of incentive (kwacha)
discrete
any float %9.0g Received any incentive
under float %9.0g under 1.5 km
over float %9.0g over 1.5 km
simaverage float %9.0g (mean) simaverage
age float %10.0g Age
age2 float %9.0g Age squared
male float %9.0g Gender
mar float %9.0g Married at baseline
educ2004 float %9.0g Yrs of completed education
timeshadsex_s byte %8.0g Times per month had sex
(subsample)
hadsex12 float %9.0g Had sex in past 12 months
(baseline)
eversex float %9.0g Ever had sex at baseline
usecondom04 float %9.0g Used a condom during last year at
baseline
tb float %9.0g HIV Test before baseline
thinktreat float %9.0g Think there will be ARV treatment
in the future
a8 byte %8.0g A8 Likelihood of HIV infection
land2004 float %9.0g Owned any land at baseline
T_consentsti long %8.0g yesno consent to sti test
T_consenthiv long %8.0g yesno consent to hiv test
T_final_trich~t float %9.0g res final trich results
T_final_resul~t float %23.0g otherres final CT results
T_final_resul~c float %23.0g otherres final GC results
hiv2004 float %9.0g HIV results
test2004 float %9.0g HIV test in 2004
followup_tested byte %8.0g Different HIV testing sample.
Drop from analysis
followupsurvey float %9.0g Was interviewed at follow-up
havesex_fo byte %10.0g Had sex between baseline and
follow-up
numsex_fo byte %10.0g Num partners between baseline and
follow-up
likelihoodhiv~o int %10.0g Likelihood of infection at
follow-up
numcond float %9.0g Number of condoms purchased at
follow-up
anycond float %9.0g Any condoms purchased at the
follow-up
bought float %9.0g Bought condoms on own at
follow-up
-------------------------------------------------------------------------------
Sorted by:
Regression Analysis
Here, we analyze the impact of receiving any monetary incentive on the decision to receive results from study participant’s HIV test. The tinc variable records the amount of monetary incentive received (in kwacha) by the study participants. We tabulate the variable tinc to see the range of monetary incentives offered.
Execution in R
library(dplyr)
# ensure that all rows are diplayed when priting tibbles
options (tibble.print_max = Inf)
# tabulate tinc
data |> filter(!is.na(tinc))|> # remove NA
select(tinc) |> # select tinc from dataset
group_by(tinc) |> # group by tinc
summarize(count=n()) |> # create table with frequency
mutate(percent = count/sum(count)*100) |> # create percent variable
round(digits = 2) # round the digits upto 2 decimal points
# A tibble: 27 × 3
tinc count percent
<dbl> <dbl> <dbl>
1 0 679 23.4
2 10 58 2
3 20 154 5.31
4 30 81 2.79
5 40 64 2.21
6 50 205 7.07
7 60 37 1.28
8 70 40 1.38
9 80 7 0.24
10 90 8 0.28
11 100 492 17.0
12 110 14 0.48
13 120 82 2.83
14 130 9 0.31
15 140 42 1.45
16 150 43 1.48
17 160 28 0.97
18 170 8 0.28
19 180 9 0.31
20 200 431 14.9
21 210 36 1.24
22 220 48 1.65
23 230 30 1.03
24 240 2 0.07
25 250 68 2.34
26 260 3 0.1
27 300 223 7.69
Execution in Stata
"C://Data analysis"
cd use "Thornton data/Data/Thornton HIV Testing Data.dta", clear
tabulate tinc
. cd "C://Data analysis"
C:\Data analysis
. use "Thornton data/Data/Thornton HIV Testing Data.dta", clear
. tabulate tinc
Total value |
of the |
incentive |
(kwacha) | Freq. Percent Cum.
------------+-----------------------------------
0 | 679 23.41 23.41
10 | 58 2.00 25.41
20 | 154 5.31 30.71
30 | 81 2.79 33.51
40 | 64 2.21 35.71
50 | 205 7.07 42.78
60 | 37 1.28 44.05
70 | 40 1.38 45.43
80 | 7 0.24 45.67
90 | 8 0.28 45.95
100 | 492 16.96 62.91
110 | 14 0.48 63.39
120 | 82 2.83 66.22
130 | 9 0.31 66.53
140 | 42 1.45 67.98
150 | 43 1.48 69.46
160 | 28 0.97 70.42
170 | 8 0.28 70.70
180 | 9 0.31 71.01
200 | 431 14.86 85.87
210 | 36 1.24 87.11
220 | 48 1.65 88.76
230 | 30 1.03 89.80
240 | 2 0.07 89.87
250 | 68 2.34 92.21
260 | 3 0.10 92.31
300 | 223 7.69 100.00
------------+-----------------------------------
Total | 2,901 100.00
Running the Regression
Here, we only focus on analyzing the effect of receiving any financial incentive. Thus, we create a factor variable indicating whether the respondent has received an incentive or not. Once we create the treatment variable, we run a regression to analyze the impact of receiving financial incentive on the decision to obtain HIV results. The variable got indicates whether or not the respondent received the HIV result. In R we use the lm () function to run a regression. In Stata we use the regress command for the same.
Execution in R
data_1 <- data |>
filter(!is.na(tinc)) |> #remove na in tinc
mutate(treatment = ifelse(tinc > 0, 1, 0)) # create treatment variable
data_1$treatment <- factor(data_1$treatment,
levels = c(0, 1),
labels = c("Control", "Treatment"))
reg <- lm(got ~ treatment, data = data_1) #run the regression
summary(reg)
Call:
lm(formula = got ~ treatment, data = data_1)
Residuals:
Min 1Q Median 3Q Max
-0.7892 -0.3387 0.2108 0.2108 0.6613
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.33868 0.01696 19.97 <2e-16 ***
treatmentTreatment 0.45055 0.01920 23.47 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4232 on 2832 degrees of freedom
(67 observations deleted due to missingness)
Multiple R-squared: 0.1628, Adjusted R-squared: 0.1625
F-statistic: 550.8 on 1 and 2832 DF, p-value: < 2.2e-16
Execution in Stata
"C://Data analysis"
cd use "Thornton data/Data/Thornton HIV Testing Data.dta", clear
drop if missing(tinc) | missing(got)
generate treatment = cond(tinc>0, 1, 0)
label define treatment 0 "Control" 1 "Treatment"
label val treatment treatment
regress got treatment
. cd "C://Data analysis"
C:\Data analysis
. use "Thornton data/Data/Thornton HIV Testing Data.dta", clear
. drop if missing(tinc) | missing(got)
(1,986 observations deleted)
. generate treatment = cond(tinc>0, 1, 0)
. label define treatment 0 "Control" 1 "Treatment"
. label val treatment treatment
. regress got treatment
Source | SS df MS Number of obs = 2,834
-------------+---------------------------------- F(1, 2832) = 550.78
Model | 98.6657682 1 98.6657682 Prob > F = 0.0000
Residual | 507.321529 2,832 .179138958 R-squared = 0.1628
-------------+---------------------------------- Adj R-squared = 0.1625
Total | 605.987297 2,833 .213903035 Root MSE = .42325
------------------------------------------------------------------------------
got | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
treatment | .4505519 .019198 23.47 0.000 .4129083 .4881954
_cons | .3386838 .0169571 19.97 0.000 .3054343 .3719333
------------------------------------------------------------------------------
The treatment effect of receiving a financial incentive is 0.4506 or about 45 percentage points, compared to the control group average of about 34 percentage points. The treatment effect is statistically significant (has a p-value of 0.000).
Robust Standard Errors
When making comparison between the distribution of outcomes between two groups, we assume that the two groups have the same variance even though their means differed. This assumption is called the homoskedasticity assumption. However, when the variance in the treatment and control group are different the assumption of homoskedasticity is violated, i.e., the error terms are heteroskedastic. In such cases, we have to use robust standard errors to account for heteroskedasticity. The robust standard errors do not affect the estimates of the parameters in the regression, but they tend to be larger than the unadjusted standard errors. This in turn makes the confidence interval wider.
To test for heteroskedasticity, we run the Breusch-Pagan / Cook-Weisberg test for heteroskedasticity. It tests the null hypothesis of homoskedasticity against the alternative hypothesis of heteroskedasticity. We need to install the lmtest package and run the bptest() function.
Execution in R
Execution in Stata
We use the estat hettest command in Stata to test ko heteroskedasticity.
"C://Data analysis"
cd use "Thornton data/Data/Thornton HIV Testing Data.dta", clear
drop if missing(tinc) | missing(got)
generate treatment = cond(tinc>0, 1, 0)
label define treatment 0 "Control" 1 "Treatment"
label val treatment treatment
regress got treatment
estat hettest
. cd "C://Data analysis"
C:\Data analysis
. use "Thornton data/Data/Thornton HIV Testing Data.dta", clear
. drop if missing(tinc) | missing(got)
(1,986 observations deleted)
. generate treatment = cond(tinc>0, 1, 0)
. label define treatment 0 "Control" 1 "Treatment"
. label val treatment treatment
. regress got treatment
Source | SS df MS Number of obs = 2,834
-------------+---------------------------------- F(1, 2832) = 550.78
Model | 98.6657682 1 98.6657682 Prob > F = 0.0000
Residual | 507.321529 2,832 .179138958 R-squared = 0.1628
-------------+---------------------------------- Adj R-squared = 0.1625
Total | 605.987297 2,833 .213903035 Root MSE = .42325
------------------------------------------------------------------------------
got | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
treatment | .4505519 .019198 23.47 0.000 .4129083 .4881954
_cons | .3386838 .0169571 19.97 0.000 .3054343 .3719333
------------------------------------------------------------------------------
. estat hettest
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
Variables: fitted values of got
chi2(1) = 25.19
Prob > chi2 = 0.0000
The Breusch-Pagan test may not capture heteroskedasticity in all instances.
The low p-value suggests that we can reject the null hypothesis of homoskedasticity. In this case, it is better to use robust standard errors instead of unadjusted standard errors.
Running Regression with Robust Standard Errors
To use robust standard errors, we need to install the sandwich package and use vcovHC() function in the coeftest() function from the lmtest package.
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.338684 0.018968 17.856 < 2.2e-16 ***
treatmentTreatment 0.450552 0.020858 21.601 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Execution in Stata
To run a regression with robust standard errors, we run the regress command with the robust option.
"C://Data analysis"
cd use "Thornton data/Data/Thornton HIV Testing Data.dta", clear
drop if missing(tinc) | missing(got)
generate treatment = cond(tinc>0, 1, 0)
label define treatment 0 "Control" 1 "Treatment"
label val treatment treatment
regress got treatment, robust
. cd "C://Data analysis"
C:\Data analysis
. use "Thornton data/Data/Thornton HIV Testing Data.dta", clear
. drop if missing(tinc) | missing(got)
(1,986 observations deleted)
. generate treatment = cond(tinc>0, 1, 0)
. label define treatment 0 "Control" 1 "Treatment"
. label val treatment treatment
. regress got treatment, robust
Linear regression Number of obs = 2,834
F(1, 2832) = 466.60
Prob > F = 0.0000
R-squared = 0.1628
Root MSE = .42325
------------------------------------------------------------------------------
| Robust
got | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
treatment | .4505519 .020858 21.60 0.000 .4096535 .4914502
_cons | .3386838 .0189675 17.86 0.000 .3014922 .3758754
------------------------------------------------------------------------------
The coefficients of the treatment and constant term are the same. But the standard errors of both the parameters are larger.
To learn about measuring impact in an experiment with partial compliance, click here