SEM Estimation by 2SLS

Overview

The goal is to:

  1. Determine the correct setup for the first stage regression, and
  2. Provide intuition for why that is the case.

Microeconomic Theory

Suppose the true supply function for some product is given by price, and one (or more) exogenous supply shifers \(Z_{supply}\). To make this concrete, consider the beef and steak products, where exogenous supply shifters would be inputs like the price of land, cattle feed, and so on.

\[\begin{aligned} Q_{supply} &= f(P_{supply}, \, Z_{supply}) \end{aligned}\]

and similarly, demand is given by:

\[\begin{aligned} Q_{demand} &= f(P_{demand}, \, Z_{demand}) \end{aligned}\]

If we observe transaction data in a competitive market, we will assume that prices and quantities are in equilibrium, thus implying:

\[P_{supply} = P_{demand} = P^*\]

and

\[Q_{supply} = Q_{demand} = Q^*\]

Together, we can find the reduced form equation for the equilibrium values:

\[\begin{aligned} Q_{demand} &= Q_{supply} \\ f(P^*, \, Z_{demand}) &= f(P^*, \, Z_{supply}) \\ \\ P^* &= f(Z_{demand}, Z_{supply}) \end{aligned}\]

Similarly, we can solve equilibrium quantity as:

\[\begin{aligned} f(Q^*, \, Z_{demand}) &= f(Q^*, \, Z_{supply}) \\ \\ Q^* &= f(Z_{demand}, Z_{supply}) \end{aligned}\]


\(\textcolor{red}{NOTE}\): In equilibrium, price and quantity only depend on the exogenous supply and demand shifters.

Simulation

Using this theory as a guide, we can build a simulation dataset with the following linear functional forms:

Setup

\[\begin{aligned} P_{supply} &= Q_{supply} + 2Z_{supply} + u_{supply} \end{aligned}\]

and similarly, demand is given by:

\[\begin{aligned} P_{demand} &= 100 - Q_{demand} + 2Z_{demand} + u_{demand} \end{aligned}\]

Which gives us the reduced form relationships:

\[\begin{aligned} P^* = 50 + Z_{supply} + Z_{demand} + v \end{aligned}\]

Where \(v = \frac{u_{demand} - u_{supply}}{2}\) contains the errors for both supply and demand.

Build Dataset

Using this relationship, we can create a simulation dataset as follows:

  1. Generate random values for \(u_{demand}\), \(u_{supply}\), \(Z_{demand}\), and \(Z_{supply}\).
  2. Use these to find equilibrium prices.
  3. Back out equilibrium quantities from the supply or demand equation (and verify that they are in fact the same).
  4. Finally, we only keep positive values for price and quantity since a firm would simply shut down if that were the case.
# number of observations to generate
N <- 100

# ud and us are exogenous ~ N(0, 1)
ud <- rnorm(N) 
us <- rnorm(N) 
v <- (ud - us) / 2

# supply and demand shifters are exogenous normal
# although I believe the distributional assumptions 
# can be relaxed
z1 <- rnorm(N) * 2 + 2  # ~N(2, 2)
z2 <- rnorm(N) + 3      # ~N(3, 0)

# price in equilibrium using reduced form equation
p <- 50 + z1 + z2 + v 

# verify quantities are the same in equilibrium!
qd <- 100 - p + 2*z2 + ud
qs <- p - 2*z1 + us
q  <- qd

if (sum(qd - qs) > 1e10) {
    cat("ERROR: Supply and demand not in equilibrium")
}

# make dataset 
# we only keep values where quantiy and price 
# are greater than zero. Our simulated datset may 
# contain negative values since non-zero conditions 
# are economic assumptions. 
data <- tibble(
        "Q" = q, 
        "P" = p, 
        "Z_Supply" = z1, 
        "Z_Demand" = z2
    ) %>% 
    filter(
        Q > 0 && P > 0
    )

Graph

After generating a dataset according to those rules, we can view market prices and quantities in a scatter plot:

ymin <- min(data$P) - 5
ymax <- max(data$P) + 5
data %>% 
    ggplot(aes(Q, P)) + 
    geom_point() + 
    ylim(ymin, ymax) + 
    theme_wsj() 

scatter

If we lock in some value of \(Z_{demand}\) and \(Z_{supply}\), we can easily see how supply and demand intersect.

z1_fixed1 <- data$Z_Supply[1]
z1_fixed2 <- data$Z_Supply[2]
z1_fixed3 <- data$Z_Supply[3]

z2_fixed1 <- data$Z_Demand[1]
z2_fixed2 <- data$Z_Demand[2]
z2_fixed3 <- data$Z_Demand[3]

data$Qs <- 1:dim(data)[1]
data$Ps1 <- data$Qs + 2 * z1_fixed1
data$Ps2 <- data$Qs + 2 * z1_fixed2
data$Ps3 <- data$Qs + 2 * z1_fixed3

data$Qd <- data$Qs
data$Pd1 <- 100 - data$Qd + 2 * z2_fixed1
data$Pd2 <- 100 - data$Qd + 2 * z2_fixed2
data$Pd3 <- 100 - data$Qd + 2 * z2_fixed3

xmin <- min(data$Q)
xmax <- max(data$Q)
ymin <- min(data$P) - 5
ymax <- max(data$P) + 5
data %>% 
    ggplot(aes(Q, P)) + 
    geom_point() + 
    geom_line(aes(Qs, Ps1), color = "red") + 
    geom_line(aes(Qd, Pd1), color = "blue") + 
    xlim(xmin, xmax) + 
    ylim(ymin, ymax) + 
    theme_wsj()

lines

Intuition for First Stage

If we are trying to estimate demand elasticity by 2SLS, the naive regression of

\[P = \beta_1 Q + \beta_2 Z_{demand}\]

We will get a biased estimate for \(\beta_1\). Intuitively, this is because price and quantity were determined in equilibrium. In other words, conditional on the exogenous supply and demand shifters (\(Z_{demand}\) and \(Z_{supply}\)), there is no variation in price or quantity.

Instead, we will rely on exogenous variation that only impacts quantity through supply, in this case the supply shifters \(Z_{supply}\). We can think of this as using exogenous shifts in the \(\textcolor{red}{red}\) line to help determine the slope of the \(\textcolor{blue}{blue}\) line.

data %>% 
    ggplot(aes(Q, P)) + 
    geom_point() + 
    geom_line(aes(Qs, Ps1), color = "red") + 
    geom_line(aes(Qs, Ps2), color = "red", linetype = "dashed") + 
    geom_line(aes(Qs, Ps3), color = "red", linetype = "dashed") + 
    geom_line(aes(Qd, Pd1), color = "blue") + 
    xlim(xmin, xmax) + 
    ylim(ymin, ymax) +
    theme_wsj() 

lines

Finally, why is that that we include the demand shifters in the first equation? Intuitively, this helps to isolate which level of the curves we are looking at. For example, for a given supply curve, we still need to include demand factors to let us pinpoint where we expect the levels to intersect. In other words, we need to know the external factors that are affecting the \(\textcolor{red}{red}\) and \(\textcolor{blue}{blue}\) lines. As shown above, in equilibrium, those are the only factors for price and quantity.

Regression Results

Finally, we test these ideas directly to see if we can recover our original “true” demand parameters.

Recall, we are looking to retrieve the “true” values from the simulation setup,

\[\begin{aligned} P_{demand} &= 100 - Q_{demand} + 2Z_{demand} + u_{demand} \end{aligned}\]

Only Supply Shifters in First Stage

If we only include supply shifters in the first stage regression, i.e.

\[Q = \alpha_1 Z_{supply} + u\]

And use the predicted values, \(\hat{Q}\) in the second stage regression:

\[P = \beta_1 \hat{Q} + \beta_2 Z_{demand} + e\]

Then we obtain baised and inconsistent estimates. We can see the result of this estimation below:

# first stage 
first_stage_bias_fit <- lm(Q ~ Z_Supply, data = data)
data$`Q Predicted (only supply)` <- predict(first_stage_bias_fit)

# second stage 
fit_bias <- lm(P ~ `Q Predicted (only supply)` + Z_Demand, data = data)

We see here that these estimates are incorrect based on what we know to be true.

scatter

All Exogenous Variables in First Stage

If we now include all exogenous variables in the first stage regression, i.e.

\[Q = \alpha_1 Z_{supply} + \alpha_2 Z_{demand} + u\]

And use these new predicted values, \(\widetilde{Q}\) in the second stage regression:

\[P = \beta_1 \widetilde{Q} + \beta_2 Z_{demand} + e\]

We can retrieve the known true paramater values.

# first stage 
first_stage_fit <- lm(Q ~ Z_Supply + Z_Demand, data = data)
data$`Q Predicted (all exogenous)` <- predict(first_stage_fit)

# second stage 
fit <- lm(P ~ `Q Predicted (all exogenous)` + Z_Demand, data = data)

scatter

In this case, using all exogenous variables in the first stage allows us to recover the “true” demand parameters on quantity and the demand shifter.

More