Regression Analysis: Introduction to Econometrics

Lecture 7

Learning Objectives

By the end of this chapter, you will be able to:

  1. ✅ Distinguish between descriptive statistics and econometric analysis
  2. ✅ Understand how regression fits a line by minimizing squared residuals
  3. ✅ Interpret regression coefficients and conduct hypothesis tests
  4. ✅ Identify and understand key regression assumptions
  5. ✅ Recognize problems from heteroskedasticity and multicollinearity
  6. ✅ Apply cross-sectional regression analysis to business decisions
  7. ✅ Understand when panel data methods are needed

1 Introduction: The $4.2 Million Question

It’s 8:45 AM on December 16, 2024. Jennifer Walsh, CEO of Sunburst Homes Construction, stands in her conference room reviewing twelve completed spec homes spread across El Paso. Her VP of Development, Michael Torres, has just presented his pricing analysis recommending they list the homes for a combined $6.1 million.

Torres’s report is confident: “Our regression analysis proves which features drive value. Each renovation adds $47,830 to sale price. Each bedroom adds $22,450. The model is 84.7% accurate. We should renovate all twelve homes before listing.”

Jennifer’s CFO, Margaret Chen, looks skeptical. “Michael, you’re recommending we spend $180,000 on renovations across all properties. You’re saying we’ll get back $574,000 in additional sale price. That’s a $394,000 net gain. Are you certain?”

Torres taps his report. “The regression coefficient is statistically significant at p < 0.001. The evidence is clear.”

Margaret turns to Jennifer. “I don’t buy it. Renovated homes might sell for more because they’re in better neighborhoods, not because of the renovation itself. His bedroom coefficient seems inflated—bigger homes naturally have more bedrooms, so how can we separate the effect? And his model assumes prediction accuracy is the same for a $250,000 home and a $400,000 home. That doesn’t make sense.”

Jennifer sets down the report. “Michael, I appreciate the analysis, but I’m not spending $180,000 based on a regression coefficient that might be confounded with neighborhood quality. Margaret’s right—we need to think more carefully about what this model actually tells us.”

She pauses. “I remember hearing about two EMBA students who helped TechFlow, PrecisionCast, BorderMed, DesertVine, and PixelPerfect with their statistical challenges. They’re supposed to be excellent at untangling these issues. Let’s get them to review this analysis.”

The quality control team reaches out to The University of Texas at El Paso EMBA program. Within hours, David Martinez and Maria Rodriguez are reviewing Torres’s report.

David reads through it carefully. “This is actually good news. Torres collected solid data—500 home sales with detailed characteristics. His regression model structure is sound. But he’s making classic interpretation errors that could cost Sunburst hundreds of thousands.”

Maria nods. “He’s treating associations as if they’re causal effects. He’s ignoring multicollinearity between square footage and bedrooms. He’s not accounting for heteroskedasticity in the luxury segment. And that renovation coefficient? Almost certainly biased upward by omitted variables.”

“Remember,” David says, “we’ve done descriptive statistics with TechFlow, probability with PrecisionCast, inference with BorderMed, ANOVA with DesertVine, and categorical analysis with PixelPerfect. But this is our first full econometric analysis. This is different. Regression isn’t just about fitting a line to data—it’s about understanding what we can and cannot conclude from observational data.”

The Cost of Causal Confusion

When TechFlow said “around $767,000,” they couldn’t set precise budgets.

When PrecisionCast said “tests are pretty good,” they risked shipping defective parts.

When Sunburst treats regression associations as causal effects and ignores endogeneity, they might waste $180,000 on renovations that don’t generate the expected returns—or miss the real drivers of home value.

In regression analysis, the difference between association and causation isn’t just academic. It’s the difference between good decisions and expensive mistakes.

The fix: Understand what regression can and cannot tell you. Test assumptions. Acknowledge limitations. Separate prediction from causation.

This chapter is about that difference. Between correlation and causation. Between fitting a model and making causal inferences. Between statistical significance and practical decision-making.

Welcome to econometrics. The art and science of learning from observational data while respecting its limitations.

2 From Description to Explanation: What Makes Regression Different?

Before David and Maria dive into Sunburst’s pricing challenge, they review how regression analysis differs from everything they’ve done before.

2.1 What We’ve Learned So Far

Over the past nine months, they’ve built a statistical toolkit:

Descriptive Statistics (TechFlow):
- Calculated means, medians, standard deviations
- Identified outliers and distribution shapes
- Measured correlation between variables
- Purpose: Summarize what happened in the data

Probability (PrecisionCast):
- Modeled defect rates and expected values
- Applied conditional probability and Bayes’ theorem
- Used distributions to predict future outcomes
- Purpose: Predict what might happen next

Statistical Inference (BorderMed):
- Tested hypotheses about population parameters
- Constructed confidence intervals
- Distinguished between statistical and practical significance
- Purpose: Make statements about populations from samples

ANOVA (DesertVine):
- Compared means across multiple groups
- Identified which factors matter
- Controlled for multiple testing
- Purpose: Determine if group differences are real

Categorical Data Analysis (PixelPerfect):
- Tested independence between categorical variables
- Analyzed contingency tables
- Understood Simpson’s paradox
- Purpose: Understand relationships in categorical data

2.2 What Regression Adds: The “Holding All Else Constant” Framework

Maria pulls up the Sunburst dataset on her laptop. “All these previous analyses asked ‘Are these variables related?’ or ‘Do these groups differ?’ Regression asks something more ambitious: ‘How much does X affect Y, controlling for everything else we can measure?’”

David continues, “When Torres says ‘each bedroom adds $22,450,’ he means: compare two homes that are identical in square footage, age, lot size, garage spaces, pool status, renovation status, and neighborhood—but one has an extra bedroom. On average, that home sells for $22,450 more.”

“That’s the key phrase,” Maria emphasizes. “‘Holding all else constant.’ We’re trying to isolate the independent effect of bedrooms by statistically controlling for other factors. This is fundamentally different from simple correlation.”

Key Distinction: Correlation vs. Regression

Correlation (from Lecture 1):

\[r = \frac{\text{Cov}(X,Y)}{s_X \cdot s_Y}\]

  • Measures strength of linear association
  • Symmetric: Cor(X,Y) = Cor(Y,X)
  • No “controlling for” other variables
  • Example: r = 0.76 between square footage and bedrooms
  • Interpretation: Larger homes tend to have more bedrooms

Regression (Lecture 7):

\[\text{Price} = \beta_0 + \beta_1(\text{sqft}) + \beta_2(\text{bedrooms}) + \cdots + \varepsilon\]

  • Measures effect of X on Y holding other variables constant
  • Asymmetric: We specify Y (dependent) and X (independent)
  • Controls for correlation between predictors
  • Example: β₂ = $22,450 (bedroom coefficient with sqft in model)
  • Interpretation: Adding a bedroom is associated with $22,450 higher price, comparing homes with the same square footage

The Critical Difference: Correlation can’t separate effects of correlated variables. Regression attempts to—but success depends on having good controls and meeting key assumptions.

2.3 The Econometric Challenge: Observational Data

“Here’s what makes this hard,” David says. “In a controlled experiment, BorderMed could randomly assign patients to treatment and control groups. Randomization ensures the groups are identical except for the treatment, so any difference in outcomes must be due to the treatment.”

“But Sunburst can’t randomly assign renovations to homes,” Maria continues. “Homeowners choose which homes to renovate. Those choices aren’t random—they renovate homes in up-and-coming neighborhoods, homes with good bones, homes where the investment will pay off. So when we see renovated homes selling for more, is it because of the renovation, or because those homes were already more valuable in ways we can’t measure?”

David nods. “That’s the fundamental challenge of econometrics. We’re trying to make causal inferences from observational data. Unlike an experiment where we control the treatment assignment, we’re working with data where decisions were made by people pursuing their own objectives. Those decisions create selection bias and endogeneity problems that correlation and simple regression can’t solve.”

Controlled Experiments vs. Observational Studies

Controlled Experiment (BorderMed’s Drug Trial):

  • Researchers randomly assign treatment
  • Randomization ensures groups are comparable
  • Can make strong causal inferences
  • “This drug causes blood pressure to decrease”

Observational Study (Sunburst’s Housing Data):

  • People choose actions based on unobserved factors
  • Groups may differ in unmeasured ways
  • Must carefully interpret associations
  • “Renovated homes are associated with higher prices, but this association may partly reflect unobserved neighborhood quality”

Regression with observational data can control for measured confounders but can’t account for unmeasured ones. This is why we must be cautious about causal language.

3 What Is Regression Actually Doing? The Mechanics

Before evaluating Torres’s analysis, David and Maria want to make sure they understand exactly what regression does mathematically.

3.1 The Goal: Minimizing Squared Prediction Errors

Maria draws a scatter plot of home prices vs. square footage on the whiteboard.

“Regression draws a line through this cloud of points. But what makes one line better than another?”

David points to several homes above and below any potential line. “For each home, we can calculate the prediction error—the vertical distance between the actual price and the predicted price from our line. Regression finds the line that makes these errors as small as possible.”

“More precisely,” Maria adds, “it minimizes the sum of squared errors. We square the errors so positive and negative errors don’t cancel out, and squaring penalizes large errors more than small ones.”

Definition: Ordinary Least Squares (OLS) Regression

The Model:

\[Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_k X_{ki} + \varepsilon_i\]

where:

  • \(Y_i\) = outcome for observation i (e.g., home price)
  • \(X_{1i}, X_{2i}, \ldots, X_{ki}\) = predictor variables for observation i
  • \(\beta_0\) = intercept (predicted Y when all X’s = 0)
  • \(\beta_1, \beta_2, \ldots, \beta_k\) = coefficients (slopes)
  • \(\varepsilon_i\) = error term (residual)

What OLS Does:

Chooses coefficient estimates \(\hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_k\) that minimize:

\[\text{SSE} = \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 = \sum_{i=1}^{n} (Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_{1i} - \cdots - \hat{\beta}_k X_{ki})^2\]

where \(\hat{Y}_i\) is the predicted value for observation i.

In Plain English:

Find the line (or hyperplane in multiple dimensions) that makes predictions as close as possible to actual values, where “close” means minimizing squared vertical distances.


Excel:

Data Analysis → Regression, or use =LINEST() function

3.2 Visualizing the Fit: Predicted Values and Residuals

David sketches a simple example with just one predictor (square footage):

“Say our regression line is: Price = $52,840 + $125.40 × sqft”

“For a 2,000 sqft home:
Predicted price = $52,840 + $125.40(2,000) = $303,640”

“If the actual sale price was $315,000:
Residual = $315,000 - $303,640 = $11,360”

“The home sold for $11,360 more than our model predicted. That $11,360 is the part of the price our model can’t explain with square footage alone. It might be due to other factors: great location within the neighborhood, recent updates, nice views, or just randomness in negotiations.”

Maria adds, “OLS finds coefficients that make the sum of all these squared residuals as small as possible across all 500 homes. That’s why it’s called ‘least squares’—it minimizes the squares of the residuals.”

Key Terms: Fitted Values, Residuals, and R²

Fitted (Predicted) Value:

\[\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_{1i} + \cdots + \hat{\beta}_k X_{ki}\]

What our model predicts for observation i based on its X values.

Residual:

\[e_i = Y_i - \hat{Y}_i\]

The prediction error. How far off our prediction was for observation i.

R-Squared (R²):

\[R^2 = 1 - \frac{\sum e_i^2}{\sum (Y_i - \bar{Y})^2} = 1 - \frac{\text{SSE}}{\text{SST}}\]

  • Interpretation: Proportion of variance in Y explained by the model
  • Range: 0 to 1 (or 0% to 100%)
  • Common misinterpretation: Torres claims R² = 0.847 means “84.7% prediction accuracy.” Wrong. It means the model explains 84.7% of the variance in prices. Individual predictions can still have large errors.

The Correct Interpretation: Our model reduces prediction error by 84.7% compared to just predicting the mean price for every home. But for any individual home, the prediction interval is still quite wide (roughly ±$37,000).

3.3 Multiple Regression: The “All Else Equal” Interpretation

“Here’s where it gets interesting,” David says. “With one predictor, the slope just tells us how Y changes with X. But with multiple predictors, each coefficient represents the change in Y for a one-unit change in that X, holding all other X’s constant.”

Maria pulls up Torres’s model:

\[\begin{aligned} \text{Price} = \beta_0 &+ \beta_1(\text{sqft}) + \beta_2(\text{bedrooms}) + \beta_3(\text{bathrooms}) \\ &+ \beta_4(\text{age}) + \beta_5(\text{lot\_size}) + \beta_6(\text{garage}) \\ &+ \beta_7(\text{pool}) + \beta_8(\text{renovated}) + \sum \beta_i(\text{neighborhood}_i) + \varepsilon \end{aligned}\]

“The bedroom coefficient (β₂ = $22,450) means: compare two homes with the same square footage, same age, same lot size, same number of bathrooms, same garage, same pool status, same renovation status, in the same neighborhood—but one has an extra bedroom. On average, that home sells for $22,450 more.”

“That ‘holding all else constant’ interpretation is powerful,” David emphasizes. “It lets us isolate effects. But—and this is crucial—it only works if we’ve measured all the important confounding variables and if our model meets certain assumptions. If we’ve missed something important, our coefficient estimates will be biased.”

4 Reading Regression Output: Coefficients and Hypothesis Tests

David pulls up Torres’s regression output. “Let’s make sure we can read this correctly before we critique it.”

4.1 The Regression Table

Maria projects Torres’s results:

Variable Coefficient Std Error t-statistic p-value 95% CI
Intercept $52,840 $18,420 2.87 0.004 [$16,620, $89,060]
Square Feet $125.40 $8.60 14.58 <0.001 [$108.50, $142.30]
Bedrooms $22,450 $7,630 2.94 0.003 [$7,470, $37,430]
Bathrooms $8,220 $4,480 1.83 0.067 [-$580, $17,020]
Age (years) -$2,180 $320 -6.81 <0.001 [-$2,810, -$1,550]
Lot Size $1.85 $0.73 2.53 0.012 [$0.41, $3.29]
Garage Spaces $6,840 $2,990 2.29 0.023 [$960, $12,720]
Pool (yes=1) $18,900 $9,210 2.05 0.041 [$680, $37,120]
Renovated (yes=1) $47,830 $8,940 5.35 <0.001 [$30,270, $65,390]

4.2 Interpreting Coefficients

“Let’s take square footage,” David says. “The coefficient is $125.40. This means that, holding all other variables constant, a home with one additional square foot is associated with a sale price that is, on average, $125.40 higher.”

“Note the language,” Maria emphasizes. “‘Is associated with,’ not ‘causes.’ We’re describing a pattern in the data, not making a causal claim. For square footage, the causal interpretation is probably reasonable—it’s hard to imagine major confounding. But for other variables, we need to be more careful.”

How to Read Regression Coefficients

Generic Form: “A one-unit increase in X is associated with a β change in Y, holding all other variables constant.”

For Continuous Variables:

  • Sqft coefficient ($125.40): A 1 sqft larger home is associated with $125.40 higher price, all else equal
  • Age coefficient (-$2,180): Each additional year of age is associated with $2,180 lower price, all else equal
  • Useful transformation: Multiply by meaningful unit change
  • “A 100 sqft larger home → +$12,540”
  • “A 10-year-older home → -$21,800”

For Binary Variables (0/1):

  • Pool coefficient ($18,900): Homes with pools sell for $18,900 more on average than homes without pools, holding other characteristics constant
  • Renovated coefficient ($47,830): Renovated homes sell for $47,830 more than non-renovated homes, all else equal (but see endogeneity discussion later!)

For The Intercept:

  • $52,840: Predicted price for a home with all X variables = 0
  • Usually not meaningful (what’s a zero sqft home in the base neighborhood?)
  • Important for predictions, but don’t over-interpret

4.3 Standard Errors and t-statistics

“The standard error tells us how precise our estimate is,” David explains. “For square footage, the standard error is $8.60. This means if we took many samples and estimated this coefficient each time, the estimates would typically vary by about $8.60 around the true value.”

“The t-statistic is just the coefficient divided by its standard error,” Maria adds. “For square footage: t = $125.40 / $8.60 = 14.58. Large t-statistics (typically |t| > 2) indicate the coefficient is statistically significantly different from zero.”

The Standard Hypothesis Test for Regression Coefficients

Null Hypothesis: \(H_0: \beta_j = 0\) (the variable has no effect)

Alternative Hypothesis: \(H_A: \beta_j \neq 0\) (the variable has an effect)

Test Statistic:

\[t = \frac{\hat{\beta}_j - 0}{SE(\hat{\beta}_j)}\]

Decision Rule: Reject \(H_0\) if |t| > t-critical or if p-value < α

Interpretation:

  • Reject \(H_0\): We have evidence that this variable is associated with the outcome
  • Fail to reject \(H_0\): Insufficient evidence that this variable matters (after controlling for other variables)

Critical Note from BorderMed Chapter: “Failing to reject \(H_0\)” does not mean “accepting” that β = 0. It means we don’t have enough evidence to conclude β ≠ 0. The true effect could be zero, or it could be small and we lack power to detect it, or it could be confounded with other variables.

4.4 The p-value: What It Actually Means

Maria points to the p-values in Torres’s output.

“Square footage has p < 0.001. This means: if the true coefficient were actually zero (i.e., if square footage had no effect on price), the probability of observing a coefficient as large as $125.40 (or larger) purely by chance would be less than 0.1%. Since that’s very unlikely, we conclude square footage is statistically significantly associated with price.”

“But look at bathrooms,” David adds. “The coefficient is $8,220 with p = 0.067. Using the conventional α = 0.05 threshold, this is not statistically significant. We cannot confidently conclude that bathrooms have an independent effect on price after controlling for square footage and bedrooms.”

“This doesn’t necessarily mean bathrooms don’t matter,” Maria clarifies. “It means we can’t distinguish their effect from zero with this data. Maybe they do matter but the effect is small. Or maybe their effect is confounded with square footage—bigger homes have more bathrooms, so it’s hard to separate the effects.”

Common p-value Misinterpretations

Wrong: “p = 0.003 means there’s a 99.7% chance the effect is real”

Right: “p = 0.003 means that if there were no true effect, we’d see a coefficient this large or larger by chance only 0.3% of the time”

Wrong: “p < 0.05 means the effect is large enough to matter for business decisions”

Right: “p < 0.05 means we have statistical evidence of an effect, but we must separately assess whether the magnitude is practically significant”

Wrong: “p = 0.067 means there’s no effect”

Right: “p = 0.067 means we lack sufficient evidence to conclude there’s an effect with 95% confidence. The true effect might be zero, small, or confounded.”

Remember from the BorderMed chapter: Statistical significance ≠ practical significance. A tiny effect can be statistically significant with large samples. A large effect can fail significance with small samples.

4.5 Confidence Intervals: Quantifying Uncertainty

“I actually prefer confidence intervals to p-values,” Maria says. “They give you the same information but also show the range of plausible values.”

“Look at the pool coefficient,” David points out. “It’s $18,900 with a 95% CI of [$680, $37,120]. Yes, it’s statistically significant (p = 0.041), but look at that range! The true effect could be as small as $680 or as large as $37,120. That’s massive uncertainty.”

“For business decisions,” Maria emphasizes, “we need to know both that an effect exists and how large it might be. The confidence interval tells us: we’re 95% confident the true effect of having a pool is somewhere between $680 and $37,120, holding everything else constant.”

Confidence Intervals for Regression Coefficients

Formula:

\[CI_{95\%} = \hat{\beta}_j \pm t_{critical} \times SE(\hat{\beta}_j)\]

where \(t_{critical}\) is from the t-distribution with (n - k - 1) degrees of freedom.

Interpretation: We are 95% confident that the true coefficient lies within this interval.

Connection to Hypothesis Test: If the 95% CI excludes zero, then p < 0.05 (reject \(H_0: \beta = 0\))

For Decision-Making:

  • Narrow CI: Precise estimate, more confidence in the magnitude
  • Wide CI: Imprecise estimate, less useful for planning
  • Example: Pool CI [$680, $37,120] is very wide—high uncertainty about true effect size

Better Than p-values?: Many statisticians prefer CIs because they convey both: 1. Whether the effect is statistically significant (does CI exclude 0?)
2. What range of effect sizes is plausible (business relevance)

5 Regression Assumptions: What Could Go Wrong?

“Torres’s regression output looks technically correct,” David says. “Excel did the calculations right. But a regression model only gives reliable results if certain assumptions hold. Let’s check them.”

5.1 The Classical Linear Regression Assumptions

Maria writes out the key assumptions on the whiteboard:

  1. Linearity: The relationship between X and Y is linear
  2. Independence: Observations are independent of each other
  3. Homoskedasticity: Constant error variance across all levels of X
  4. Normality: Errors are normally distributed (mainly matters for small samples)
  5. No perfect multicollinearity: Predictors are not perfectly correlated
  6. Exogeneity: Predictors are not correlated with the error term (no omitted variable bias)

“The first four are technical assumptions about the model structure,” David explains. “They affect our standard errors and hypothesis tests. But they’re usually not fatal—we have remedies and robustness.”

“The last two are more serious,” Maria emphasizes. “Multicollinearity makes it hard to separate effects of correlated predictors. Endogeneity (violation of exogeneity) means our coefficient estimates are biased—they’re systematically wrong, not just imprecise. No amount of additional data will fix bias from endogeneity.”

The Big Three Problems in Torres’s Analysis

Based on our review of the data and Torres’s approach, we’ve identified three major issues:

1. Multicollinearity: Square footage and bedrooms are highly correlated (r = 0.76), making it difficult to separately identify their effects. The bedroom coefficient is unstable.

2. Heteroskedasticity: Price variance increases with home size. Small homes show tight clustering around predictions; luxury homes show much greater variability. This affects our confidence intervals and prediction intervals, especially for high-end properties.

3. Endogeneity: The renovation variable is correlated with unobserved neighborhood quality factors. Homeowners selectively renovate certain homes, creating omitted variable bias. The $47,830 coefficient overstates the causal effect of renovation.

We’ll examine each problem in detail.

6 Problem #1: Multicollinearity

6.1 What Is Multicollinearity?

David pulls up a scatter plot showing square footage vs. bedrooms.

“Look at this. Homes with 1,500 sqft have 2-3 bedrooms. Homes with 3,000 sqft have 4-5 bedrooms. The correlation is 0.76. They move together.”

“When predictors are highly correlated,” Maria explains, “regression struggles to figure out which one is really driving the outcome. Is price going up because of more bedrooms, or because of the larger square footage that comes with more bedrooms? The math can’t easily untangle them.”

Definition: Multicollinearity

What It Is: High correlation among predictor variables in a regression model.

Why It’s a Problem:

  • Coefficient estimates become unstable (sensitive to small data changes)
  • Standard errors inflate (reducing t-statistics and widening CIs)
  • Difficult to determine individual effects of correlated predictors
  • Model predictions are still reliable, but individual coefficient interpretation becomes questionable

How to Detect:

  1. Correlation Matrix: Look for high correlations among predictors (|r| > 0.7 is concerning)
  2. Variance Inflation Factor (VIF):

\[VIF_j = \frac{1}{1 - R^2_j}\]

where \(R^2_j\) is the R² from regressing predictor j on all other predictors.

VIF Interpretation:
- VIF = 1: No correlation with other predictors
- VIF = 1-4: Mild multicollinearity (usually acceptable)
- VIF = 4-10: Moderate multicollinearity (concerning)
- VIF > 10: Severe multicollinearity (problematic)


Excel: Use =CORREL() for correlation matrix; VIF requires running auxiliary regressions

6.2 Multicollinearity in the Sunburst Data

Maria runs the diagnostics:

Correlation Matrix (Selected Variables):

Sqft Bedrooms Bathrooms
Sqft 1.000 0.759 0.879
Bedrooms 0.759 1.000 0.642
Bathrooms 0.879 0.642 1.000

VIF Values:

  • Square footage: VIF = 4.8
  • Bedrooms: VIF = 4.2
  • Bathrooms: VIF = 3.1

“Both square footage and bedrooms have VIFs above 4,” David observes. “That’s moderate multicollinearity. Not severe enough to break the model, but enough to make us question the bedroom coefficient.”

“Look at what happens,” Maria demonstrates. She removes bedrooms from the model and re-runs the regression. The square footage coefficient changes from $125.40 to $147.20. “See? When we remove bedrooms, square footage picks up part of the bedroom effect because they’re correlated. The coefficient is unstable.”

The Bedroom Coefficient: Proceed with Caution

Torres reports bedrooms add $22,450 to home value. But:

Evidence of Multicollinearity: - Correlation with sqft: r = 0.759
- VIF = 4.2 (moderate multicollinearity)
- Coefficient sensitivity: Changes substantially when model specification changes

What This Means: - We cannot reliably separate the “bedroom effect” from the “square footage effect”
- The $22,450 estimate is imprecise (partly capturing correlated sqft)
- Small changes in sample or specification could change this coefficient substantially

For Decision-Making: - Focus on square footage as the primary size metric (VIF still elevated but more stable)
- Don’t make investment decisions based on bedroom count alone
- Recognize that “4 bedrooms” and “3,000 sqft” are closely related—you usually get both or neither

Torres’s Mistake: He recommends “maximizing bedroom count” based on the $22,450 coefficient. But that coefficient is confounded with square footage. Adding a bedroom without adding square footage (e.g., subdividing existing space) might not generate anywhere near $22,450 in value.

6.3 What to Do About Multicollinearity

“So how do we handle this?” David asks.

Maria thinks carefully. “A few options, none perfect:”

Option 1: Accept It and Be Careful
“Acknowledge the multicollinearity in interpretation. Say ‘We can’t reliably separate bedroom and square footage effects, but we know size matters.’ Don’t make strong claims about bedroom coefficients.”

Option 2: Drop One of the Correlated Variables
“If bedrooms and square footage are highly correlated, keep the one that’s more reliable or more actionable. In this case, keep square footage—it’s more precisely measured and more stable.”

Option 3: Combine Variables
“Create a ‘home size’ index that combines square footage and bedrooms. This recognizes they’re measuring similar underlying constructs.”

Option 4: Collect More Data or Use Different Methods
“With panel data or instrumental variables, we might be able to better separate effects. But that’s beyond the scope of this project.”

“For Sunburst,” David concludes, “we’ll recommend Option 1 and Option 2. Acknowledge the multicollinearity, emphasize square footage over bedroom count, and avoid strong causal claims about bedrooms.”

7 Problem #2: Heteroskedasticity

7.1 What Is Heteroskedasticity?

“Homo-ske-das-ti-city,” David sounds out slowly. “That’s a mouthful. It means ‘equal variance.’ Hetero-skedasticity means ‘unequal variance.’”

Maria draws a diagram showing residuals vs. fitted values. “Homoskedasticity means the spread of residuals is constant across all levels of our predictor. Heteroskedasticity means the spread changes—often getting wider for larger predicted values.”

Definition: Heteroskedasticity

Homoskedasticity (The Assumption):
Error variance is constant: \(\text{Var}(\varepsilon_i) = \sigma^2\) for all i

Heteroskedasticity (The Violation):
Error variance changes: \(\text{Var}(\varepsilon_i) = \sigma^2_i\) varies with X or predicted Y

Why It Matters:

  • Standard errors are incorrect (usually too small)
  • Confidence intervals are wrong (usually too narrow)
  • Hypothesis tests can be misleading (may overstate significance)
  • Coefficient estimates remain unbiased, but we lose efficiency
  • Prediction intervals are more problematic for some observations than others

How to Detect:

  1. Visual: Plot residuals vs. fitted values; look for “fan shape”
  2. Breusch-Pagan Test: Formal test of constant variance
    • \(H_0\): Homoskedasticity (constant variance)
    • \(H_A\): Heteroskedasticity
  3. White Test: More general test allowing non-linear forms

Common Patterns in Business Data: - Larger firms have more variable performance
- Wealthier customers show more diverse spending patterns
- Luxury goods show more price variation than commodities
- Real estate: Expensive homes show more price variance than modest homes

7.2 Heteroskedasticity in the Sunburst Data

Maria shows David the residual plot. “Look at this. For homes with fitted values around $220,000, the residuals are tightly clustered—most within ±$15,000 of the prediction. But for homes with fitted values around $350,000, the residuals are all over the place—some ±$40,000 or more.”

David runs the Breusch-Pagan test:
- Test statistic: χ² = 28.4
- P-value = 0.002
- Conclusion: Reject null hypothesis of constant variance

“This makes sense intuitively,” Maria notes. “In the modest home market ($200k-$250k), homes are relatively standardized. Buyers care mainly about size and condition. In the luxury market ($350k+), buyers are more heterogeneous. Some value pools, others value lot size, others value specific neighborhoods. More buyer preference diversity means more price variability.”

Implications of Heteroskedasticity for Sunburst

Problem: Torres’s model assumes prediction uncertainty is the same for all homes. But our analysis shows:

  • Homes < $250k: Residual SD ≈ $12,000
  • Homes $250k-$325k: Residual SD ≈ $18,000
  • Homes > $325k: Residual SD ≈ $28,000

Torres’s Mistake: He reports a single “margin of error of ±$8,500” for all predictions. This dramatically understates uncertainty for luxury homes.

Correct Approach: - Small homes ($220k predicted): 95% prediction interval ≈ ±$24,000
- Medium homes ($280k predicted): 95% prediction interval ≈ ±$35,000
- Large homes ($360k predicted): 95% prediction interval ≈ ±$55,000

For Decision-Making: - Pricing luxury homes requires more judgment and wider ranges
- Don’t over-rely on point predictions for high-end properties
- The regression model is most useful for mid-market homes
- For luxury properties, comparable sales analysis (non-regression) may be more informative

7.3 Remedies for Heteroskedasticity

“So what do we do?” David asks.

Maria lists the options:

Option 1: Use Robust Standard Errors
“We can calculate ‘heteroskedasticity-robust’ standard errors (White standard errors) that remain valid even with non-constant variance. This fixes hypothesis tests and confidence intervals. Excel doesn’t do this automatically, but statistical software does.”

Option 2: Transform the Dependent Variable
“Taking logs often helps. Instead of regressing Price on X, regress log(Price) on X. This can stabilize variance because percentage changes are more constant than absolute dollar changes. But interpretation becomes more complex—coefficients represent percentage effects.”

Option 3: Weighted Least Squares
“If we know how variance changes with X, we can weight observations inversely to their variance. This gives more weight to precise observations (small homes) and less weight to noisy ones (luxury homes).”

Option 4: Acknowledge It in Interpretation
“Be honest about prediction intervals varying across the range. Don’t report a single ±$8,500 margin—report different intervals for different home types.”

“For Sunburst,” David decides, “we’ll go with Option 4. We’ll explain that prediction uncertainty increases with home value and provide separate confidence ranges for different market segments. That’s transparent and actionable.”

8 Problem #3: Endogeneity—The Most Serious Issue

8.1 What Is Endogeneity?

“This is the big one,” Maria says seriously. “Multicollinearity makes coefficients imprecise. Heteroskedasticity makes standard errors wrong. But endogeneity makes coefficient estimates biased—systematically wrong. And it’s often unfixable without better data or different methods.”

David nods. “One of our key regression assumptions is exogeneity: the predictors are uncorrelated with the error term. Formally: \(E(\varepsilon | X) = 0\). This means that conditional on our measured X variables, there are no systematic patterns in what’s left over.”

“When this assumption is violated,” Maria continues, “we have endogeneity. This happens when:

  1. Omitted variable bias: An important variable is missing from the model, and it’s correlated with both an included predictor and the outcome
  2. Simultaneity: Y affects X while X affects Y (supply and demand is the classic example)
  3. Measurement error: X is measured with error, and that error is related to the outcome”
Definition: Endogeneity and Omitted Variable Bias

Exogeneity (The Assumption):
\(E(\varepsilon | X_1, X_2, \ldots, X_k) = 0\)

In words: Conditional on our observed predictors, there are no other factors systematically affecting Y.

Endogeneity (The Violation):
\(E(\varepsilon | X_1, X_2, \ldots, X_k) \neq 0\)

Some predictor is correlated with unobserved factors in the error term.

Most Common Source: Omitted Variable Bias

Suppose the true model is: \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon\]

But we only observe \(X_1\) and estimate: \[Y = \beta_0 + \beta_1 X_1 + u\]

where \(u = \beta_2 X_2 + \varepsilon\).

If \(X_1\) and \(X_2\) are correlated, then \(X_1\) is correlated with \(u\), violating exogeneity.

Result: Our estimate \(\hat{\beta}_1\) is biased. Even with infinite data, it won’t converge to the true \(\beta_1\). The bias is:

\[\text{Bias}(\hat{\beta}_1) = \beta_2 \cdot \frac{\text{Cov}(X_1, X_2)}{\text{Var}(X_1)}\]

Direction of Bias:
- If \(X_2\) positively affects Y and is positively correlated with \(X_1\): upward bias
- If \(X_2\) positively affects Y but negatively correlated with \(X_1\): downward bias

8.2 Endogeneity in the Renovation Variable

Maria pulls up the summary statistics. “Look at this breakdown:

Renovation Rates by Neighborhood: - East Side: 12% of homes renovated
- West Side: 17%
- Central: 20%
- Mission Valley: 16%
- Northeast: 35%
- Kern Place: 39%

Average Prices: - Renovated homes: $291,000
- Non-renovated homes: $265,000
- Difference: $26,000”

“Torres claims renovated homes sell for $47,830 more, controlling for neighborhood. But I’m skeptical,” David says. “Think about the decision process: when does a homeowner renovate?”

Maria lists the factors:

“They renovate when: 1. The home is in a good micro-location within the neighborhood (specific street, view, proximity to parks)
2. The home has good ‘bones’—solid structure, good layout
3. The neighborhood is improving, so the investment will pay off
4. The owner has high income and good taste
5. The home is older and needed updates”

“Our model controls for neighborhood at a coarse level—East Side, Northeast, etc.,” David points out. “But within Northeast, there’s huge variation. Some blocks are near UTEP with student apartments; others are quiet residential streets with mountain views. Those micro-location factors affect both prices AND renovation decisions, but they’re not in our model.”

The Endogeneity Problem with Renovation

Torres’s Coefficient: Renovated = +$47,830 (p < 0.001)

Torres’s Interpretation: “Renovating a home causes a $47,830 increase in sale price”

The Problem: This assumes renovations are randomly assigned. They’re not. Owners selectively renovate homes that: - Are in better micro-locations (not captured by our coarse neighborhood dummies)
- Have better underlying quality (not captured by age, sqft, bedrooms)
- Are in improving sub-markets (not captured by time-constant neighborhood dummies)

The Result: The renovation coefficient is biased upward. It captures: 1. True causal effect of renovation (what we want)
2. Unobserved micro-location quality (bias)
3. Unobserved home quality (bias)
4. Unobserved neighborhood trends (bias)

Evidence of Bias: - Renovated homes are larger (2,487 sqft) than non-renovated (2,144 sqft)—even after controlling for size, renovated homes differ systematically
- Renovation rates vary 3x across neighborhoods (12% to 39%)—not random
- Simple difference ($26,000) is much smaller than regression coefficient ($47,830)—suggests confounding

What Torres Should Say:
“Renovated homes are associated with $47,830 higher prices, but this association likely reflects both the renovation itself AND unobserved quality factors that led to the renovation decision. We cannot interpret this as the causal effect of renovation.”

8.3 Why This Matters for Business Decisions

“Here’s why this is critical,” Maria emphasizes. “Torres recommends Sunburst spend $15,000 renovating each spec home, expecting to recoup $47,830 in higher sale price—a $32,830 net gain per home, or $394,000 total across 12 homes.”

“But that $47,830 coefficient is biased,” David responds. “It’s mixing the true renovation effect with unobserved location and quality factors. If Sunburst renovates a randomly selected spec home, they won’t get $47,830. They’ll get something less—maybe much less.”

“How much less?” Maria asks.

“Hard to say precisely without better data, but let’s think about it. The raw difference between renovated and non-renovated home prices is $26,000. That’s before controlling for anything. The regression coefficient is $47,830 after controlling for measured characteristics. The fact that the regression coefficient is HIGHER than the raw difference suggests the controls are picking up negative confounding—renovated homes are systematically different in ways that would otherwise push prices down (e.g., older, in less desirable sub-locations). But we know that’s not the whole story.”

“Looking at within-neighborhood patterns,” Maria suggests, “in Northeast, renovated homes average $291k vs. $273k for non-renovated—difference of $18k. In Kern Place, $342k vs. $315k—difference of $27k. These are more plausible estimates of the true effect because we’re comparing more similar homes.”

“So the true causal effect is probably in the $15,000-$25,000 range, not $47,830,” David concludes. “If renovations cost $15,000, the net benefit might be $0 to $10,000, not $32,830. Across 12 homes, that’s the difference between a $394,000 profit and maybe a $60,000 profit—or even a loss if renovation costs run over budget.”

Torres’s $394,000 Mistake

Torres’s Recommendation:
Renovate all 12 spec homes at $15,000 each = $180,000 total investment
Expected return: $47,830 × 12 = $573,960
Net profit: $393,960

Why It’s Wrong:
The $47,830 coefficient is biased upward by omitted variables. True causal effect is likely $15,000-$25,000.

More Realistic Scenario:
True renovation effect: ~$20,000
Expected return: $20,000 × 12 = $240,000
Net profit: $60,000

Worst Case:
True renovation effect: ~$15,000
Renovations cost more than expected: $18,000 each
Expected return: $15,000 × 12 = $180,000
Actual cost: $216,000
Net loss: -$36,000

Better Approach:
Use actual comparable sales of renovated vs. non-renovated homes in the specific neighborhoods where spec homes are located, rather than regression coefficients from observational data with endogeneity problems.

8.4 Can We Fix Endogeneity?

“So what do we do?” David asks.

Maria sighs. “Endogeneity is hard to fix. Really hard. Here are the standard approaches, but none are perfect:”

Approach 1: Measure the Omitted Variables
“If we could measure micro-location quality, school ratings, exact lot characteristics, etc., we could include them in the model. But that data isn’t available, or is expensive to collect.”

Approach 2: Instrumental Variables (IV)
“Find a variable that affects renovation decisions but doesn’t directly affect home prices (except through renovation). This is the gold standard in econometrics, but good instruments are rare. For Sunburst, we can’t identify one with the available data.”

Approach 3: Natural Experiments
“Find a setting where renovations were assigned for reasons unrelated to home quality—maybe a city program that randomly selected homes for renovation subsidies. Obviously not available here.”

Approach 4: Difference-in-Differences or Fixed Effects
“With panel data—observing the same homes over time—we could compare price changes for homes that got renovated vs. those that didn’t. This controls for time-invariant unobserved quality. But we only have cross-sectional data.”

Approach 5: Be Honest About Limitations
“Acknowledge the endogeneity, explain why the coefficient is likely biased, and recommend alternative evidence (comparable sales) for the decision.”

“For Sunburst,” David decides, “we go with Approach 5. We can’t fix the endogeneity with this data, so we need to be transparent about limitations and recommend they don’t rely on the regression coefficient for renovation ROI calculations.”

9 A Brief Note on Panel Data

Maria pauses the analysis. “Before we finish with Sunburst, I want to touch on something we keep mentioning: panel data. We have cross-sectional data—500 homes, each observed once. But many econometric applications use panel data—multiple entities observed over time.”

9.1 What Is Panel Data?

David explains: “Remember from Lecture 1, we talked about three types of data structures:

Cross-Sectional (what we have): Different homes, one time point
Time Series: One entity, many time points
Panel: Multiple entities, multiple time points”

“Panel data is incredibly powerful,” Maria adds, “because it lets you control for unobserved time-invariant characteristics. For example, if we observed each home multiple times over the years—maybe some got renovated mid-period—we could use ‘fixed effects’ models to control for all permanent home characteristics (location, lot quality, floor plan, etc.) and isolate the renovation effect.”

Panel Data and Fixed Effects: A Preview

Cross-Sectional Model (What We Used):

\[\text{Price}_i = \beta_0 + \beta_1 X_{1i} + \cdots + \beta_k X_{ki} + \varepsilon_i\]

where i indexes homes.

Panel Data Model:

\[\text{Price}_{it} = \beta_0 + \beta_1 X_{1it} + \cdots + \beta_k X_{kit} + \alpha_i + u_{it}\]

where: - i indexes homes
- t indexes time periods
- \(\alpha_i\) is a home-specific fixed effect (captures all time-invariant characteristics)
- \(u_{it}\) is the remaining error

Key Advantage: The fixed effect \(\alpha_i\) absorbs all permanent differences between homes—location quality, lot characteristics, architectural features, etc. This eliminates omitted variable bias from these factors.

Example: Suppose Home A is renovated between 2022 and 2023. We compare: - Home A’s price in 2023 (post-renovation) to its own 2022 price (pre-renovation)
- Difference out the \(\alpha_i\) (all permanent quality factors)
- What’s left is the renovation effect, free of bias from time-invariant omitted variables

Limitations:
- Need data on same homes over time (often unavailable for real estate)
- Can only identify effects of time-varying variables
- Doesn’t help with omitted variables that change over time
- More complex to implement and interpret

When to Use: Panel data methods are standard in labor economics (following workers over time), health economics (patients across periods), development economics (countries over years), and any setting where you observe the same entities repeatedly.

For This Course: Full panel data econometrics is beyond our scope, but you should know it exists and when to consult experts who specialize in these methods.

9.2 Other Advanced Methods

“There are many other econometric techniques for dealing with observational data challenges,” David notes:

Difference-in-Differences (DiD): Compare changes over time for a treatment group vs. control group. Used for policy evaluation.

Regression Discontinuity (RD): Exploit sharp cutoffs in treatment assignment (e.g., test score thresholds for program eligibility).

Synthetic Controls: Create a weighted combination of control units that mimics the treatment unit pre-intervention, then compare post-intervention.

Matching Methods: Find comparable treatment and control observations based on measured characteristics.

Time Series Econometrics: Models for data with serial correlation and trends (ARIMA, VAR, cointegration).

“These are all tools for trying to make causal inferences from observational data,” Maria summarizes. “They’re not magic—they all require assumptions—but they’re often more credible than naive regression with endogeneity problems.”

“For your careers,” David adds, “you don’t need to implement these yourself, but you should know they exist. When you face a business question with observational data and potential confounding, consider consulting an econometrician who can help you choose the right method.”

10 The Corrected Analysis: David and Maria’s Report

After thorough review, David and Maria prepare their consulting report for Sunburst Homes. Their report acknowledges the challenges with Torres’s analysis while providing actionable guidance.

10.1 Key Corrections and Recommendations

1. On Multicollinearity

“We cannot reliably separate the effects of square footage and bedroom count due to high correlation (r = 0.76, VIF > 4 for both). The bedroom coefficient of $22,450 is imprecise and should not be used for investment decisions. Focus on total square footage as the primary size metric.”

2. On Heteroskedasticity

“Prediction uncertainty is not constant. Small homes (<$250k) can be priced with ±$24,000 prediction intervals. Luxury homes (>$325k) require ±$55,000 intervals. Don’t rely on a single ‘margin of error’ figure.”

3. On Endogeneity

“The renovation coefficient of $47,830 is likely biased upward by omitted variables. Homeowners selectively renovate homes in better micro-locations and with better underlying quality. We estimate the true causal effect at $15,000-$25,000 based on within-neighborhood comparables. Do not use the regression coefficient for ROI calculations.”

4. On Pricing Strategy

“Use regression to identify comparable homes with similar characteristics, then examine actual sales prices of those comparables. Apply qualitative adjustments for features the regression cannot capture. Construct prediction intervals that acknowledge uncertainty, particularly for higher-priced homes.”

5. On Renovation Decisions

“For spec homes in Northeast and Kern Place neighborhoods, comparable sales analysis suggests renovated homes sell for $18,000-$27,000 more than non-renovated comparables. If cosmetic renovations cost less than $15,000, they may generate positive ROI. For homes in other neighborhoods, the market evidence is weaker. In all cases, base decisions on comparable sales, not regression coefficients.”

10.2 The Correct Interpretation of R²

Maria addresses a final misconception: “Torres claimed R² = 0.847 means ‘84.7% accuracy.’ That’s incorrect. R² means the model explains 84.7% of the variance in home prices. For any individual home, prediction error can still be large.”

“Think of it this way,” David adds. “If we just predicted every home would sell for the average price ($269,552), we’d have huge prediction errors. The regression reduces those errors by 84.7% by incorporating home characteristics. But that still leaves substantial unexplained variance—about 15% of the total, which translates to roughly ±$37,000 prediction intervals for typical homes.”

R²: What It Is and Isn’t

What R² Measures:
Proportion of variance in Y explained by the model: \(R^2 = 1 - \frac{SSE}{SST}\)

What R² = 0.847 Means:
- The model reduces prediction error by 84.7% compared to just predicting the mean
- 84.7% of the variance in home prices is associated with variation in our predictors
- 15.3% of the variance remains unexplained

What R² = 0.847 Does NOT Mean:
- ❌ “84.7% of individual predictions are accurate”
- ❌ “Prediction error is only 15.3%”
- ❌ “We can predict prices with 84.7% confidence”

Why This Matters:
Even with high R², individual predictions can have substantial error. For business decisions, examine: - Standard error of the regression (typical prediction error)
- Prediction intervals for specific cases (accounting for both coefficient uncertainty and residual variation)
- Whether R² is high enough for your application (depends on context—predicting human behavior may only achieve R² = 0.30, which could still be valuable)

Better Metrics for Prediction Quality:
- Root Mean Squared Error (RMSE): Average prediction error in outcome units
- Mean Absolute Error (MAE): Average absolute prediction error
- Prediction intervals: Range that will contain true value X% of the time

11 Putting It All Together: Regression in Business Context

As David and Maria finalize their report, they reflect on the broader lessons.

11.1 What Regression Can Do

“Regression is incredibly useful,” Maria emphasizes. “It lets us:

  • Quantify associations between variables while controlling for confounders
  • Make predictions about outcomes based on observable characteristics
  • Identify which factors matter most (through coefficient magnitudes and significance)
  • Provide probabilistic ranges for uncertain outcomes
  • Test hypotheses about relationships in data”

11.2 What Regression Cannot Do (Without Strong Assumptions)

“But regression is not magic,” David cautions. “With observational data, it cannot:

  • Prove causation (only association)
  • Control for unmeasured confounders (endogeneity remains)
  • Solve selection bias or omitted variable bias without additional methods
  • Tell us what would happen under counterfactual interventions
  • Eliminate uncertainty or make decisions for us”

11.3 The Econometrician’s Mindset

“The biggest lesson,” Maria reflects, “is humility. Torres was too confident. He said the model ‘proves’ relationships, gives ‘84.7% accuracy,’ and provides ‘definitive guidance.’ Those claims are overreach.”

“Good econometric analysis,” David adds, “requires acknowledging limitations, testing assumptions, being transparent about what we can and cannot conclude, and recognizing that statistical models inform decisions but don’t make them.”

Principles for Applied Regression Analysis

1. Be Clear About Your Goal:
- Description? Prediction? Causal inference?
- Different goals require different approaches and different standards of evidence

2. Understand Your Data:
- Observational or experimental?
- Cross-sectional, time series, or panel?
- What variables are measured? What’s missing?

3. Test Assumptions:
- Check for multicollinearity (VIF)
- Test for heteroskedasticity (Breusch-Pagan)
- Think hard about endogeneity (likely omitted variables?)
- Examine residual plots

4. Use Careful Language:
- “Associated with” not “causes”
- “Holding other measured factors constant”
- “In this sample” or “Based on this model”
- Acknowledge limitations explicitly

5. Assess Practical Significance:
- Statistical significance ≠ business relevance
- Examine coefficient magnitudes and confidence intervals
- Consider effect sizes relative to decision thresholds

6. Validate Predictions:
- Check predictions against holdout data when possible
- Provide prediction intervals, not just point estimates
- Acknowledge where uncertainty is greatest

7. Know When to Get Help:
- Severe endogeneity? Consider IV, RD, DiD
- Panel data? Consider fixed effects
- Time series? Consider ARIMA, VAR
- Complex selection? Consider matching, propensity scores
- Don’t force simple regression to answer causal questions it can’t handle

12 Wrapping Up: The Path Forward

Jennifer Walsh reads David and Maria’s report carefully. “This is helpful. You’ve shown me that Michael’s analysis had the right data and the right model structure, but drew conclusions that were too strong.”

“So what should we do?” asks Margaret Chen, the CFO.

“Three things,” Jennifer decides. “First, price the spec homes using the comparable sales approach David and Maria recommended, not Michael’s regression formula. Second, skip the across-the-board renovations. Instead, let’s do modest cosmetic updates on the two Northeast homes and the one Kern Place home where comparable sales suggest it might pay off. Third, and most important—let’s bring David and Maria in for ongoing consulting on future projects. We need this kind of careful statistical thinking.”

Michael Torres looks chastened but appreciative. “I learned something important today. I thought I was being rigorous by running a regression and testing significance. But rigor means more than calculating p-values—it means thinking carefully about what the model can and cannot tell us.”

David and Maria pack up their materials. “Every company we’ve worked with—TechFlow, PrecisionCast, BorderMed, DesertVine, PixelPerfect, and now Sunburst—has taught us something about the gap between statistical analysis and business decision-making.”

“The statistics are important,” Maria adds. “But so is judgment, domain knowledge, humility about limitations, and honest communication about uncertainty. That’s what separates good quantitative analysis from merely technically correct analysis.”

As they leave Sunburst’s offices, David reflects: “You know what? This has been an incredible journey. We’ve gone from calculating means and standard deviations for TechFlow to conducting econometric analysis with discussions of endogeneity and panel data methods. We’ve learned not just the formulas, but how to think statistically.”

“And more importantly,” Maria adds, “we’ve learned that good statistical practice isn’t about impressing people with fancy methods. It’s about being clear, honest, and helpful. It’s about acknowledging what we don’t know as carefully as we present what we do know.”

“Statistics,” David concludes, “is ultimately about making better decisions under uncertainty. Not eliminating uncertainty—that’s impossible—but navigating it thoughtfully.”

They smile. After seven companies and seven statistical challenges, they’re ready for whatever quantitative problems their careers throw at them next.

13 Practice Problems

Problem 1: Interpreting Coefficients

A regression of employee salary on years of experience, education level (1-5 scale), and department (using dummies) yields:

\[\text{Salary} = 35{,}000 + 2{,}800(\text{Experience}) + 4{,}200(\text{Education}) + 8{,}500(\text{Dept\_Tech}) + \varepsilon\]

All coefficients are statistically significant at p < 0.01.

Questions:
1. Interpret the experience coefficient in plain English
2. How much more does a Tech department employee earn than a baseline department employee, holding experience and education constant?
3. Can you conclude that getting more education causes higher salary? Why or why not?
4. What are potential sources of omitted variable bias in this model?

Problem 2: Multicollinearity

You run a regression predicting house prices with: square footage, number of rooms, lot size, and age. The VIF values are:

  • Square footage: VIF = 8.2
  • Number of rooms: VIF = 7.8
  • Lot size: VIF = 1.4
  • Age: VIF = 1.2

Questions:
1. Which variables show problematic multicollinearity?
2. Why might square footage and number of rooms be highly correlated?
3. What would you recommend to address this issue?
4. Can you still use this model for price prediction? Why or why not?

Problem 3: Heteroskedasticity

You estimate a model of restaurant daily revenue as a function of weather, day of week, and local events. You plot residuals vs. fitted values and see a clear “fan shape”—residuals are tightly clustered for low predicted revenues but widely dispersed for high predicted revenues.

Questions:
1. What does this pattern indicate?
2. Why might restaurant revenue show this pattern?
3. How does heteroskedasticity affect your coefficient estimates?
4. How does it affect your hypothesis tests?
5. What would you recommend to address this issue?

Problem 4: Endogeneity

A consulting firm regresses client satisfaction on: project cost, project duration, consultant experience, and whether the client received a “premium service package” (a binary variable). The premium package coefficient is \(+15,000\) (meaning clients who bought the premium package reported $15,000 higher satisfaction scores, on a 0-100 scale… wait, that doesn’t make sense…).

Let’s revise: The premium package coefficient is +12.5 points (on a 0-100 satisfaction scale) and is statistically significant (p < 0.001).

Questions:
1. Can you interpret this as “the premium package causes 12.5 points higher satisfaction”?
2. What is the likely source of endogeneity here?
3. In what direction would this bias the coefficient? (upward or downward?)
4. How would you design a study to get an unbiased estimate of the premium package effect?

Problem 5: Prediction vs. Causation

A company uses regression to predict customer churn based on: usage frequency, customer service calls, contract length, and payment history. The model has R² = 0.73.

Management wants to reduce churn and asks: “Which variable should we intervene on? The model shows customer service calls have the strongest association with churn (coefficient = 0.28, p < 0.001). Should we reduce customer service call volume?”

Questions:
1. What’s wrong with this interpretation?
2. Explain why reducing customer service calls might not reduce churn
3. Distinguish between this model’s usefulness for prediction vs. for causal decision-making
4. What additional information would you need to make causal recommendations?

14 Excel Functions Quick Reference

Running Regression:

  1. Data → Data Analysis → Regression (requires Analysis ToolPak add-in)
  2. Input Y Range: Your dependent variable
  3. Input X Range: Your independent variable(s)
  4. Select output location
  5. Check boxes for: Confidence Interval, Residuals, Residual Plots

Reading Output:

  • Regression Statistics: R², Adjusted R², Standard Error
  • ANOVA Table: Overall F-test of model significance
  • Coefficients Table: β estimates, standard errors, t-statistics, p-values, confidence intervals
  • Residual Output: Predicted values, residuals for each observation

Alternative: LINEST Function:

=LINEST(Y_range, X_range, TRUE, TRUE)

Returns coefficients, standard errors, R², F-statistic in an array. More flexible but less user-friendly.

Diagnostics:

Correlation Matrix:

=CORREL(range1, range2)

Or use Data → Data Analysis → Correlation

For VIF, need to run auxiliary regressions (each X on all other X’s) and calculate: VIF = 1/(1-R²)

Coefficient (β): Estimated effect of a one-unit change in X on Y, holding other variables constant

Confidence Interval: Range of plausible values for a parameter; quantifies uncertainty

Cross-Sectional Data: Multiple entities observed at one point in time

Endogeneity: Violation of exogeneity assumption; predictor correlated with error term, causing bias

Exogeneity: Assumption that predictors are uncorrelated with error term; required for unbiased estimates

Fitted Value: Predicted outcome from regression model for a given observation

Heteroskedasticity: Non-constant error variance; affects standard errors and inference

Homoskedasticity: Constant error variance across all levels of predictors; a key assumption

Instrumental Variable (IV): Variable used to identify causal effects when endogeneity is present

Multicollinearity: High correlation among predictors; makes it difficult to separate their effects

Observational Study: Data where treatment/exposure is not randomly assigned

Omitted Variable Bias: Bias in coefficient estimates due to excluding a relevant variable correlated with included predictors

Ordinary Least Squares (OLS): Method that minimizes sum of squared residuals

Panel Data: Multiple entities observed over multiple time periods

Prediction Interval: Range for a new observation’s outcome; wider than confidence interval

R-Squared (R²): Proportion of variance in Y explained by the model; ranges from 0 to 1

Residual: Difference between observed and predicted values; prediction error

Standard Error: Measure of uncertainty in coefficient estimate

t-statistic: Coefficient divided by its standard error; used for hypothesis testing

Time Series Data: One entity observed over multiple time periods

VIF (Variance Inflation Factor): Measure of multicollinearity; VIF > 4 indicates concern


In econometrics, the most important skill isn’t mastering complex techniques. It’s knowing what you can and cannot conclude from your data—and communicating that honestly.

Every coefficient tells a story. Make sure you’re telling the right one.