Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

1. [70 points] You wish to predict the sale price of single-family residences in

ID: 3125274 • Letter: 1

Question

1. [70 points] You wish to predict the sale price of single-family residences in Massachusetts using property features (commonly called a “hedonic pricing model”).  You collect price and property features data on properties sold in the state for the year 2010 and obtain the following regression:

Pricei = 14407.60 – 759.92*houseagei + .24*lotsizei + 354.35*bldareai + 12015.61*roomsi + µi

              (6433.23)   (89.67)                       (.115)              (265.39)                   (8516.47)

Observations = 2691         R2 = 0.49             F = 65.10

Where:

houseage = age of the house (in years)

lotsize = total square feet of the land

bldarea = total square feet of the interior of the house

rooms = total number of rooms in the house

[6 points] How would you categorize, or label, this dataset?  Defend your answer.

[6 points] What is the interpretation of the constant term in this regression?  Why is it included?

[8 points] How do we interpret the coefficient on lotsize? Why is the coefficient on lotsize nominally small if we expect it to have a large impact on the price of a house?

[6 points] What is the predicted price of a house that is 7 years old, with a lot size of 800 square feet, a building interior of 400 square feet, and 5 rooms?  Will this predicted price be close to the actual price?  Why or why not?

[6 points] Explain what is meant that the value of the R2 = .49.  What is one good reason and one bad reason to use R2 as a measure of the “goodness of fit” of a regression?

[10 points] Test the significance of each independent variable in the model using = .05.  Are these findings expected?  Why or why not, and what could explain your findings?

[8 points] Construct a 95% confidence interval for houseage in the model above.  What this measure is telling us?  How will consistency in your OLS estimation affect your confidence intervals?

[10 points] After thinking about your model further, you wish to add median income as variable in your regression.  You collect data on the median income of each census tract in Massachusetts in the year 2010, and match that to your housing data.  Assuming you believe that your original form of the model suffered from omitted variable bias, in what direction would you expect your estimates to change with the inclusion of median income?  Defend your answers.

[10 points] Suppose you ran the same model as above only using log(price) instead of price and obtain an R2 of 0.54 and an F-statistic of 68.17.  Based on this information, are we able to say which version (level or log) of the model is better?  Explain why or why not.

Explanation / Answer

1)    What is the interpretation of the constant term in this regression?  Why is it included?

Sol: - The constant term is the intercept of the model. It is included in the model because if the value of all the independent variable is 0, then the price of the single family residence will be equal to this constant value.

2)    How do we interpret the coefficient on lotsize? Why is the coefficient on lotsize nominally small if we expect it to have a large impact on the price of a house?

Sol: - If the lotsize (total square feet of the land) increased by 1 unit then we say that the price increases by 0.24 unit.

3)    What is the predicted price of a house that is 7 years old, with a lot size of 800 square feet, a building interior of 400 square feet, and 5 rooms?  Will this predicted price be close to the actual price?  Why or why not?

Sol: - The predicted price of house is given by

            Price = 14407.60 – 759.92*houseagei + .24*lotsizei + 354.35*bldareai + 12015.61*roomsi

                        =14407.60 – 759.92*7 + 0.24*800 + 354.35*400 + 12015.61*5

                        = 211098.21

4)    Explain what is meant that the value of the R2 = .49.  What is one good reason and one bad reason to use R2 as a measure of the “goodness of fit” of a regression?

Sol: - R2 is the coefficient of determination for the regression model. R2 = 0.49 implies that the model explains the 49% variability of response data around the mean. One good reason is that it measure of how close the data are to the fitted regression line and bad reason is that its value will keep on increasing if you add more variables to the model irrespective of the fact that whether they actually affect the dependent variable or not.

5)    Test the significance of each independent variable in the model using = .05.  Are these findings expected?  Why or why not, and what could explain your findings?

Sol: - Yes. These findings are expected. F test gives you an overall result. It tells that out of all the coefficients whether at least one of the coefficient is different from zero or not i.e. at least one independent variable is affecting the dependent variable or not. But to know exactly which variable is affecting we will have test the significance of each variable if Ho is rejected.

6)    Construct a 95% confidence interval for houseage in the model above.  What this measure is telling us?  How will consistency in your OLS estimation affect your confidence intervals?

Sol: - The Confidence interval is given by

            CI (Coeff) = Est(Coeff) ± t(/2, n-k-1)SE(Coeff)

So,       CI (houseage) = 759.92 ± t(0.05/2, 2691-4-1)* 89.67

= 759.92 ± 1.960848* 89.67

= (548.0908, 935.7492)

The measure tells that the value of coefficient of houseage will lie this interval 95 times out of 100. If the OLS is consistent then the range of CI will be small.

7)    After thinking about your model further, you wish to add median income as variable in your regression.  You collect data on the median income of each census tract in Massachusetts in the year 2010, and match that to your housing data.  Assuming you believe that your original form of the model suffered from omitted variable bias, in what direction would you expect your estimates to change with the inclusion of median income?  Defend your answers.

Sol: - The Gauss–Markov theorem states that regression models which fulfill the classical linear regression model assumptions provide the best, linear and unbiased estimators. With respect to ordinary least squares, the relevant assumption of the classical linear regression model is that the error term is uncorrelated with the regressors.

The presence of omitted-variable bias violates this particular assumption. The violation causes the OLS estimator to be biased and inconsistent. The direction of the bias depends on the estimators as well as the covariance between the regressors and the omitted variables. A positive covariance of the omitted variable with both a regressor and the dependent variable will lead the OLS estimate of the included regressor's coefficient to be greater than the true value of that coefficient.

Hence the addition of variable median income will reduce the value of estimates.

8)    Suppose you ran the same model as above only using log(price) instead of price and obtain an R2 of 0.54 and an F-statistic of 68.17.  Based on this information, are we able to say which version (level or log) of the model is better?  Explain why or why not.

Sol:- After taking Log(price) the values of R2 increased implying that the data was not normal earlier. Hence we can say that the model with the transformation log(price) is a better fitted regression model.

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Chat Now And Get Quote