Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Data were collected at a large university on all first-year computer science maj

ID: 3296116 • Letter: D

Question

Data were collected at a large university on all first-year computer science majors in a particular year. The purpose of the study was to attempt to predict success in the early university years. One measure of success was the cumulative grade point average (GPA) after three semesters. Explanatory variables under study were average high school grades in mathematics (HSM), science (HSS), and English (USE). We also include SAT mathematics (SATM) and SAT verbal (SATV) scores as explanatory variables. The SAS output below relates to this problem. (a) Write down the model under consideration, including all assumptions. (b) Describe how each assumption would be investigated and what course of action might be taken if an assumption were not met. (c) Based on the SAS output, indicate which explanatory variables seem to be significant predictors of GPA for this population of students. (d) (i) Briefly discuss the difference between the "Type I SS" and "Type III SS" (Sums of Squares) presented in the SAS output below. (ii) Hence explain why SATM (satm in the SAS output below) is significant under Type I Sums of Squares and not significant under Type III Sums of Squares. Dependent Variable: gpa

Explanation / Answer

(a)

The model is given as,

gpa = 0.326718739 + 0.0009435925 satm - 0.0004078495 satv + 0.1459610795 hsm + 0.0359053199 hss + 0.0552925813 hse

Assumptions for performin a liner regression are -

1. Model is linear in parameters
2. The data are a random sample of the population (The errors are statistically independent from one another
3. The expected value of the errors is always zero
4. The independent variables are not too strongly collinear
5. The residuals have constant variance
6. The errors are normally distributed

b)

1. Model is linear in parameters -

Diagnosis - Model will not be significant and fit the data; Residual errors will be significant.

Solution - Use non-linear regression

2. The data are a random sample of the population (The errors are statistically independent from one another).

Diagnosis - Look for correlation between residuals and another variable (not in the model) . That is, residuals are dominated by another variable, Z, which is not random with respect to the other independent variables

Solution - Add the variable to the model

3. The expected value of the errors is always zero.

Diagnosis - Look at the curvature of the plot of observed vs. predicted Y. Ideally the plot should be linear.

Solution - Try transforming independent variable


4. The independent variables are not too strongly collinear

Diagnosis - Look for correlations among independent variables. The correlations should not be high.

Solution - Remove statistically redundant variables

5. The residuals have constant variance

Diagnosis - plot residuals against fitted values. Ideally, there should be no pattern in the plot.

Solution - Transform the dependent variable


6. The errors are normally distributed

Diagnosis - examine QQ plot of residuals

Solution - Transform the dependent variable

c)

Looking at the regression output, Pr > |t| is less than 0.05 only for the variable hsm.

So, the variable hsm is only significant predictor of GPA.

d)

(i)

The Type I SS for each factor is the incremental improvement in the error SS as each factor effect is added to the regression model.
Type III SS gives the sum of squares that would be obtained for each variable if it were entered last into the model. That is, the effect of each variable is evaluated after all other factors have been accounted for.

Difference between Type I and III SS for the variables are interpreted below.

There is reduction of 8.5829 in SS, if the variable satm alone is added in the model, but there is reduction of 0.9280 in SS, if the variable satm is added in the model with variables satv, hsm, hss and hse. So, the SS=7.6549 (8.5829-0.9280) is explained by satm will also happen if variables satv, hsm, hss and hse is added to the model.

There is reduction of 0.0009 in SS, if the variable satv alone is added in the model, but there is reduction of 0.2326 in SS, if the variable satv is added in the model with variables satm, hsm, hss and hse. So, the SS=-0.2317 (0.0009-0.2326) is explained by satv will also happen if variables satm, hsm, hss and hse is added to the model.

There is reduction of 17.726 in SS, if the variable hsm alone is added in the model, but there is reduction of 6.7724 in SS, if the variable hsm is added in the model with variables satm, satv, hss and hse. So, the SS=-10.9536 (17.726-6.7724) is explained by hsm will also happen if variables satm, satv, hss and hse is added to the model.

There is reduction of 1.3765 in SS, if the variable hss alone is added in the model, but there is reduction of 0.4421 in SS, if the variable hss is added in the model with variables satm, satv, hsm and hse. So, the SS=-0.9344 (1.3765-0.4421) is explained by hss will also happen if variables satm, satv, hsm and hse is added to the model.

There is reduction of 0.9568 in SS, if the variable hse alone is added in the model, but there is reduction of 0.9568 in SS, if the variable hse is added in the model with variables satm, satv, hsm and hss. So, the SS=-0 (0.9568-0.9568) is explained by hse will also happen if variables satm, satv, hsm and hss is added to the model. The difference of 0 shows that hse is independent of other variables of the model and correlation of hse with other explanatory model is 0.

ii)

satm is significant under Type I sum of squares because there is reduction of 8.5829 in SS, if the variable satm alone is added in the model, but there is reduction of 0.9280 in SS, if the variable satm is added in the model with variables satv, hsm, hss and hse.
So, the SS=7.6549 (8.5829-0.9280) is explained by satm will also happen if variables satv, hsm, hss and hse is added to the model.
And, 7.6549/8.5829 = 89% of SS explained by satm can also be explained by other variables of the model. Therefore satm is not significant under type III sum of squares.