Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Use a statistical software package to determine the multiple regression equation

ID: 3359852 • Letter: U

Question

Use a statistical software package to determine the multiple regression equation. Discuss each of the variables. For example, are you surprised that the regression coefficient for ERA is negative? Is the number of wins affected by whether the team plays in the National or the American League?

Find the coefficient of determination for this set of independent variables.

Develop a correlation matrix. Which independent variables have strong or weak correlations with the dependent variable? Do you see any problems with multicollinearity?

Conduct a global test on the set of independent variables. Interpret.

Conduct a test of hypothesis on each of the independent variables. Would you consider deleting any of the variables? If so, which ones?

Rerun the analysis until only significant net regression coefficients remain in the analysis. Identify these variables.

Develop a histogram of the residuals from the final regression equation developed in part (f). Is it reasonable to conclude that the normality assumption has been met?

Plot the residuals against the fitted values from the final regression equation developed in part (f). Plot the residuals on the vertical axis and the fitted values on the horizontal axis.

Won Lost Runs Hits Doubles Triples Home Runs Runs Batted I Earned Run Averac Strike Out: Walks 4 Arizona 5 Atlanta 6 Baltimore 7 Boston 8 Chicago Cubs 9 Chicago Sox 10 Cincinnati 11 Cleveland 91 71 738 41 312 66 96 613 1440 264 21 89 73 818 1511 5 22 75 87 685 1414 298 27 88 7 752 1467 26 21 91 71 790 1515 69 93 646 1362 290 20 83 79 770 1452 270 54 81 81 751 1515 80 82 719 1403 294 76 86 611 1348 252 67 95 676 1534 279 80 82681 1363276 19 80 82 667 1368270 29 77 85 750 1471 293 94 68 781 1521318 41 79 83 656 1361 9567 859 1485 27532 81 8 663 1396 276 30 97 65 772 1451 57 105 587 1303 276 27 90 72 665 1338 236 24 92 70 697 1411 284 61 101 513 1274 86 76 736 1456285 18 96 66 802 1343 295 90 72 787 1556 268 85 77 755 1364319 69 93 655 1355 1140 634 1056 424 1236 479 922 467 13 Detroit 14 Florida 15 Houston 16 Kansas City 17 LA Angels 1184 545 1274 585 1147 546 1375 514 1025 415 905 471 1070 18 LA Dodgers 19 Milwaukee 20 Minnesota 21 NY Mets 22 NY Yankees 23 Oakland 24 Philadelphia 1184 533 1216 546 967 559 1095 502 1136 662 1061 527 1064 560 3 25 Pittsburgh San Diego 27 San Francisco 28 Seattle 29 St. Louis 30 Tampa Bay 31 Texas 1183 538 1099 487 1184 459 1027 541 1292 672 22716 769 1164 471 1220 503

Explanation / Answer

Let number of game be the dependent variable (Y) and independent variables be

Team batting average (BA), number of stolen bases (SB), number of errors committed (errors), team ERA, number of home runs (HR), and whether the team plays in the American or the National League (league)

(a)Then the multiple regression equation obtained is

Y = 39.7100 + .0713(league) – 16.9762(ERA) + 392.363(BA) + .1145(HR) + .0189(SB) - .0991(errors)

Each point that the team BA increases the number of wins increases by 392. Each SB adds .0189 to average wins. Each error raises the average number of wins by .0991. An increase in one on the ERA decreases the number of wins by 16.9762. HR adds .1145 to wins

(b) Coefficient of determination for this set of independent variables

Coefficient of determination=R2 = .866 meaning 86.6% of games won is explained by the variables

(c) Correlation matrix


Wins

League

ERA

BA

HR

SB

Error

Wins

1

League

.049

1

ERA

-.681

.145

1

BA

.461

.224

.058

1

HR

.438

.114

.087

.317

1

SB

.034

.270

-.203

-.176

-.308

1

Errors

-.634

-.016

.480

-.166

-.279

-.133

1

ERA and Errors have strong correlations and League and SB have weak correlations with the dependent variables. ERA and Errors are significant at .05 and .01 so we might have a problem with multicollinearity

(d) Conduct a global test on the set of independent variables. Interpret

The null hypothesis, states that there is no significant correlation. Correlation Coefficient Significance Level = 0.05 Decision Rule: Reject if the p-value < 0.05 significance level From the ANOVA table, we find that the p value is 0, which is much less than 0.05. Reject the null hypothesis that there is no significant correlation. According to the overall test of significance, we can conclude that the regression model is valid

(e) Conduct a test of hypothesis on each of the independent variables. Would you consider deleting any of the variables? If so, which ones?

P-values for league, SB, and Errors are more than the significance level .05 so these variables are not significant and should be deleted


Wins

League

ERA

BA

HR

SB

Error

Wins

1

League

.049

1

ERA

-.681

.145

1

BA

.461

.224

.058

1

HR

.438

.114

.087

.317

1

SB

.034

.270

-.203

-.176

-.308

1

Errors

-.634

-.016

.480

-.166

-.279

-.133

1