Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Your answers should include the code used to generate results,a total of ten gra

ID: 3321831 • Letter: Y

Question

Your answers should include the code used to generate results,a total of ten graphs,and written responses to questions 1(d) and 2(e).
Question1 In this question you are required to use Monte Carlo simulation to investigate theproperties of F-tests in linear regression with homoskedastic residuals.

(a) In each of 1000 iterations of a Monte Carlo simulation,do the following with n = 4 and k = 3.

(i) Generate an n ×k matrix X where every element of the first column is equal to one, and the remaining elements are independent standard normalrandomvariables.

(ii) Generate an n ×1 vectoru whose elements are independent standard normalrandomvariables.

(iii) Set y = X +u,where is a k ×1 vector with the first element equal to one and the remaining elements equal to zero.

(iv) Compute the(non-robust) F-statistic for testing the null hypothesis that the last k 1elements of are equal to zero.

(v) Record whether the F-statistic exceeds the 95% quantile of the F(k 1,nk) distribution.

(vi) Record whether k 1 times the F-statistic exceeds the 95% quantile of the 2(k 1) distribution.
Compute the proportion of Monte Carlo replications in which the F-statistic exceeded the 95% quantile of theF(k1,nk) distribution,and the proportion of Monte Carlo replications in which k1 times the F-statistic exceeded the 95% quantile of the 2(k1) distribution. Call these proportions rejF(n) and rej2(n) respectively.

(b) Repeat part (a) usingn = 5,6,7,. . .,50. Graph rejF (n) and rej2(n) as functions of n.(Hint:if you have trouble getting Julia to produce a suitable graph, try exporting the data to be graphed to a spreadsheet program with suitable graphing capabilities.)

(c) Repeat parts (a) and (b) using Student t random variables with 5 degrees of freedom in place of standard normal random variables.

(d) Interpret the graphs produced in parts(b) and (c) in words,relating what you see to what we learned about the F-statistic in class.

Any help will be appreciated!

Explanation / Answer

options(warn=1)

Here is the code to produce the density plots. Notice that dnorm(x) and dt(x,y) produce the densities of the standard normal and t with y d.f. distributions respectively, evaluated at values of x. Three values of df are used, 20, 4, 1. The t20 plot is quite close to normal.

x<-(-30:30)/10
ynorm<-dnorm(x)
yt20<-dt(x,20)
yt4<-dt(x,4)
yt1<-dt(x,1)
plot(x,ynorm,type="l")
lines(x,yt20,lty=2)
lines(x,yt4,lty=3)
lines(x,yt1,lty=4)
# legend(1.5,.4,legend=c("normal","t-20 df","t-4 df","t-1 df"),lty=1:4)
text(0.8, 0.34, labels="normal", cex= 0.9, pos=3)
text(-1, 0.35, labels="t-20 df", cex= 0.9, pos=3)
arrows(-0.8,0.37,-0.5,0.35, length=.1)
text(0, 0.3, labels="t-1 df", cex= 0.9, pos=1)
text(0, 0.38, labels="t-4 df", cex= 0.9, pos=1)

# Step 2: Plot histograms of random samples drawn from these four
# distributions

# Draw samples of 1000 observations (data points) each for these 4
# distibutions. rnorm(n) and rt(n,df) randomly generate n data
# points from the normal and t-df distributions.

par(mfrow=c(2,2))
xnorm<-rnorm(1000)
hist(xnorm,main="data from normal distribution")
xt20<-rt(1000,20)
hist(xt20,main="data from t-20 distribution")
xt4<-rt(1000,4)
hist(xt4,main="data from t-4 distribution")
xt1<-rt(1000,1)
hist(xt1,main="data from t-1 distribution")

# In principle, if the number of data points is large enough,
# histograms should resemble the density function. While this is
# reasonably true for the normal and t-20 data, it is definitely
# not true for the t-1 (Cauchy) data. Some illumination can be
# obtained by looking at boxplots.

# Step 3: look at boxplots

par(mfrow=c(2,2))
boxplot(xnorm,main="data from normal distribution")
boxplot(xt20,main="data from t-20 distribution")
boxplot(xt4,main="data from t-4 distribution")
boxplot(xt1,main="data from t-1 distribution")

# We see that for the t-1 data the majority of the data lies in a
# small band near zero. However the t-1 distribution produces
# extreme outliers. This causes a
# histogram of t-1 data to put almost all the data in one or two
# boxes near zero and then to have many outliers

# Step 4

# We now start our Monte Carlo simulations with K=5000
# iterations. Each iteration consists of a sample of
# size n=3 from a t-20, samples of size n=3 and n=25
# from a t-4, and a sample of size n=1000 from a t-1.
# We save the values of Xbar and s (the sample standard
# deviation).

xbar3.20<-rep(NA,5000)
sig3.20<-rep(NA,5000)
xbar3.4<-rep(NA,5000)
sig3.4<-rep(NA,5000)
xbar25.4<-rep(NA,5000)
sig25.4<-rep(NA,5000)
xbar1000.1<-rep(NA,5000)
sig1000.1<-rep(NA,5000)

for (i in 1:5000) {
samp<-rt(3,20)
xbar3.20[i]<-mean(samp)
sig3.20[i]<-sqrt(var(samp))
samp<-rt(3,4)
xbar3.4[i]<-mean(samp)
sig3.4[i]<-sqrt(var(samp))
samp<-rt(25,4)
xbar25.4[i]<-mean(samp)
sig25.4[i]<-sqrt(var(samp))
samp<-rt(1000,1)
xbar1000.1[i]<-mean(samp)
sig1000.1[i]<-sqrt(var(samp))
}

# The variance of a t-df distribution is sigma^2=df/(df-2) when df>2.
# See t distribution in
# stat-techniques.pdf. Since
# mean(Xbar) = mu = 0 and Var(Xbar )=sigma^2/n, if
# n is large enough for the Central Limit theorem to
# hold, we expect a normal qq plot of Xbar to look like
# a straight line with y-intercept 0 and slope sigma/sqrt(n).
# We add to our qq plots this theoretical line.

var20<-20/18
var4<-2


par(mfrow=c(2,2))
qqnorm(xbar3.20,main="n=3, t-20 dist.")
abline(0,sqrt(var20/3))
qqnorm(xbar3.4,main="n=3, t-4 dist.")
abline(0,sqrt(var4/3))
qqnorm(xbar25.4,main="n=25, t-4 dist.")
abline(0,sqrt(var4/25))
qqnorm(xbar1000.1,main="n=1000, t-1 dist.")


###########################################

# We see that n=3 is good enough for the t-20
# distribution but not for the t-4. For the t-4, n=25
# works fairly well, except possibly in the extreme
# tails (in comparing the two plots for t-4, it is
# important to take into account the different
# vertical scales). The CLT doesn't seem to work for
# the t-1 distribution, even when n=1000!

# Checking mean(Xbar) = mu = 0 for a sample of size n=3 from a t-20
# distribution. Note we have 5000 values of
# Xbar, as opposed to the complete sampling distribution of
# Xbar, so exact inequality is not expected.

mean(xbar3.20)

# Checking Var(Xbar)=s^2/n

sqrt(var(xbar3.20))
sqrt(var20/3)

# Checking E(s^2)=s^2

var20
mean(sig3.20^2)


# These three identities seem to hold. Repeating for samples
# of size n=3 from t-4 distribution:

mean(xbar3.4)
sqrt(var(xbar3.4))
sqrt(var4/3)
var4
mean(sig3.4^2)

# for n=25 from t-4 distribution:

mean(xbar25.4)
sqrt(var(xbar25.4))
sqrt(var4/25)
var4
mean(sig25.4^2)

# for n=1000 from t-1 distribution:
# The three identities are not valid for the t-1 distribution
# because for that distribution mu and s do not exist.

# Therefore these are meaningful
mean(xbar1000.1)
mean(sig1000.1^2)

# If the CLT approximation holds, we expect about 95% of the
# confidence intervals Xbar +/- t_{alpha/2,n-1} s/n to contain the true
# value mu=0. We will check these for our 4 groups of 5000
# simulated confidence intervals.

# when n=3, use 2 degrees of freedom (n-1)
tcrit<-qt(.975,2)
mean(abs(xbar3.20)<tcrit*sig3.20/sqrt(3))
# Answer: [1] 0.947
mean(abs(xbar3.4)<tcrit*sig3.4/sqrt(3))

# explanation of above: finds percentage of xbar values that
# lie in the range. Recall population mean is 0.
# example of mean operation

test_case <-c(1,2,3,4)
sum(test_case>1) # answer is 3: i.e., 3 values in vector are > 1
mean(test_case>1) # divide sum (which is 3) by 4: answer = 0.75

# Therefore the mean operation above should be 0.95 if CLT holds true.

# Alternative way for understanding the code
# Try it again
mean(abs(xbar3.20)<tcrit*sig3.20/sqrt(3))
# Answer: [1] 0.947

# Alternative way
sd_xbar3.20 <- sig3.20/sqrt(3)
upper_limit <- mean(xbar3.20) + tcrit*sd_xbar3.20
lower_limit <- mean(xbar3.20) - tcrit*sd_xbar3.20
indexes_of_xbar3.20_in_range <- which((xbar3.20 < upper_limit)&(xbar3.20 > lower_limit))
number_within_range <- length(indexes_of_xbar3.20_in_range)
number_within_range/5000

# for n=25
tcrit<-qt(.975,24)
mean(abs(xbar25.4)<tcrit*sig25.4/sqrt(25))

tcrit<-qt(.975,999)
mean(abs(xbar1000.1)<tcrit*sig1000.1/sqrt(1000))

# For the sample of size n=3 from t-20 and the samples of size
# n=3 and n=25 from t-4, we estimate the true confidence level
# (also called coverage probability) to be close to 95%.
# For the t-1 confidence intervals we estimate a coverage
# probability. We can form a 95% confidence interval for
# the true coverage probability as follows: We have 5000
# confidence intervals each of which has an unknown probability p
# of containing the median value mu =0 (recall Cauchy distribution has no mean).
# Thus if Y is the number of
# confidence intervals which contain mu=0, Y has a Binomial(5000,p)
# distribution. In this run,
# phat= Y/5000 = .9808 and a 95% confidence interval for p (proportion)

# phat +/- 1.96 sqrt(phat (1 - phat)/5000) = .9808 +/- .0038.

0.9808*(1 -0.9808)
# Answer: [1] 0.01883136
0.9808*(1 -0.9808)/5000
# Answer: [1] 3.766272e-06
x= 0.9808*(1 -0.9808)/5000
sqrt(x)
# Answer: [1] 0.001940689
sqrt(x)*1.96
# Answer: [1] 0.003803750

# Notice when applying section 8.1 of R book, n refers to the number of
# Bernoulli trials, which in this case, is 5000 since we have 5000
# confidence intervals. Anyway we can be confident, that the true
# confidence level of these intervals is at least .9770.
# There is a theorem that if the underlying distribution is
# symmetric, even if the CLT approximation does not hold, the true
# confidence level is at least as large as the nominal (in this
# case 95%) level. Statisticians are inherently conservative: if
# we say that the confidence level is 95% we are happier if it is
# in fact larger than 95% than if it is smaller than 95%.

# We can visualize the confidence intervals by plotting them
# vertically. The horizontal line represents the true value mu =0.
# Here is the picture for the first 100 confidence intervals
# constructed from samples of size 25 from the t-4 distribution.

tcrit<-qt(.975,24)
llim<-xbar25.4[1:100]-tcrit*sig25.4[1:100]/sqrt(25)
ulim<-xbar25.4[1:100]+tcrit*sig25.4[1:100]/sqrt(25)
matplot(rbind(1:100,1:100),rbind(llim,ulim),type="l",lty=1,ylab="confidence interval")
abline(h=0)
title("95% large sample confidence intervals for population mean",
sub="sampled from t-4 dist.,n=25")

# Learning digression for matplot
matplot(c(1,2,3),c(4,5,6))
matplot(c(1,2,3),c(4,5,6),type="l")

# help(matplot) says "plot the columns of one matrix against the columns of another."
# create matrices

x = rbind(1:100,1:100)
is.matrix(x) # True
dim(x)
# Answer: 2 100 - 2 rows and 100 columns
y = rbind(llim, ulim) # each of these llim and ulim are vectors with 100 elements
is.vector(llim) # True
length

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote