(1) R has probability functions available for use. Using one distribution to approximate another is not uncommon.
(1)(a) (6 points) The normal distribution may be used to approximate the binomial distribution if np > 5 and
np(1-p) > 5. Find the following binomial probabilities using dbinom() and pbinom() with a probability, p =
0.5, and n = 100. Then, estimate the same probabilities using the normal approximation with continuity
correction and pnorm().
(i) The probability of exactly 50 successes.
(ii) The probability of fewer than 40 successes.
(iii) The probability of 60 or more successes.
(1)(b) (4 points) With n = 100 and p = 0.02, use the binomial probabilities from dbinom() to calculate the
expected value and variance for this binomial distribution using the general formula for mean and variance of
a discrete distribution (To do this, you will need to use integer values from 0 to 100 as binomial outcomes
along with the corresponding binomial probability). Calculate the same using the formulae np and np(1-p).
n <- 100
p <- 0.02
n * p # expected value
##  2
n * p * (1 – p) # variance
##  1.96
(2) A recurring problem in statistics is the identification of outliers. This problem involves
plotting data to display outliers, and then classiying them.
(2)(a) (5 points) Generate a random sample, “x”, of 100 values using set.seed(123) and rexp(n = 100, rate =
1). Do not change this number. If you must draw another sample, start the process with set.seed(123) to
maintain comparability with the answer sheet. Present “x” in side-by-side box- and QQ-plots, using boxplot()
and qqnorm(), qqline(). Use boxplot.stats() and/or logical statements to identify the extreme outliers, if any.
(2)(b) (5 points) Transform the random sample, “x”, generated in (a), to form a different variable, designated
“y”, using the Box-Cox Transformation: y = 3*(xˆ(1/3)) – 1). Display the values for “y” as in (a) and identify
(3) Performing hypothesis tests using random samples is fundamental to statistical inference.
The first part of this problem involves comparing two different diets. Using “ChickWeight”
data available in the base R, “datasets” package, execute the following code to prepare a data
frame for analysis.
# load “ChickWeight” dataset
# Create T | F vector indicating observations with Time == 21 and Diet == “1” OR “3”
index <- ChickWeight$Time == 21 & (ChickWeight$Diet == “1” | ChickWeight$Diet == “3”)
# Create data frame, “result,” with the weight and Diet of those observations with “TRUE” “index”” values
result <- subset(ChickWeight[index, ], select = c(weight, Diet))
# Encode “Diet” as a factor
result$Diet <- factor(result$Diet)
## Classes ‘nfnGroupedData’, ‘nfGroupedData’, ‘groupedData’ and ‘data.frame’: 26 obs. of 2 variables:
## $ weight: num 205 215 202 157 223 157 305 98 124 175 …
## $ Diet : Factor w/ 2 levels “1”,”3″: 1 1 1 1 1 1 1 1 1 1 …
The data frame, “result”, will have chick weights for two diets, identified as diet “1” and “3”.
Use the data frame, “result,” to complete the following item.
(3)(a) (4 points) Use the “weight” data for the two diets to test the null hypothesis of equal population
weights for the two diets. Test at the 95% confidence level with a two-sided t-test. This can be done using
t.test() in R. Assume equal variances. Display the results.
Working with paired data is another common statistical activity. The “ChickWeight” data
will be used to illustrate how the weight gain from week 20 to 21 may be analyzed. Use the
following code to prepare pre- and post-data from Diet == “3” for analysis.
# load “ChickWeight” dataset
# Create T | F vector indicating observations with Diet == “3”
index <- ChickWeight$Diet == “3”
# Create vector of “weight” for observations where Diet == “3” and Time == 20
pre <- subset(ChickWeight[index, ], Time == 20, select = weight)$weight
# Create vector of “weight” for observations where Diet == “3” and Time == 21
post <- subset(ChickWeight[index, ], Time == 21, select = weight)$weight
(3)(b) (6 points) Conduct a paired t-test and construct a two-sided, 95% confidence interval for the average
weight gain from week 20 to week 21. Do not use t.test(). Write the code for determination of the
confidence interval endpoints. Present the resulting interval.
(4) Statistical inference depends on using a sampling distribution for a statistic in order to make
confidence statements about unknown population parameters. The Central Limit Theorem is
used to justify use of the normal distribution as a sampling distribution for statistical inference.
Using Nile River flow data from 1871 to 1970, this problem demonstrates sampling distribution
convergence to normality. Use the code below to prepare the data.
(4)(a) (3 points) Using Nile River flow data and the “moments” package, calculate skewness and kurtosis.
Present side-by-side displays using qqnorm(), qqline() and boxplot(); i.e par(mfrow = c(1, 2)). Add features
to these displays as you choose.
(4)(b) (3 points) Using set.seed(124) and the Nile data, generate 1000 random samples of size n = 16, with
replacement. For each sample drawn, calculate and store the sample mean. This will require a for-loop and
use of the sample() function. Label the resulting 1000 mean values as “sample1”. Repeat these steps
using set.seed(127) – a different “seed” – and samples of size n = 64. Label these 1000 mean values
as “sample2”. Compute and present the mean value, sample standard deviation and sample variance for
“sample1” and “sample2”.
(4)(c) (4 points) Using “sample1” and “sample2”, present separate histograms with the normal density curve
superimposed (use par(mfrow = c(2, 1))). To prepare comparable histograms it will be necessary to use “freq
= FALSE” and to maintain the same x-axis with “xlim = c(750, 1050)”, and the same y-axis with “ylim
= c(0, 0.025).” To superimpose separate density functions, you will need to use the mean and
standard deviation for each “sample” – each histogram – separately.
(5) This problem deals with 2 x 2 contingency table analysis. This is an example of categorical
data analysis. The method shown in this problem can be used to
screen data for potential predictors that may be used in building a model.
The “Seatbelts” dataset contains monthly road casualties in Great Britain, 1969 to 1984. Use
the code below to organize the data and generate two factor variables: “killed” and “month”.
These variables will be used for contingency table analysis.
Seatbelts <- as.data.frame(Seatbelts)
Seatbelts$Month <- seq(from = 1, to = nrow(Seatbelts))
Seatbelts <- subset(Seatbelts, select = c(DriversKilled, Month))
## DriversKilled Month
## Min. : 60.0 Min. : 1.00
## 1st Qu.:104.8 1st Qu.: 48.75
## Median :118.5 Median : 96.50
## Mean :122.8 Mean : 96.50
## 3rd Qu.:138.0 3rd Qu.:144.25
## Max. :198.0 Max. :192.00
killed <- factor(Seatbelts$DriversKilled > 118.5, labels = c(“below”, “above”))
month <- factor(Seatbelts$Month > 96.5, labels = c(“below”, “above”))
(5)(a) (3 points) Using “Seatbelts,” generate a scatterplot of the variables DriversKilled versus Month. This
is a time series, and Seatbelts$Month should be on the horizontal axis. Show vertical and horizontal lines to
indicate the median of each variable. Label as desired.
(5)(b) (2 points) A chi-square test of independence will be used to test the null
hypothesis that the factor variables, “killed” and “month”, are independent. Use table() to generate a 2 x
2 contingency table showing the fatality count classified by “killed” and “month”. Use the uncorrected
chisq.test() to test the null hypothesis that “killed” and “month” are independent at the 95% confidence level.
Present these results.
(5)(c) (5 points) Write a function that computes the uncorrected Pearson Chi-squared statistic based on
the a 2 x 2 contingency table with margins added (check Davies, Section 11.1.1, pp. 216-219, and Kabacoff,
Section 20.1.3, pp. 473-474). Add margins to the contingency table from (b) using the function addmargins().
Submit this augmented table to the function you have written. Compare the result with (b). Your function
should duplicate and output the X-squared value (chi-squared) and p-value. Present both results.
The statements shown below calculate the expected value for each cell in an augmented contingency table
with margins added. Using these statements, the Pearson Chi-square statistic may be calculated. Other
approaches are acceptable.
e11 <- x[3, 1] * x[1, 3] / x[3, 3], e12 <- x[3, 2] * x[1, 3] / x[3, 3], e21 <- x[3, 1] * x[2, 3] / x[3, 3], e22 <- x[3,
2] * x[2, 3] / x[3, 3]
# Write function for computing uncorrected Pearson Chi-squared statistic and associated p-value
We value our customers and so we ensure that what we do is 100% original..
With us you are guaranteed of quality work done by our qualified experts.Your information and everything that you do with us is kept completely confidential.
You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.Read more
The Product ordered is guaranteed to be original. Orders are checked by the most advanced anti-plagiarism software in the market to assure that the Product is 100% original. The Company has a zero tolerance policy for plagiarism.Read more
The Free Revision policy is a courtesy service that the Company provides to help ensure Customer’s total satisfaction with the completed Order. To receive free revision the Company requires that the Customer provide the request within fourteen (14) days from the first completion date and within a period of thirty (30) days for dissertations.Read more
The Company is committed to protect the privacy of the Customer and it will never resell or share any of Customer’s personal information, including credit card data, with any third party. All the online transactions are processed through the secure and reliable online payment systems.Read more
By placing an order with us, you agree to the service we provide. We will endear to do all that it takes to deliver a comprehensive paper as per your requirements. We also count on your cooperation to ensure that we deliver on this mandate.Read more