denotes a vector with a single value (`2.114`), stored at index `1`. Since there is no concept of a "scalar" in R, the command `x[[1]]` is equivalent to `x[1:1]`, which is also the same as `x[1]`. In other programming languages, `x[[1]]`, "the first element of `x`", and `x[1]`, "the subvector of `x` starting and ending at index `1`" would be different; in R, these are identical. Because of this, most people use single brackets instead of double brackets when indexing vectors, writing `x[1]` and `x[5]` instead of `x[[1]]` and `x[[5]]]`. -------------------------------------------------------------------------------- Plots and Descriptive Statistics ================================ Descriptive Statistics for Numeric Variables -------------------------------------------- There are a variety of functions for computing descriptive statistics for the values stored in a vector. * Sum of values: ```{r} sum(x) ``` * Measures of central tendency (sample mean and median): ```{r} mean(x) median(x) ``` * Measures of variability (sample standard deviation and sample variance): ```{r} sd(x) var(x) ``` * Extreme values (minimum and maximum): ```{r} min(x) max(x) ``` * Quantiles: ```{r} quantile(x, .25) # first quartile quantile(x, .75) # third quartile quantile(x, .99) # 99th percentile ``` Plots for Numeric Variables --------------------------- We can use the `hist` command to make a histogram of the values stored in a vector: ```{r} hist(x) ``` By default, the output looks fine when printed in black and white, but it isn't very pretty. We can specify the bin color, change the axis labels, and omit the main title by passing additional arguments to this `hist` function ```{r, tidy=FALSE} hist(x, col="steelblue", xlab="Passing Distance", ylab="Count", main="") ``` Use the `boxplot` and `qqnorm` commands to make boxplots and normal probability plots, as in the following examples: ```{r, tidy=FALSE} boxplot(x, border="darkred", ylab="Passing Distance") qqnorm(x, col="darkgreen", xlab="Normal Quantiles", ylab="Passing Distance Quantiles", main="") ``` Scatterplots ------------ We can make a scatter plot using the `plot` command. For example, to plot passing distance versus distance to the kerb, run the command ```{r, tidy=FALSE} plot(bikedata$kerb, bikedata$passing.distance, xlab="Distance to Kerb", ylab="Passing Distance") ``` To connect the points with lines, use the `type="l"` argument. For example ```{r, tidy=FALSE} t <- 1:10 plot(t, t^2, type="l") ``` We can use the `lty` and `col` arguments to change the style and color of the line: ```{r, tidy=FALSE} t <- 1:10 plot(t, t^2, type="l", lty=2, col="green") ``` Categorical Variables --------------------- So far, we have seen how to use R to summarize and plot a numeric (quantitative) variable. R also has very good support for categorical (qualitative) variables, referred to as *factors*. To see levels, the set of possible values for a factor variable, use the `levels` function. For example, to see the levels of the `colour` variable: ```{r} levels(bikedata$colour) ``` To tabulate the values of the variable, use the `table` command, as in ```{r} table(bikedata$colour) ```` Note: be default, the `table` command omits missing values. To include these values in the output, include `useNA="ifany"` in the call to `table`: ```{r} table(bikedata$colour, useNA="ifany") ``` We can present tabulated counts in a bar plot using the following commands ```{r} tab <- table(bikedata$colour, useNA="ifany") barplot(tab) ``` Usually, it makes sense to arrange the table values in decreasing order. Here is an example with sorted counts that adds axis labels and changes the bar colors: ```{r, tidy=FALSE} barplot(sort(tab, decreasing=TRUE), xlab="Colour", ylab="Count", col="steelblue") ``` -------------------------------------------------------------------------------- Inference ========= Inference for a Population Mean ------------------------------- We can use the `t.test` function to test a hypothesis about a population mean. ```{r} t.test(bikedata$passing.distance) ``` This reports the t statistic, the degrees of freedom, the p-value, and the sample mean. The command also reports a 95% confidence interval for the population mean. To change the confidence level, use the `conf.level` argument, as in ```{r} t.test(bikedata$passing.distance, conf.level=0.99) ``` By default, the null hypothesis is that the true (population) mean is equal to 0, and the alternative hypothesis is that the true mean is not equal to 0. To use a different null, pass the `mu` argument. To use a different alternative, pass `alternative="less"` or `alternative="greater"`. For example, to test the null hypothesis that the true mean is equal to 1.5 against the alternative that it is greater, run the command ```{r} t.test(bikedata$passing.distance, alternative="greater", mu=1.5) ``` Note that for a one-sided alternative, the confidence interval is one-sided, as well. Inference for a Population Proportion ------------------------------------- To perform a test on a population proportion, use the `prop.test` function. This performs a test on a population proportion that is slightly different than the one we cover in the core statistics course, but it will give you a very similar answer. In the first argument, specify `x`, the number of successes; in the second argument, specify `n`, the number of trials. By default, the null value of the population proportion is `0.5`; to specify a different value, use the `p` argument. For example, to test the null hypothesis that the true proportion of cars passing the rider on his route is exactly equal to 40%, we first tabulate the `colour` variable, ```{r} table(bikedata$colour) ``` In this instance, the number of "successes" is equal to the number of blue cars, `636`. Recall that some of the values for the `colour` variable are missing. If the missingness is unrelated to the actual color, then we can safely ifnore these values; in this case, the number of "trials" is equal to the sumof the counts for all of the non-missing values. ```{r} sum(table(bikedata$colour)) ``` Now, to test the proportion, we run the command: ```{r} prop.test(636, 2341, p=0.40) ``` As with the `t.test` function, we can use a one-sided alternative or specify a different confidence level for the interval by using the `alternative` or `conf.level` argument, respectively. Here is a test of the null that the true proportion is equal to `0.5` against the alternative that it is less, along with a one-sided 99% confidence interval: ```{r} prop.test(636, 2341, p=0.50, alternative="less", conf.level=0.99) ``` -------------------------------------------------------------------------------- Linear Regression ================= Model Fitting ------------- We fit a linear regression model using the `lm` command. For example, suppose we want to fit a with response variable `sqrt(passing.distance)` and predictors `helmet`, `vehicle`, and `kerb`. We would use the following command: ```{r} model <- lm(sqrt(passing.distance) ~ helmet + vehicle + kerb, data = bikedata) ``` The formula syntax `y ~ x1 + x2 + x3` means use `y` as the response variable, use `x1`, `x2`, and `x3` as predictor variables, and include an intercept in the model. We can either store all of the variables in a data frame and use the `data` argument as above, or we can extract the variable first. For example, we could run the following commands to fit an equivalent model: ```{r} sqrt.passing.distance <- sqrt(bikedata$passing.distance) helmet <- bikedata$helmet vehicle <- bikedata$vehicle kerb <- bikedata$kerb model1 <- lm(sqrt.passing.distance ~ helmet + vehicle + kerb) ``` In the latter instance, there is no need to pass the `data` argument to `lm`; the response and the predictor variables exist in the environment. Inference for Regression Parameters ----------------------------------- When we have a fitted value, we can get the coefficient estimates, their standard errors, t statistics, and p values with the `summary` command. ```{r} summary(model) ``` This also reports some information about the residuals, along with the R^2 and the adjusted R^2 values. To get a confidence interval for a population regression coefficient, use the `confint` command. For example, we can get a 99% confidence interval for coefficient of `helmetY` with ```{r} confint(model, "helmetY", 0.99) ``` Regression Diagnostics ---------------------- Plotting a fitted model shows us regression diagnostics: ```{r} plot(model) ``` We can extract the raw or standardized residuals with the `residuals` or `rstandard` command. For example, to make scatter plots of residuals versus `kerb`, use ```{r} plot(bikedata$kerb, residuals(model)) ``` For standardized residuals, use ```{r} plot(bikedata$kerb, rstandard(model)) ``` Here is a boxplots of standardized residual versus `helmet`: ```{r} boxplot(rstandard(model) ~ bikedata$helmet) ``` Forecasting ----------- We use the `predict` command to forecast the response values for new observations. To use this command, we first make a data frame with the predictors for the new observations, then we pass this data frame and the model to the `predict` command. Suppose we want to get a prediction `sqrt(passing.distance)` when riding with a helmet, being passed by an SUV, and having a kerb distance of 1.2 meters. In this case, we run the commands ```{r} newdata <- data.frame(helmet="Y", vehicle="SUV", kerb=1.2) predict(model, newdata) ``` We can also ask for the standard error of the fit: ```{r} predict(model, newdata, se.fit = TRUE) ``` We can get a 95% confidence interval for the mean with ```{r} predict(model, newdata, interval = "confidence", level = 0.95) ``` Here, the reported confidence interval is approximately `(1.157, 1.209)`. Finally, we can get a prediction interval for the response with ```{r} predict(model, newdata, interval = "prediction", level = 0.95) ``` Here, the reported prediction interval is approximately `(0.897, 1.469)`.[1] 2.114