Homework Assignment 2:

1. [15 points] Prepare a line graph for the ridership of Amtrak data from the beginning of 1991 to March 2004 with the labels indicating the axes. Print your R command and line graph. After observing the behavior of the ridership from your graph, answer the following questions using related statistics:

Which year/month does max/min belong to?

What are the range and IQR of the ridership? Are there any outliers?

2. [15 points] Prepare a boxplot of the ridership as well as a histogram. Print the boxplot/histogram with R commands to generate them. What can you say about the outliers (are there outliers) and distribution (is the distribution symmetric or skewed, if skewed, is it skewed to right or left)?

3. [20 points] Use the Excel data set Pollution (the variables of this data set are explained below). Prepare a heatmap to observe the correlations of any two numeric variables of Pollution with correlation coefficients printed in cells. Print the heatmap and R command(s) to produce it. Which variable pairs have the highest/lowest positive/negative correlation. How do you interpret the highest positive and negative correlations as a data analyst, do these correlations make sense?

The Pollution.xlsx data set includes regional climate, pollution, and population demographic statistics from 1960 in the United States. Below are the descriptions for these variables.

Variable Description

PREC Average annual precipitation in inches

JANT Average January temperature in degrees F

JULT Average July temperature in degrees F

OVR65 Percent population aged 65 or older

POPN Average household size

EDUC Median school years completed by those over 22

HOUS Percent housing units, which are in good repair and with all facilities

DENS Population per square mile in urbanized areas, 1960

NONW Percent non-white population in urbanized areas, 1960

WWDRK Percent employed in white collar occupations

POOR Percent of families with income < $3000

HC Relative hydrocarbon pollution potential

NOX Relative nitric oxides pollution potential

SO2 Relative sulfur dioxide pollution potential

HUMID Annual average % relative humidity at 1:00 pm

MORT Total age-adjusted mortality rate per 100,000

4. [15 points] Select any five variables of Pollution you wish and prepare a matrix scatter plot with these five variables. Print the matrix scatter plot and the R command(s) to produce it. Interpret diagonal and off-diagonal elements of this matrix. Are the correlation coefficients you computed in line with your answer to Question 3?

Hint: Install GGally package first and then use ggpairs command like the one in your textbook.

5. [15 points] Use the Pollution data set for PCA. How many components are required for us to explain at least 99.9% of variance. Provide a table from the output to support your claim.

Hint: Refer to related lecture notes and textbook section. You can use an R command line like

“pcs <- prcomp(data.frame(Pollution))” to run the PCA.

6. [20 points] Run the PCA one more time with scaling, and print its output. How many components are required for us to explain at least 80% of variance.

Hint: Modify the R command you used in Q6 by a line like “pcs <- prcomp(data.frame(Pollution), scale. =T)”.

7. [10 points, BONUS] Return the observations of the Pollution in terms of its principal components by setting the “scores” equal to these, i.e. use a command like “scores = pcs$x”. Print first five observations of the data set in terms of its principal components, i.e. type an R command like “head(scores, 5)”. Compute the correlation between any two columns of scores to see that the principal components of scores are not correlated, i.e. type an R command like “cor(scores[ ,1], scores[ ,2])”. Keep in mind that the correlation coefficient like “-5.360244e-17” in R is practically 0.

====