As we did when considering only one variable, we begin with a graphical display. A scatterplot is the most useful display technique for comparing two quantitative variables. We plot on the y-axis the variable we consider the response variable and on the x-axis we place the
explanatory or predictor variable. How do we determine which variable is which? In general, the explanatory variable attempts to explain, or predict, the observed outcome. The response variable measures the outcome of a study. One may even consider exploring whether one variable causes the variation in another variable – for example, a popular research study is that taller people are more likely to receive higher salaries. In this case, Height would be the
explanatory variable used to explain the variation in the response variable Salaries. In summarizing the relationship between two quantitative variables, we need to consider: We will refer to the Exam Data set,
(Final.MTW or Final.XLS), that consists of random sample of 50 students who took Stat200 last semester. The data consists of their semester average on mastery quizzes and their score on the final exam. We construct a scatterplot showing the relationship between Quiz Average
(explanatory or predictor variable) and Final (response variable). Thus, we are studying whether student performance on the mastery quizzes explains the variation in their final exam score. That is, can mastery quiz performance be considered a predictor of final exam score? We create this graph using either Minitab or SPSS: To create a scatterplot in SPSS:
This should result in the following scatterplot: Association/Direction and FormWe can interpret from either graph that there is a positive association between Quiz Average and Final: low values of quiz average are accompanied by lower final scores and the same for higher quiz and final scores. If this relationship were reversed, high quizzes with low finals, then the graph would have displayed a negative association. That is, the points in the graph would have decreased going from right to left. The scatterplot can also be used to provide a description of the form. From this example we can see that the relationship is linear. That is, there does not appear to be a change in the direction in the relationship. StrengthIn order to measure the strength of a linear relationship between two quantitative variables we use correlation. Correlation is the measure of the strength of a linear relationship. We calculate correlation in Minitab by (using the Exam Data):
The output gives us a Pearson Correlation of 0.609 Correlation Properties (NOTE: the symbol for correlation is r)
Equations of Straight Lines: Review The equation of a straight line is given by y = a + bx. When x = 0, y = a, the intercept of the line; b is the slope of the line: it measures the change in y per unit change in x. Two examples:
For the 'Data 1' the equation is y = 3 + 2x ; the intercept is 3 and the slope is 2. The line slopes upward, indicating a positive relationship between x and y. For the 'Data 2' the equation is y = 13 - 2x ; the intercept is 13 and the slope is -2. The line slopes downward, indicating a negative relationship between x and y.
The relationship between x and y is 'perfect' for these two examples—the points fall exactly on a straight line or the value of y is determined exactly by the value of x. Our interest will be concerned with relationships between two variables which are not perfect. The 'Correlation' between x and y is r = 1.00 for the values of x and y on the left and r = -1.00 for the values of x and y on the right. Regression analysis is concerned with finding the 'best' fitting line for predicting the average value of a response variable y using a predictor variable x. Least Squares RegressionThe best description of many relationships between two quantitative variables can be achieved using a straight line. In statistics, this line is referred to as a regression line. Historically, this term is associated with Sir Francis Galton who in the mid 1800’s studied the phenomenon that children of tall parents tended to “regress” toward mediocrity. Adjusting the algebraic line expression, the regression line is written as: \(\hat{y}=b_0+b_1 x \) Here, bo is the y-intercept and b1 is the slope of the regression line. Some questions to consider are:
By answering the third question we should gain insight into the first two questions. We use the regression line to predict a value of \(\hat{y}\) for any given value of X. The “best” line would make the best predictions: the observed y-values should stray as little as possible from the line. The vertical distances from the observed values to their predicted counterparts on the line are called residuals and these residuals are referred to as the errors in predicting y. As in any prediction or estimation process you want these errors to be as small as possible. To accomplish this goal of minimum error, we select the method of least squares: that is, we minimize the sum of the squared residuals. Mathematically, the residuals and sum of squared residuals appears as follows: Residuals: \(y-\hat{y}\) Sum of squared residuals: \(\sum{(y-\hat{y})^2}\) A unique solution is provided through calculus (not shown!), assuring us that there is in fact one best line. Calculus solutions result in the following calculations for bo and b1: \(b_1=r\frac{S_y}{S_x}\) \(b_0=\bar{y}-b_1\bar{x}\) Another way of looking at the least squares regression line is that when x takes its mean value then y should also takes its mean value. That is, the regression line should always pass through the point (\(\bar{x}\), \(\bar{y}\)). As to the other expressions in the slope equation, Sy refers to the square root of the sum of squared deviations between the observed values of y and mean of y; similarly, Sx refers to the square root of the sum of squared deviations between the observed values of x and the mean of x. Example: Exam Data set (Final.MTW or Final.XLS) To perform a regression on the Exam Data we can use either Minitab or SPSS:
Plus the following is the first five rows of the data in the worksheet: To perform a regression analysis in SPSS:
This should result in the following regression output: WOW! This is quite a bit of output. We will take this data apart and you will see that these results are not too complicated. Also, if you hang your mouse over various parts of the output in Minitab pop-ups will appear with explanations. The OutputFrom the output we see:
NOTE: Remember that the square root of a value can be positive or negative (think of the square root of 2). Thus the sign of the correlation is related to the sign of the slope. For example, if we substitute the first Quiz Average of 84.44 into the regression equation we get: Final = 12.1 + 0.751*(84.44) = 75.5598 which is the first value in the FITS column. Using this value, we can compute the first residual under RESI by taking the difference between the observed y and this fitted : 90 – 75.5598 = 14.4402. Similar calculations are continued to produce the remaining fitted values and residuals. Coefficient of Determination, R2The values of the response variable vary in regression problems (think of how not all people of the same height have the same weight), in which we try to predict the value of y from the explanatory variable x. The amount of variation in the response variable that can be explained (i.e. accounted for) by the explanatory variable is denoted by R2. In our Exam Data example this value is 37% meaning that 37% of the variation in the Final averages can be explained (now you know why this is also referred to as an explanatory variable) by the Quiz Averages. Since this value is in the output and is related to the correlation we mention R2 now; we will take a further look at this statistic in a future lesson. Residuals or Prediction ErrorAs with most predictions about anything you expect there to be some error, that is you expect the prediction to not be exactly correct (e.g. when predicting the final voting percentage you would expect the prediction to be accurate but not necessarily the exact final voting percentage). Also, in regression, usually not every X variable has the same Y variable as we mentioned earlier regarding that not every person with the same height (x-variable) would have the same weight (y-variable). These errors in regression predictions are called prediction error or residuals. The residuals are calculated by taking the observed Y-value minus its corresponding predicted Y-value or \(y-\hat{y}\). Therefore we would have as many residuals as we do y observations. The goal in least squares regression is to select the line that minimizes these residuals: in essence we create a best fit line that has the least amount of error. What type of graph is used to show the relationship between two quantitative variables?The most useful graph for displaying the relationship between two quantitative variables is a scatterplot.
What type of graph is used for two quantitative data?A bar graph for any type of quantitative data is called a histogram. The discrete values taken by the data are labeled in ascending order across the horizontal axis, and a rectangle is drawn vertically so that the height of each rectangle corresponds to each discrete variable's frequency or relative frequency.
What graph is best for two quantitative variables?A scatterplot is the most useful display technique for comparing two quantitative variables. We plot on the y-axis the variable we consider the response variable and on the x-axis we place the explanatory or predictor variable.
How do you determine the relationship between two quantitative variables?The best way to determine how two variables relate to each other is by plotting the data points on a scatter plot, a graph where each data point is plotted individually. If all the points seem to follow a line of some sort, then we can interpret that relationship.
|