Outliers and Influential Points
An outlier is a data point that stands out from the rest of a data set either because it represents a mistake leading to an unusual value, or because it is the result of a valid but unusual individual or event leading to an unusual but valid data point.
When a data set contains one or more outliers, each one of them can be either be an influential point or not. This depends upon whether or not including or excluding the outlier strongly affects the position of the least squares regression line.
What to Do When Outliers are Present in a Data Set
-
Compute the least-squares regression line both with and without each outlier to determine which outliers are influential.
-
Report the equations of the least-squares regression line both with and without each influential point.
Influential Point
An influential point is a point that, when included in a scatterplot, strongly affects the position of the least-squares regression line.
Example
(This is an example from the e-book)
Above is a scatterplot of farm area versus total land area for U.S. states. The blue solid line on the plot is the least-squares regression line computed for the 48 states not including Texas or Alaska. The red dashed line is the least-squares regression line for 49 states including Texas. Including Texas moves the line somewhat. The green dash-dot line is the least-squares regression line for 49 states including Alaska. Including Alaska causes a big shift in the position of the line. In this case, while both Texas and Alaska would be considered outliers, only Alaska would be considered an influential point because only Alaska causes a tremendous change in the LSRL when it is included in the data set.
Here is a video showing some more examples. Please keep in mind that r squared will be defined on the next page. (It comes up briefly in this video.)