## Simple Linear Regression

Introduction to simple linear regression: Article review

### Abstract

The use of linear regression is to predict a trend in data, or predict the value of a variable (dependent) from the value of another variable (independent), by fitting a straight line through the data. Dallal (2000), examined how significant the linear regression equation is, how to use it to draw the best fitting line of the scatter plot and how important the best fitting line is.

### Introduction to simple linear regression: Article review

The use of linear regression is to predict a trend in data, or predict the value of a variable (dependent) from the value of another variable (independent), by fitting a straight line through the data. Linear regression represents a connecting link between the independent (carrier) variable and dependent (response) variable, which if graphed on X and Y-coordinates, results in a straight line. Linear regression shows the straight line which thoroughly represents, or predicts, the value of the response variable, given the noted value of the carrier variable (Frey, 2006). This essay aims at reviewing the article introduction to simple linear regression by Dallal (2000).

### Problem statement

Dallal (2000) assumed a relationship between body mass (independent or carrier variable) and muscle strength (dependent or response variable), the more body mass the more muscle strength. However, this relationship is not without exceptions, which is reflected on the scatter plot of a regression model. Therefore, the author posed the question of how to illustrate the straight line, which accurately portrays the data, or predicts the value of the response variable.

### Research purpose statement

In the given example, most cases would show a perfect regression. However, standardization of the procedure of putting in a straight line is necessary to provide better communication and common grounds for analysts working on the same data. Further, in the example regression equation given (Strength = -13.971 + 3.016 LBM [Lean Body mass]), one can draw two conclusions; first, a predicted muscle strength equals LBM multiplied by 3.016 minus 13.971. Second, the difference between muscle strength of two individuals is presumably 3.016 multiplied by the difference in their LBM.

### Research questions

*Research question 1: Why we need to fit a regression equation into a set of data?*

It is clear from the previous example there are reasons for fitting a regression equation into a set of data. These are 1) to describe the data, and 2) to predict an independent (response) variable from a dependent (carrier) one.

*Research question 2: What is the underlying principle of calculating a straight line?*

If the points signaling data in a scatter plot are close to a line, it means the line represents, matches or gives a good fit of data. If not, then the line with most of the points closer to it that any other is the one that gives good fit of data. Further, If the is used to predict values, these values should close enough to the noted ones, in other words, residuals (observed values – predicted values) should small values.

*Research question 3: How linear regression (least squares) equation is used to illustrate the best fitting line?*

The standard used, as the name implies, is the sum of squared residuals (observed – predicted values) is minimal for the best fitting line. This applies to a line fitted to a set of sample data to promote generalization to a population from which this sample was taken. Yet for a population, there is a slightly different linear regression equation. The equation illustrates that an output (dependent) variable on the Y-axis can be predicted from an input (independent) variableson the X-axis after adding a random error (s_{i}).

*Research question 4: Is the sample regression equation an accurate estimate of the population regression equation?*

There is a reservation for accreditation of this statement, which is directed at the confidence bands in relation to the regression line. They are understood as the standard error of the mean (the standard deviation of the mean of the sampling distribution). Yet with one exception that is the sampling mean of the dependent variables amplifies as it adds distance from the mean.

### Sources of data

Dallal (2000), stated in the second part of his article (linked to the main article) are cross- sectional data. This type of data has the advantages of being used if sampling method are not weighted and-or un-stratified. This method can also be used if the researcher is concerned only with minor or small probabilities. The longitudinal data results in more statistical power, however, in repeated cross-sectional analysis, new subjects added per analysis compensates for the inherent decreased statistical power (Yee and Niemeier, 1996).

### Data collection strategies and methods

A good data collection strategy should have two objectives, namely, having motivated respondents (affected by time consuming, trust in statistics, difficulty of questionnaire, and benefit included). The second objective should be having high quality data, which tailored to sample individuals, sampling method and good instruments of data collection (Statistics Norway, 2007).

Methods of data collection are many and selection of a particular method depends on the available resources, reliability, resources of analysis and reporting, besides the skills and knowledge of the analyst. Some of these methods are case studies, behavior observation check lists, attitude, and opinion surveys, questionnaires distributed by mail, e-mail, or phone calls. Other methods of data collection include time series (evaluating one variable over a period of time as a week), and individual or group interviews (The Ohio State University Bulletin Extension, 2005).

### Conclusions

Dallal (2000), inferred that simple linear regression means that we can predict a dependent variable from an independent one, so whenever we need to know from the beginning each time we add information. The regression line is important as it makes the estimation of a dependent variable more accurate and it allows the estimation of a response variable for individuals with values of the carrier variable not included in the data. The author also inferred there are two methods of predicting a variable either from within the range of values of independent variable of the sample given (interpolation) or outside this range (extrapolation). The author recommended the first method as it has the advantage of being safe, yet with concerns as regards the way to demonstrate the linearity of relationship between the two variables.

### References

Dallal, G. (2000). Introduction to simple linear regression. Retrieved January 14, 2008, from http://www.tufts.edu/~gdallal/slr.htm.

Frey, B. (2006). Statistics Hacks. Sebastopol, CA: O’Reilly Media Inc.

Statistics Norway (2007). Strategy for data collection. Retrieved 04/07/2008, from http://www.ssb.no/vis/english/about_ssb/strategy/strategy_data_collection.pdf

The Ohio State University (2005). Bulletin Extension – Step Four: Methods of Data Collection. Retrieved 04/07/2008, from http://www.ohioline.ag.ohio-state.edu

Yee J L. and Niemeier D (1996). Advantages and Disadvantages: Longitudinal vs. Repeated Cross-Section Survey-A Discussion Paper. Project Battelle, 94, 16-22.