Linear Regression - Boston Houseprice Data

Today, I will be doing a Linear Regression Analysis using the Boston House Price dataset from the Sci-Kit Learn Library. Linear Regression allows the examination of the relationship between inputs and a continuous output. In this case, the output refers to the value of the house price. We will be exploring the data to see whether or not we can learn anything about the relationships between the variables and whether we can build a model that can predict house prices based on some of the inputs.

First, we import the relevant libraries.

bos import

As the data is part of the Sci-Kit Learn library, it requires very little manipulation to load it and begin using it.

bos 1

Before any statistical methods can be run, we need to get an understanding of how the data is presented. So, after loading the dataset, I check its type.

bos 2

It is a Bunch, which is a dictionary that allows attribute-style access. It has four keys: data, target, feature_names and DESCR. Calling these methods reveals the structure of the data that we’re interested in.

bos 3

The data key contains the inputs and is a numpy array of 506 rows long and 13 columns wide. We will explore these in a bit more detail soon, in the hope that we can identify some reliable predictors.

bos 4

The target key contains the actual house prices, this is the output and the variable we will be trying to predict based on the inputs.

bos target info

Quickly calling .info() on the target reveals that it is a column of numerical values. This suggests that it is indeed a regression problem we are dealing with. While calling .head() and looking at the description reveals that the target represents the median value of owner-occupied homes in $1000’s for different suburbs of Boston. Understanding how the data is presented will better allow us to understand our performance meausre, the root-mean-squared-error. More on this later!

bos target head

This plot shows the distribution of house prices. Apart from a few outliers at the top end, it resembles a normal distribution. We can suspect the outliers refer to a cut-off point at 50,000 + etc.

bos dist

The feature names key contains an array of 13 strings, these appear to be the column names of the inputs. Calling the DESCR key confirms this and elaborates on what each column represents. This can be found here..

bos 5

Now that we understand what we are looking at, it is time to pass this into a Pandas DataFrame to more easily allow for manipulation of the data.

bos 6

Although it is clear from the description that there are no null or none values, it is good practice to check if any of the data are missing.

bos no_null

The .info() is another useful method when looking at data, in particular with reference to data types.

bos info

The describe method for Pandas produces a tranche of descriptive statistics about the data frame in question.

bos describe

Before conducting a regression analysis, we want to establish whether or not the data provide evidence of an association between an input or inputs and house price. Without a relationship between input and the house price, there is little point in conducting a regression analysis. If there is a relationship, we want to establish it’s strength. Perhaps more importantly, though in this case we know the answer, we want to establish whether or not this relationship is linear.

Linear Regression is a fairly inflexible model. It makes the assumption that there is a linear relationship between X and Y. If this condition is satisfied, there should relatively small variance in prediction if the model is trained on different data. This is as the model carries with it a fair amount of bias (or error introduced by approximating real-life problems with simple-models), which is minimised the more the data reflects the pre-conditions for a Linear Regression. Although there is no magic bullet in model selection, ideally the correct model should minimise both the variance and the bias (for those interested, this is known as the variance-bias trade-off). By establishing that there is a linear relationship between X and Y, we can rely more confidently on the validity of our predictions.

This heatmap displays the correlation between the variables. As we would like to select those features with a high correlation to the target, this is helpful.

bos corr 1

Two variables stick out as being highly correlated to the target, RM has a strong positive correlation (0.7) to the target and LSTAT has a strong negative correlation (-0.74) to the target. Let’s see if we get similar suggestions from scatterplots.

bos crim scat bos zn scat bos indus scat bos chas scat bos nox scat bos age scat bos dis scat bos rad scat bos tax scat bos pitratio scat bos b scat bos rm scat bos lstat scat

The final two graphs represent the variables mentioned above. They both bear out the sort of correlation suggested by the heatmap.

LSTAT referes to the % of the population defined as lower status.

RM refers to the average number of rooms in a dwelling, perhaps it is not surprising that this is positively correlated with the median house price.

All of this indicates that a linear model is the correct one to pursue. This dataset was selected in the knowledge that a linear model would be a good fit, but ‘in the wild’ so to speak, this would be a good indication as to the type of model we would want to fit.

Next, we can begin preparing the data for our model. We split the data, using sklearns train_test_split() method, into training and test sets. The random_state parameter ensures that the same random sample is used each time we call the method. We can also confirm the size of the sets by calling .shape().

bos split

With sklearn, it is as easy as instantiating the model and calling .fit() on the training sets and calling .predict() on the X_train set. We will then compare the difference between the actual value and the predicted value.

bos fit

Here are the results. In order to evaluate the accuracy of the predictions, SKLearn provides the mean-squared-error or MSE. The MSE can be thought of as the average squared difference between predicted values and actual values, thus the closer it is to 0, the better the model has performed at prediction. By taking the square root of the MSE, we produce the RMSE, which is measured in the same unit as the quantity being investigated.

bos rmse

So, what does an RMSE of 5.6 mean? As the RMSE measures the standard deviation of errors the model makes, it means that 68% of the model’s predictions fall within $5.6 thousand dollars of the actual mean house price, and about 95% fall within $11.2 thousand of the actual value.

Whether or not this is an acceptable result, depends entirely upon the nature of the purpose of running a regression analysis in the first place. When we consider the mean house price data is $22.5 thousand (the data is from the 80’s after all), then $5.6 thousand accounts for about a quarter of that value. The target here refers to the median house prices of Boston suburbs and thus, one can imagine it being helpful as a means of insight, though not as the sole criteria for valuation of a property.

We could ask, though, whether this was the best potential model with which to predict house prices, given the data at hand. The problem that could arise here, is that the model overfits to the data. Another way of expressing this would be to say that the model best predicts a pattern unique to the sample data, that may not translate to the broader population data. As the population data is not always available or is too large to be used practically, it is standard practice to divide the data into sample training and test sets. The purpose of a test set is a means to combat against overfitting. Ideally, there would be a further validation set, but given the size of data sets this seems impractical. We won’t get into bootstrapping here, though this is another means of combatting against overfitting.

Assuming that correlation was the correct means with which to select the input variables, I will run the same models above on RM and LSTAT separately.

First RM.

bos rmse train rm

Then LSTAT.

bos rmse train lstat

Based on this, I was correct in choosing the first model. However, on the test sets, the RM only model produced the best RMSE with 4.98, while the combined mode was slightly worse with 5.1 while the LSTAT only produced 6.2. With such a small sample set, these fluctuations in rank are likely meaningless and merely prove that the usefulness in conducting this regression would likely lie in the ability to gain a broad insight into house prices, and the variables deciding them, rather than being an exact means of predicting them.

I think this is a good place leave off for now. Thank you for reading!