Mark’s Courses: Simple Linear Regression Tutorial

Matthew McElhaney

3 years ago

I’ve decided to start building a data science course for my son so when he’s old enough to be interested I can offer him education that I can understand and explain. This is the first in hopefully many, and will focus on linear regression of a large feature data set. The data set is from kaggle, and you can submit your result to the competition to see how you score!

I’ve put all files used in this exercise here:

Notebook Link

Test.csv

train.csv

Step 0:

If you haven’t, install Python 3.11 then pip install jupyter, pandas & sklearn

Python can be installed from here: Download Python | Python.org

Note: Ensure you add python to path (check box during install, not a default option).

Then open command prompt and install the required libraries:

pip install jupyter

pip install pandas

pip install scikit-learn

pip install matplotlib

pip install seaborn

You’ll see a screen like this:

Step 1:

Step 2:

We’ll be doing a simple regression on housing data to make predictions of home prices. The exercise can be found here:

House Prices – Advanced Regression Techniques | Kaggle

You can download the data here:

House Prices – Advanced Regression Techniques | Kaggle

Your folder should look like this:

Step 3:

Start Jupyter Notebook and create a notebook in the folder

Go to your dos prompt (start and type cmd on windows machines) and type in jupyter notebook. You should see something like this:

Select and right click one of the urls at the bottom of the prompt and paste it in a web browser:

Navigate to the folder you created for the data (note: where you start jupyter in the dos prompt is where it can access, so if you put it somewhere outside the folder you instantiated your notebook instance it won’t have access):

Create a new notebook file:

If you’re new to notebooks, this is a great resource: How to Use Jupyter Notebook in 2020: A Beginner’s Tutorial (dataquest.io)

Step 4:

Now that you’re in your new notebook.

Rename the notebook to something you want like “housing_prices”. I’ll share the notebook at the end but you can go through and copy paste everything within quotes (starts with a line at the end in WordPress).

This tutorial go over a few key concepts:

Exploratory Data Analysis (EDA)
Simple Data Engineering
1. Dropping unnecessary columns / features
2. The creation and use of dummy variables

First things first let’s do our imports and load the data:

import pandas as pd  # for dataframes and associated methods
import seaborn  # to visualization correlation
import matplotlib.pyplot as mp  # to visualize correlation
from sklearn.model_selection import train_test_split  # to split training / testing data
from sklearn.linear_model import LinearRegression  # For linear regression model
from sklearn.metrics import r2_score  # to score output model

train_df = pd.read_csv('train.csv')  # pull in the data
test_df = pd.read_csv('test.csv')

Exploratory Data Analysis (EDA)

EDA is done to give a data scientist an understanding of the data set. This could mean a lot of things to a lot of people. Generally the data scientists will use statistics and visualizations to understand what’s happening. The could include, but aren’t limited to:

Is the data numerical, boolean, text based, all of the above?
What are the distributions of the data? This will drive what type of regressions are chosen
Are there a lot of outliers? Missing data? Something a data scientist will need to work with?

Head is a command that shows the first 5 rows of data. There’s a tail command as well, which will show the last five rows. These are good commands to visually make sure everything loaded ok

print(train_df.head())  # check the data load

81 columns – that’s not a small data set. Don’t worry-you’ll see how powerful a good scripting language can be to cut through this. Next let’s find out what the data is through a command called describe

train_df.describe()

Something that pops out above is that the target variable, SalePrice, has a mean of 180921 and a standard deviation of 79442. This means the target variable has a large spread of values.

Another command we can use is called info and gives additional information on data types. We’ll use the output to understand what variables are text and what are numbers

train_df.info()

We have a wide variety of number and text data. int64 is just an integer, float64 indicates a decimal (makes sense for Lot Frontage, 4 lines down). The values that are listed as objects are text based and likely have multiple values. Let’s look at “Street”

print(train_df['Street'].unique())

Ok makes sense. This column has paved and gravel as two text fields. There’s likely text columns that have more than two.

Finally we can visualize correlation using a visualization tool called Seaborn. Run this command:

plot = seaborn.heatmap(train_df.corr(), cmap="YlGnBu", annot=False)
mp.show()

Some good information here. What can we do with this? If we wanted to shrink the feature set down manually, we could find variables that are highly collinear and use one to represent the group. Why is this? Let’s look at Year Sold and GarageCars – the correlation between the two is very high. This makes sense-most houses built in the 1950s have at most two spots, where more modern houses have larger garages.

This means that any information about both year sold and number of garages is partially duplicated and one could possibly be dropped without
significantly hampering the regression. Why would we do this?

Well, if you’re doing a model across millions of data points, and you can drop a feature without hurting the accuracy of the model, then it’s something you can consider for performance

What if you didn’t want to do this manually? Great question, but beyond the scope of this. If you’re curious look up principle components analysis – it’s a way to take a high number of features and reduce them programmatically. In general
this method is far better than manually removing features, and takes far less work

Summary of the results from the EDA:

Our target value is numerical
This data contains both numerical and text data
Some of the data is highly correlated

Model building

Ok let’s focus on the model building aspect. For this we’ll want to build simple regressions. This choice has some pros and cons:

Pros:

Regressions are computationally inexpensive. This means both training and running the model won’t take a lot of
time for the computer to think about it. A good word to win an interview is parsimonious
Regressions are explainable – you can see what a coefficient’s impact on the result is

Cons:

Regressions require numbers, not text. We’ll need to build a pipeline to handle converting text to numbers or drop them
Regressions do not handle missing values very well. Other models do (why random forests rule over the world). So we’ll
need to deal with missing values somehow
Regressions don’t handle outliers very well on their own. This can be addressed in preprocessing, but sometimes addressing
that is more dangerous then just letting the regression handle it

More commentary on the outlier issue: Think about if this training data set is mostly focused on 3 bedroom 2 bathroom houses and a 20 bedroom mansion shows up for the model to address. It’s likely going to have no clue how to manage the relationship between 3 bedrooms and 20 bedrooms as a linear function for pricing

The first step we’ll do is the separate the numeric and non numeric columns. We’ll need this to convert the non numeric columns to numbers in the future


numeric_columns = [c for c in train_df.columns if pd.api.types.is_numeric_dtype(train_df[c])]

def return_numeric_columns(df):
    return [c for c in df.columns if pd.api.types.is_numeric_dtype(df[c])]

This command is a bit hard to explain if you’re new to python. It’s called a list comprehension. Basically it enables you to populate a list with a single line of code. List comprehensions are very pythonic, and once you get used to them are pretty cool. but in English what it’s doing is: if the column type is numeric, add it to the list.

Note: we’re adding functions of each at the bottom, it’ll become clear later why they’re needed to build a pipeline for data ingestion

Let’s check to see if we got the right columns:

print(numeric_columns)

We did! Now we can do the same thing to get the string columns. Notice in the function call we make for the string columns calls the numerical function. We can stack functions on each other to get strong results.

string_columns =  [x for x in train_df.columns if x not in numeric_columns]

def return_string_columns(df):
    numeric_columns = return_numeric_columns(df)
    return [x for x in df.columns if x not in numeric_columns]

print(string_columns)

It worked!

The last thing to note is that the data has an “Id” column, which presumably goes from 1 to x. This will add noise to the regression since an arbitrary number has nothing to do with home price, so let’s drop it for the purposes.

def drop_id_column(df):
    if 'Id' in df: df = df.drop(['Id'], axis=1)  # note: this is more for pipelines
    else: None  # no need to run
        
    return df

""" drop the ID column """
train_df = drop_id_column(train_df)

"""check to see if Id is gone:"""

print('Id is in the dataframe: ', 'Id' in train_df)

We noticed above that we have missing data in some of the numerical columns. There’s a few ways to address this but the easiest way is to interpolate them (note: often not the best way). This will basically say ‘if the value is missing, plug in a value based on a linear interpolation from other numerical values’. Again,
this should be generally avoided because now when you do your regression you’re doing a regression of a regression.

for col in drop_id_column(return_numeric_columns(train_df)):
    train_df[col] = train_df[col].interpolate()
    
def interpolate_numerical_columns(df):
    for col in drop_id_column(return_numeric_columns(df)):
         df[col] = df[col].interpolate()
    return df

Remember when we noticed all of those text columns? Text is great, but a linear regression isn’t going to be able to consume them organically. One way to address them is to use dummy variables? What is a dummy variable? Dummy variables pivot text data into columns with associated 1s and 0s. See below:

In the left data set, a linear regression can’t apply a coefficient to a string value. But on the right it can apply a coefficient to the integer value. So let’s say we were doing a regression for height and being male was +3 inches, and being female was -3 inches. A linear regression could do:

height = C (base height) + 3 * (Male) – 3 *(Female)

train_df = pd.get_dummies(train_df, columns = return_string_columns(train_df))

def create_dummy_variables(df):
    return pd.get_dummies(df, columns = return_string_columns(df))

Awesome! See how there’s 1s and 0s now, and the columns were automatically named based on the base column name?

Now we’re ready to run the regression.

Separate the columns into dependent and independent variables
Split the outcomes into test and train data sets
Fit a regression on the train data set
Score the regression on the test data set

You may be asking what train and test training sets are (and best practice is to have a validation one if you want to tune hyper parameters, way beyond the scope of this). Basically you need enough data to train a good model, but still have some left over to test to see how good your model is on different data. sklearn makes this very easy with a single function call:

X = train_df.drop(['SalePrice'], axis=1)
y = train_df['SalePrice']

""" split the data set """
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 123)
""" make a regression"""
reg = LinearRegression().fit(X_train,y_train)
"""create predictions based on the test data set"""
predictions = reg.predict(X_test)
""" compare the predictions to the actual values"""
r2_score(predictions, y_test)

And r^2 value of 0.88-not bad!

So how do we string this all together? The flow chart below shows what we’re about to do:

Let’s create a few functions that will return the regression when run:

def return_regression(df):
    X = df.drop(['SalePrice'], axis=1)
    y = df['SalePrice']

    """ split the data set """
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 123)
    """ make a regression"""
    reg = LinearRegression().fit(X_train,y_train)

    return reg


def run_regression(df, reg):
    predictions = reg.predict(df)
    return predictions

And now let’s create the model using a pipeline:

train_df = pd.read_csv('train.csv')

processed_train_df = train_df.pipe(drop_id_column) \
     .pipe(interpolate_numerical_columns) \
     .pipe(create_dummy_variables)

model = processed_train_df.pipe(return_regression)

One thing that we’ll need to account for is that the training data may have some text fields that aren’t in the smaller test set. For example, if the “MiscFeature” training data has values like “porch” but they’re not found in the test set, the regression won’t know how to handle the missing column. Let’s add a function to add any missing columns in the test set and set them to 0 since they’re not present:

def drop_sale_price(df):
    return df.drop('SalePrice', axis=1)

""" function that adds columns in train that aren't in test and sets the same order """
def add_missing_columns(train_df, test_df):
    missing__cols = list(set(list(train_df)) - set(list(test_df)))
    for col in missing_cols:
        test_df[col] = 0

    test_df = test_df[list(train_df)]
        
    return test_df

Read the data in, run the regression, and print the results:

test_df = pd.read_csv('test.csv')

processed_test_df = test_df.pipe(drop_id_column) \
.pipe(interpolate_numerical_columns) \
.pipe(create_dummy_variables)

processed_test_df = add_missing_columns(processed_train_df, processed_test_df)
processed_test_df = drop_sale_price(processed_test_df)

""" run the regression """
results = run_regression(processed_test_df, model)

print(results)

We have data! Now the last step is to package it up for Kaggle and submit!

"""create a dataframe from the output list and output to a file"""
id_list = list(range(1461, 2920))
result_df = pd.DataFrame(list(zip(id_list, results)), columns=['Id', 'SalePrice'])

result_df.to_csv('results.csv', index=False)