Linear Regression Implementation

Here is the link to my source code: Linear Regression

Least-Squares Linear Regression

For this blog post, I’ve implemented least-squares linear regression in two ways: using the analytical method and an efficient gradient descent.

Analytical Method

The analytical method is derived from solving the equation: \[0 = X^{T}(X \hat{w} - y)\]

where \(X\) is our paddded feature matrix. By setting the gradient equal to \(0\), we can solve for \(\hat{w}\) to get an explicit solution for the optimized weight vector. Solving this equation requires \(X\) to be an invertible matrix such that it has at least many rows as columns. Thus, our final solution for \(\hat{w}\) is

\[\hat{w} = (X^{T}X)^{-1} X^{T}y\]

In my fit_analytic method, I utilized numpy’s linalg transpose and inverse methods alongside orderly matrix multiplications to calculate the optimized weight vector \(\hat{w}\).

Efficient Gradient Descent

To implement an efficient gradient descent for least-squares linear regression, instead of computing the original gradient equation at each iteration of an epoch:

\[\nabla{L(w)} = X^{T}(Xw - y)\]

I calculated once \(P = X^{T}X\) and \(q = X^{T}y\) to reduce the time complexity of the matrix mutliplication of \(X^{T}X\) being \(O(np^2)\) and \(X^{T}y\) being \(O(np)\). Thus, reducing the gradient equation to be:

\[\nabla{L(w)} = Pw - q\]

reduces the time complexity of calculating the gradient to be \(O(p^2)\) steps which is significantly faster! In my fit_gradient method, I first initialized some random weight vector of \(p\) shape as my padded \(X\) feature matrix. Then, I computed \(P\) and \(q\) which I used inside my for-loop to update my weight vector \(self.w\) with the gradient. At each epoch, I calculated the score of the current weight vector and appended the current score to score_history.

Experiments

Demonstration

Shown below, I’ve generated a set of data using the LR_data method in my LinearRegression class. Then, I fit the data using both analytic and gradient descent methods and should expect a similar optimized weight vector \(w\).

%load_ext autoreload
%autoreload 2

from linear import LinearRegression 
import numpy as np 
from matplotlib import pyplot as plt

LR = LinearRegression()

def pad(X):
    return np.append(X, np.ones((X.shape[0], 1)), 1)

def LR_data(n_train = 100, n_val = 100, p_features = 1, noise = .1, w = None):
    if w is None: 
        w = np.random.rand(p_features + 1) + .2
    
    X_train = np.random.rand(n_train, p_features)
    y_train = pad(X_train)@w + noise*np.random.randn(n_train)

    X_val = np.random.rand(n_val, p_features)
    y_val = pad(X_val)@w + noise*np.random.randn(n_val)
    
    return X_train, y_train, X_val, y_val

np.random.seed(1)

n_train = 100
n_val = 100
p_features = 1
noise = 0.2


# create some data
X_train, y_train, X_val, y_val = LR_data(n_train, n_val, p_features, noise)

# plot it
fig, axarr = plt.subplots(1, 2, sharex = True, sharey = True)
axarr[0].scatter(X_train, y_train)
axarr[1].scatter(X_val, y_val)
labs = axarr[0].set(title = "Training", xlabel = "x", ylabel = "y")
labs = axarr[1].set(title = "Validation", xlabel = "x")
plt.tight_layout()


LR.fit_analytic(X_train, y_train) 
print(f"Training score = {LR.score(X_train, y_train).round(4)}")
print(f"Validation score = {LR.score(X_val, y_val).round(4)}")

Training score = 0.4765
Validation score = 0.4931

LR2 = LinearRegression()

LR2.fit_gradient(X_train, y_train, alpha = 0.01, max_iter = int(1e2))

Then, I can plot the score_history of the gradient descent to see how the score evolved until the max iterations. By observation, the score evolved monotonically since we’re not using stochastic gradient.

plt.plot(LR2.score_history)
labels = plt.gca().set(title = "Evolution of Training Score", xlabel = "Iteration", ylabel = "Score")

Experiment 1: Increasing p_features with constant n_train

For this first experiment, I’ve chosen to increase the number of p_features to \(10\), \(50\), and then later choose the number of p_features to be \(n-1\).

np.random.seed(4)

n_train = 100
n_val = 100
p_features = 10
noise = 0.2

# create some data
X_train, y_train, X_val, y_val = LR_data(n_train, n_val, p_features, noise)

# Use our fit_analytic method to get training and validation score
LR1 = LinearRegression()

LR1.fit_analytic(X_train, y_train) 

LR1_train_score = LR1.score(X_train, y_train).round(4)
LR1_validation_score = LR1.score(X_val, y_val).round(4)

np.random.seed(2)

p_features = 50
X_train, y_train, X_val, y_val = LR_data(n_train, n_val, p_features, noise)

# Use our fit_analytic method to get training and validation score
LR2 = LinearRegression()

LR2.fit_analytic(X_train, y_train) 

LR2.fit_analytic(X_train, y_train) 

LR2_train_score = LR2.score(X_train, y_train).round(4)
LR2_validation_score = LR2.score(X_val, y_val).round(4)

np.random.seed(3)

p_features = n_train - 1 
X_train, y_train, X_val, y_val = LR_data(n_train, n_val, p_features, noise)

# Use our fit_analytic method to get training and validation score
LR3 = LinearRegression()

LR3.fit_analytic(X_train, y_train) 

LR3_train_score = LR3.score(X_train, y_train).round(4)
LR3_validation_score = LR3.score(X_val, y_val).round(4)

We can visualize the training and validation scores to visibly observe the differences of each experiment as we increase the number of p_features.

# Code from https://www.geeksforgeeks.org/bar-plot-in-matplotlib/
bar_width = 0.2

training_scores = [LR1_train_score, LR2_train_score, LR3_train_score]
validation_scores = [LR1_validation_score, LR2_validation_score, LR3_validation_score]

# Set position of bar on X axis
br1 = np.arange(len(training_scores))
br2 = [x + bar_width for x in br1]

# Make the plot
plt.bar(br1, training_scores, width = bar_width,
        edgecolor ='grey', label ='training score')
plt.bar(br2, validation_scores, width = bar_width,
        edgecolor ='grey', label ='validation score')

plt.title('Number of p_features vs score with constant n_train')

# Adding Xticks
plt.xlabel('Number of p_features', fontsize = 12)
plt.ylabel('Score', fontsize = 12)
plt.xticks([r + 0.1 for r in range(len(training_scores))],
        ['10', '50', 'n_train - 1'])

legend = plt.legend()
plt.show()

As shown on the graph above, we can see that as the number of p_features increases up to n_train - 1, the fitted model’s training_score also increases. However, the model’s validation score decreases. This conclusion is related to the model being overfit due to the significant difference of the training and validation score between the model with n_train - 1 p_features. This means that we’ve trained the model exactly to some random training data given, however, when validated on the true labels, the calculated optimized weight vector \(w\) will be highly inaccurate in comparison to the labels.

Experiment 2: LASSO Regularization

Using LASSO regularization, we modify the original loss function to add a regularization term:

\[L(w) = ||Xw - y||^{2}_{2} + \alpha||w'||_{1}\]

This extension of the regularization term minimizes the weight vector \(w\) as small as it could be and forces the weight vector’s entries to be exactly zero.

Varying Degrees of Alpha

For this experiment, we can choose varying degrees of alpha while increasing the number of p_features of our data.

from sklearn.linear_model import Lasso

# Alpha of 0.001
L1 = Lasso(alpha = 0.001)

p_features = 10

# Fit our model
X_train, y_train, X_val, y_val = LR_data(n_train, n_val, p_features, noise)
L1.fit(X_train, y_train)
L1_validation_1 = L1.score(X_val, y_val)

p_features = 50

# Fit our model
X_train, y_train, X_val, y_val = LR_data(n_train, n_val, p_features, noise)
L1.fit(X_train, y_train)
L1_validation_2 = L1.score(X_val, y_val)

p_features = n_train - 1 

# Fit our model
X_train, y_train, X_val, y_val = LR_data(n_train, n_val, p_features, noise)
L1.fit(X_train, y_train)
L1_validation_3 = L1.score(X_val, y_val)

After fitting the Lasso model with an alpha of \(0.001\), we can plot its validation scores in contrast to the standard linear regression.

bar_width = 0.2

lasso_validation = [L1_validation_1, L1_validation_2, L1_validation_3]
linear_validation = [LR1_validation_score, LR2_validation_score, LR3_validation_score]

# Set position of bar on X axis
br1 = np.arange(len(lasso_validation))
br2 = [x + bar_width for x in br1]

# Make the plot
plt.bar(br2, linear_validation, width = bar_width,
        edgecolor ='grey', label ='standard linear regression')
plt.bar(br1, lasso_validation, width = bar_width,
        edgecolor ='grey', label ='lasso regularization')

plt.title('Number of p_features vs Validation Score for Lasso Regularization and Standard Linear Regression')

# Adding Xticks
plt.xlabel('Number of p_features', fontsize = 12)
plt.ylabel('Validation Score', fontsize = 12)
plt.xticks([r + 0.1 for r in range(len(lasso_validation))],
        ['10', '50', 'n_train - 1'])

legend = plt.legend()
plt.show()

# Alpha of 0.01
L2 = Lasso(alpha = 0.01)

p_features = 10

# Fit our model
X_train, y_train, X_val, y_val = LR_data(n_train, n_val, p_features, noise)
L2.fit(X_train, y_train)
L2_validation_1 = L2.score(X_val, y_val)

p_features = 50

# Fit our model
X_train, y_train, X_val, y_val = LR_data(n_train, n_val, p_features, noise)
L2.fit(X_train, y_train)
L2_validation_2 = L2.score(X_val, y_val)

p_features = n_train - 1

# Fit our model
X_train, y_train, X_val, y_val = LR_data(n_train, n_val, p_features, noise)
L2.fit(X_train, y_train)
L2_validation_3 = L2.score(X_val, y_val)

After fitting the Lasso model with an alpha of \(0.01\), we can plot its validation scores in contrast to the standard linear regression.

bar_width = 0.2

lasso_validation = [L2_validation_1, L2_validation_2, L2_validation_3]
linear_validation = [LR1_validation_score, LR2_validation_score, LR3_validation_score]

# Set position of bar on X axis
br1 = np.arange(len(lasso_validation))
br2 = [x + bar_width for x in br1]

# Make the plot
plt.bar(br2, linear_validation, width = bar_width,
        edgecolor ='grey', label ='standard linear regression')
plt.bar(br1, lasso_validation, width = bar_width,
        edgecolor ='grey', label ='lasso regularization score')

plt.title('Number of p_features vs Validation Score for Lasso Regularization and Standard Linear Regression')

# Adding Xticks
plt.xlabel('Number of p_features', fontsize = 12)
plt.ylabel('Validation Score', fontsize = 12)
plt.xticks([r + 0.1 for r in range(len(lasso_validation))],
        ['10', '50', 'n_train - 1'])

legend = plt.legend()
plt.show()

# Alpha of 0.1
L3 = Lasso(alpha = 0.1)

p_features = 10

# Fit our model
X_train, y_train, X_val, y_val = LR_data(n_train, n_val, p_features, noise)
L3.fit(X_train, y_train)
L3_validation_1 = L3.score(X_val, y_val)

p_features = 50

# Fit our model
X_train, y_train, X_val, y_val = LR_data(n_train, n_val, p_features, noise)
L3.fit(X_train, y_train)
L3_validation_2 = L3.score(X_val, y_val)

p_features = n_train - 1

# Fit our model
X_train, y_train, X_val, y_val = LR_data(n_train, n_val, p_features, noise)
L3.fit(X_train, y_train)
L3_validation_3 = L3.score(X_val, y_val)

After fitting the Lasso model with an alpha of \(0.1\), we can plot its validation scores in contrast to the standard linear regression.

bar_width = 0.2

lasso_validation = [L3_validation_1, L3_validation_2, L3_validation_3]
linear_validation = [LR1_validation_score, LR2_validation_score, LR3_validation_score]

# Set position of bar on X axis
br1 = np.arange(len(lasso_validation))
br2 = [x + bar_width for x in br1]

# Make the plot
plt.bar(br2, linear_validation, width = bar_width,
        edgecolor ='grey', label ='standard linear regression')
plt.bar(br1, lasso_validation, width = bar_width,
        edgecolor ='grey', label ='lasso regularization')

plt.title('Number of p_features vs Validation Scores for Lasso Regularization and Standard Linear Regression')

# Adding Xticks
plt.xlabel('Number of p_features', fontsize = 12)
plt.ylabel('Validation Score', fontsize = 12)
plt.xticks([r + 0.1 for r in range(len(lasso_validation))],
        ['10', '50', 'n_train - 1'])

legend = plt.legend()

# Plotting all three alpha levels

bar_width = 0.2

lasso_validation_1 = [L1_validation_1, L1_validation_2, L1_validation_3]
lasso_validation_2 = [L2_validation_1, L2_validation_2, L2_validation_3]
lasso_validation_3 = [L3_validation_1, L3_validation_2, L3_validation_3]

# Set position of bar on X axis
br1 = np.arange(len(lasso_validation_1))
br2 = [x + bar_width for x in br1]
br3 = [x + bar_width for x in br2]

# Make the plot
plt.bar(br1, lasso_validation_1, width = bar_width,
        edgecolor ='grey', label ='alpha = 0.001')
plt.bar(br2, lasso_validation_2, width = bar_width,
        edgecolor ='grey', label ='alpha = 0.01')
plt.bar(br3, lasso_validation_3, width = bar_width,
        edgecolor ='grey', label ='alpha = 0.1')

plt.title('Number of p_features vs Validation Score for Lasso Regularization with Varying Alpha Degrees')

# Adding Xticks
plt.xlabel('Number of p_features', fontsize = 12)
plt.ylabel('Validation Score', fontsize = 12)
plt.xticks([r + 0.1 for r in range(len(lasso_validation))],
        ['10', '50', 'n_train - 1'])

legend = plt.legend()

After plotting all three degrees of alpha with increasing number of p_features up to n_train - 1, I found that smaller values of alpha (alpha = \(0.001\)) will still retain a moderately high validation score despite reaching up to n_train - 1 number of p_features. However, as I increase the degree of alpha to \(0.01\) and \(0.1\), the difference in validation score between LASSO regularization and standard linear regression becomes significantly different. As I increase the strength of the regularizer, the validation score for LASSO regularization approaches zero, and is no longer accuracte in predicting the true labels.

In conclusion, LASSO regularization can improve a model’s validation score with smaller alpha levels in contrast to utilizing standard linear regression even with up to n_train - 1 number of p_features. However, as you increase the strength of the regularization, the validation score decreases significantly, and the model is no longer proficient in its predictions.

Bikeshare Data Set

We can use our LinearRegression class to perform a case study over the data set related to the Capital Bikeshare system in Washington DC.

Shown below is a preview of what the bikeshare data set looks like.

import pandas as pd
from sklearn.model_selection import train_test_split
bikeshare = pd.read_csv("https://philchodrow.github.io/PIC16A/datasets/Bike-Sharing-Dataset/day.csv")

bikeshare.head()

	instant	dteday	season	mnth	weekday	workingday	weathersit	temp	atemp	hum	windspeed	casual	registered	cnt
0	1	2011-01-01	1	1	6	0	2	0.344167	0.363625	0.805833	0.160446	331	654	985
1	2	2011-01-02	1	1	0	0	2	0.363478	0.353739	0.696087	0.248539	131	670	801
2	3	2011-01-03	1	1	1	1	1	0.196364	0.189405	0.437273	0.248309	120	1229	1349
3	4	2011-01-04	1	1	2	1	1	0.200000	0.212122	0.590435	0.160296	108	1454	1562
4	5	2011-01-05	1	1	3	1	1	0.226957	0.229270	0.436957	0.186900	82	1518	1600

Here, we can plot the number of casual users’ bike usage per day.

import datetime
fig, ax = plt.subplots(1, figsize = (7, 3))
ax.plot(pd.to_datetime(bikeshare['dteday']), bikeshare['casual'])
ax.set(xlabel = "Day", ylabel = "# of casual users")
l = plt.tight_layout()

We can simplify the data set’s columns to dummy variables which we can later use in our predictions.

cols = ["casual", 
        "mnth", 
        "weathersit", 
        "workingday",
        "yr",
        "temp", 
        "hum", 
        "windspeed",
        "holiday"]

bikeshare = bikeshare[cols]

bikeshare = pd.get_dummies(bikeshare, columns = ['mnth'], drop_first = "if_binary")
bikeshare

	casual	weathersit	workingday	yr	temp	hum	windspeed	holiday	mnth_2	mnth_3	mnth_4	mnth_5	mnth_6	mnth_7	mnth_8	mnth_9	mnth_10	mnth_11	mnth_12
0	331	2	0	0	0.344167	0.805833	0.160446	0	0	0	0	0	0	0	0	0	0	0	0
1	131	2	0	0	0.363478	0.696087	0.248539	0	0	0	0	0	0	0	0	0	0	0	0
2	120	1	1	0	0.196364	0.437273	0.248309	0	0	0	0	0	0	0	0	0	0	0	0
3	108	1	1	0	0.200000	0.590435	0.160296	0	0	0	0	0	0	0	0	0	0	0	0
4	82	1	1	0	0.226957	0.436957	0.186900	0	0	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
726	247	2	1	1	0.254167	0.652917	0.350133	0	0	0	0	0	0	0	0	0	0	0	1
727	644	2	1	1	0.253333	0.590000	0.155471	0	0	0	0	0	0	0	0	0	0	0	1
728	159	2	0	1	0.253333	0.752917	0.124383	0	0	0	0	0	0	0	0	0	0	0	1
729	364	1	0	1	0.255833	0.483333	0.350754	0	0	0	0	0	0	0	0	0	0	0	1
730	439	2	1	1	0.215833	0.577500	0.154846	0	0	0	0	0	0	0	0	0	0	0	1

731 rows × 19 columns

Now we can do a train-test split using the dummified data set:

train, test = train_test_split(bikeshare, test_size = .2, shuffle = False)

X_train = train.drop(["casual"], axis = 1)
y_train = train["casual"]

X_test = test.drop(["casual"], axis = 1)
y_test = test["casual"]

After splitting the data set into training and testing data, we can now use our LinearRegression class to score the model on the test data:

X_train.shape
LR = LinearRegression()

LR.fit_analytic(X_train, y_train)
LR.score(X_test, y_test)

0.6967732383931238

We observe that our LinearRegression model does moderately well when validated on the test data as our the model predicts about \(69.7\)% of the true labels correctly.

Next, we can plot the predicted casual ridership vs the actual casual ridership. By observation, while the predicted casual ridership did not approximate exactly to the amount of actual casual ridership per day, we can see that the predicted casual ridership follows the same pattern of peaks and dips as the actual casual ridership.

bike_pred = LR.predict(X_test)

num_steps = len(X_test)

# Plot bike_pred vs actual ridership
plt.plot(np.arange(num_steps) + 1, bike_pred, label = "predicted ridership")
plt.plot(np.arange(num_steps) + 1, y_test, label = "actual ridership")

plt.xlabel("Day")
plt.ylabel("Total Casual Users")
plt.title("Predicted and Actual Casual Users vs Day")

legend = plt.legend()

By looking at the entries of the model’s w entries corresponding to the entries of X_train.columns, we can assess the negative and positive coefficients that best impact casual ridership per day.

import pandas as pd

coefficients = np.array(LR.w[0:18])
features = np.array(X_train.columns)

dataset = pd.DataFrame({'feature': features, 'coefficients': coefficients}, columns = ['feature', 'coefficients']) 
dataset.sort_values(by=['coefficients'], ascending = False)

	feature	coefficients
3	temp	1498.715113
10	mnth_5	537.301886
9	mnth_4	518.408753
15	mnth_10	437.600848
14	mnth_9	371.503854
8	mnth_3	369.271956
11	mnth_6	360.807998
2	yr	280.586927
16	mnth_11	252.433004
13	mnth_8	241.316412
12	mnth_7	228.881481
17	mnth_12	90.821460
7	mnth_2	-3.354397
0	weathersit	-108.371136
6	holiday	-235.879349
4	hum	-490.100340
1	workingday	-791.690549
5	windspeed	-1242.800381

I’ve created a sorted dataframe containing the entries of \(w\) and the corresponding coefficients to X_train.columns. We see that the most impactful feature is temp concluding that higher temperatures will have a positive increase to the number of total casual ridership by around 1500 users. Additionally the summer months all have positive coefficients which contributes to ridership. In contrast, weathersit, workingday and holiday contributes negatively to casual ridership, which windspeed is the most negative of them all.