from gradient import LogisticRegression
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
import numpy as np
all='ignore')
np.seterr(
4)
np.random.seed(
# Generate a non-linearly separable data
= 3
p_features = make_blobs(n_samples = 200, n_features = p_features - 1, centers = [(-1, -1), (1, 1)])
X, y
= plt.scatter(X[:,0], X[:,1], c = y)
fig = plt.xlabel("Feature 1")
xlab = plt.ylabel("Feature 2") ylab
Gradient Descent Algorithm
Here is the link to my source code: Gradient Descent
Gradient Descent Implementation
To implement gradient descent, I created a gradient_empirical
method which would calculate the gradient of the empirical risk formula. This method uses several other methods: predict
, and loss_deriv
. The predict
method is simply a way of generating our predicted \(y_{hat}\) by dotting the padded feature matrix \(X\) with the current weight vector \(w\). The loss_deriv
method returns the derivative of our loss function by taking our predicted \(y_{hat}\) and computing it on the \(sigmoid\) function subtracted by the true label \(y\) vector.
Next, in my fit
method, I used gradient_empirical
on the padded feature matrix \(X\) with the true label vector \(y\) for each iteration. Using the result of my gradient empirical
method, I can update the next weight vector \(w^{k+1}\) to be equal to \(w^{k} - \alpha * gradient\). These updates continue until either the max_epochs
have been reached or the loss
of \(w^{k}\) and \(w^{k+1}\) no longer shows any computational difference.
Stochastic Gradient Descent with Momentum Implementation
To implement stochastic gradient descent with momentum, I implemented a new fit method called fit_stochastic
. Inside this method, I create an optional parameter called momentum
which when set to \(True\), initializes \(\beta\) to \(0.8\), otherwise, sets \(\beta\) to \(0\) which is just the regular stochastic gradient descent. Stochastic gradient is implemented by grabbing a subset of our data to compute the gradient onto with a parameter batch_size
to determine the size of the subset. For every iteration of an epoch
, we compute the gradient for that batch of data and update \(w\) by computing \(w^{k} - \alpha(gradient) + \beta(w^{k} - w^{k-1})\). I did this by storing the previous and current \(w\) in separate variables, initial_w
and curr_w
and assign initial_w = curr_w
after each update of the current \(w\) vector. At the end of an epoch, I compute the loss and append to it to the model’s loss_history
.
Testing the Gradient Descent Algorithm
# Testing all the gradient descent evolutions
= LogisticRegression()
LR
LR.fit_stochastic(X, y, = 2000,
max_epochs = True,
momentum = 10,
batch_size = .05)
alpha
= len(LR.loss_history)
num_steps + 1, LR.loss_history, label = "stochastic gradient (momentum)")
plt.plot(np.arange(num_steps)
= LogisticRegression()
LR
LR.fit_stochastic(X, y, = 2000,
max_epochs = False,
momentum = 10,
batch_size = .05)
alpha
= len(LR.loss_history)
num_steps + 1, LR.loss_history, label = "stochastic gradient")
plt.plot(np.arange(num_steps)
= LogisticRegression()
LR = .05, max_epochs = 2000)
LR.fit(X, y, alpha
= len(LR.loss_history)
num_steps + 1, LR.loss_history, label = "gradient")
plt.plot(np.arange(num_steps)
plt.loglog()
= plt.legend() legend
Testing the Logistic Regression class, we see that the Stochastic Gradient with Momentum converges to a minimizer the fastest with regular Stochastic Gradient converging the second fastest. Lastly, regular Gradient Descent eventually converged to a minimizer at the slowest rate taking more epochs than both Stochastic Gradient methods.
Experiments
Experiment 1
For our \(\textbf{first experiment}\), we will demonstrate for when the learning rate is too large, that the gradient descent method will not converge to a minimizer. Below, I’ve generated data that is at least 10 feature dimensions for this experiment.
from gradient import LogisticRegression
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
import numpy as np
all='ignore')
np.seterr(
901)
np.random.seed(
# Generate a non-linearly separable data
= 11
p_features = make_blobs(n_samples = 200, n_features = p_features - 1, centers = [(-1, -1), (1, 1)])
X, y
= np.append(X, np.ones((X.shape[0], 1)), 1) X_
# Testing the gradient descent with learning rate = 88
901)
np.random.seed(
= LogisticRegression()
LR
88, 1000)
LR.fit(X, y, = LR.empirical_risk(X_, y)
loss
= plt.subplots(1, 2)
fig, axarr
0].scatter(X[:,0], X[:,1], c = y)
axarr[0].set(xlabel = "Feature 1", ylabel = "Feature 2", title = f"Loss = {loss}")
axarr[
= np.linspace(-3, 3, 101)
f1
= axarr[0].plot(f1, (LR.w[2] - f1*LR.w[0])/LR.w[1], color = "black")
p
1].plot(LR.loss_history)
axarr[1].set(xlabel = "Iteration number", ylabel = "Empirical Risk")
axarr[ plt.tight_layout()
# Testing the gradient descent with learning rate = 89
901)
np.random.seed(
= LogisticRegression()
LR
89, 1000)
LR.fit(X, y, = LR.empirical_risk(X_, y)
loss
= plt.subplots(1, 2)
fig, axarr
0].scatter(X[:,0], X[:,1], c = y)
axarr[0].set(xlabel = "Feature 1", ylabel = "Feature 2", title = f"Loss = {loss}")
axarr[
= np.linspace(-3, 3, 101)
f1
= axarr[0].plot(f1, (LR.w[2] - f1*LR.w[0])/LR.w[1], color = "black")
p
1].plot(LR.loss_history)
axarr[1].set(xlabel = "Iteration number", ylabel = "Empirical Risk")
axarr[ plt.tight_layout()
Setting the gradient descent learning rate to be 88 vs 89 to be fitted for 1000 epochs for a datatset with at least 10 features illustrated that there is a threshold for when a minimized loss is found. A possible explanation to this phenomenon could be that a large learning rate overshoots the model to miss the minimizer. Thus, the model would continue going down the direction of the gradient and may oscillate trying to find the minimizer.
Experiment 2
For our \(\textbf{ second experiment}\), we will demonstrate the speed of convergence for different choices of batch size for the stochastic gradient
algorithm.
# Experimenting stochastic gradient models with different batch sizes
901)
np.random.seed(
= LogisticRegression()
LR
LR.fit_stochastic(X, y, = 1000,
max_epochs = False,
momentum = 5,
batch_size = .05)
alpha
= len(LR.loss_history)
num_steps + 1, LR.loss_history, label = "batch size of 5")
plt.plot(np.arange(num_steps)
= LogisticRegression()
LR
LR.fit_stochastic(X, y, = 1000,
max_epochs = False,
momentum = 20,
batch_size = .05)
alpha
= len(LR.loss_history)
num_steps + 1, LR.loss_history, label = "batch size of 20")
plt.plot(np.arange(num_steps)
= LogisticRegression()
LR
LR.fit_stochastic(X, y, = 1000,
max_epochs = False,
momentum = 50,
batch_size = .05)
alpha
= len(LR.loss_history)
num_steps + 1, LR.loss_history, label = "batch size of 50")
plt.plot(np.arange(num_steps)
plt.loglog()
= plt.legend() legend
Shown above displays the loss evolution of three stochastic gradient models of batch sized 5, 20, and 50. By observation, we can see that as the batch size increases, the model requires more epochs to converge to some minimizer solution. In this particular example, the model with batch size 5 converged the fastest, followed the model with batch size 20, and lastly, the model with batch size 50.
Experiment 3
For our \(\textbf{third experiment}\), we will demonstrate the case in which the use of momentum significantly speeds up convergence.
# Experimenting stochastic gradient models with and without momentum
901)
np.random.seed(
= LogisticRegression()
LR
LR.fit_stochastic(X, y, = 1000,
max_epochs = False,
momentum = 5,
batch_size = .05)
alpha
= len(LR.loss_history)
num_steps + 1, LR.loss_history, label = "stochastic gradient without momentum")
plt.plot(np.arange(num_steps)
= LogisticRegression()
LR
LR.fit_stochastic(X, y, = 1000,
max_epochs = True,
momentum = 5,
batch_size = .05)
alpha
= len(LR.loss_history)
num_steps + 1, LR.loss_history, label = "stochastic gradient with momentum")
plt.plot(np.arange(num_steps)
plt.loglog()
= plt.legend() legend
Shown above displays the loss evolution of two stochastic models: one with momentum and one without momentum. By observation, we can conclude that the stochastic gradient model with momentum significantly speeds up convergence to some minimizer solution in comparison to the model without momentum.