Gradient Descent Algorithm

Here is the link to my source code: Gradient Descent

Gradient Descent Implementation

To implement gradient descent, I created a gradient_empirical method which would calculate the gradient of the empirical risk formula. This method uses several other methods: predict, and loss_deriv. The predict method is simply a way of generating our predicted \(y_{hat}\) by dotting the padded feature matrix \(X\) with the current weight vector \(w\). The loss_deriv method returns the derivative of our loss function by taking our predicted \(y_{hat}\) and computing it on the \(sigmoid\) function subtracted by the true label \(y\) vector.

Next, in my fit method, I used gradient_empirical on the padded feature matrix \(X\) with the true label vector \(y\) for each iteration. Using the result of my gradient empirical method, I can update the next weight vector \(w^{k+1}\) to be equal to \(w^{k} - \alpha * gradient\). These updates continue until either the max_epochs have been reached or the loss of \(w^{k}\) and \(w^{k+1}\) no longer shows any computational difference.

Stochastic Gradient Descent with Momentum Implementation

To implement stochastic gradient descent with momentum, I implemented a new fit method called fit_stochastic. Inside this method, I create an optional parameter called momentum which when set to \(True\), initializes \(\beta\) to \(0.8\), otherwise, sets \(\beta\) to \(0\) which is just the regular stochastic gradient descent. Stochastic gradient is implemented by grabbing a subset of our data to compute the gradient onto with a parameter batch_size to determine the size of the subset. For every iteration of an epoch, we compute the gradient for that batch of data and update \(w\) by computing \(w^{k} - \alpha(gradient) + \beta(w^{k} - w^{k-1})\). I did this by storing the previous and current \(w\) in separate variables, initial_w and curr_w and assign initial_w = curr_w after each update of the current \(w\) vector. At the end of an epoch, I compute the loss and append to it to the model’s loss_history.

Testing the Gradient Descent Algorithm

from gradient import LogisticRegression 
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
import numpy as np
np.seterr(all='ignore') 

np.random.seed(4)

# Generate a non-linearly separable data
p_features = 3
X, y = make_blobs(n_samples = 200, n_features = p_features - 1, centers = [(-1, -1), (1, 1)])

fig = plt.scatter(X[:,0], X[:,1], c = y)
xlab = plt.xlabel("Feature 1")
ylab = plt.ylabel("Feature 2")

# Testing all the gradient descent evolutions

LR = LogisticRegression()
LR.fit_stochastic(X, y, 
                  max_epochs = 2000, 
                  momentum = True, 
                  batch_size = 10, 
                  alpha = .05) 

num_steps = len(LR.loss_history)
plt.plot(np.arange(num_steps) + 1, LR.loss_history, label = "stochastic gradient (momentum)")

LR = LogisticRegression()
LR.fit_stochastic(X, y, 
                  max_epochs = 2000, 
                  momentum = False, 
                  batch_size = 10, 
                  alpha = .05)

num_steps = len(LR.loss_history)
plt.plot(np.arange(num_steps) + 1, LR.loss_history, label = "stochastic gradient")

LR = LogisticRegression()
LR.fit(X, y, alpha = .05, max_epochs = 2000)

num_steps = len(LR.loss_history)
plt.plot(np.arange(num_steps) + 1, LR.loss_history, label = "gradient")

plt.loglog()

legend = plt.legend()

Testing the Logistic Regression class, we see that the Stochastic Gradient with Momentum converges to a minimizer the fastest with regular Stochastic Gradient converging the second fastest. Lastly, regular Gradient Descent eventually converged to a minimizer at the slowest rate taking more epochs than both Stochastic Gradient methods.

Experiments

Experiment 1

For our \(\textbf{first experiment}\), we will demonstrate for when the learning rate is too large, that the gradient descent method will not converge to a minimizer. Below, I’ve generated data that is at least 10 feature dimensions for this experiment.

from gradient import LogisticRegression 
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
import numpy as np
np.seterr(all='ignore') 

np.random.seed(901)

# Generate a non-linearly separable data
p_features = 11
X, y = make_blobs(n_samples = 200, n_features = p_features - 1, centers = [(-1, -1), (1, 1)])

X_ = np.append(X, np.ones((X.shape[0], 1)), 1)

# Testing the gradient descent with learning rate = 88

np.random.seed(901)

LR = LogisticRegression()

LR.fit(X, y, 88, 1000)
loss = LR.empirical_risk(X_, y)

fig, axarr = plt.subplots(1, 2)

axarr[0].scatter(X[:,0], X[:,1], c = y)
axarr[0].set(xlabel = "Feature 1", ylabel = "Feature 2", title = f"Loss = {loss}")

f1 = np.linspace(-3, 3, 101)

p = axarr[0].plot(f1, (LR.w[2] - f1*LR.w[0])/LR.w[1], color = "black")

axarr[1].plot(LR.loss_history)
axarr[1].set(xlabel = "Iteration number", ylabel = "Empirical Risk")
plt.tight_layout()

# Testing the gradient descent with learning rate = 89

np.random.seed(901)

LR = LogisticRegression()

LR.fit(X, y, 89, 1000)
loss = LR.empirical_risk(X_, y)

fig, axarr = plt.subplots(1, 2)

axarr[0].scatter(X[:,0], X[:,1], c = y)
axarr[0].set(xlabel = "Feature 1", ylabel = "Feature 2", title = f"Loss = {loss}")

f1 = np.linspace(-3, 3, 101)

p = axarr[0].plot(f1, (LR.w[2] - f1*LR.w[0])/LR.w[1], color = "black")

axarr[1].plot(LR.loss_history)
axarr[1].set(xlabel = "Iteration number", ylabel = "Empirical Risk")
plt.tight_layout()

Setting the gradient descent learning rate to be 88 vs 89 to be fitted for 1000 epochs for a datatset with at least 10 features illustrated that there is a threshold for when a minimized loss is found. A possible explanation to this phenomenon could be that a large learning rate overshoots the model to miss the minimizer. Thus, the model would continue going down the direction of the gradient and may oscillate trying to find the minimizer.

Experiment 2

For our \(\textbf{ second experiment}\), we will demonstrate the speed of convergence for different choices of batch size for the stochastic gradient algorithm.

# Experimenting stochastic gradient models with different batch sizes

np.random.seed(901)

LR = LogisticRegression()
LR.fit_stochastic(X, y, 
                  max_epochs = 1000, 
                  momentum = False, 
                  batch_size = 5, 
                  alpha = .05)

num_steps = len(LR.loss_history)
plt.plot(np.arange(num_steps) + 1, LR.loss_history, label = "batch size of 5")

LR = LogisticRegression()
LR.fit_stochastic(X, y, 
                  max_epochs = 1000, 
                  momentum = False, 
                  batch_size = 20, 
                  alpha = .05)

num_steps = len(LR.loss_history)
plt.plot(np.arange(num_steps) + 1, LR.loss_history, label = "batch size of 20")

LR = LogisticRegression()
LR.fit_stochastic(X, y, 
                  max_epochs = 1000, 
                  momentum = False, 
                  batch_size = 50,
                  alpha = .05) 

num_steps = len(LR.loss_history)
plt.plot(np.arange(num_steps) + 1, LR.loss_history, label = "batch size of 50")

plt.loglog()

legend = plt.legend()

Shown above displays the loss evolution of three stochastic gradient models of batch sized 5, 20, and 50. By observation, we can see that as the batch size increases, the model requires more epochs to converge to some minimizer solution. In this particular example, the model with batch size 5 converged the fastest, followed the model with batch size 20, and lastly, the model with batch size 50.

Experiment 3

For our \(\textbf{third experiment}\), we will demonstrate the case in which the use of momentum significantly speeds up convergence.

# Experimenting stochastic gradient models with and without momentum

np.random.seed(901)

LR = LogisticRegression()
LR.fit_stochastic(X, y, 
                  max_epochs = 1000, 
                  momentum = False, 
                  batch_size = 5, 
                  alpha = .05)

num_steps = len(LR.loss_history)
plt.plot(np.arange(num_steps) + 1, LR.loss_history, label = "stochastic gradient without momentum")

LR = LogisticRegression()
LR.fit_stochastic(X, y, 
                  max_epochs = 1000, 
                  momentum = True, 
                  batch_size = 5, 
                  alpha = .05)

num_steps = len(LR.loss_history)
plt.plot(np.arange(num_steps) + 1, LR.loss_history, label = "stochastic gradient with momentum")

plt.loglog()

legend = plt.legend()

Shown above displays the loss evolution of two stochastic models: one with momentum and one without momentum. By observation, we can conclude that the stochastic gradient model with momentum significantly speeds up convergence to some minimizer solution in comparison to the model without momentum.