HW4. Backpropagation

42 minute read

Topics: Backpropagation, Multi-layer perceptron


import pickle
import numpy as np
import matplotlib.pyplot as plt
import os
import math
from torchvision.datasets import CIFAR10
download = not os.path.isdir('cifar-10-batches-py')
dset_train = CIFAR10(root='.', download=download)

import torch

Problem 1: Understanding Backpropagation

1.(a) Implement the forward and backward pass 1

Implement the code for forward and backward pass for \(f(\overset {\rightarrow} {x}, \overset {\rightarrow} {w}) = \frac{1}{1 + exp(- (w_0x_0 + w_1x_1 - w_2/x_2 + w_3))}\)

def forward_helper(x0, x1, x2, w0, w1, w2, w3):
  sum_part = -(w0 * x0 + w1 * x1 - w2/x2 + w3)
  denum = 1 + np.exp(sum_part)
  return 1 / denum
def f_1(x0, x1, x2, w0, w1, w2, w3):
    """
    Computes the forward and backward pass through the computational graph 
    of (a)

    Inputs:
    - x0, x1, x2, w0, w1, w2, w3: Python floats

    Returns a tuple of:
    - L: The output of the graph
    - grads: A tuple (grad_x0, grad_x1, grad_x2, grad_w0, grad_w1, grad_w2, 
      grad_w3)
    giving the derivative of the output L with respect to each input.
    """
    ###########################################################################
    # TODO: Implement the forward pass for the computational graph for (a) and#
    # store the output of this graph as L                                     #
    ###########################################################################
    s0 = x0 * w0
    s1 = x1 * w1
    a = 1 / x2
    s2 = a * w2
    b = s0 + s1
    c = b - s2
    d = c + w3
    e = -1 * d
    f = np.exp(e)
    g = f + 1
    L = 1 / g


    ###########################################################################
    #                              END OF YOUR CODE                           #
    ###########################################################################
    
    ###########################################################################
    # TODO: Implement the backward pass for the computational graph for (a)   #
    # Store the gradients for each input                                      #
    ###########################################################################
    grad_L = 1
    grad_g = -1 / (g**2)
    grad_f = grad_g
    grad_e = np.exp(e) * grad_f
    grad_d = -1 * grad_e
    grad_c = grad_d
    grad_w3 = grad_d
    grad_b = grad_c
    grad_s2 = grad_c
    grad_s0 = grad_b
    grad_s1 = grad_b
    grad_x0 = w0 * grad_s0
    grad_w0 = x0 * grad_s0
    grad_x1 = w1 * grad_s1
    grad_w1 = x1 * grad_s1 
    grad_a = w2 * grad_s2
    grad_w2 = a * grad_s2
    grad_x2 = -1 / (x2**2) * grad_a
    
    ###########################################################################
    #                              END OF YOUR CODE                           #
    ###########################################################################

    grads = (grad_x0, grad_x1, grad_x2, grad_w0, grad_w1, grad_w2, grad_w3)
    return L, grads

1.(b) Implement the forward and backward pass 2

Implement the code for forward and backward pass of the below computation graph.

png

def f_2(w, x, y, z):
    """
    Computes the forward and backward pass through the computational graph 
    of (c)

    Inputs:
    - w, x, y, z: Python floats

    Returns a tuple of:
    - L: The output of the graph
    - grads: A tuple (grad_w, grad_x, grad_y, grad_z)
    giving the derivative of the output L with respect to each input.
    """
    ###########################################################################
    # TODO: Implement the forward pass for the computational graph for (c) and#
    # store the output of this graph as L                                     #
    ###########################################################################
    a = w**(-1)
    b = -x
    e = a**b 
    c = np.exp(y)
    d = np.exp(z)
    c1 = c 
    c2 = c
    d1 = d
    d2 = d
    p = c1 * d1
    p1 = p
    p2 = p 
    f = c2 + p1
    g = d2 / p2
    m = e - f
    n = m / g
    L = n**2

    ###########################################################################
    #                              END OF YOUR CODE                           #
    ###########################################################################

    ###########################################################################
    # TODO: Implement the backward pass for the computational graph for (c)   #
    # Store the gradients for each input                                      #
    ###########################################################################
    grad_L = 1
    grad_n = 2 * n * grad_L
    grad_m = (1 / g) * grad_n
    grad_g = (-m / g**2) * grad_n
    grad_e = grad_m
    grad_f = -grad_m
    grad_a = b * (a**(b - 1)) * grad_e
    grad_b = (a**b) * np.log(a) * grad_e
    grad_w = (-1 / w**2) * grad_a
    grad_x = -grad_b 
    grad_c2 = grad_f
    grad_p1 = grad_f
    grad_d2 = (1 / p2) * grad_g
    grad_p2 = (-d / p2**2) * grad_g
    grad_p = grad_p1 + grad_p2
    grad_c1 = d1 * grad_p
    grad_d1 = c1 * grad_p
    grad_c = grad_c1 + grad_c2
    grad_d = grad_d1 + grad_d2
    grad_y = np.exp(y) * grad_c
    grad_z = np.exp(z) * grad_d
   
    ###########################################################################
    #                              END OF YOUR CODE                           #
    ###########################################################################

    grads = (grad_w, grad_x, grad_y, grad_z)
    return L, grads

Problem 2: Softmax Classifier with Two Layer Neural Network

In this problem you will develop a two Layer neural network with fully-connected layers to perform classification, and test it out on the CIFAR-10 dataset.

We train the network with a softmax loss function on the weight matrices. The network uses a ReLU nonlinearity after the first fully connected layer. In other words, the network has the following architecture:

input - fully connected layer - ReLU - fully connected layer - softmax

The outputs of the second fully-connected layer are the scores for each class.

You cannot use any deep learning libraries such as PyTorch in this part.

2.(a) Layers

In this problem, implement fully connected layer, relu and softmax. Filling in all TODOs in skeleton codes will be sufficient.

def fc_forward(X, W, b):
    """
    Computes the forward pass for a fully-connected layer.
    
    The input X has shape (N, Din) and contains a minibatch of N
    examples, where each example x[i] has shape (Din,).
    
    Inputs:
    - X: A numpy array containing input data, of shape (N, Din)
    - W: A numpy array of weights, of shape (Din, Dout)
    - b: A numpy array of biases, of shape (Dout,)
    
    Returns a tuple of:
    - out: output, of shape (N, Dout)
    - cache: (X, W, b)
    """
    ###########################################################################
    # TODO: Implement the forward pass. Store the result in out.              #
    ###########################################################################
    out = np.matmul(X, W) + b

    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    cache = (X, W, b)
    return out, cache


def fc_backward(dout, cache):
    """
    Computes the backward pass for a fully_connected layer.
    
    Inputs:
    - dout: Upstream derivative, of shape (N, Dout)
    - cache: returned by your forward function. Tuple of:
      - X: Input data, of shape (N, Din)
      - W: Weights, of shape (Din, Dout)
      - b: Biases, of shape (Dout,)
      
    Returns a tuple of:
    - dX: Gradient with respect to X, of shape (N, Din)
    - dW: Gradient with respect to W, of shape (Din, Dout)
    - db: Gradient with respect to b, of shape (Dout,)
    """
    X, W, b = cache
    dX, dW, db = None, None, None
    ###########################################################################
    # TODO: Implement the affine backward pass.                               #
    ###########################################################################
    dX = np.matmul(dout, np.transpose(W))
    dW = np.matmul(np.transpose(X), dout)
    db = np.sum(dout, axis = 0)

    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return dX, dW, db

def relu_forward(x):
    """
    Computes the forward pass for a layer of rectified linear units (ReLUs).

    Input:
    - x: Inputs, of any shape

    Returns a tuple of:
    - out: Output, of the same shape as x
    - cache: x
    """
    out = x.copy()
    ###########################################################################
    # TODO: Implement the ReLU forward pass.                                  #
    ###########################################################################
    out[out < 0] = 0

    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    cache = x

    return out, cache


def relu_backward(dout, cache):
    """
    Computes the backward pass for a layer of rectified linear units (ReLUs).

    Input:
    - dout: Upstream derivatives, of any shape
    - cache: returned by your forward function. Input x, of same shape as dout

    Returns:
    - dx: Gradient with respect to x
    """
    dx, x = dout.copy(), cache
    ###########################################################################
    # TODO: Implement the ReLU backward pass.                                 #
    ###########################################################################
    dx[x < 0] = 0

    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return dx


def softmax_loss(X, y):
    """
    Computes the loss and gradient for softmax classification.

    Inputs:
    - X: Input data, of shape (N, C) where x[i, j] is the score for the jth
      class for the ith input.
    - y: Vector of labels, of shape (N,) where y[i] is the label for X[i] and
      0 <= y[i] < C

    Returns a tuple of:
    - loss: Scalar giving the loss
    - dX: Gradient of the loss with respect to x
    """
    loss, dX = None, None

    dX = np.exp(X - np.max(X, axis=1, keepdims=True))
    dX /= np.sum(dX, axis=1, keepdims=True)
    loss = -np.sum(np.log(dX[np.arange(X.shape[0]), y])) / X.shape[0]
    dX[np.arange(X.shape[0]), y] -= 1
    dX /= X.shape[0]


    return loss, dX

2.(b) Two Layer Softmax Classifier

In this problem, implement two layer softmax classifier.

class SoftmaxClassifier(object):
    """
    A fully-connected neural network with
    softmax loss that uses a modular layer design. We assume an input dimension
    of D, a hidden dimension of H, and perform classification over C classes.

    The architecture should be fc - relu - fc - softmax with one hidden layer

    The learnable parameters of the model are stored in the dictionary
    self.params that maps parameter names to numpy arrays.
    """

    def __init__(self, input_dim=3072, hidden_dim=300, num_classes=10,
                 weight_scale=1e-3, reg=0.0):
        """
        Initialize a new network.

        Inputs:f
        - input_dim: An integer giving the size of the input
        - hidden_dim: An integer giving the size of the hidden layer, None
          if there's no hidden layer.
        - num_classes: An integer giving the number of classes to classify
        - weight_scale: Scalar giving the standard deviation for random
          initialization of the weights.
        """
        self.params = {}
        self.reg = reg
        ############################################################################
        # TODO: Initialize the weights and biases of the two-layer net. Weights    #
        # should be initialized from a Gaussian centered at 0.0 with               #
        # standard deviation equal to weight_scale, and biases should be           #
        # initialized to zero. All weights and biases should be stored in the      #
        # dictionary self.params, with fc weights and biases using the keys        #
        # 'W' and 'b', i.e., W1, b1 for the weights and bias in the first linear   #
        # layer, W2, b2 for the weights and bias in the second linear layer.       #
        ############################################################################
        W1 = np.random.normal(0.0,  weight_scale, (input_dim, hidden_dim))
        b1 = np.zeros((hidden_dim,))

        W2 = np.random.normal(0.0,  weight_scale, (hidden_dim, num_classes))
        b2 = np.zeros((num_classes,)) 

        self.params["W1"] = W1 
        self.params["b1"] = b1 
        self.params["W2"] = W2 
        self.params["b2"] = b2 

        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################


    def forwards_backwards(self, X, y=None):
        """
        Compute loss and gradient for a minibatch of data.

        Inputs:
        - X: Array of input data of shape (N, Din)
        - y: Array of labels, of shape (N,). y[i] gives the label for X[i].

        Returns:
        If y is None, then run a test-time forward pass of the model and return:
        - scores: Array of shape (N, C) giving classification scores, where
          scores[i, c] is the classification score for X[i] and class c.

        If y is not None, then run a training-time forward and backward pass. And
        return a tuple of:
        - loss: Scalar value giving the loss
        - grads: Dictionary with the same keys as self.params, mapping parameter
          names to gradients of the loss with respect to those parameters.
        """
        scores = None
        ############################################################################
        # TODO: Implement the forward pass for the two-layer net, computing the    #
        # class scores for X and storing them in the scores variable.              #
        ############################################################################
        W1 = self.params["W1"]
        b1 = self.params["b1"]
        W2 = self.params["W2"]
        b2 = self.params["b2"]

        a_1, cache_fc1 = fc_forward(X, W1, b1)
        s_1, cache_relu = relu_forward(a_1)
        scores, cache_fc2 = fc_forward(s_1, W2, b2)
      
        
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        # If y is None then we are in test mode so just return scores
        if y is None:
            return scores

        loss, grads = 0, {}
        ############################################################################
        # TODO: Implement the backward pass for the two-layer net. Store the loss  #
        # in the loss variable and gradients in the grads dictionary. Compute data #
        # loss using softmax, and make sure that grads[k] holds the gradients for  #
        # self.params[k].                                                          # 
        ############################################################################
        loss, grad_L = softmax_loss(scores, y)
        ds_1, dW2, db2 = fc_backward(grad_L, cache_fc2)
        da_1 = relu_backward(ds_1, cache_relu)
        _, dW1, db1 = fc_backward(da_1, cache_fc1)

        grads["W1"] = dW1
        grads["b1"] = db1
        grads["W2"] = dW2
        grads["b2"] = db2


        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################
        ############################################################################
        # TODO: 4.2(g)(EECS 504 only) Add L2 regularization                        # 
        ############################################################################
        loss += self.reg * (np.sum(dW1 ** 2) + np.sum(dW2 ** 2))


        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################
        return loss, grads

  

2.(c) Training

In this problem, you need to preprocess the images and set up model hyperparameters. Notice that adjust the training and val split is optional.

def unpickle(file):
    with open(file, 'rb') as fo:
        dict = pickle.load(fo, encoding="latin1")
    return dict

def load_cifar10():
    data = {}
    meta = unpickle("cifar-10-batches-py/batches.meta")
    batch1 = unpickle("cifar-10-batches-py/data_batch_1")
    batch2 = unpickle("cifar-10-batches-py/data_batch_2")
    batch3 = unpickle("cifar-10-batches-py/data_batch_3")
    batch4 = unpickle("cifar-10-batches-py/data_batch_4")
    batch5 = unpickle("cifar-10-batches-py/data_batch_5")
    test_batch = unpickle("cifar-10-batches-py/test_batch")
    X_train = np.vstack((batch1['data'], batch2['data'], batch3['data'],\
                         batch4['data'], batch5['data']))
    Y_train = np.array(batch1['labels'] + batch2['labels'] + batch3['labels'] + 
                       batch4['labels'] + batch5['labels'])
    X_test = test_batch['data']
    Y_test = test_batch['labels']
    
    #Preprocess images here                                     
    X_train = (X_train-np.mean(X_train,axis=1,keepdims=True))/np.std(X_train,axis=1,keepdims=True)
    X_test = (X_test-np.mean(X_test,axis=1,keepdims=True))/np.std(X_test,axis=1,keepdims=True)

    data['X_train'] = X_train[:40000]
    data['y_train'] = Y_train[:40000]
    data['X_val'] = X_train[40000:]
    data['y_val'] = Y_train[40000:]
    data['X_test'] = X_test
    data['y_test'] = Y_test
    return data

def testNetwork(model, X, y, num_samples=None, batch_size=100):
    """
    Check accuracy of the model on the provided data.

    Inputs:
    - model: Image classifier
    - X: Array of data, of shape (N, d_1, ..., d_k)
    - y: Array of labels, of shape (N,)
    - num_samples: If not None, subsample the data and only test the model
      on num_samples datapoints.
    - batch_size: Split X and y into batches of this size to avoid using
      too much memory.

    Returns:
    - acc: Scalar giving the fraction of instances that were correctly
      classified by the model.
    """

    # Subsample the data
    N = X.shape[0]
    if num_samples is not None and N > num_samples:
        mask = np.random.choice(N, num_samples)
        N = num_samples
        X = X[mask]
        y = y[mask]

    # Compute predictions in batches
    num_batches = N // batch_size
    if N % batch_size != 0:
        num_batches += 1
    y_pred = []
    for i in range(num_batches):
        start = i * batch_size
        end = (i + 1) * batch_size
        scores = model.forwards_backwards(X[start:end])
        y_pred.append(np.argmax(scores, axis=1))
    y_pred = np.hstack(y_pred)
    acc = np.mean(y_pred == y)

    return acc

def SGD(W,dW, learning_rate=1e-3):
    """ Apply a gradient descent step on weight W 
    Inputs:
        W : Weight matrix
        dW : gradient of weight, same shape as W
        learning_rate : Learning rate. Defaults to 1e-3.
    Returns:
        new_W: Updated weight matrix
    """

    # Apply a gradient descent step on weight W using the gradient dW and the specified learning rate.
    new_W = W - learning_rate * dW

    return new_W

def trainNetwork(model, data, **kwargs):
    """
     Required arguments:
    - model: Image classifier
    - data: A dictionary of training and validation data containing:
      'X_train': Array, shape (N_train, d_1, ..., d_k) of training images
      'X_val': Array, shape (N_val, d_1, ..., d_k) of validation images
      'y_train': Array, shape (N_train,) of labels for training images
      'y_val': Array, shape (N_val,) of labels for validation images

    Optional arguments:
    - learning_rate: A scalar for initial learning rate.
    - lr_decay: A scalar for learning rate decay; after each epoch the
      learning rate is multiplied by this value.
    - batch_size: Size of minibatches used to compute loss and gradient
      during training.
    - num_epochs: The number of epochs to run for during training.
    - print_every: Integer; training losses will be printed every
      print_every iterations.
    - verbose: Boolean; if set to false then no output will be printed
      during training.
    - num_train_samples: Number of training samples used to check training
      accuracy; default is 1000; set to None to use entire training set.
    - num_val_samples: Number of validation samples to use to check val
      accuracy; default is None, which uses the entire validation set.
    - optimizer: Choice of using either 'SGD' or 'SGD_Momentum' for updating weights; default is SGD.
    """
    
    
    learning_rate =  kwargs.pop('learning_rate', 1e-3)
    lr_decay = kwargs.pop('lr_decay', 1.0)
    batch_size = kwargs.pop('batch_size', 100)
    num_epochs = kwargs.pop('num_epochs', 10)
    num_train_samples = kwargs.pop('num_train_samples', 1000)
    num_val_samples = kwargs.pop('num_val_samples', None)
    print_every = kwargs.pop('print_every', 10)   
    verbose = kwargs.pop('verbose', True)
    optimizer = kwargs.pop('optimizer', 'SGD')
    
    epoch = 0
    best_val_acc = 0
    best_params = {}
    loss_history = []
    train_acc_history = []
    val_acc_history = []
    
    
    num_train = data['X_train'].shape[0]
    iterations_per_epoch = max(num_train // batch_size, 1)
    num_iterations = num_epochs * iterations_per_epoch
    
    #Initialize velocity dictionary if optimizer is SGD_Momentum
    if optimizer == 'SGD_Momentum':
      velocity_dict = {p:np.zeros(w.shape) for p,w in model.params.items()}
      
    for t in range(num_iterations): 
        # Make a minibatch of training data
        batch_mask = np.random.choice(num_train, batch_size)
        X_batch = data['X_train'][batch_mask]
        y_batch = data['y_train'][batch_mask]
        
        # Compute loss and gradient
        loss, grads = model.forwards_backwards(X_batch, y_batch)
        loss_history.append(loss)

        # Perform a parameter update
        if optimizer == 'SGD':
          for p, w in model.params.items(): 
              model.params[p] = SGD(w,grads[p], learning_rate=learning_rate)

        elif optimizer == 'SGD_Momentum':
          for p, w in model.params.items():
              model.params[p], velocity_dict[p] = SGD_Momentum(w, grads[p], velocity_dict[p], beta=0.5, learning_rate=learning_rate)
        else:
          raise NotImplementedError
        # Print training loss
        if verbose and t % print_every == 0:
            print('(Iteration %d / %d) loss: %f' % (
                   t + 1, num_iterations, loss_history[-1]))
         
        # At the end of every epoch, increment the epoch counter and decay
        # the learning rate.
        epoch_end = (t + 1) % iterations_per_epoch == 0
        if epoch_end:
            epoch += 1
            learning_rate *= lr_decay
        
        # Check train and val accuracy on the first iteration, the last
        # iteration, and at the end of each epoch.
        first_it = (t == 0)
        last_it = (t == num_iterations - 1)
        if first_it or last_it or epoch_end:
            train_acc = testNetwork(model, data['X_train'], data['y_train'],
                num_samples= num_train_samples)
            val_acc = testNetwork(model, data['X_val'], data['y_val'],
                num_samples=num_val_samples)
            train_acc_history.append(train_acc)
            val_acc_history.append(val_acc)

            if verbose:
                print('(Epoch %d / %d) train acc: %f; val_acc: %f' % (
                       epoch, num_epochs, train_acc, val_acc))

            # Keep track of the best model
            if val_acc > best_val_acc:
                best_val_acc = val_acc
                best_params = {}
                for k, v in model.params.items():
                    best_params[k] = v.copy()
        
    model.params = best_params
        
    return model, train_acc_history, val_acc_history
        

# load data
data = load_cifar10() 
train_data = { k: data[k] for k in ['X_train', 'y_train', 
                                    'X_val', 'y_val']}

# initialize model
model_SGD = SoftmaxClassifier(hidden_dim = 300, weight_scale=1e-2)

#######################################################################
# TODO: Set up model hyperparameters for SGD                          #
#######################################################################

# set the hyperparameter
learning_rate_SGD = 0.05
lr_decay_SGD = 0.15
batch_size_SGD = 100

# start training using SGD
model_SGD, train_acc_history_SGD, val_acc_history_SGD = trainNetwork(
    model_SGD, train_data, 
    learning_rate = learning_rate_SGD,
    lr_decay=lr_decay_SGD, 
    batch_size=batch_size_SGD,
    num_epochs=10, 
    print_every=1000, optimizer = 'SGD')
#######################################################################
#                         END OF YOUR CODE                            #
#######################################################################

(Iteration 1 / 4000) loss: 2.310917
(Epoch 0 / 10) train acc: 0.156000; val_acc: 0.159400
(Epoch 1 / 10) train acc: 0.445000; val_acc: 0.435200
(Epoch 2 / 10) train acc: 0.494000; val_acc: 0.465600
(Iteration 1001 / 4000) loss: 1.344627
(Epoch 3 / 10) train acc: 0.519000; val_acc: 0.471000
(Epoch 4 / 10) train acc: 0.516000; val_acc: 0.471600
(Epoch 5 / 10) train acc: 0.512000; val_acc: 0.471800
(Iteration 2001 / 4000) loss: 1.488484
(Epoch 6 / 10) train acc: 0.516000; val_acc: 0.471900
(Epoch 7 / 10) train acc: 0.515000; val_acc: 0.471900
(Iteration 3001 / 4000) loss: 1.341380
(Epoch 8 / 10) train acc: 0.497000; val_acc: 0.471900
(Epoch 9 / 10) train acc: 0.502000; val_acc: 0.471900
(Epoch 10 / 10) train acc: 0.519000; val_acc: 0.471900

2.(d) Training with SGD_Momentum

The model above was trained using SGD. Now implement the SGD with momentum

def SGD_Momentum(W, dW, velocity, beta=0.5, learning_rate=1e-3):
    """ Apply a gradient descent with momentum update on weight W
    Inputs:
        W : Weight matrix
        dW : gradient of weight, same shape as W
        velocity : velocity matrix, same shape as W
        beta : scalar value in range [0,1] weighting the velocity matrix. Setting it to 0 should make SGD_Momentum same as SGD. 
               Defaults to 0.5.
        learning_rate : Learning rate. Defaults to 1e-3.
    Returns:
        new_W: Updated weight matrix
        new_velocity: Updated velocity matrix
    """
    #######################################################################
    # TODO: Apply a gradient descent step on weight W using the gradient dW and the specified learning rate.
    # 1. Calculate the new velocity by using the velocity of last iteration (input velocity) and gradient
    # 2. Update the weights using the new_velocity
    #######################################################################
    new_velocity = beta * velocity + learning_rate * dW
    new_W = W - new_velocity

    #######################################################################
    #                         END OF YOUR CODE                            #
    #######################################################################
    return new_W, new_velocity

# initialize model
model_SGD_Momentum = SoftmaxClassifier(hidden_dim = 300, weight_scale=1e-2)

# start training 
#Using SGD_Momentum as optimizer for trainning for training
model_SGD_Momentum, train_acc_history_SGD_Momentum, val_acc_history_SGD_Momentum = trainNetwork(
    model_SGD_Momentum, train_data, 
    learning_rate = learning_rate_SGD,
    lr_decay=lr_decay_SGD, 
    batch_size=batch_size_SGD,
    num_epochs=10, 
    print_every=1000, optimizer = 'SGD_Momentum')
(Iteration 1 / 4000) loss: 2.297836
(Epoch 0 / 10) train acc: 0.174000; val_acc: 0.158800
(Epoch 1 / 10) train acc: 0.524000; val_acc: 0.454200
(Epoch 2 / 10) train acc: 0.573000; val_acc: 0.493500
(Iteration 1001 / 4000) loss: 1.495447
(Epoch 3 / 10) train acc: 0.549000; val_acc: 0.500900
(Epoch 4 / 10) train acc: 0.579000; val_acc: 0.501700
(Epoch 5 / 10) train acc: 0.579000; val_acc: 0.502400
(Iteration 2001 / 4000) loss: 1.325367
(Epoch 6 / 10) train acc: 0.584000; val_acc: 0.502600
(Epoch 7 / 10) train acc: 0.562000; val_acc: 0.502600
(Iteration 3001 / 4000) loss: 1.322262
(Epoch 8 / 10) train acc: 0.563000; val_acc: 0.502600
(Epoch 9 / 10) train acc: 0.580000; val_acc: 0.502600
(Epoch 10 / 10) train acc: 0.586000; val_acc: 0.502600

2.(e) Report Accuracy

Run the given code and report the accuracy of model_SGD and model_SGD_Momentum on test set.

# report test accuracy
acc = testNetwork(model_SGD, data['X_test'], data['y_test'])
print("Test accuracy of model_SGD: {}".format(acc))
# report test accuracy
acc = testNetwork(model_SGD_Momentum, data['X_test'], data['y_test'])
print("Test accuracy of model_SGD_Momentum: {}".format(acc))
Test accuracy of model_SGD: 0.4775
Test accuracy of model_SGD_Momentum: 0.4987

2.(f) Plot

Using the train_acc_history and val_acc_history, plot the train & val accuracy versus epochs on one plot, using SGD and SGD_Momentum as optimizer.

#######################################################################
# Your Code here
#######################################################################
plt.plot(train_acc_history_SGD, label = "train accuracy using SGD")
plt.plot(val_acc_history_SGD, label = "validate accuracy using SGD")
plt.plot(train_acc_history_SGD_Momentum, label = "train accuracy using SGD Momentum")
plt.plot(val_acc_history_SGD_Momentum, label = "validate accuracy using SGD Momentum")
plt.xlabel("epochs")
plt.ylabel("accuracy")
plt.legend()
plt.plot()

#######################################################################
#                         END OF YOUR CODE                            #
#######################################################################
[]

png

Report your obervation here: The model with SGD momentum optimizer trains quickly than the model with SGD. Train accuracy with SGD momentum goes up to almost 0.5 ~ 0.6, which is higher than the train accuracy with SGD, almost 0.5. Validation accuraccies also show same trend. Validation accuracy with SGD momentum goes up to 0.5, which is higher than the validation accuracy with SGD, almost 0.45. Also, there are gaps between train accuracy and validate accuracy. Validate accuracies are lower than train accuracies in both optimizers.

2.(g) Adding L2 regularization (EECS 504 Only)

Add L2 regularization to the softmax classifier in 2(b)

2.(h) Training with L2 regularization (EECS 504 Only)

Train the model again using L2 regularization, using SGD

# initialize model (remember to set the l2 regularization weight > 0)
model_SGD_L2 = SoftmaxClassifier(hidden_dim = 300, weight_scale=1e-2, reg=0.01)

# start training using SGD. The training hyperparameter you choose should be same to the 4.2(c)
model_SGD_L2, train_acc_history_SGD_L2, val_acc_history_SGD_L2 = trainNetwork(
    model_SGD_L2, train_data, 
    learning_rate = learning_rate_SGD,
    lr_decay=lr_decay_SGD, 
    batch_size=batch_size_SGD,
    num_epochs=10, 
    print_every=1000, optimizer = 'SGD')
(Iteration 1 / 4000) loss: 2.319169
(Epoch 0 / 10) train acc: 0.143000; val_acc: 0.132900
(Epoch 1 / 10) train acc: 0.445000; val_acc: 0.433300
(Epoch 2 / 10) train acc: 0.511000; val_acc: 0.462600
(Iteration 1001 / 4000) loss: 1.300620
(Epoch 3 / 10) train acc: 0.519000; val_acc: 0.470700
(Epoch 4 / 10) train acc: 0.503000; val_acc: 0.471100
(Epoch 5 / 10) train acc: 0.476000; val_acc: 0.471100
(Iteration 2001 / 4000) loss: 1.353539
(Epoch 6 / 10) train acc: 0.519000; val_acc: 0.471100
(Epoch 7 / 10) train acc: 0.532000; val_acc: 0.471100
(Iteration 3001 / 4000) loss: 1.495751
(Epoch 8 / 10) train acc: 0.518000; val_acc: 0.471100
(Epoch 9 / 10) train acc: 0.488000; val_acc: 0.471100
(Epoch 10 / 10) train acc: 0.510000; val_acc: 0.471100
# report test accuracy
acc = testNetwork(model_SGD_L2, data['X_test'], data['y_test'])
print("Test accuracy of model_SGD_L2: {}".format(acc))
Test accuracy of model_SGD_L2: 0.4762

2.(i) Plot

Using the train_acc_history and val_acc_history, plot the train & val accuracy versus epochs on one plot for model with and without L2 regularization, using SGD as optimizer.

#######################################################################
# TODO: Your Code here
#######################################################################
plt.plot(train_acc_history_SGD, label = "train accuracy using SGD")
plt.plot(val_acc_history_SGD, label = "validate accuracy using SGD")
plt.plot(train_acc_history_SGD_L2, label = "train accuracy using SGD and L2 regularization")
plt.plot(val_acc_history_SGD_L2, label = "validate accuracy using SGD and L2 regularization")
plt.xlabel("epochs")
plt.ylabel("accuracy")
plt.legend()
plt.plot()

#######################################################################
#                         END OF YOUR CODE                            #
#######################################################################
[]

png

After adding L2 regularization, the gap between training accuracy and validation accuracy tends to decrease than before.

2.(j) Tune your own model

Feel free to tune any hyperparameters and choose any optimizer as you want – just train the best model you can!

#######################################################################
# TODO: Train your own model                                          #
#######################################################################

# initialize model (set reg=0.0 for EECS 442 students)
your_model = SoftmaxClassifier(hidden_dim = 300, weight_scale = 1e-2,  reg = 1e-2)

# train your moodel
your_model, train_acc_history_your_model, val_acc_history_your_model = trainNetwork(
    your_model, train_data, 
    learning_rate = 0.05,
    lr_decay= 0.5, 
    batch_size= 100,
    num_epochs= 20, 
    print_every=1000, optimizer = "SGD_Momentum")
#######################################################################
#                         END OF YOUR CODE                            #
#######################################################################
(Iteration 1 / 8000) loss: 2.305209
(Epoch 0 / 20) train acc: 0.134000; val_acc: 0.123800
(Epoch 1 / 20) train acc: 0.541000; val_acc: 0.445000
(Epoch 2 / 20) train acc: 0.564000; val_acc: 0.491600
(Iteration 1001 / 8000) loss: 1.372331
(Epoch 3 / 20) train acc: 0.632000; val_acc: 0.500700
(Epoch 4 / 20) train acc: 0.621000; val_acc: 0.518600
(Epoch 5 / 20) train acc: 0.618000; val_acc: 0.524400
(Iteration 2001 / 8000) loss: 1.207773
(Epoch 6 / 20) train acc: 0.638000; val_acc: 0.525400
(Epoch 7 / 20) train acc: 0.671000; val_acc: 0.526500
(Iteration 3001 / 8000) loss: 1.141951
(Epoch 8 / 20) train acc: 0.637000; val_acc: 0.526500
(Epoch 9 / 20) train acc: 0.666000; val_acc: 0.529600
(Epoch 10 / 20) train acc: 0.666000; val_acc: 0.528600
(Iteration 4001 / 8000) loss: 0.943535
(Epoch 11 / 20) train acc: 0.641000; val_acc: 0.529600
(Epoch 12 / 20) train acc: 0.653000; val_acc: 0.528400
(Iteration 5001 / 8000) loss: 1.219422
(Epoch 13 / 20) train acc: 0.668000; val_acc: 0.528600
(Epoch 14 / 20) train acc: 0.659000; val_acc: 0.528600
(Epoch 15 / 20) train acc: 0.637000; val_acc: 0.528500
(Iteration 6001 / 8000) loss: 1.241045
(Epoch 16 / 20) train acc: 0.645000; val_acc: 0.528600
(Epoch 17 / 20) train acc: 0.651000; val_acc: 0.528600
(Iteration 7001 / 8000) loss: 1.226419
(Epoch 18 / 20) train acc: 0.663000; val_acc: 0.528600
(Epoch 19 / 20) train acc: 0.662000; val_acc: 0.528500
(Epoch 20 / 20) train acc: 0.647000; val_acc: 0.528500
# report test accuracy 
# (Run this code only once when you obtain the highest model in validation set!)
acc = testNetwork(your_model, data['X_test'], data['y_test'])
print("Test accuracy of your model: {}".format(acc))
Test accuracy of your model: 0.5184