Deep Learning / Spring 1399, Iran University of Science and Technology
Please pay attention to these notes:
########################################
# Put your implementation here #
########################################
Assignment Page: https://iust-deep-learning.github.io/982/assignments/01_Multilayer_Perceptron
Course Forum: https://groups.google.com/forum/#!forum/dl982/
Fill your information here & run the cell
#@title Enter your information & "RUN the cell!!" { run: "auto" }
student_id = 0#@param {type:"integer"}
student_name = "" #@param {type:"string"}
Your_Github_account_Email = "" #@param {type:"string"}
print("your student id:", student_id)
print("your name:", student_name)
from pathlib import Path
ASSIGNMENT_PATH = Path('asg01')
ASSIGNMENT_PATH.mkdir(parents=True, exist_ok=True)
In this assignment, you will explore and implement the properties of a primary deep learning model called multilayer perceptron(MLP). Basically, the goal of an MLP is to learn a non-linear mapping from inputs to outputs. We can show this mapping as $y = f(x; \theta)$ , where $x$ is the input and $\theta$ is a vector of all the parameters in the network, which we're trying to learn.
As you see in the figure, every MLP network consists of an input layer, an output layer, and one or more hidden layers in between. Each layer consists of one or more cells called Neurons. In every Neuron, a dot product between the inputs of the cell and a weight vector is calculated. The result of the dot product then goes through a non-linear function (activation function e.g. $tanh$ or $sigmoid$) and gives us the output of the neuron.
Thoughout this assignment, inputs will be matrices with the shape of $b \times M$ where $b$ is the batch size and $M$ is the number of features of inputs.
As for the equations, let's compute the output of the $i$th layer:
$$A^i = f(A^{i-1}w^i + b^i)$$
Imagine that $(i-1)$th and $i$th layer have sizes of $n$ and $p$ respectively. The dimensions of weight and bias will be as follows:
$$w^{n\times p} , b^{1\times p}$$
Numpy is the only package you're allowed to use for implementing your MLP in this assignment, so let's import it in the cell below!
import numpy as np
Now let's implement some activation functions! Linear, Relu and Sigmoid are the functions that we'll need in this assignment. Note that you should also implement their derivatives since you'll need them later for back-propagation.
## We've implemented the Linear activation function for you
def linear(x, deriv=False):
return x if not deriv else np.ones_like(x)
def relu(x, deriv=False):
"""
Args:
x: A numpy array of any shape
deriv: True or False. determines if we want the derivative of the function or not.
Returns:
relu_out: A numpy array of the same shape as x.
Basically relu function or its derivative applied to every element of x
"""
########################################
# Put your implementation here #
########################################
return relu_out
def sigmoid(x, deriv=False):
"""
Args:
x: A numpy array of any shape
deriv: True or False. determines if we want the derivative of the function or not.
Returns:
sig_out: A numpy array of the same shape as x.
Basically sigmoid function or its derivative applied to every element of x
"""
########################################
# Put your implementation here #
########################################
return sig_out
# Test your implementation
!wget -q https://github.com/iust-deep-learning/982/raw/master/static_files/assignments/asg01_assets/act_test.npy
x_act, relu_out, sig_out = np.load('act_test.npy', allow_pickle=True)
assert np.allclose( relu_out[0], relu(x_act, deriv=True), atol=1e-6, rtol=1e-5) and np.allclose(relu_out[1], relu(x_act, deriv=False), atol=1e-6, rtol=1e-5)
assert np.allclose(sig_out[0], sigmoid(x_act, deriv=True), atol=1e-6, rtol=1e-5) and np.allclose(sig_out[1], sigmoid(x_act, deriv=False), atol=1e-6, rtol=1e-5)
Question: Why do activation functions have to be non-linear? Could any non-linear function be used as an activation function?
Write your answers here
Now let's implement our MLP class. This class handles adding layers and doing the forward propagation. Here are the attributes of this class:
- parameters: A list of dictionaries in the form of {'w': weight, 'b': bias} where weight and bias are weight matrix and bias vector of a layer.
- act_funcs: A list of activation functions used in the corresponding layer.
- activations: A list of matrices each corresponding to the output of each layer.
- weighted_ins: A list of matrices each corresponding to the weighted input of each layer. Weighted input, as the name suggests, is layer's input multiplied by layer's weights and added to layer's bias. Which then goes into the layer's activation function to compute the layer's activations(outputs)!
Note that we store weighted inputs and outputs of the layers because we'll need them later for implementing the back-propagation algorithm.
You only need to complete the _feedforward function in the MLP class. This function performs forward propagation on the input.
class MLP:
def __init__(self, input_dim):
"""
Args:
input_dim: An integer determining the inpu dimension of the MLP
"""
self.input_dim = input_dim
self.parameters = []
self.act_funcs = []
self.activations = []
self.weighted_ins = []
def add_layer(self, layer_size, act_func=linear):
"""
Add layers to the MLP using this function
Args:
layer_size: An integer determinig the number of neurons in the layer
act_func: A function applied to the units in the layer
"""
### Size of the previous layer of mlp
prev_size = self.input_dim if not self.parameters else self.parameters[-1]['w'].shape[-1]
### Weight scale used in He initialization
weight_scale = np.sqrt(2/prev_size)
### initializing the weights and bias of the layer
weight = np.random.normal(size=(prev_size, layer_size))*weight_scale
bias = np.ones(layer_size) *0.1
### Add weights and bias of the layer to the parameters of the MLP
self.parameters.append({'w': weight, 'b': bias})
### Add the layer's activation function
self.act_funcs.append(act_func)
def feed_forward(self, X):
"""
Propagate the inputs forward using this function
Args:
X: A numpy array of shape (b, input_dim) where b is the batch size and input_dim is the dimension of the input
Returns:
mlp_out: A numpy array of shape (b, out_dim) where b is the batch size and out_dim is the dimension of the output
Hint: Don't forget to store weighted inputs and outputs of each layer in self.weighted_ins and self.activations respectively
"""
self.activations = []
self.weighted_ins = []
mlp_out = X
########################################
# Put your implementation here #
########################################
return mlp_out
# Test your implementation
import pickle
!wget -q https://github.com/iust-deep-learning/982/raw/master/static_files/assignments/asg01_assets/mlptest.pkl
x = np.random.normal(size=(512, 100))
mlp = MLP(100)
mlp.add_layer(64, relu)
mlp.add_layer(32, relu)
out = mlp.feed_forward(x)
assert len(mlp.parameters) == 2
assert mlp.activations[0].shape == tuple([512, 64]) and mlp.weighted_ins[0].shape == tuple([512, 64])
assert mlp.activations[1].shape == tuple([512, 32]) and mlp.weighted_ins[1].shape == tuple([512, 32])
assert out.shape == tuple([512, 32])
assert np.array_equal(mlp.activations[-1], out)
x, out, parameters = pickle.load(open('mlptest.pkl', 'rb'))
mlp.parameters = parameters
assert np.allclose( out, mlp.feed_forward(x), atol=1e-6, rtol=1e-5)
Question: In the _addlayer function of the MLP class, we used a method called He initialization to initialize the weights. Explain how this method can help with the training of an MLP?
Write your answers here
In the previous sections, we implemented an MLP that accepts an input $x$ and propagates it forward and produces an output $\hat{y}$. The next step in implementing our MLP is to see how good our network's output $\hat{y}$ is compared to the target output $y$! This is where the loss function comes in. This function gets $y$ and $\hat{y}$ as its inputs and returns a scaler as its output. This scaler indicates how good current parameters of the network are.
the choice of this function depends on the task, e.g regression or binary classification. Since you'll be doing a multiclass classification later in this assignment, let's implement the cross-entropy function. Cross-entropy is the function mostly used for classification tasks but to use it in a multiclass setting, the network's outputs must be passed through a softmax activation function and the target output must be in one-hot encoded format.
def softmax(y_hat):
"""
Apply softmax to the inputs
Args:
y_hat: A numpy array of shape (b, out_dim) where b is the batch size and out_dim is the output dimension of the network(number of classes)
Returns:
soft_out: A numpy array of shape (b, out_dim)
"""
########################################
# Put your implementation here #
########################################
return soft_out
# Test your implementation
y_hat = np.random.normal(size=(100, 5))
y_soft = softmax(y_hat)
assert y_hat.shape == y_soft.shape
assert all([(y - 1.)<1e-5 for y in np.sum(y_soft, axis=1)])
y_hat = np.array([[10,10,10,10], [0,0,0,0]])
assert np.allclose( softmax(y_hat), np.array([[0.25, 0.25, 0.25, 0.25], [0.25, 0.25, 0.25, 0.25]]), atol=1e-6, rtol=1e-5)
Now implement the categorical cross-entropy function ("categorical" refers to multiclass classification). Note that the inputs are in batches, so the loss of a batch of samples will be the average of losses of samples in the batch.
def categorical_cross_entropy(y, y_soft):
"""
Compute the categorical cross entropy loss
Args:
y: A numpy array of shape (b, out_dim). Target labels of network.
y_soft: A numpy array of shape (b, out_dim). Output of the softmax activation function
Returns:
loss: A scaler of type float. Average loss over a batch.
Hint: Use np.mean to compute average loss of a batch
"""
########################################
# Put your implementation here #
########################################
return loss
# Test your implementation
y = np.array([[1,0,0], [0,0,1], [1,0,0], [0,1,0]])
y_hat = np.array([[10,1,1], [0,-1,9], [100,-9,9], [0.1,12,10]])
y_soft = softmax(y_hat)
assert round(categorical_cross_entropy(y, y_soft), 3) == 0.032
Great! You have implemented both softmax and categorical cross-entropy functions. Now instead of applying softmax activation function to the output layer of the MLP and then using categorical cross-entropy as loss function, we can merge these two steps and make a softmax categorical cross-entropy loss function and use linear activation function in the output layer! The reason behind this is that the gradient of the softmax categorical cross-entropy loss with respect to the MLP's output is efficiently calculated as:
for a single sample. Here $\hat{y}$ is the MLP's output and $y$ is the target output (labels).
Now let's implement the softmax categorical cross-entropy function!
def softmax_categorical_cross_entropy(y, y_hat, return_grad=False):
"""
Compute the softmax categorical cross entropy loss
Args:
y: A numpy array of shape (b, out_dim). Target labels of network.
y_hat: A numpy array of shape (b, out_dim). Output of the output layer of the network
return_grad: If True return gradient of the loss with respect to y_hat. If False just return the loss
Returns:
loss: A scaler of type float. Average loss over a batch.
"""
y_soft = softmax(y_hat)
if not return_grad:
loss = categorical_cross_entropy(y, y_soft)
return loss
else:
loss_grad = (y_soft - y)/y.shape[0]
return loss_grad
After calculating the loss of the MLP, we need to propagate this loss back to the hidden layers in order to calculate the gradient of the loss with respect to the weights and biases of the network. The algorithm used to calculate these gradients is called back-propagation or simply backprop. Backprop uses chain rule to compute the gradients of the network parameters. Now let's go over the steps of this algorithm (This is the fully matrix-based version):
Check this for a detailed explanation of the back-propagation algorithm.
Now implement the back-propagation algorithm!
def mlp_gradients(mlp, loss_function, x, y):
"""
Compute the gradient of loss with respect to mlp's weights and biases
Args:
mlp: An object of MLP class
loss_function: A function used as loss function of the MLP
x: A numpy array of shape (batch_size, input_dim). The MLP's input
y: A numpy array of shape (batch_size, num_classes). Target labels
Returns:
gradients: A list of dictionaries {'w': dw, 'b': db} corresponding to the dictionaries in mlp.parameters
dw is the gradient of loss with respect to the weights of the layer
db is the gradient of loss with respect to the bias of the layer
"""
gradients = []
### get the output of the network
y_hat = mlp.activations[-1]
num_layers = len(mlp.parameters)
### compute gradient of the loss with respect to network output
g = loss_function(y, y_hat, return_grad=True)
### You'll need the input in the last step of backprop so let's make a new list with x in the beginning
activations = [x] + mlp.activations
for i in reversed(range(num_layers)):
########################################
# Put your implementation here #
########################################
return gradients
# Test your implementation
import pickle
!wget -q https://github.com/iust-deep-learning/982/raw/master/static_files/assignments/asg01_assets/grad_test.zip
!unzip grad_test.zip
x = np.load('grad_x.npy')
y = np.load('grad_y.npy')
mlp = pickle.load(open('grad_mlp_test.pkl', 'rb'))
expected_grads = pickle.load(open('grads', 'rb'))
mlp.feed_forward(x)
grads = mlp_gradients(mlp, softmax_categorical_cross_entropy, x, y)
assert all([np.allclose(eg['w'], g['w'], atol=1e-6, rtol=1e-5) and
np.allclose(eg['b'], g['b'], atol=1e-6, rtol=1e-5)
for eg, g in zip(expected_grads, grads)])
Now that we've computed the gradients of the parameters of our MLP, we should optimize these parameters using the gradients in order for the network to produce better outputs.
Gradient descent is an optimizaion method that iteratively moves the paramters in the oposite direction of their gradients. Below is the update rule for gradient descent:
$$ w \leftarrow w - \alpha \nabla_wLoss$$
Where $\alpha$ is the learning rate hyperparameter.
There are three main variants of gradient descent: stochastic gradient descent, mini-batch gradient descent and batch gradient descent.
Mini-batch gradient descent is the most used variant in practice and that's what we'll use in this assignment
Let's perform a step of gradient descent on a simple MLP!
x = np.random.normal(size=(16, 10))
y = np.eye(16)
lr = 0.1
### Define the mlp
mlp = MLP(x.shape[-1])
mlp.add_layer(16)
mlp.add_layer(8)
mlp.add_layer(y.shape[-1])
### compute mlp's output
y_hat = mlp.feed_forward(x)
### print current loss
print("loss before gradient descent: ", softmax_categorical_cross_entropy(y, y_hat))
### Compute gradients of the mlp's parameters
grads = mlp_gradients(mlp, softmax_categorical_cross_entropy, x, y)
### perform gradient descent
mlp.parameters = [{'w':p['w']-lr*g['w'], 'b':p['b']-lr*g['b']} for g, p in zip(grads, mlp.parameters)]
### compute mlp's output again after gradeint descent
y_hat = mlp.feed_forward(x)
### print loss after gradient descent
print("loss after gradient descent: ", softmax_categorical_cross_entropy(y, y_hat))
Question: Do gradient descent steps always decrease the loss? why? (Hint: toy with the learning rate in the axample above!)
Write your answers here
Instead of using gradient descent, we'll be using an extention of it called gradient descent with momentum. So instead of updating the parameters based only on current gradients, we take into account the gradients from previous steps! This way, parameter updates will have lower variance and convergence will be faster and smoother.
$$ v \leftarrow \gamma v - \alpha \nabla_wLoss$$
$$ w \leftarrow w + v$$
Where $w$ is denotes mlp's weights and $v$ is called velocity which is basically a weighted average of all previous gradients.
Here $\gamma$ determines how fast effects of the previous gradients fade and $\alpha$ is the learning rate.
Now let's implement the SGD class!
class SGD:
def __init__(self, lr=0.01, momentum=0.9):
"""
Args:
lr: learning rate of the SGD optimizer
momentum: momentum of the SGD optimizer
Hint: velocity should be a list of dictionaries just like mlp.parameters
"""
self.lr = lr
self.momentum = momentum
### initialize velocity
self.velocity = []
def step(self, parameters, grads):
"""
Perform a gradient descent step
Args:
parameters: A list of dictionaries {'w': weights , 'b': bias}. MLP's parameters.
grads: A list of dictionaries {'w': dw, 'b': db}. gradient of MLP's parameters. Basically the output of "mlp_gradients" function you implemented!
Returns:
Updated_parameters: A list of dictionaries {'w': weights , 'b': bias}. mlp's parameters after performing a step of gradient descent.
"""
########################################
# Put your implementation here #
########################################
return Updated_parameters
In this part of the assignment, you'll use the MLP you implemented in the first part to classify Kannada handwritten digits!
This dataset consists of 60000 images of handwritten digits in Kannada script.
You can check this github repository for more information about the dataset.
let's download the dataset:
!wget -q https://github.com/iust-deep-learning/982/raw/master/static_files/assignments/asg01_assets/kannada.zip
!unzip kannada.zip
import pandas as pd
import matplotlib.pyplot as plt
train = pd.read_csv('train.csv')
train.head()
As you can see, the first column of the dataframe is the label, and the rest of the columns are the pixels. Let's put the dataset in numpy arrays. Also, we must normalize the pixel values to [0,1] range to help the convergence of our MLP model.
x = train.values[:, 1:]/255.
y = train.values[:, 0]
plt.imshow(x[10000].reshape(28, 28))
As we are doing a multiclass classification, the labels must be in one-hot encoded format.
def one_hot_encoder(y):
y = y.reshape(-1)
num_samples = y.shape[0]
max_label = np.max(y)
one_hot = np.zeros((num_samples, max_label+1))
one_hot[np.arange(num_samples),y] = 1
return one_hot
Now let's transform the labels into one-hot encoded format!
y = one_hot_encoder(y)
We've implemented the _get_minibatches function below. This function transforms the dataset into multiple batches. We need this function because we'll be doing mini-batch gradient descent.
import math
def get_mini_batches(x, y, batch_size, shuffle=True):
idx = list(range(len(x)))
np.random.shuffle(idx)
steps = math.ceil(len(x)/batch_size)
x, y = x[idx, :], y[idx, :]
for i in range(steps):
yield (x[i*batch_size: (i+1)*batch_size], y[i*batch_size: (i+1)*batch_size])
Evaluation metrics are used to measure the performance of a model after training. The choice of this metric depends on factors like the nature of the task (e.g classification or regression) or a dataset's characteristics (e.g class imbalance). For multiclass classification with balanced classes, accuracy is a reasonable choice.
We've implemented the accuracy function in the cell below:
def accuracy(y, y_hat):
return np.mean(np.argmax(y, axis=-1)==np.argmax(y_hat, axis=-1))
Now let's split the dataset into train and validatoin sets:
from sklearn.model_selection import train_test_split
x_train, x_val, y_train, y_val = train_test_split(x, y, stratify=y)
Everything is now ready for training our MLP! Create your MLP model in the cell bellow. The choice of the number of layers, their sizes and their activation functions is up to you.
mlp = MLP(x_train.shape[-1])
########################################
# Put your implementation here #
########################################
Let's set some hyper-parameters. Feel free to change these hyper-parameters however you see fit!
epochs = 10
Batch_size = 1024
sgd_lr = 0.1
sgd_momentum = 0.9
Now let's train the network!
from tqdm import tqdm_notebook
### Defining a optimizer
optimizer = SGD(lr=sgd_lr, momentum=sgd_momentum)
train_loss, val_loss, train_accs, val_accs = [], [], [], []
for i in range(epochs):
mini_batches = get_mini_batches(x_train, y_train, Batch_size)
for xx, yy in tqdm_notebook(mini_batches, desc='epoch {}'.format(i+1)):
### forward propagation
mlp.feed_forward(xx)
### compute gradients
grads = mlp_gradients(mlp, softmax_categorical_cross_entropy, xx, yy)
### optimization
mlp.parameters = optimizer.step(mlp.parameters, grads)
y_hat = mlp.feed_forward(x_train)
y_hat_val = mlp.feed_forward(x_val)
val_loss.append(softmax_categorical_cross_entropy(y_val, y_hat_val))
train_loss.append(softmax_categorical_cross_entropy(y_train, y_hat))
train_acc = accuracy(y_train, y_hat)*100
val_acc = accuracy(y_val, y_hat_val)*100
train_accs.append(train_acc)
val_accs.append(val_acc)
print("training acc: {:.2f} %".format(train_acc))
print("test acc: {:.2f} %".format(val_acc))
Let's visualize accuracy and loss for train and validation sets during training:
plt.plot(list(range(len(train_loss))), train_loss, label='train')
plt.plot(list(range(len(val_loss))), val_loss, label='val')
plt.xlabel('epoch')
plt.ylabel('loss')
plt.legend()
plt.show()
plt.plot(list(range(len(train_accs))), train_accs, label='train')
plt.plot(list(range(len(val_accs))), val_accs, label='val')
plt.xlabel('epoch')
plt.ylabel('accuracy')
plt.legend()
plt.show()
Question: Looking at loss and accuracy plots, how would you describe your model in terms of bias and variance?
Write your answers here
Congratulations! You finished the assignment & you're ready to submit your work. Please follow the instructions:
dl_asg01__xx__xx.zip
) and upload it via https://forms.gle/2dogVcZhfBvBC1aM6Note: We need your Github token to create (if doesn't exist previously) new repository to store learned model data. Also Google Drvie token enable us to download current notebook & create submission. If you are intrested feel free to check our code.
#@title
! pip install -U --quiet PyDrive > /dev/null
! wget -q https://github.com/github/hub/releases/download/v2.10.0/hub-linux-amd64-2.10.0.tgz
import os
import time
import yaml
import json
from google.colab import files
from IPython.display import Javascript
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
asg_name = 'assignment_1'
script_save = '''
require(["base/js/namespace"],function(Jupyter) {
Jupyter.notebook.save_checkpoint();
});
'''
repo_name = 'iust-deep-learning-assignments'
submission_file_name = 'dl_asg01__%s__%s.zip'%(student_id, student_name.lower().replace(' ', '_'))
! tar xf hub-linux-amd64-2.10.0.tgz
! cd hub-linux-amd64-2.10.0/ && chmod a+x install && ./install
! hub config --global hub.protocol https
! hub config --global user.email "$Your_Github_account_Email"
! hub config --global user.name "$student_name"
! hub api --flat -X GET /user
! hub api -F affiliation=owner -X GET /user/repos > repos.json
repos = json.load(open('repos.json'))
repo_names = [r['name'] for r in repos]
has_repository = repo_name in repo_names
if not has_repository:
get_ipython().system_raw('! hub api -X POST -F name=%s /user/repos > repo_info.json' % repo_name)
repo_info = json.load(open('repo_info.json'))
repo_url = repo_info['clone_url']
else:
for r in repos:
if r['name'] == repo_name:
repo_url = r['clone_url']
stream = open("/root/.config/hub", "r")
token = list(yaml.load_all(stream))[0]['github.com'][0]['oauth_token']
repo_url_with_token = 'https://'+token+"@" +repo_url.split('https://')[1]
! git clone "$repo_url_with_token"
! cp -r "$ASSIGNMENT_PATH" "$repo_name"/
! cd "$repo_name" && git add -A
! cd "$repo_name" && git commit -m "Add assignment 01 results"
! cd "$repo_name" && git push -u origin master
sub_info = {
'student_id': student_id,
'student_name': student_name,
'repo_url': repo_url,
'asg_dir_contents': os.listdir(str(ASSIGNMENT_PATH)),
'dateime': str(time.time()),
'asg_name': asg_name
}
json.dump(sub_info, open('info.json', 'w'))
Javascript(script_save)
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
file_id = drive.ListFile({'q':"title='%s.ipynb'"%asg_name}).GetList()[0]['id']
downloaded = drive.CreateFile({'id': file_id})
downloaded.GetContentFile('%s.ipynb'%asg_name)
! jupyter nbconvert --to script "$asg_name".ipynb > /dev/null
! jupyter nbconvert --to html "$asg_name".ipynb > /dev/null
! zip "$submission_file_name" "$asg_name".ipynb "$asg_name".html "$asg_name".txt info.json > /dev/null
print("##########################################")
print("Done! Submisson created, Please download using the bellow cell!")
files.download(submission_file_name)