Assignment #1

Deep Learning / Fall 1398, Iran University of Science and Technology


Please pay attention to these notes:


  • Assignment Due: 1398/08/18 23:59
  • If you need any additional information, please review the assignment page on the course website.
  • The items you need to answer are highlighted in red and the coding parts you need to implement are denoted by:
    ########################################
    #     Put your implementation here     #
    ########################################
  • We always recommend co-operation and discussion in groups for assignments. However, each student has to finish all the questions by himself/herself. If our matching system identifies any sort of copying, you'll be responsible for consequences. So, please mention his/her name if you have a team-mate.
  • Students who audit this course should submit their assignments like other students to be qualified for attending the rest of the sessions.
  • Finding any sort of copying will zero down that assignment grade and also will be counted as two negative assignment for your final score.
  • When you are ready to submit, please follow the instructions at the end of this notebook.
  • If you have any questions about this assignment, feel free to drop us a line. You may also post your questions on the course Forum page.
  • You must run this notebook on Google Colab platform, it depends on Google Colab VM for some of its dependencies.
  • Before starting to work on the assignment Please fill your name in the next section AND Remember to RUN the cell.


Assignment Page: https://iust-deep-learning.github.io/981/assignments/01_mlp_and_preprocessing

Course Forum: https://groups.google.com/forum/#!forum/dl981/


Fill your information here & run the cell

In [0]:
#@title Enter your information & "RUN the cell!!" { run: "auto" }
student_id = 0 #@param {type:"integer"}
student_name = "" #@param {type:"string"}
Your_Github_account_Email = "" #@param {type:"string"}

print("your student id:", student_id)
print("your name:", student_name)


from pathlib import Path

ASSIGNMENT_PATH = Path('asg01')
ASSIGNMENT_PATH.mkdir(parents=True, exist_ok=True)

1. MLP

In class, we studied about MLP. In this part, you have to implement your own MLP and train and test it on the Iris dataset.

Iris dataset


The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

You can see this link for more details.

Let's get this simple dataset and see some samples of it.

In [0]:
from sklearn.datasets import load_iris
iris = load_iris()
print(iris['data'][:10])
print(iris['target'][:10])

Implementation


Before going any further, we have to import some prerequisites:

In [0]:
import numpy as np

If you want to import some modules or implement some helper functions or classes you can do it in this cell.

In [0]:
 

Now, implement your MLP from scratch.

In [0]:
class MLP(object):
  
  def train(self, x, y):
    """
    train MLP model on train data

    Args:
      x: 2d numpy array or list of train data
      y: 1d or 2d numpy array or list of train data labels
    """

    ########################################
    #     Put your implementation here     #
    ########################################
    
    return True
  
  def test(self, x, y):
    """
    test MLP model on test data

    Args:
      x: 2d numpy array or list of test data
      y: 1d or 2d numpy array or list of test data labels

    Returns:
      acc: In the simplest way ratio between the number of correct predicts with the number 
           of all train data
    """

    ########################################
    #     Put your implementation here     #
    ########################################
    
    return acc
  
  def predict(self, x):
    """
    predict output of MLP model on input data

    Args:
      x: 1d or 2d numpy array or list of input data

    Returns:
      pred: 1d numpy array or list or integer that represent output predicted 
            from MLP
    """

    ########################################
    #     Put your implementation here     #
    ########################################
    
    return pred
  
  def save_model(self, model_path):
    """
    save model to disk

    Args:
      model_path: path of model
    """

    ########################################
    #     Put your implementation here     #
    ########################################
    
    return True
  
  def load_model(self, model_path):
    """
    load model from disk

    Args:
      model_path: path of model
    """

    ########################################
    #     Put your implementation here     #
    ########################################
    
    return True
In [0]:
def initialize_model():
  """
  initilize a MLP model that classify Iris dataset
  
  Returns:
    model: A MLP object
               
  Hint: Consider the number of features in the Iris dataset and the number of its classes 
        and initialize weights.
  """
  
  ########################################
  #     Put your implementation here     #
  ########################################
  
  return model
In [0]:
def split_train_test(x, y):
  """
  split input data and labels to train and test sections.
  
  Args:
    x: 2d numpy array or list of input data
    y: 1d or 2d numpy array or list of data labels
    
  Returns:
    train_data: 2d numpy array or list of train_data
    train_labels: 1d or 2d numpy array or list of train data labels
    test_data: 2d numpy array or list of test_data
    test_labels: 1d or 2d numpy array or list of test data labels
  """
  
  ########################################
  #     Put your implementation here     #
  ########################################
  
  return train_data, train_labels, test_data, test_labels

Test your implementation(don't change this cell):

In [0]:
mlp = initialize_model()
train_data, train_labels, test_data, test_labels = split_train_test(iris['data'], iris['target'])
mlp.train(train_data, train_labels)
mlp.save_model(ASSIGNMENT_PATH / 'my_model.h5')
del mlp
new_mlp = initialize_model()
new_mlp.load_model(ASSIGNMENT_PATH / 'my_model.h5')
print('your model accuracy on test data is: %s' % (new_mlp.test(test_data, test_labels)))

In class, we studied the mathematics behind the back-propagation when the activation function of the last layer is Relu. Now write equations related to the softmax activation function and obtain delta formulas for all layers.

$\color{red}{\text{Write your answer here}}$

2.Text classification

In class, we studied how to build a basic dense model. Now we want to learn how to prepare a text dataset to feed into a provided model. First, we start with a simple dataset and then, we try a harder example.

Sentiment Analysis on Movie Reviews

This small dataset is available for free on NLTK. You can learn how to install movie_reviews dataset here.

In [0]:
from keras.layers import Activation, Input, Dropout
from keras.layers import Dense
from keras.models import Model
from keras.optimizers import Adam
from keras import backend as K

  ########################################
  #     Put your implementation here     #
  ########################################

In every deep learning task, we need to divide our dataset into train and test categories. The train category is used to train the model, and the test one is used to evaluate the trained model. The proportion of train and test dataset does not have any specific formula, and it is up to you, but you should consider the majority of the dataset as the train one.

In [0]:
  '''
    Split the documents into train and test datasets
  '''
document = {'train': [], 'test': []} # Put the documents here
labels = {'train': [], 'test': []} # Put the labels here

  ########################################
  #     Put your implementation here     #
  ########################################

Encoding the text data

To feed the text data into a deep model, we must convert the strings to numerical data. A variety of approaches are available for this purpose, and we use two of them for this task: One-Hot and TF-IDF encodings

One-Hot encoding

one-hot is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0). So, in our case, we should convert each word to an array in which only one cell in the whole array must be 1, the one which represents that specific word. Then, to represent a document as a vector, we should sum all the words' vectors in the document.

In [0]:
'''
     Encode documents to One-Hot representation.
'''
xs = {'train': [], 'test': []} # Put the document vectors here

  ########################################
  #     Put your implementation here     #
  ########################################

As we studied in the TA class, for classification tasks we need to convert the labels into the one-hot format.

In [0]:
'''
     Convert labels into One-Hot representation.
'''
ys = {'train': [], 'test': []} # Put the label vectors here

  ########################################
  #     Put your implementation here     #
  ########################################

Now we build and train the model, and then visualize the results.

In [0]:
def recall(y_true, y_pred):
    """
    Recall metric.

    Only computes a batch-wise average of recall.

    Computes the recall, a metric for multi-label classification of
    how many relevant items are selected.
    """
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall


def precision(y_true, y_pred):
    """
    Precision metric.

    Only computes a batch-wise average of precision.

    Computes the precision, a metric for multi-label classification of
    how many selected items are relevant.

    Source
    ------
    https://github.com/fchollet/keras/issues/5400#issuecomment-314747992
    """
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision


def f1(y_true, y_pred):
    """Calculate the F1 score."""
    p = precision(y_true, y_pred)
    r = recall(y_true, y_pred)
    return 2 * ((p * r) / (p + r))

def create_model(nb_classes, input_shape):
    """Create a MLP model."""
    input_ = Input(shape=input_shape)
    x = input_
    x = Dense(16, activation='relu')(x)
    x = Dense(16, activation='relu')(x)
    x = Dense(nb_classes)(x)
    x = Activation('sigmoid')(x)
    model = Model(inputs=input_, outputs=x)
    return model
In [0]:
data = {'x_train': xs['train'], 'y_train': ys['train'],
        'x_test': xs['test'], 'y_test': ys['test']}
model = create_model(2, (data['x_train'].shape[1], ))
model.compile(loss='binary_crossentropy',optimizer="adam", metrics=["accuracy",f1,recall,precision])
history = model.fit(data['x_train'], data['y_train'],
              batch_size=32,
              epochs=20,
              validation_data=(data['x_test'], data['y_test']))
In [0]:
%matplotlib inline
import matplotlib.pyplot as plt

history_dict = history.history
acc = history_dict['acc']

epochs = range(1, len(acc) + 1)

acc_values = history_dict['acc']
val_acc = history_dict['val_acc']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend()

plt.show()

TF-IDF encoding

TFIDF, short for term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus[1]. TF-IDF considers both frequencies of a word in the document and Inverse Document Frequency which determines whether a word is common in documents or not. You can learn more about this approach here to implement it. Note that you need to provide a vector for each document with the same shape as the One-Hot vector but with different values.

In [0]:
'''
     Encode documents to TF-IDF representation.
'''
xs = {'train': [], 'test': []} # Put the document vectors here

  ########################################
  #     Put your implementation here     #
  ########################################

Now we train and visualize our model again. Note that the result may vary concerning the preprocessing you do or the tokenizer you use to split your data.

In [0]:
data = {'x_train': xs['train'], 'y_train': ys['train'],
        'x_test': xs['test'], 'y_test': ys['test']}
model = create_model(2, (data['x_train'].shape[1], ))
model.compile(loss='binary_crossentropy',optimizer="adam", metrics=["accuracy",f1,recall,precision])
history = model.fit(data['x_train'], data['y_train'],
              batch_size=32,
              epochs=20,
              validation_data=(data['x_test'], data['y_test']))
In [0]:
history_dict = history.history
acc = history_dict['acc']

epochs = range(1, len(acc) + 1)

acc_values = history_dict['acc']
val_acc = history_dict['val_acc']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend()

plt.show()

Sentiment Analysis on IMDB

Working with this dataset is a bit tricky. download the dataset from here, then use the Training set as your whole dataset. You can use a sample of 12500 reviews if you faced any ram problems, but remember to include both negative and positive reviews equally.

In [0]:
 '''
    Import necessary modules, download and prepare the requested dataset
  '''
  ########################################
  #     Put your implementation here     #
  ########################################
In [0]:
 '''
    Split the documents into train and test datasets
  '''
document = {'train': [], 'test': []} # Put the documents here
labels = {'train': [], 'test': []} # Put the labels here

  ########################################
  #     Put your implementation here     #
  ########################################

Now you train the dense model on this dataset. Use one of the encoding approaches you used for the prior dataset and then feed the preprocessed data into the model.

In [0]:
'''
     Encode documents to a vector representation.
'''
xs = {'train': [], 'test': []} # Put the document vectors here
ys = {'train': [], 'test': []} # Put the label vectors here

  ########################################
  #     Put your implementation here     #
  ########################################
In [0]:
data = {'x_train': xs['train'], 'y_train': ys['train'],
        'x_test': xs['test'], 'y_test': ys['test']}
model = create_model(2, (data['x_train'].shape[1], ))
model.compile(loss='binary_crossentropy',optimizer="adam", metrics=["accuracy",f1,recall,precision])
history = model.fit(data['x_train'], data['y_train'],
              batch_size=32,
              epochs=20,
              validation_data=(data['x_test'], data['y_test']))
In [0]:
history_dict = history.history
acc = history_dict['acc']

epochs = range(1, len(acc) + 1)

acc_values = history_dict['acc']
val_acc = history_dict['val_acc']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend()

plt.show()

Word Embeddings

In this section, we want to use a pre-trained word embedding to encode the reviews. To do so, we leverage the Google News Word2Vec model, a model that provides 300 semantic features for each word. These features are extracted concerning the position of the training word and by considering adjacent words in the training data (Google News). More detailed information will be discussed in your class later.

You can download the pre-trained model from here, and you may want to use gensim to load the file. Next, you need to replace the document vector with the average of word vectors that are available in the W2V model. Use a weighted average to consider the frequency of a word as well as its presence.

In [0]:
'''
     Encode documents to a vector representation.
'''
xs = {'train': [], 'test': []} # Put the document vectors here
ys = {'train': [], 'test': []} # Put the label vectors here

  ########################################
  #     Put your implementation here     #
  ########################################
In [0]:
data = {'x_train': xs['train'], 'y_train': ys['train'],
        'x_test': xs['test'], 'y_test': ys['test']}
model = create_model(2, (data['x_train'].shape[1], ))
model.compile(loss='binary_crossentropy',optimizer="adam", metrics=["accuracy",f1,recall,precision])
history = model.fit(data['x_train'], data['y_train'],
              batch_size=32,
              epochs=20,
              validation_data=(data['x_test'], data['y_test']))
In [0]:
history_dict = history.history
acc = history_dict['acc']

epochs = range(1, len(acc) + 1)

acc_values = history_dict['acc']
val_acc = history_dict['val_acc']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend()

plt.show()

3. Image classification

In this part, We want to classify animal images according to their species (frog vs penguin).

First, we should download the dataset.

In [0]:
# Download the dataset
! wget -q http://iust-deep-learning.github.io/981/static_files/assignments/asg01_assets/data.zip
  
# Then, Extact it
! unzip data.zip -d .
! cat frog_url.txt

As you see, two files have the URL address of images, so you should download and save them in appropriate folders. Do it in this cell:

In [0]:
 

As a suggestion, it is better to view some of the images at first. To do so, modify this code:

In [0]:
import cv2
import matplotlib.pyplot as plt

img_path = ''
img = cv2.imread(img_path)
plt.imshow(img)

Before going any further, we have to import some prerequisites:

In [0]:
from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
from keras.applications.vgg16 import decode_predictions
import numpy as np

In every deep learning task, we need to divide our dataset into train and test categories. The train category is used to train the model, and the test one is used to evaluate the trained model. The proportion of train and test dataset does not have any specific formula, and it is up to you, but you should consider the majority of the dataset as the train one.

In [0]:
  '''
    Split the images into train and test datasets
  '''
images = {'train': [], 'test': []} # Put the images here
labels = {'train': [], 'test': []} # Put the labels here

  ########################################
  #     Put your implementation here     #
  ########################################

Now we change images to numeric feature vectors to feed them into the network.

To do so, we leverage the vgg16 model. It is a CNN model; these models will be discussed in the future.

In [0]:
vgg16_model = VGG16(weights='imagenet', include_top=False)
vgg16_model.summary()

To prepare images to feed them into the network, some preprocessing is required. Implement this in this cell. For example, you can normalize images.

In [0]:
def preprocess_image(image):
    """
    preprocess input image

    Args:
      image: 2d numpy array input image

    Returns:
      img: 2d numpy array preprocessed image
    """
    img = image.copy()
    ########################################
    #     Put your implementation here     #
    ########################################
    return img

Now, you must first preprocess the images, then convert/encode them into feature vectors.

In [0]:
xs = {'train': [], 'test': []}
for image in images['train']:
    img = # first read image
    img = cv2.resize(img, (224, 224))
    img = np.expand_dims(img, axis=0)
    img = preprocess_image(img)
    features = vgg16_model.predict(img)
    ff = features.flatten()
    xs['train'].append(features)

for image in images['test']:
    img = # first read image
    img = cv2.resize(img, (224, 224))
    img = np.expand_dims(img, axis=0)
    img = preprocess_image(img)
    features = vgg16_model.predict(img)
    ff = features.flatten()
    xs['test'].append(features)

If you need to convert the labels into another format, you can do so by deleting the two last lines and implementing your code.

In [0]:
ys = {'train': [], 'test': []}
ys['train'] = labels['train'][:]
ys['test'] = labels['test'][:]

Now implement an MLP model for this task to separate frog images from penguin images.

If you want to import some modules or implement some helper functions or classes you can do it in this cell.

In [0]:
 

Now, implement your MLP from scratch.

In [0]:
class MLP(object):
  
  def train(self, x, y):
    """
    train MLP model on train data

    Args:
      x: 2d numpy array or list of train data
      y: 1d or 2d numpy array or list of train data labels
    """

    ########################################
    #     Put your implementation here     #
    ########################################
    
    return True
  
  def test(self, x, y):
    """
    test MLP model on test data

    Args:
      x: 2d numpy array or list of test data
      y: 1d or 2d numpy array or list of test data labels

    Returns:
      acc: In the simplest way ratio between the number of correct predicts with the number 
           of all train data
    """

    ########################################
    #     Put your implementation here     #
    ########################################
    
    return acc
  
  def predict(self, x):
    """
    predict output of MLP model on input data

    Args:
      x: 1d or 2d numpy array or list of input data

    Returns:
      pred: 1d numpy array or list or integer that represent output predicted 
            from MLP
    """

    ########################################
    #     Put your implementation here     #
    ########################################
    
    return pred
  
  def save_model(self, model_path):
    """
    save model to disk

    Args:
      model_path: path of model
    """

    ########################################
    #     Put your implementation here     #
    ########################################
    
    return True
  
  def load_model(self, model_path):
    """
    load model from disk

    Args:
      model_path: path of model
    """

    ########################################
    #     Put your implementation here     #
    ########################################
    
    return True
In [0]:
def initialize_model():
  """
  initilize a MLP model that classify Iris dataset
  
  Returns:
    model: A MLP object
               
  Hint: Consider the number of features in the Iris dataset and the number of its classes 
        and initialize weights.
  """
  
  ########################################
  #     Put your implementation here     #
  ########################################
  
  return model

Evaluate your model(don't change this cell):

In [0]:
mlp = initialize_model()
mlp.train(xs['train'], ys['train'])
print('your model accuracy on test data is: %s' % (mlp.test(xs['train'], ys['test'])))
mlp.save_model(ASSIGNMENT_PATH / 'topvgg16_model.h5')

Submission

Congratulations! You finished the assignment & you're ready to submit your work. Please follow the instruction:

  1. Check and review your answers. Make sure all of the cell outputs are what you want.
  2. Select File > Save.
  3. Run Create Submission cell, It may take several minutes and it may ask you for your credential.
  4. Run Download Submission cell to obtain your submission as a zip file.
  5. Grab downloaded file (dl_asg01__xx__xx.zip) and submit it via https://forms.gle/3srwTZhBbc4KfXaR8.

Note: We need your Github token to create (if doesn't exist previously) new repository to store learned model data. Also Google Drive token enables us to download the current notebook & create a submission. If you are interested feel free to check our code.

Create Submission (Run the cell)

In [0]:
#@title
! pip install -U --quiet PyDrive > /dev/null
! wget -q https://github.com/github/hub/releases/download/v2.10.0/hub-linux-amd64-2.10.0.tgz 
  
import os
import time
import yaml
import json

from google.colab import files
from IPython.display import Javascript
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

asg_name = 'assignment_01'
script_save = '''
require(["base/js/namespace"],function(Jupyter) {
    Jupyter.notebook.save_checkpoint();
});
'''
repo_name = 'iust-deep-learning-assignments'
submission_file_name = 'dl_asg01__%s__%s.zip'%(student_id, student_name.lower().replace(' ',  '_'))

! tar xf hub-linux-amd64-2.10.0.tgz
! cd hub-linux-amd64-2.10.0/ && chmod a+x install && ./install
! hub config --global hub.protocol https
! hub config --global user.email "$Your_Github_account_Email"
! hub config --global user.name "$student_name"
! hub api --flat -X GET /user
! hub api -F affiliation=owner -X GET /user/repos > repos.json

repos = json.load(open('repos.json'))
repo_names = [r['name'] for r in repos]
has_repository = repo_name in repo_names
if not has_repository:
  get_ipython().system_raw('! hub api -X POST -F name=%s /user/repos > repo_info.json' % repo_name)
  repo_info = json.load(open('repo_info.json')) 
  repo_url = repo_info['clone_url']
else:
  for r in repos:
    if r['name'] == repo_name:
      repo_url = r['clone_url']
  
stream = open("/root/.config/hub", "r")
token = list(yaml.load_all(stream))[0]['github.com'][0]['oauth_token']
repo_url_with_token = 'https://'+token+"@" +repo_url.split('https://')[1]

! git clone "$repo_url_with_token"
! cp -r "$ASSIGNMENT_PATH" "$repo_name"/
! cd "$repo_name" && git add -A
! cd "$repo_name" && git commit -m "Add assignment 02 results"
! cd "$repo_name" && git push -u origin master

sub_info = {
    'student_id': student_id,
    'student_name': student_name, 
    'repo_url': repo_url,
    'asg_dir_contents': os.listdir(str(ASSIGNMENT_PATH)),
    'dateime': str(time.time()),
    'asg_name': asg_name
}
json.dump(sub_info, open('info.json', 'w'))

Javascript(script_save)

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
file_id = drive.ListFile({'q':"title='%s.ipynb'"%asg_name}).GetList()[0]['id']
downloaded = drive.CreateFile({'id': file_id})
downloaded.GetContentFile('%s.ipynb'%asg_name) 

! jupyter nbconvert --to script "$asg_name".ipynb > /dev/null
! jupyter nbconvert --to html "$asg_name".ipynb > /dev/null
! zip "$submission_file_name" "$asg_name".ipynb "$asg_name".html "$asg_name".txt info.json > /dev/null

print("##########################################")
print("Done! Submisson created, Please download using the bellow cell!")

Download Submission (Run the cell)

In [0]:
#@title
files.download(submission_file_name)

If that cell makes an error when running you can download file dl_asg01_your_struden_id_your_name.zip from left panel and files section by right-clicking on it and choosing download button.

Special Thanks

Special thanks to Amirhossein Kazemnejad and Kiamehr Razaee for creating the template of deep learning course assignments.