Assignment #1¶

Deep Learning / Fall 1398, Iran University of Science and Technology

Please pay attention to these notes:

Assignment Due: 1398/08/18 23:59
If you need any additional information, please review the assignment page on the course website.

The items you need to answer are highlighted in red and the coding parts you need to implement are denoted by:

########################################
#     Put your implementation here     #
########################################

We always recommend co-operation and discussion in groups for assignments. However, each student has to finish all the questions by himself/herself. If our matching system identifies any sort of copying, you'll be responsible for consequences. So, please mention his/her name if you have a team-mate.
Students who audit this course should submit their assignments like other students to be qualified for attending the rest of the sessions.
Finding any sort of copying will zero down that assignment grade and also will be counted as two negative assignment for your final score.
When you are ready to submit, please follow the instructions at the end of this notebook.
If you have any questions about this assignment, feel free to drop us a line. You may also post your questions on the course Forum page.
You must run this notebook on Google Colab platform, it depends on Google Colab VM for some of its dependencies.
Before starting to work on the assignment Please fill your name in the next section AND Remember to RUN the cell.

Assignment Page: https://iust-deep-learning.github.io/981/assignments/01_mlp_and_preprocessing

Course Forum: https://groups.google.com/forum/#!forum/dl981/

Fill your information here & run the cell

#@title Enter your information & "RUN the cell!!" { run: "auto" }
student_id = 0 #@param {type:"integer"}
student_name = "" #@param {type:"string"}
Your_Github_account_Email = "" #@param {type:"string"}

print("your student id:", student_id)
print("your name:", student_name)


from pathlib import Path

ASSIGNMENT_PATH = Path('asg01')
ASSIGNMENT_PATH.mkdir(parents=True, exist_ok=True)

1. MLP¶

In class, we studied about MLP. In this part, you have to implement your own MLP and train and test it on the Iris dataset.

Iris dataset¶

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

You can see this link for more details.

Let's get this simple dataset and see some samples of it.

from sklearn.datasets import load_iris
iris = load_iris()
print(iris['data'][:10])
print(iris['target'][:10])

Implementation¶

Before going any further, we have to import some prerequisites:

for implementing mlp from scratch for this part and part 3 please see this.

import numpy as np

If you want to import some modules or implement some helper functions or classes you can do it in this cell.

Now, implement your MLP from scratch.

class MLP(object):
  
  def train(self, x, y):
    """
    train MLP model on train data

    Args:
      x: 2d numpy array or list of train data
      y: 1d or 2d numpy array or list of train data labels
    """

    ########################################
    #     Put your implementation here     #
    ########################################
    
    return True
  
  def test(self, x, y):
    """
    test MLP model on test data

    Args:
      x: 2d numpy array or list of test data
      y: 1d or 2d numpy array or list of test data labels

    Returns:
      acc: In the simplest way ratio between the number of correct predicts with the number 
           of all train data
    """

    ########################################
    #     Put your implementation here     #
    ########################################
    
    return acc
  
  def predict(self, x):
    """
    predict output of MLP model on input data

    Args:
      x: 1d or 2d numpy array or list of input data

    Returns:
      pred: 1d numpy array or list or integer that represent output predicted 
            from MLP
    """

    ########################################
    #     Put your implementation here     #
    ########################################
    
    return pred
  
  def save_model(self, model_path):
    """
    save model to disk

    Args:
      model_path: path of model
    """

    ########################################
    #     Put your implementation here     #
    ########################################
    
    return True
  
  def load_model(self, model_path):
    """
    load model from disk

    Args:
      model_path: path of model
    """

    ########################################
    #     Put your implementation here     #
    ########################################
    
    return True

def initialize_model():
  """
  initilize a MLP model that classify Iris dataset
  
  Returns:
    model: A MLP object
               
  Hint: Consider the number of features in the Iris dataset and the number of its classes 
        and initialize weights.
  """
  
  ########################################
  #     Put your implementation here     #
  ########################################
  
  return model

def split_train_test(x, y):
  """
  split input data and labels to train and test sections.
  
  Args:
    x: 2d numpy array or list of input data
    y: 1d or 2d numpy array or list of data labels
    
  Returns:
    train_data: 2d numpy array or list of train_data
    train_labels: 1d or 2d numpy array or list of train data labels
    test_data: 2d numpy array or list of test_data
    test_labels: 1d or 2d numpy array or list of test data labels
  """
  
  ########################################
  #     Put your implementation here     #
  ########################################
  
  return train_data, train_labels, test_data, test_labels

Test your implementation(don't change this cell):

mlp = initialize_model()
train_data, train_labels, test_data, test_labels = split_train_test(iris['data'], iris['target'])
mlp.train(train_data, train_labels)
mlp.save_model(ASSIGNMENT_PATH / 'my_model.h5')
del mlp
new_mlp = initialize_model()
new_mlp.load_model(ASSIGNMENT_PATH / 'my_model.h5')
print('your model accuracy on test data is: %s' % (new_mlp.test(test_data, test_labels)))

In class, we studied the mathematics behind the back-propagation when the activation function of the last layer is Relu. Now write equations related to the softmax activation function and obtain delta formulas for all layers.

please see this.

$\color{red}{\text{Write your answer here}}$

2.Text classification¶

In class, we studied how to build a basic dense model. Now we want to learn how to prepare a text dataset to feed into a provided model. First, we start with a simple dataset and then, we try a harder example.

Sentiment Analysis on Movie Reviews¶

This small dataset is available for free on NLTK. You can learn how to install movie_reviews dataset here.

from keras.layers import Activation, Input, Dropout
from keras.layers import Dense
from keras.models import Model
from keras.optimizers import Adam
from keras import backend as K

import nltk
from nltk.corpus import movie_reviews

nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!

True

In every deep learning task, we need to divide our dataset into train and test categories. The train category is used to train the model, and the test one is used to evaluate the trained model. The proportion of train and test dataset does not have any specific formula, and it is up to you, but you should consider the majority of the dataset as the train one.

from sklearn.model_selection import train_test_split
  
'''
    Split the documents into train and test datasets
'''
document = {'train': [], 'test': []} # Put the documents here
labels = {'train': [], 'test': []} # Put the labels here


train, test = train_test_split(movie_reviews.fileids(),test_size=0.33,shuffle=True)

document['train'] = [movie_reviews.raw(doc_id) for doc_id in train]
document['test'] = [movie_reviews.raw(doc_id) for doc_id in test]

labels['train'] = [movie_reviews.categories(doc_id) for doc_id in train]
labels['test'] = [movie_reviews.categories(doc_id) for doc_id in test]

Encoding the text data¶

To feed the text data into a deep model, we must convert the strings to numerical data. A variety of approaches are available for this purpose, and we use two of them for this task: One-Hot and TF-IDF encodings

One-Hot encoding¶

one-hot is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0). So, in our case, we should convert each word to an array in which only one cell in the whole array must be 1, the one which represents that specific word. Then, to represent a document as a vector, we should sum all the words' vectors in the document.

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

'''
     Encode documents to One-Hot representation.
'''
xs = {'train': [], 'test': []} # Put the document vectors here

count_vect = CountVectorizer()
xs['train'] = count_vect.fit_transform(document['train']).toarray()
xs['test'] = count_vect.transform(document['test']).toarray()

As we studied in the TA class, for classification tasks we need to convert the labels into the one-hot format.

from sklearn.preprocessing import MultiLabelBinarizer

'''
     Convert labels into One-Hot representation.
'''
ys = {'train': [], 'test': []} # Put the label vectors here

mlb = MultiLabelBinarizer()

ys['train'] = mlb.fit_transform(labels['train'])
ys['test'] = mlb.transform(labels['test'])

Now we build and train the model, and then visualize the results.

def recall(y_true, y_pred):
    """
    Recall metric.

    Only computes a batch-wise average of recall.

    Computes the recall, a metric for multi-label classification of
    how many relevant items are selected.
    """
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall


def precision(y_true, y_pred):
    """
    Precision metric.

    Only computes a batch-wise average of precision.

    Computes the precision, a metric for multi-label classification of
    how many selected items are relevant.

    Source
    ------
    https://github.com/fchollet/keras/issues/5400#issuecomment-314747992
    """
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision


def f1(y_true, y_pred):
    """Calculate the F1 score."""
    p = precision(y_true, y_pred)
    r = recall(y_true, y_pred)
    return 2 * ((p * r) / (p + r))

def create_model(nb_classes, input_shape):
    """Create a MLP model."""
    input_ = Input(shape=input_shape)
    x = input_
    x = Dense(16, activation='relu')(x)
    x = Dense(16, activation='relu')(x)
    x = Dense(nb_classes)(x)
    x = Activation('sigmoid')(x)
    model = Model(inputs=input_, outputs=x)
    return model

data = {'x_train': xs['train'], 'y_train': ys['train'],
        'x_test': xs['test'], 'y_test': ys['test']}
model = create_model(2, (data['x_train'].shape[1], ))
model.compile(loss='binary_crossentropy',optimizer="adam", metrics=["accuracy",f1,recall,precision])
history = model.fit(data['x_train'], data['y_train'],
              batch_size=32,
              epochs=20,
              validation_data=(data['x_test'], data['y_test']))

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:66: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:541: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4432: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:793: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3657: The name tf.log is deprecated. Please use tf.math.log instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_impl.py:183: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1033: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1020: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3005: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

Train on 1340 samples, validate on 660 samples
Epoch 1/20
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:197: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:207: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:216: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:223: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

1340/1340 [==============================] - 2s 1ms/step - loss: 0.6452 - acc: 0.5854 - f1: 0.6128 - recall: 0.6664 - precision: 0.5747 - val_loss: 0.6042 - val_acc: 0.6318 - val_f1: 0.6434 - val_recall: 0.6652 - val_precision: 0.6235
Epoch 2/20
1340/1340 [==============================] - 1s 776us/step - loss: 0.4467 - acc: 0.8537 - f1: 0.8578 - recall: 0.8784 - precision: 0.8387 - val_loss: 0.5441 - val_acc: 0.7402 - val_f1: 0.7509 - val_recall: 0.7803 - val_precision: 0.7243
Epoch 3/20
1340/1340 [==============================] - 1s 754us/step - loss: 0.3014 - acc: 0.9743 - f1: 0.9750 - recall: 0.9881 - precision: 0.9629 - val_loss: 0.4075 - val_acc: 0.8227 - val_f1: 0.8325 - val_recall: 0.8818 - val_precision: 0.7893
Epoch 4/20
1340/1340 [==============================] - 1s 773us/step - loss: 0.0742 - acc: 0.9925 - f1: 0.9925 - recall: 0.9925 - precision: 0.9926 - val_loss: 0.4664 - val_acc: 0.8136 - val_f1: 0.8133 - val_recall: 0.8121 - val_precision: 0.8145
Epoch 5/20
1340/1340 [==============================] - 1s 726us/step - loss: 0.0152 - acc: 0.9993 - f1: 0.9993 - recall: 0.9993 - precision: 0.9993 - val_loss: 0.3326 - val_acc: 0.8644 - val_f1: 0.8653 - val_recall: 0.8697 - val_precision: 0.8613
Epoch 6/20
1340/1340 [==============================] - 1s 719us/step - loss: 0.0060 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.3356 - val_acc: 0.8682 - val_f1: 0.8694 - val_recall: 0.8773 - val_precision: 0.8617
Epoch 7/20
1340/1340 [==============================] - 1s 756us/step - loss: 0.0039 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.3382 - val_acc: 0.8682 - val_f1: 0.8698 - val_recall: 0.8803 - val_precision: 0.8597
Epoch 8/20
1340/1340 [==============================] - 1s 765us/step - loss: 0.0028 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.3451 - val_acc: 0.8689 - val_f1: 0.8702 - val_recall: 0.8788 - val_precision: 0.8620
Epoch 9/20
1340/1340 [==============================] - 1s 788us/step - loss: 0.0022 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.3546 - val_acc: 0.8720 - val_f1: 0.8728 - val_recall: 0.8788 - val_precision: 0.8671
Epoch 10/20
1340/1340 [==============================] - 1s 800us/step - loss: 0.0017 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.3716 - val_acc: 0.8735 - val_f1: 0.8737 - val_recall: 0.8758 - val_precision: 0.8717
Epoch 11/20
1340/1340 [==============================] - 1s 806us/step - loss: 0.0014 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.3689 - val_acc: 0.8682 - val_f1: 0.8695 - val_recall: 0.8773 - val_precision: 0.8620
Epoch 12/20
1340/1340 [==============================] - 1s 789us/step - loss: 0.0012 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.3883 - val_acc: 0.8682 - val_f1: 0.8682 - val_recall: 0.8697 - val_precision: 0.8669
Epoch 13/20
1340/1340 [==============================] - 1s 806us/step - loss: 9.7491e-04 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.3903 - val_acc: 0.8659 - val_f1: 0.8659 - val_recall: 0.8667 - val_precision: 0.8653
Epoch 14/20
1340/1340 [==============================] - 1s 870us/step - loss: 8.3467e-04 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.3980 - val_acc: 0.8667 - val_f1: 0.8668 - val_recall: 0.8682 - val_precision: 0.8656
Epoch 15/20
1340/1340 [==============================] - 1s 837us/step - loss: 7.2933e-04 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.4034 - val_acc: 0.8659 - val_f1: 0.8659 - val_recall: 0.8667 - val_precision: 0.8653
Epoch 16/20
1340/1340 [==============================] - 1s 833us/step - loss: 6.4625e-04 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.4055 - val_acc: 0.8659 - val_f1: 0.8659 - val_recall: 0.8667 - val_precision: 0.8653
Epoch 17/20
1340/1340 [==============================] - 1s 791us/step - loss: 5.8048e-04 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.4113 - val_acc: 0.8644 - val_f1: 0.8644 - val_recall: 0.8652 - val_precision: 0.8638
Epoch 18/20
1340/1340 [==============================] - 1s 753us/step - loss: 5.2724e-04 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.4154 - val_acc: 0.8629 - val_f1: 0.8631 - val_recall: 0.8652 - val_precision: 0.8612
Epoch 19/20
1340/1340 [==============================] - 1s 794us/step - loss: 4.8210e-04 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.4197 - val_acc: 0.8636 - val_f1: 0.8640 - val_recall: 0.8667 - val_precision: 0.8614
Epoch 20/20
1340/1340 [==============================] - 1s 798us/step - loss: 4.4363e-04 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.4228 - val_acc: 0.8629 - val_f1: 0.8634 - val_recall: 0.8667 - val_precision: 0.8602

%matplotlib inline
import matplotlib.pyplot as plt

history_dict = history.history
acc = history_dict['acc']

epochs = range(1, len(acc) + 1)

acc_values = history_dict['acc']
val_acc = history_dict['val_acc']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend()

plt.show()

TF-IDF encoding¶

TFIDF, short for term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus[1]. TF-IDF considers both frequencies of a word in the document and Inverse Document Frequency which determines whether a word is common in documents or not. You can learn more about this approach here to implement it. Note that you need to provide a vector for each document with the same shape as the One-Hot vector but with different values.

from sklearn.feature_extraction.text import TfidfVectorizer

'''
     Encode documents to TF-IDF representation.
'''
xs = {'train': [], 'test': []} # Put the document vectors here

vectorizer = TfidfVectorizer()
xs['train'] = vectorizer.fit_transform(document['train']).toarray()
xs['test'] = vectorizer.transform(document['test']).toarray()

Now we train and visualize our model again. Note that the result may vary concerning the preprocessing you do or the tokenizer you use to split your data.

data = {'x_train': xs['train'], 'y_train': ys['train'],
        'x_test': xs['test'], 'y_test': ys['test']}
model = create_model(2, (data['x_train'].shape[1], ))
model.compile(loss='binary_crossentropy',optimizer="adam", metrics=["accuracy",f1,recall,precision])
history = model.fit(data['x_train'], data['y_train'],
              batch_size=32,
              epochs=20,
              validation_data=(data['x_test'], data['y_test']))

Train on 1340 samples, validate on 660 samples
Epoch 1/20
1340/1340 [==============================] - 2s 1ms/step - loss: 0.6880 - acc: 0.5493 - f1: 0.5216 - recall: 0.4910 - precision: 0.5649 - val_loss: 0.6817 - val_acc: 0.5629 - val_f1: 0.5119 - val_recall: 0.4591 - val_precision: 0.5799
Epoch 2/20
1340/1340 [==============================] - 1s 836us/step - loss: 0.6366 - acc: 0.6918 - f1: 0.6338 - recall: 0.5425 - precision: 0.7738 - val_loss: 0.6440 - val_acc: 0.6129 - val_f1: 0.5831 - val_recall: 0.5439 - val_precision: 0.6297
Epoch 3/20
1340/1340 [==============================] - 1s 831us/step - loss: 0.5104 - acc: 0.9101 - f1: 0.9058 - recall: 0.8769 - precision: 0.9383 - val_loss: 0.5828 - val_acc: 0.6833 - val_f1: 0.6708 - val_recall: 0.6455 - val_precision: 0.6990
Epoch 4/20
1340/1340 [==============================] - 1s 894us/step - loss: 0.3301 - acc: 0.9810 - f1: 0.9806 - recall: 0.9724 - precision: 0.9893 - val_loss: 0.4873 - val_acc: 0.7924 - val_f1: 0.7794 - val_recall: 0.7348 - val_precision: 0.8312
Epoch 5/20
1340/1340 [==============================] - 1s 910us/step - loss: 0.1726 - acc: 0.9981 - f1: 0.9981 - recall: 0.9963 - precision: 1.0000 - val_loss: 0.4357 - val_acc: 0.8076 - val_f1: 0.7942 - val_recall: 0.7455 - val_precision: 0.8520
Epoch 6/20
1340/1340 [==============================] - 1s 901us/step - loss: 0.0821 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.4078 - val_acc: 0.8197 - val_f1: 0.8107 - val_recall: 0.7742 - val_precision: 0.8526
Epoch 7/20
1340/1340 [==============================] - 1s 870us/step - loss: 0.0404 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.3719 - val_acc: 0.8402 - val_f1: 0.8384 - val_recall: 0.8303 - val_precision: 0.8470
Epoch 8/20
1340/1340 [==============================] - 1s 918us/step - loss: 0.0226 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.3675 - val_acc: 0.8386 - val_f1: 0.8373 - val_recall: 0.8303 - val_precision: 0.8447
Epoch 9/20
1340/1340 [==============================] - 1s 888us/step - loss: 0.0142 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.3697 - val_acc: 0.8348 - val_f1: 0.8338 - val_recall: 0.8288 - val_precision: 0.8391
Epoch 10/20
1340/1340 [==============================] - 1s 866us/step - loss: 0.0097 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.3706 - val_acc: 0.8364 - val_f1: 0.8355 - val_recall: 0.8318 - val_precision: 0.8395
Epoch 11/20
1340/1340 [==============================] - 1s 917us/step - loss: 0.0070 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.3671 - val_acc: 0.8417 - val_f1: 0.8406 - val_recall: 0.8348 - val_precision: 0.8465
Epoch 12/20
1340/1340 [==============================] - 1s 910us/step - loss: 0.0053 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.3718 - val_acc: 0.8379 - val_f1: 0.8369 - val_recall: 0.8318 - val_precision: 0.8422
Epoch 13/20
1340/1340 [==============================] - 1s 935us/step - loss: 0.0041 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.3703 - val_acc: 0.8402 - val_f1: 0.8388 - val_recall: 0.8318 - val_precision: 0.8461
Epoch 14/20
1340/1340 [==============================] - 1s 974us/step - loss: 0.0033 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.3780 - val_acc: 0.8379 - val_f1: 0.8369 - val_recall: 0.8318 - val_precision: 0.8422
Epoch 15/20
1340/1340 [==============================] - 1s 904us/step - loss: 0.0027 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.3758 - val_acc: 0.8394 - val_f1: 0.8382 - val_recall: 0.8318 - val_precision: 0.8449
Epoch 16/20
1340/1340 [==============================] - 1s 927us/step - loss: 0.0023 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.3792 - val_acc: 0.8386 - val_f1: 0.8375 - val_recall: 0.8318 - val_precision: 0.8435
Epoch 17/20
1340/1340 [==============================] - 1s 911us/step - loss: 0.0019 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.3790 - val_acc: 0.8402 - val_f1: 0.8385 - val_recall: 0.8303 - val_precision: 0.8470
Epoch 18/20
1340/1340 [==============================] - 1s 932us/step - loss: 0.0016 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.3791 - val_acc: 0.8409 - val_f1: 0.8394 - val_recall: 0.8318 - val_precision: 0.8473
Epoch 19/20
1340/1340 [==============================] - 1s 934us/step - loss: 0.0014 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.3823 - val_acc: 0.8409 - val_f1: 0.8394 - val_recall: 0.8318 - val_precision: 0.8473
Epoch 20/20
1340/1340 [==============================] - 1s 971us/step - loss: 0.0012 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.3847 - val_acc: 0.8402 - val_f1: 0.8385 - val_recall: 0.8303 - val_precision: 0.8471

history_dict = history.history
acc = history_dict['acc']

epochs = range(1, len(acc) + 1)

acc_values = history_dict['acc']
val_acc = history_dict['val_acc']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend()

plt.show()

Sentiment Analysis on IMDB¶

Working with this dataset is a bit tricky. download the dataset from here, then use the Training set as your whole dataset. You can use a sample of 12500 reviews if you faced any ram problems, but remember to include both negative and positive reviews equally.

 '''
    Import necessary modules, download and prepare the requested dataset
  '''

!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!gunzip -c /content/aclImdb_v1.tar.gz | tar xopf /content/aclImdb_v1.tar.gz
!cd aclImdb && mkdir movie_data
!cd aclImdb && for split in train test; do for sentiment in pos neg; do for file in $split/$sentiment/*; do cat $file >> movie_data/full_${split}.txt; echo >> movie_data/full_${split}.txt; done; done; done;

reviews_train = []
for line in open('/content/aclImdb/movie_data/full_train.txt', 'r'):
    reviews_train.append(line.strip())
    
reviews_test = []
for line in open('/content/aclImdb/movie_data/full_test.txt', 'r'):
    reviews_test.append(line.strip())

--2020-01-06 22:05:32--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’

aclImdb_v1.tar.gz   100%[===================>]  80.23M  36.7MB/s    in 2.2s    

2020-01-06 22:05:34 (36.7 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]

 '''
    Split the documents into train and test datasets
  '''
document = {'train': [], 'test': []} # Put the documents here
labels = {'train': [], 'test': []} # Put the labels here

first = 6250
last = 18750
target = [1 if i < 12500 else 0 for i in range(first,last)]
documents = reviews_train[first:last]

document['train'], document['test'], labels['train'], labels['test'] = train_test_split(
    np.asarray(documents), target, train_size = 0.75
)

9375

Now you train the dense model on this dataset. Use one of the encoding approaches you used for the prior dataset and then feed the preprocessed data into the model.

from keras.utils import to_categorical

'''
     Encode documents to a vector representation.
'''
xs = {'train': [], 'test': []} # Put the document vectors here
ys = {'train': [], 'test': []} # Put the label vectors here

vectorizer = TfidfVectorizer()
xs['train'] = vectorizer.fit_transform(document['train']).toarray()
xs['test'] = vectorizer.transform(document['test']).toarray()

ys['train'] = to_categorical(labels['train'])
ys['test'] = to_categorical(labels['test'])

data = {'x_train': xs['train'], 'y_train': ys['train'],
        'x_test': xs['test'], 'y_test': ys['test']}
model = create_model(2, (data['x_train'].shape[1], ))
model.compile(loss='binary_crossentropy',optimizer="adam", metrics=["accuracy",f1,recall,precision])
history = model.fit(data['x_train'], data['y_train'],
              batch_size=32,
              epochs=20,
              validation_data=(data['x_test'], data['y_test']))

Train on 9375 samples, validate on 3125 samples
Epoch 1/20
9375/9375 [==============================] - 8s 852us/step - loss: 0.5083 - acc: 0.7426 - f1: 0.7511 - recall: 0.7754 - precision: 0.7295 - val_loss: 0.2907 - val_acc: 0.8909 - val_f1: 0.8916 - val_recall: 0.8976 - val_precision: 0.8858
Epoch 2/20
9375/9375 [==============================] - 8s 809us/step - loss: 0.1275 - acc: 0.9632 - f1: 0.9633 - recall: 0.9645 - precision: 0.9621 - val_loss: 0.2848 - val_acc: 0.8874 - val_f1: 0.8876 - val_recall: 0.8890 - val_precision: 0.8862
Epoch 3/20
9375/9375 [==============================] - 8s 818us/step - loss: 0.0299 - acc: 0.9954 - f1: 0.9954 - recall: 0.9954 - precision: 0.9953 - val_loss: 0.3012 - val_acc: 0.8837 - val_f1: 0.8837 - val_recall: 0.8842 - val_precision: 0.8833
Epoch 4/20
9375/9375 [==============================] - 8s 819us/step - loss: 0.0103 - acc: 0.9993 - f1: 0.9993 - recall: 0.9993 - precision: 0.9993 - val_loss: 0.3273 - val_acc: 0.8840 - val_f1: 0.8841 - val_recall: 0.8845 - val_precision: 0.8837
Epoch 5/20
9375/9375 [==============================] - 8s 821us/step - loss: 0.0060 - acc: 0.9994 - f1: 0.9994 - recall: 0.9994 - precision: 0.9994 - val_loss: 0.3485 - val_acc: 0.8832 - val_f1: 0.8832 - val_recall: 0.8835 - val_precision: 0.8829
Epoch 6/20
9375/9375 [==============================] - 7s 798us/step - loss: 0.0032 - acc: 0.9996 - f1: 0.9996 - recall: 0.9996 - precision: 0.9997 - val_loss: 0.3809 - val_acc: 0.8838 - val_f1: 0.8839 - val_recall: 0.8842 - val_precision: 0.8836
Epoch 7/20
9375/9375 [==============================] - 8s 809us/step - loss: 0.0015 - acc: 0.9999 - f1: 0.9999 - recall: 0.9999 - precision: 0.9999 - val_loss: 0.3940 - val_acc: 0.8818 - val_f1: 0.8818 - val_recall: 0.8822 - val_precision: 0.8814
Epoch 8/20
9375/9375 [==============================] - 8s 802us/step - loss: 7.5682e-04 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.4187 - val_acc: 0.8827 - val_f1: 0.8829 - val_recall: 0.8838 - val_precision: 0.8819
Epoch 9/20
9375/9375 [==============================] - 8s 809us/step - loss: 3.9445e-04 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.4447 - val_acc: 0.8830 - val_f1: 0.8831 - val_recall: 0.8838 - val_precision: 0.8824
Epoch 10/20
9375/9375 [==============================] - 7s 799us/step - loss: 2.0818e-04 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.4667 - val_acc: 0.8827 - val_f1: 0.8828 - val_recall: 0.8835 - val_precision: 0.8822
Epoch 11/20
9375/9375 [==============================] - 8s 802us/step - loss: 1.2257e-04 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.4779 - val_acc: 0.8824 - val_f1: 0.8825 - val_recall: 0.8835 - val_precision: 0.8816
Epoch 12/20
9375/9375 [==============================] - 8s 826us/step - loss: 7.9339e-05 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.4941 - val_acc: 0.8822 - val_f1: 0.8823 - val_recall: 0.8832 - val_precision: 0.8815
Epoch 13/20
9375/9375 [==============================] - 8s 815us/step - loss: 5.4633e-05 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.5095 - val_acc: 0.8814 - val_f1: 0.8815 - val_recall: 0.8822 - val_precision: 0.8809
Epoch 14/20
9375/9375 [==============================] - 7s 780us/step - loss: 3.9041e-05 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.5199 - val_acc: 0.8816 - val_f1: 0.8819 - val_recall: 0.8835 - val_precision: 0.8803
Epoch 15/20
9375/9375 [==============================] - 8s 804us/step - loss: 2.8912e-05 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.5332 - val_acc: 0.8811 - val_f1: 0.8813 - val_recall: 0.8826 - val_precision: 0.8801
Epoch 16/20
9375/9375 [==============================] - 7s 797us/step - loss: 2.1838e-05 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.5449 - val_acc: 0.8810 - val_f1: 0.8812 - val_recall: 0.8822 - val_precision: 0.8801
Epoch 17/20
9375/9375 [==============================] - 7s 783us/step - loss: 1.6791e-05 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.5547 - val_acc: 0.8806 - val_f1: 0.8808 - val_recall: 0.8816 - val_precision: 0.8800
Epoch 18/20
9375/9375 [==============================] - 7s 779us/step - loss: 1.3093e-05 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.5695 - val_acc: 0.8826 - val_f1: 0.8827 - val_recall: 0.8832 - val_precision: 0.8821
Epoch 19/20
9375/9375 [==============================] - 7s 789us/step - loss: 1.0385e-05 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.5750 - val_acc: 0.8805 - val_f1: 0.8805 - val_recall: 0.8810 - val_precision: 0.8801
Epoch 20/20
9375/9375 [==============================] - 7s 798us/step - loss: 8.2683e-06 - acc: 1.0000 - f1: 1.0000 - recall: 1.0000 - precision: 1.0000 - val_loss: 0.5829 - val_acc: 0.8802 - val_f1: 0.8803 - val_recall: 0.8813 - val_precision: 0.8794

history_dict = history.history
acc = history_dict['acc']

epochs = range(1, len(acc) + 1)

acc_values = history_dict['acc']
val_acc = history_dict['val_acc']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend()

plt.show()

Word Embeddings¶

In this section, we want to use a pre-trained word embedding to encode the reviews. To do so, we leverage the Google News Word2Vec model, a model that provides 300 semantic features for each word. These features are extracted concerning the position of the training word and by considering adjacent words in the training data (Google News). More detailed information will be discussed in your class later.

You can download the pre-trained model from here, and you may want to use gensim to load the file. Next, you need to replace the document vector with the average of word vectors that are available in the W2V model. Use a weighted average to consider the frequency of a word as well as its presence.

!wget https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
!gunzip GoogleNews-vectors-negative300.bin.gz

--2020-01-06 22:32:45--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.40.118
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.40.118|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’

GoogleNews-vectors- 100%[===================>]   1.53G  46.3MB/s    in 34s     

2020-01-06 22:33:20 (45.9 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]

from gensim.models import KeyedVectors

'''
     Encode documents to a vector representation.
'''
xs = {'train': [], 'test': []} # Put the document vectors here
ys = ys # Put the label vectors here

words = {}
for rev in reviews_train[first:last]:
    for word in rev.split():
        words[word]=1

word_vecs = {}
model = KeyedVectors.load_word2vec_format("/content/GoogleNews-vectors-negative300.bin", binary=True)
for word in words:
    try:
        word_vecs[word] = model.get_vector(word)
    except KeyError:
        # Word not in the vocabulary
        pass

for rev in document['train']:
    tmp = []
    for word in rev.split():
        if word in word_vecs.keys():
            tmp.append(word_vecs[word])
    mean = np.array(tmp).mean(axis=0)
    xs['train'].append(mean)

for rev in document['test']:
    tmp = []
    for word in rev.split():
        if word in word_vecs.keys():
            tmp.append(word_vecs[word])
    mean = np.array(tmp).mean(axis=0)
    xs['test'].append(mean)

xs['train'] = np.asarray(xs['train'])
xs['test'] = np.asarray(xs['test'])

/usr/local/lib/python3.6/dist-packages/smart_open/smart_open_lib.py:402: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL

data = {'x_train': xs['train'], 'y_train': ys['train'],
        'x_test': xs['test'], 'y_test': ys['test']}
model = create_model(2, (data['x_train'].shape[1], ))
model.compile(loss='binary_crossentropy',optimizer="adam", metrics=["accuracy",f1,recall,precision])
history = model.fit(data['x_train'], data['y_train'],
              batch_size=32,
              epochs=20,
              validation_data=(data['x_test'], data['y_test']))

Train on 9375 samples, validate on 3125 samples
Epoch 1/20
9375/9375 [==============================] - 1s 158us/step - loss: 0.6231 - acc: 0.6908 - f1: 0.6673 - recall: 0.6367 - precision: 0.7140 - val_loss: 0.5141 - val_acc: 0.7808 - val_f1: 0.7791 - val_recall: 0.7741 - val_precision: 0.7845
Epoch 2/20
9375/9375 [==============================] - 1s 54us/step - loss: 0.4590 - acc: 0.7975 - f1: 0.7972 - recall: 0.7962 - precision: 0.7983 - val_loss: 0.4353 - val_acc: 0.7990 - val_f1: 0.7995 - val_recall: 0.8010 - val_precision: 0.7981
Epoch 3/20
9375/9375 [==============================] - 1s 58us/step - loss: 0.4204 - acc: 0.8141 - f1: 0.8143 - recall: 0.8148 - precision: 0.8138 - val_loss: 0.4164 - val_acc: 0.8117 - val_f1: 0.8120 - val_recall: 0.8131 - val_precision: 0.8109
Epoch 4/20
9375/9375 [==============================] - 1s 60us/step - loss: 0.4035 - acc: 0.8206 - f1: 0.8207 - recall: 0.8210 - precision: 0.8204 - val_loss: 0.4332 - val_acc: 0.8032 - val_f1: 0.8032 - val_recall: 0.8029 - val_precision: 0.8035
Epoch 5/20
9375/9375 [==============================] - 0s 52us/step - loss: 0.3993 - acc: 0.8217 - f1: 0.8217 - recall: 0.8220 - precision: 0.8215 - val_loss: 0.4004 - val_acc: 0.8216 - val_f1: 0.8219 - val_recall: 0.8234 - val_precision: 0.8205
Epoch 6/20
9375/9375 [==============================] - 1s 58us/step - loss: 0.3904 - acc: 0.8253 - f1: 0.8254 - recall: 0.8259 - precision: 0.8250 - val_loss: 0.3956 - val_acc: 0.8274 - val_f1: 0.8274 - val_recall: 0.8278 - val_precision: 0.8270
Epoch 7/20
9375/9375 [==============================] - 1s 55us/step - loss: 0.3888 - acc: 0.8301 - f1: 0.8303 - recall: 0.8308 - precision: 0.8298 - val_loss: 0.4036 - val_acc: 0.8190 - val_f1: 0.8192 - val_recall: 0.8195 - val_precision: 0.8189
Epoch 8/20
9375/9375 [==============================] - 1s 53us/step - loss: 0.3840 - acc: 0.8345 - f1: 0.8346 - recall: 0.8350 - precision: 0.8342 - val_loss: 0.3914 - val_acc: 0.8283 - val_f1: 0.8284 - val_recall: 0.8291 - val_precision: 0.8278
Epoch 9/20
9375/9375 [==============================] - 1s 59us/step - loss: 0.3790 - acc: 0.8355 - f1: 0.8356 - recall: 0.8362 - precision: 0.8350 - val_loss: 0.3962 - val_acc: 0.8221 - val_f1: 0.8226 - val_recall: 0.8246 - val_precision: 0.8206
Epoch 10/20
9375/9375 [==============================] - 1s 59us/step - loss: 0.3787 - acc: 0.8356 - f1: 0.8357 - recall: 0.8366 - precision: 0.8349 - val_loss: 0.3933 - val_acc: 0.8237 - val_f1: 0.8239 - val_recall: 0.8250 - val_precision: 0.8230
Epoch 11/20
9375/9375 [==============================] - 0s 53us/step - loss: 0.3732 - acc: 0.8366 - f1: 0.8367 - recall: 0.8372 - precision: 0.8363 - val_loss: 0.3867 - val_acc: 0.8280 - val_f1: 0.8280 - val_recall: 0.8278 - val_precision: 0.8281
Epoch 12/20
9375/9375 [==============================] - 1s 59us/step - loss: 0.3775 - acc: 0.8356 - f1: 0.8359 - recall: 0.8370 - precision: 0.8348 - val_loss: 0.3873 - val_acc: 0.8254 - val_f1: 0.8255 - val_recall: 0.8259 - val_precision: 0.8251
Epoch 13/20
9375/9375 [==============================] - 1s 60us/step - loss: 0.3715 - acc: 0.8379 - f1: 0.8379 - recall: 0.8378 - precision: 0.8380 - val_loss: 0.3840 - val_acc: 0.8299 - val_f1: 0.8302 - val_recall: 0.8314 - val_precision: 0.8291
Epoch 14/20
9375/9375 [==============================] - 1s 57us/step - loss: 0.3718 - acc: 0.8381 - f1: 0.8383 - recall: 0.8393 - precision: 0.8375 - val_loss: 0.3872 - val_acc: 0.8251 - val_f1: 0.8252 - val_recall: 0.8256 - val_precision: 0.8249
Epoch 15/20
9375/9375 [==============================] - 1s 57us/step - loss: 0.3679 - acc: 0.8393 - f1: 0.8394 - recall: 0.8403 - precision: 0.8386 - val_loss: 0.3856 - val_acc: 0.8309 - val_f1: 0.8310 - val_recall: 0.8317 - val_precision: 0.8304
Epoch 16/20
9375/9375 [==============================] - 0s 53us/step - loss: 0.3677 - acc: 0.8396 - f1: 0.8398 - recall: 0.8409 - precision: 0.8388 - val_loss: 0.4294 - val_acc: 0.8067 - val_f1: 0.8067 - val_recall: 0.8070 - val_precision: 0.8065
Epoch 17/20
9375/9375 [==============================] - 0s 51us/step - loss: 0.3653 - acc: 0.8404 - f1: 0.8405 - recall: 0.8412 - precision: 0.8399 - val_loss: 0.3829 - val_acc: 0.8280 - val_f1: 0.8283 - val_recall: 0.8294 - val_precision: 0.8272
Epoch 18/20
9375/9375 [==============================] - 1s 53us/step - loss: 0.3661 - acc: 0.8397 - f1: 0.8397 - recall: 0.8401 - precision: 0.8394 - val_loss: 0.3827 - val_acc: 0.8301 - val_f1: 0.8305 - val_recall: 0.8317 - val_precision: 0.8293
Epoch 19/20
9375/9375 [==============================] - 0s 52us/step - loss: 0.3634 - acc: 0.8433 - f1: 0.8434 - recall: 0.8444 - precision: 0.8425 - val_loss: 0.3821 - val_acc: 0.8326 - val_f1: 0.8329 - val_recall: 0.8346 - val_precision: 0.8313
Epoch 20/20
9375/9375 [==============================] - 0s 52us/step - loss: 0.3635 - acc: 0.8431 - f1: 0.8434 - recall: 0.8447 - precision: 0.8421 - val_loss: 0.3803 - val_acc: 0.8317 - val_f1: 0.8317 - val_recall: 0.8317 - val_precision: 0.8318

history_dict = history.history
acc = history_dict['acc']

epochs = range(1, len(acc) + 1)

acc_values = history_dict['acc']
val_acc = history_dict['val_acc']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend()

plt.show()

3. Image classification¶

In this part, We want to classify animal images according to their species (frog vs penguin).

First, we should download the dataset.

# Download the dataset
! wget -q http://iust-deep-learning.github.io/981/static_files/assignments/asg01_assets/data.zip
  
# Then, Extact it
! unzip data.zip -d .
! cat frog_url.txt

As you see, two files have the URL address of images, so you should download and save them in appropriate folders. Do it in this cell:

As a suggestion, it is better to view some of the images at first. To do so, modify this code:

import cv2
import matplotlib.pyplot as plt

img_path = ''
img = cv2.imread(img_path)
plt.imshow(img)

Before going any further, we have to import some prerequisites:

from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
from keras.applications.vgg16 import decode_predictions
import numpy as np

In every deep learning task, we need to divide our dataset into train and test categories. The train category is used to train the model, and the test one is used to evaluate the trained model. The proportion of train and test dataset does not have any specific formula, and it is up to you, but you should consider the majority of the dataset as the train one.

  '''
    Split the images into train and test datasets
  '''
images = {'train': [], 'test': []} # Put the images here
labels = {'train': [], 'test': []} # Put the labels here

  ########################################
  #     Put your implementation here     #
  ########################################

Now we change images to numeric feature vectors to feed them into the network.

To do so, we leverage the vgg16 model. It is a CNN model; these models will be discussed in the future.

vgg16_model = VGG16(weights='imagenet', include_top=False)
vgg16_model.summary()

To prepare images to feed them into the network, some preprocessing is required. Implement this in this cell. For example, you can normalize images.

def preprocess_image(image):
    """
    preprocess input image

    Args:
      image: 2d numpy array input image

    Returns:
      img: 2d numpy array preprocessed image
    """
    img = image.copy()
    ########################################
    #     Put your implementation here     #
    ########################################
    return img

Now, you must first preprocess the images, then convert/encode them into feature vectors.

xs = {'train': [], 'test': []}
for image in images['train']:
    img = # first read image
    img = cv2.resize(img, (224, 224))
    img = np.expand_dims(img, axis=0)
    img = preprocess_image(img)
    features = vgg16_model.predict(img)
    ff = features.flatten()
    xs['train'].append(features)

for image in images['test']:
    img = # first read image
    img = cv2.resize(img, (224, 224))
    img = np.expand_dims(img, axis=0)
    img = preprocess_image(img)
    features = vgg16_model.predict(img)
    ff = features.flatten()
    xs['test'].append(features)

If you need to convert the labels into another format, you can do so by deleting the two last lines and implementing your code.

ys = {'train': [], 'test': []}
ys['train'] = labels['train'][:]
ys['test'] = labels['test'][:]

Now implement an MLP model for this task to separate frog images from penguin images.

If you want to import some modules or implement some helper functions or classes you can do it in this cell.

Now, implement your MLP from scratch.

class MLP(object):
  
  def train(self, x, y):
    """
    train MLP model on train data

    Args:
      x: 2d numpy array or list of train data
      y: 1d or 2d numpy array or list of train data labels
    """

    ########################################
    #     Put your implementation here     #
    ########################################
    
    return True
  
  def test(self, x, y):
    """
    test MLP model on test data

    Args:
      x: 2d numpy array or list of test data
      y: 1d or 2d numpy array or list of test data labels

    Returns:
      acc: In the simplest way ratio between the number of correct predicts with the number 
           of all train data
    """

    ########################################
    #     Put your implementation here     #
    ########################################
    
    return acc
  
  def predict(self, x):
    """
    predict output of MLP model on input data

    Args:
      x: 1d or 2d numpy array or list of input data

    Returns:
      pred: 1d numpy array or list or integer that represent output predicted 
            from MLP
    """

    ########################################
    #     Put your implementation here     #
    ########################################
    
    return pred
  
  def save_model(self, model_path):
    """
    save model to disk

    Args:
      model_path: path of model
    """

    ########################################
    #     Put your implementation here     #
    ########################################
    
    return True
  
  def load_model(self, model_path):
    """
    load model from disk

    Args:
      model_path: path of model
    """

    ########################################
    #     Put your implementation here     #
    ########################################
    
    return True

def initialize_model():
  """
  initilize a MLP model that classify Iris dataset
  
  Returns:
    model: A MLP object
               
  Hint: Consider the number of features in the Iris dataset and the number of its classes 
        and initialize weights.
  """
  
  ########################################
  #     Put your implementation here     #
  ########################################
  
  return model

Evaluate your model(don't change this cell):

mlp = initialize_model()
mlp.train(xs['train'], ys['train'])
print('your model accuracy on test data is: %s' % (mlp.test(xs['train'], ys['test'])))
mlp.save_model(ASSIGNMENT_PATH / 'topvgg16_model.h5')

Submission¶

Congratulations! You finished the assignment & you're ready to submit your work. Please follow the instruction:

Check and review your answers. Make sure all of the cell outputs are what you want.
Select File > Save.
Run Create Submission cell, It may take several minutes and it may ask you for your credential.
Run Download Submission cell to obtain your submission as a zip file.
Grab downloaded file (dl_asg01__xx__xx.zip) and submit it via https://forms.gle/3srwTZhBbc4KfXaR8.

Note: We need your Github token to create (if doesn't exist previously) new repository to store learned model data. Also Google Drive token enables us to download the current notebook & create a submission. If you are interested feel free to check our code.

Create Submission (Run the cell)¶

#@title
! pip install -U --quiet PyDrive > /dev/null
! wget -q https://github.com/github/hub/releases/download/v2.10.0/hub-linux-amd64-2.10.0.tgz 
  
import os
import time
import yaml
import json

from google.colab import files
from IPython.display import Javascript
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

asg_name = 'assignment_01'
script_save = '''
require(["base/js/namespace"],function(Jupyter) {
    Jupyter.notebook.save_checkpoint();
});
'''
repo_name = 'iust-deep-learning-assignments'
submission_file_name = 'dl_asg01__%s__%s.zip'%(student_id, student_name.lower().replace(' ',  '_'))

! tar xf hub-linux-amd64-2.10.0.tgz
! cd hub-linux-amd64-2.10.0/ && chmod a+x install && ./install
! hub config --global hub.protocol https
! hub config --global user.email "$Your_Github_account_Email"
! hub config --global user.name "$student_name"
! hub api --flat -X GET /user
! hub api -F affiliation=owner -X GET /user/repos > repos.json

repos = json.load(open('repos.json'))
repo_names = [r['name'] for r in repos]
has_repository = repo_name in repo_names
if not has_repository:
  get_ipython().system_raw('! hub api -X POST -F name=%s /user/repos > repo_info.json' % repo_name)
  repo_info = json.load(open('repo_info.json')) 
  repo_url = repo_info['clone_url']
else:
  for r in repos:
    if r['name'] == repo_name:
      repo_url = r['clone_url']
  
stream = open("/root/.config/hub", "r")
token = list(yaml.load_all(stream))[0]['github.com'][0]['oauth_token']
repo_url_with_token = 'https://'+token+"@" +repo_url.split('https://')[1]

! git clone "$repo_url_with_token"
! cp -r "$ASSIGNMENT_PATH" "$repo_name"/
! cd "$repo_name" && git add -A
! cd "$repo_name" && git commit -m "Add assignment 02 results"
! cd "$repo_name" && git push -u origin master

sub_info = {
    'student_id': student_id,
    'student_name': student_name, 
    'repo_url': repo_url,
    'asg_dir_contents': os.listdir(str(ASSIGNMENT_PATH)),
    'dateime': str(time.time()),
    'asg_name': asg_name
}
json.dump(sub_info, open('info.json', 'w'))

Javascript(script_save)

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
file_id = drive.ListFile({'q':"title='%s.ipynb'"%asg_name}).GetList()[0]['id']
downloaded = drive.CreateFile({'id': file_id})
downloaded.GetContentFile('%s.ipynb'%asg_name) 

! jupyter nbconvert --to script "$asg_name".ipynb > /dev/null
! jupyter nbconvert --to html "$asg_name".ipynb > /dev/null
! zip "$submission_file_name" "$asg_name".ipynb "$asg_name".html "$asg_name".txt info.json > /dev/null

print("##########################################")
print("Done! Submisson created, Please download using the bellow cell!")

Download Submission (Run the cell)¶

#@title
files.download(submission_file_name)

If that cell makes an error when running you can download file dl_asg01_your_struden_id_your_name.zip from left panel and files section by right-clicking on it and choosing download button.

Special Thanks¶

Special thanks to Amirhossein Kazemnejad and Kiamehr Razaee for creating the template of deep learning course assignments.