Please pay attention to these notes:
########################################
# Put your implementation here #
########################################
Assignment Page: https://iust-deep-learning.github.io/981/assignments/01_mlp_and_preprocessing
Course Forum: https://groups.google.com/forum/#!forum/dl981/
Fill your information here & run the cell
#@title Enter your information & "RUN the cell!!" { run: "auto" }
student_id = 0 #@param {type:"integer"}
student_name = "" #@param {type:"string"}
Your_Github_account_Email = "" #@param {type:"string"}
print("your student id:", student_id)
print("your name:", student_name)
from pathlib import Path
ASSIGNMENT_PATH = Path('asg01')
ASSIGNMENT_PATH.mkdir(parents=True, exist_ok=True)
In class, we studied about MLP. In this part, you have to implement your own MLP and train and test it on the Iris dataset.
The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.
You can see this link for more details.
Let's get this simple dataset and see some samples of it.
from sklearn.datasets import load_iris
iris = load_iris()
print(iris['data'][:10])
print(iris['target'][:10])
for implementing mlp from scratch for this part and part 3 please see this.
import numpy as np
If you want to import some modules or implement some helper functions or classes you can do it in this cell.
Now, implement your MLP from scratch.
class MLP(object):
def train(self, x, y):
"""
train MLP model on train data
Args:
x: 2d numpy array or list of train data
y: 1d or 2d numpy array or list of train data labels
"""
########################################
# Put your implementation here #
########################################
return True
def test(self, x, y):
"""
test MLP model on test data
Args:
x: 2d numpy array or list of test data
y: 1d or 2d numpy array or list of test data labels
Returns:
acc: In the simplest way ratio between the number of correct predicts with the number
of all train data
"""
########################################
# Put your implementation here #
########################################
return acc
def predict(self, x):
"""
predict output of MLP model on input data
Args:
x: 1d or 2d numpy array or list of input data
Returns:
pred: 1d numpy array or list or integer that represent output predicted
from MLP
"""
########################################
# Put your implementation here #
########################################
return pred
def save_model(self, model_path):
"""
save model to disk
Args:
model_path: path of model
"""
########################################
# Put your implementation here #
########################################
return True
def load_model(self, model_path):
"""
load model from disk
Args:
model_path: path of model
"""
########################################
# Put your implementation here #
########################################
return True
def initialize_model():
"""
initilize a MLP model that classify Iris dataset
Returns:
model: A MLP object
Hint: Consider the number of features in the Iris dataset and the number of its classes
and initialize weights.
"""
########################################
# Put your implementation here #
########################################
return model
def split_train_test(x, y):
"""
split input data and labels to train and test sections.
Args:
x: 2d numpy array or list of input data
y: 1d or 2d numpy array or list of data labels
Returns:
train_data: 2d numpy array or list of train_data
train_labels: 1d or 2d numpy array or list of train data labels
test_data: 2d numpy array or list of test_data
test_labels: 1d or 2d numpy array or list of test data labels
"""
########################################
# Put your implementation here #
########################################
return train_data, train_labels, test_data, test_labels
Test your implementation(don't change this cell):
mlp = initialize_model()
train_data, train_labels, test_data, test_labels = split_train_test(iris['data'], iris['target'])
mlp.train(train_data, train_labels)
mlp.save_model(ASSIGNMENT_PATH / 'my_model.h5')
del mlp
new_mlp = initialize_model()
new_mlp.load_model(ASSIGNMENT_PATH / 'my_model.h5')
print('your model accuracy on test data is: %s' % (new_mlp.test(test_data, test_labels)))
In class, we studied the mathematics behind the back-propagation when the activation function of the last layer is Relu. Now write equations related to the softmax activation function and obtain delta formulas for all layers.
please see this.
$\color{red}{\text{Write your answer here}}$
In class, we studied how to build a basic dense model. Now we want to learn how to prepare a text dataset to feed into a provided model. First, we start with a simple dataset and then, we try a harder example.
from keras.layers import Activation, Input, Dropout
from keras.layers import Dense
from keras.models import Model
from keras.optimizers import Adam
from keras import backend as K
import nltk
from nltk.corpus import movie_reviews
nltk.download('movie_reviews')
In every deep learning task, we need to divide our dataset into train and test categories. The train category is used to train the model, and the test one is used to evaluate the trained model. The proportion of train and test dataset does not have any specific formula, and it is up to you, but you should consider the majority of the dataset as the train one.
from sklearn.model_selection import train_test_split
'''
Split the documents into train and test datasets
'''
document = {'train': [], 'test': []} # Put the documents here
labels = {'train': [], 'test': []} # Put the labels here
train, test = train_test_split(movie_reviews.fileids(),test_size=0.33,shuffle=True)
document['train'] = [movie_reviews.raw(doc_id) for doc_id in train]
document['test'] = [movie_reviews.raw(doc_id) for doc_id in test]
labels['train'] = [movie_reviews.categories(doc_id) for doc_id in train]
labels['test'] = [movie_reviews.categories(doc_id) for doc_id in test]
To feed the text data into a deep model, we must convert the strings to numerical data. A variety of approaches are available for this purpose, and we use two of them for this task: One-Hot and TF-IDF encodings
one-hot is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0). So, in our case, we should convert each word to an array in which only one cell in the whole array must be 1, the one which represents that specific word. Then, to represent a document as a vector, we should sum all the words' vectors in the document.
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
'''
Encode documents to One-Hot representation.
'''
xs = {'train': [], 'test': []} # Put the document vectors here
count_vect = CountVectorizer()
xs['train'] = count_vect.fit_transform(document['train']).toarray()
xs['test'] = count_vect.transform(document['test']).toarray()
As we studied in the TA class, for classification tasks we need to convert the labels into the one-hot format.
from sklearn.preprocessing import MultiLabelBinarizer
'''
Convert labels into One-Hot representation.
'''
ys = {'train': [], 'test': []} # Put the label vectors here
mlb = MultiLabelBinarizer()
ys['train'] = mlb.fit_transform(labels['train'])
ys['test'] = mlb.transform(labels['test'])
Now we build and train the model, and then visualize the results.
def recall(y_true, y_pred):
"""
Recall metric.
Only computes a batch-wise average of recall.
Computes the recall, a metric for multi-label classification of
how many relevant items are selected.
"""
true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
recall = true_positives / (possible_positives + K.epsilon())
return recall
def precision(y_true, y_pred):
"""
Precision metric.
Only computes a batch-wise average of precision.
Computes the precision, a metric for multi-label classification of
how many selected items are relevant.
Source
------
https://github.com/fchollet/keras/issues/5400#issuecomment-314747992
"""
true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
precision = true_positives / (predicted_positives + K.epsilon())
return precision
def f1(y_true, y_pred):
"""Calculate the F1 score."""
p = precision(y_true, y_pred)
r = recall(y_true, y_pred)
return 2 * ((p * r) / (p + r))
def create_model(nb_classes, input_shape):
"""Create a MLP model."""
input_ = Input(shape=input_shape)
x = input_
x = Dense(16, activation='relu')(x)
x = Dense(16, activation='relu')(x)
x = Dense(nb_classes)(x)
x = Activation('sigmoid')(x)
model = Model(inputs=input_, outputs=x)
return model
data = {'x_train': xs['train'], 'y_train': ys['train'],
'x_test': xs['test'], 'y_test': ys['test']}
model = create_model(2, (data['x_train'].shape[1], ))
model.compile(loss='binary_crossentropy',optimizer="adam", metrics=["accuracy",f1,recall,precision])
history = model.fit(data['x_train'], data['y_train'],
batch_size=32,
epochs=20,
validation_data=(data['x_test'], data['y_test']))
%matplotlib inline
import matplotlib.pyplot as plt
history_dict = history.history
acc = history_dict['acc']
epochs = range(1, len(acc) + 1)
acc_values = history_dict['acc']
val_acc = history_dict['val_acc']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend()
plt.show()
TFIDF, short for term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus[1]. TF-IDF considers both frequencies of a word in the document and Inverse Document Frequency which determines whether a word is common in documents or not. You can learn more about this approach here to implement it. Note that you need to provide a vector for each document with the same shape as the One-Hot vector but with different values.
from sklearn.feature_extraction.text import TfidfVectorizer
'''
Encode documents to TF-IDF representation.
'''
xs = {'train': [], 'test': []} # Put the document vectors here
vectorizer = TfidfVectorizer()
xs['train'] = vectorizer.fit_transform(document['train']).toarray()
xs['test'] = vectorizer.transform(document['test']).toarray()
Now we train and visualize our model again. Note that the result may vary concerning the preprocessing you do or the tokenizer you use to split your data.
data = {'x_train': xs['train'], 'y_train': ys['train'],
'x_test': xs['test'], 'y_test': ys['test']}
model = create_model(2, (data['x_train'].shape[1], ))
model.compile(loss='binary_crossentropy',optimizer="adam", metrics=["accuracy",f1,recall,precision])
history = model.fit(data['x_train'], data['y_train'],
batch_size=32,
epochs=20,
validation_data=(data['x_test'], data['y_test']))
history_dict = history.history
acc = history_dict['acc']
epochs = range(1, len(acc) + 1)
acc_values = history_dict['acc']
val_acc = history_dict['val_acc']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend()
plt.show()
'''
Import necessary modules, download and prepare the requested dataset
'''
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!gunzip -c /content/aclImdb_v1.tar.gz | tar xopf /content/aclImdb_v1.tar.gz
!cd aclImdb && mkdir movie_data
!cd aclImdb && for split in train test; do for sentiment in pos neg; do for file in $split/$sentiment/*; do cat $file >> movie_data/full_${split}.txt; echo >> movie_data/full_${split}.txt; done; done; done;
reviews_train = []
for line in open('/content/aclImdb/movie_data/full_train.txt', 'r'):
reviews_train.append(line.strip())
reviews_test = []
for line in open('/content/aclImdb/movie_data/full_test.txt', 'r'):
reviews_test.append(line.strip())
'''
Split the documents into train and test datasets
'''
document = {'train': [], 'test': []} # Put the documents here
labels = {'train': [], 'test': []} # Put the labels here
first = 6250
last = 18750
target = [1 if i < 12500 else 0 for i in range(first,last)]
documents = reviews_train[first:last]
document['train'], document['test'], labels['train'], labels['test'] = train_test_split(
np.asarray(documents), target, train_size = 0.75
)
Now you train the dense model on this dataset. Use one of the encoding approaches you used for the prior dataset and then feed the preprocessed data into the model.
from keras.utils import to_categorical
'''
Encode documents to a vector representation.
'''
xs = {'train': [], 'test': []} # Put the document vectors here
ys = {'train': [], 'test': []} # Put the label vectors here
vectorizer = TfidfVectorizer()
xs['train'] = vectorizer.fit_transform(document['train']).toarray()
xs['test'] = vectorizer.transform(document['test']).toarray()
ys['train'] = to_categorical(labels['train'])
ys['test'] = to_categorical(labels['test'])
data = {'x_train': xs['train'], 'y_train': ys['train'],
'x_test': xs['test'], 'y_test': ys['test']}
model = create_model(2, (data['x_train'].shape[1], ))
model.compile(loss='binary_crossentropy',optimizer="adam", metrics=["accuracy",f1,recall,precision])
history = model.fit(data['x_train'], data['y_train'],
batch_size=32,
epochs=20,
validation_data=(data['x_test'], data['y_test']))
history_dict = history.history
acc = history_dict['acc']
epochs = range(1, len(acc) + 1)
acc_values = history_dict['acc']
val_acc = history_dict['val_acc']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend()
plt.show()
In this section, we want to use a pre-trained word embedding to encode the reviews. To do so, we leverage the Google News Word2Vec model, a model that provides 300 semantic features for each word. These features are extracted concerning the position of the training word and by considering adjacent words in the training data (Google News). More detailed information will be discussed in your class later.
You can download the pre-trained model from here, and you may want to use gensim to load the file. Next, you need to replace the document vector with the average of word vectors that are available in the W2V model. Use a weighted average to consider the frequency of a word as well as its presence.
!wget https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
!gunzip GoogleNews-vectors-negative300.bin.gz
from gensim.models import KeyedVectors
'''
Encode documents to a vector representation.
'''
xs = {'train': [], 'test': []} # Put the document vectors here
ys = ys # Put the label vectors here
words = {}
for rev in reviews_train[first:last]:
for word in rev.split():
words[word]=1
word_vecs = {}
model = KeyedVectors.load_word2vec_format("/content/GoogleNews-vectors-negative300.bin", binary=True)
for word in words:
try:
word_vecs[word] = model.get_vector(word)
except KeyError:
# Word not in the vocabulary
pass
for rev in document['train']:
tmp = []
for word in rev.split():
if word in word_vecs.keys():
tmp.append(word_vecs[word])
mean = np.array(tmp).mean(axis=0)
xs['train'].append(mean)
for rev in document['test']:
tmp = []
for word in rev.split():
if word in word_vecs.keys():
tmp.append(word_vecs[word])
mean = np.array(tmp).mean(axis=0)
xs['test'].append(mean)
xs['train'] = np.asarray(xs['train'])
xs['test'] = np.asarray(xs['test'])
data = {'x_train': xs['train'], 'y_train': ys['train'],
'x_test': xs['test'], 'y_test': ys['test']}
model = create_model(2, (data['x_train'].shape[1], ))
model.compile(loss='binary_crossentropy',optimizer="adam", metrics=["accuracy",f1,recall,precision])
history = model.fit(data['x_train'], data['y_train'],
batch_size=32,
epochs=20,
validation_data=(data['x_test'], data['y_test']))
history_dict = history.history
acc = history_dict['acc']
epochs = range(1, len(acc) + 1)
acc_values = history_dict['acc']
val_acc = history_dict['val_acc']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend()
plt.show()
In this part, We want to classify animal images according to their species (frog vs penguin).
First, we should download the dataset.
# Download the dataset
! wget -q http://iust-deep-learning.github.io/981/static_files/assignments/asg01_assets/data.zip
# Then, Extact it
! unzip data.zip -d .
! cat frog_url.txt
As you see, two files have the URL address of images, so you should download and save them in appropriate folders. Do it in this cell:
As a suggestion, it is better to view some of the images at first. To do so, modify this code:
import cv2
import matplotlib.pyplot as plt
img_path = ''
img = cv2.imread(img_path)
plt.imshow(img)
Before going any further, we have to import some prerequisites:
from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
from keras.applications.vgg16 import decode_predictions
import numpy as np
In every deep learning task, we need to divide our dataset into train and test categories. The train category is used to train the model, and the test one is used to evaluate the trained model. The proportion of train and test dataset does not have any specific formula, and it is up to you, but you should consider the majority of the dataset as the train one.
'''
Split the images into train and test datasets
'''
images = {'train': [], 'test': []} # Put the images here
labels = {'train': [], 'test': []} # Put the labels here
########################################
# Put your implementation here #
########################################
Now we change images to numeric feature vectors to feed them into the network.
To do so, we leverage the vgg16 model. It is a CNN model; these models will be discussed in the future.
vgg16_model = VGG16(weights='imagenet', include_top=False)
vgg16_model.summary()
To prepare images to feed them into the network, some preprocessing is required. Implement this in this cell. For example, you can normalize images.
def preprocess_image(image):
"""
preprocess input image
Args:
image: 2d numpy array input image
Returns:
img: 2d numpy array preprocessed image
"""
img = image.copy()
########################################
# Put your implementation here #
########################################
return img
Now, you must first preprocess the images, then convert/encode them into feature vectors.
xs = {'train': [], 'test': []}
for image in images['train']:
img = # first read image
img = cv2.resize(img, (224, 224))
img = np.expand_dims(img, axis=0)
img = preprocess_image(img)
features = vgg16_model.predict(img)
ff = features.flatten()
xs['train'].append(features)
for image in images['test']:
img = # first read image
img = cv2.resize(img, (224, 224))
img = np.expand_dims(img, axis=0)
img = preprocess_image(img)
features = vgg16_model.predict(img)
ff = features.flatten()
xs['test'].append(features)
If you need to convert the labels into another format, you can do so by deleting the two last lines and implementing your code.
ys = {'train': [], 'test': []}
ys['train'] = labels['train'][:]
ys['test'] = labels['test'][:]
Now implement an MLP model for this task to separate frog images from penguin images.
If you want to import some modules or implement some helper functions or classes you can do it in this cell.
Now, implement your MLP from scratch.
class MLP(object):
def train(self, x, y):
"""
train MLP model on train data
Args:
x: 2d numpy array or list of train data
y: 1d or 2d numpy array or list of train data labels
"""
########################################
# Put your implementation here #
########################################
return True
def test(self, x, y):
"""
test MLP model on test data
Args:
x: 2d numpy array or list of test data
y: 1d or 2d numpy array or list of test data labels
Returns:
acc: In the simplest way ratio between the number of correct predicts with the number
of all train data
"""
########################################
# Put your implementation here #
########################################
return acc
def predict(self, x):
"""
predict output of MLP model on input data
Args:
x: 1d or 2d numpy array or list of input data
Returns:
pred: 1d numpy array or list or integer that represent output predicted
from MLP
"""
########################################
# Put your implementation here #
########################################
return pred
def save_model(self, model_path):
"""
save model to disk
Args:
model_path: path of model
"""
########################################
# Put your implementation here #
########################################
return True
def load_model(self, model_path):
"""
load model from disk
Args:
model_path: path of model
"""
########################################
# Put your implementation here #
########################################
return True
def initialize_model():
"""
initilize a MLP model that classify Iris dataset
Returns:
model: A MLP object
Hint: Consider the number of features in the Iris dataset and the number of its classes
and initialize weights.
"""
########################################
# Put your implementation here #
########################################
return model
Evaluate your model(don't change this cell):
mlp = initialize_model()
mlp.train(xs['train'], ys['train'])
print('your model accuracy on test data is: %s' % (mlp.test(xs['train'], ys['test'])))
mlp.save_model(ASSIGNMENT_PATH / 'topvgg16_model.h5')
Congratulations! You finished the assignment & you're ready to submit your work. Please follow the instruction:
dl_asg01__xx__xx.zip
) and submit it via https://forms.gle/3srwTZhBbc4KfXaR8.Note: We need your Github token to create (if doesn't exist previously) new repository to store learned model data. Also Google Drive token enables us to download the current notebook & create a submission. If you are interested feel free to check our code.
#@title
! pip install -U --quiet PyDrive > /dev/null
! wget -q https://github.com/github/hub/releases/download/v2.10.0/hub-linux-amd64-2.10.0.tgz
import os
import time
import yaml
import json
from google.colab import files
from IPython.display import Javascript
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
asg_name = 'assignment_01'
script_save = '''
require(["base/js/namespace"],function(Jupyter) {
Jupyter.notebook.save_checkpoint();
});
'''
repo_name = 'iust-deep-learning-assignments'
submission_file_name = 'dl_asg01__%s__%s.zip'%(student_id, student_name.lower().replace(' ', '_'))
! tar xf hub-linux-amd64-2.10.0.tgz
! cd hub-linux-amd64-2.10.0/ && chmod a+x install && ./install
! hub config --global hub.protocol https
! hub config --global user.email "$Your_Github_account_Email"
! hub config --global user.name "$student_name"
! hub api --flat -X GET /user
! hub api -F affiliation=owner -X GET /user/repos > repos.json
repos = json.load(open('repos.json'))
repo_names = [r['name'] for r in repos]
has_repository = repo_name in repo_names
if not has_repository:
get_ipython().system_raw('! hub api -X POST -F name=%s /user/repos > repo_info.json' % repo_name)
repo_info = json.load(open('repo_info.json'))
repo_url = repo_info['clone_url']
else:
for r in repos:
if r['name'] == repo_name:
repo_url = r['clone_url']
stream = open("/root/.config/hub", "r")
token = list(yaml.load_all(stream))[0]['github.com'][0]['oauth_token']
repo_url_with_token = 'https://'+token+"@" +repo_url.split('https://')[1]
! git clone "$repo_url_with_token"
! cp -r "$ASSIGNMENT_PATH" "$repo_name"/
! cd "$repo_name" && git add -A
! cd "$repo_name" && git commit -m "Add assignment 02 results"
! cd "$repo_name" && git push -u origin master
sub_info = {
'student_id': student_id,
'student_name': student_name,
'repo_url': repo_url,
'asg_dir_contents': os.listdir(str(ASSIGNMENT_PATH)),
'dateime': str(time.time()),
'asg_name': asg_name
}
json.dump(sub_info, open('info.json', 'w'))
Javascript(script_save)
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
file_id = drive.ListFile({'q':"title='%s.ipynb'"%asg_name}).GetList()[0]['id']
downloaded = drive.CreateFile({'id': file_id})
downloaded.GetContentFile('%s.ipynb'%asg_name)
! jupyter nbconvert --to script "$asg_name".ipynb > /dev/null
! jupyter nbconvert --to html "$asg_name".ipynb > /dev/null
! zip "$submission_file_name" "$asg_name".ipynb "$asg_name".html "$asg_name".txt info.json > /dev/null
print("##########################################")
print("Done! Submisson created, Please download using the bellow cell!")
#@title
files.download(submission_file_name)
If that cell makes an error when running you can download file dl_asg01_your_struden_id_your_name.zip from left panel and files section by right-clicking on it and choosing download button.
Special thanks to Amirhossein Kazemnejad and Kiamehr Razaee for creating the template of deep learning course assignments.