Please pay attention to these notes:
########################################
# Put your implementation here #
########################################
Assignment Page: https://iust-deep-learning.github.io/972/assignments/04_nlp_intro
Course Forum: https://groups.google.com/forum/#!forum/dl972/
Fill your information here & run the cell
#@title Enter your information & "RUN the cell!!"
student_id = 0 #@param {type:"integer"}
student_name = "" #@param {type:"string"}
Your_Github_account_Email = "" #@param {type:"string"}
print("your student id:", student_id)
print("your name:", student_name)
from pathlib import Path
ASSIGNMENT_PATH = Path('asg04')
ASSIGNMENT_PATH.mkdir(parents=True, exist_ok=True)
In any NLP task with neural networks involved, we need a numerical representation of our input (which are mainly words). A naive solution would be to use a huge one-hot vector with the same size as our vocabulary, each element representing one word. But this sparse representation is a poor usage of a huge multidimentional space as it does not contain any usefull information about the meaning and semantics of a word. This is where word embedding comes in handy.
Embeddings are another way of representing vocabulary in a lower dimentional (compared to one-hot representation) continuous space. The goal is to have similar vectors for the words with similar meanings (so the elements of the vector actually carry some information about the meaning of the words). The question is, how are we going to achieve such representations? The idea is simple but elegant: The words appearing in the same context are likely to have similar meanings.
So how can we use this idea to learn word vectors?
We are going to train a simple neural network with a single hidden layer to perform a certain task, but then we’re not actually going to use that neural network for the task we trained it on! Instead, the goal is actually just to learn the weights of the hidden layer and use this hidden layer as our word representation vector.
So lets talk about this "fake" task. We’re going to train the neural network to do the following: given a specific word (the input word), the network is going to tell us the probability for every word in our vocabulary of being near to this given word (be one of its context words). So the network is going to look somthing like this (considering that our vocabulary size is 10000):
By training the network on this task, the words which appear in similar contexts are forced to have similar values in the hidden layer since they are going to give similar outputs, so we can use this hidden layer values as our word representation.
$\color{red}{\text{Write your answer here}}$
Softmax is a very handy tool when it comes to probability distribution prediction problems, but it has its downsides when the number of the nodes grows too large. Let's look at softmax activation in our output layer:
$$ \mathbf{S_{ij}} = \frac {e^{W_{j}^T Y_{i-1}}}{\sum_{j=1}^{N} e^{W_{j}^T Y_{i-1}}\ } \ $$As you can see, every single output is dependent on the other outputs, so in order to compute the derivative with respect to any weight, all the other weights play a role! For a 10000 output size this results in milions of mathematical operations for a single weight update, which is not practical at all!
$\color{red}{\text{Write your answer here}}$
$\color{red}{\text{Write your answer here}}$
There is a vergy good library called gensim for using word2vec in python. You can train your own word vectors on your own corpora or use available pretrained models. For example the following model is word vectors for a vocabulary of 3 million words and phrases trained on roughly 100 billion words from a Google News dataset with vector length of 300 features:
!wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
!gunzip GoogleNews-vectors-negative300.bin.gz
Lets load this model in python:
import gensim
# Load Google's pre-trained Word2Vec model.
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
print ("# of words", len(model.vocab))
print ("# of vectors", len(model.vectors))
print ("the first 10 elements of embedding vector for the word king:",
model.vectors[model.vocab["king"].index][:10])
As you can see it requires a huge amount of memory!
target_words = ["king", "horse", "blue", "apple",
"computer", "lion", "rome", "tehran",
"orange", "red", "army", "cat",
"asia", "mouse"]
########################################
# Put your implementation here #
########################################
You can find the cosine similarity between two word vectors using similarity method:
print ('logitech', '/', 'cat', '->', model.similarity('logitech', 'cat'))
print ('black', '/', 'criminal', '->', model.similarity('black', 'criminal'))
print ('white', '/', 'criminal', '->', model.similarity('white', 'criminal'))
print ('black', '/', 'offensive', '->', model.similarity('black', 'offensive'))
print ('white', '/', 'offensive', '->', model.similarity('white', 'offensive'))
$\color{red}{\text{Write your answer here}}$
$\color{red}{\text{Write your answer here}}$
########################################
# Put your implementation here #
########################################
Word vectors have some other cool properties, for example we know the relation between the meanings of the two words "man" and "woman" is similar to the relation between words "king" and "queen". So we expect $e_{queen} - e_{king} = e_{women} - e_{man}$ or $e_{queen} = e_{king} + e_{women} - e_{man}$ .
########################################
# Put your implementation here #
########################################
From the previous section, we saw that word vectors can store a lot of semantic information in themselves. But can we solve an NLP task by just feeding them through a simple neural network? Assume we want to find all named entities in a given sentence (aka Named Entity recognition). For example, In "I bought 300 shares of Apple Corp. in the last year". We want to locate the word "Apple" and categorize it as an Organization entity.
Obviously, a neural network cannot guess the type entirely based on a single word. We need to provide an extra piece of information to help the decision. This piece of information is called "Context" . We can decide if the word Apple is referring to the company or fruit by seeing it in a sentence (context). However, feeding a complete sentence through a network is inefficient as it makes the input layer really big even for a 10-word sentence (10 * 300 = 3000, assuming an embedding size of 300).
To make training such network possible, we make the input only by including K surrounding neighbor words. hence, apple can be easily classified as a company by looking at the context window [ the, apple, corporation ]
In a window-based classifier, every input sentence $X = [\mathbf{x^{(1)}}, ... , \mathbf{x^{(T)}}]$ with a label sequence $Y = [\mathbf{y^{(1)}}, ..., \mathbf{y^{(T)}}]$ is split into $T$ <context window, center word label>
data points. We create a context window $\mathbf{w^{(t)}}$ for every token $\mathbf{x^{(t)}}$ in the original sentence by concatenating its k surrounding neighbors: $\mathbf{w^{(t)}} = [\mathbf{x^{(t-k)}}; ...; \mathbf{x^{(t)}}; ...; \mathbf{x^{(t+k)}}]$, therefore our new data point is created as $\langle \mathbf{w^{(t)}} , \mathbf{y^{(t)}} \rangle$.
Having word case information might also help the neural network to find name entities with higher confidence. To incorporate casing, every token $\mathbf{x^{(t)}}$ is augmented with feature vector $\mathbf{c}$ representing such information: $\mathbf{x^{(t)}} = [\mathbf{e^{(t)}};\mathbf{c^{(t)}}]$ where $\mathbf{e^{(t)}}$ is the corresponding embedding.
In this section, we aim to build a window based feedforward neural network on the NER task, and then analyze its limitations through a case study.
Let's import some depencecies.
! wget -q https://iust-deep-learning.github.io/972/static_files/assignments/asg04_assets/data.tar.gz
! tar xvfz data.tar.gz > /dev/null
from IPython.display import SVG
from pprint import pprint
import numpy as np
import keras
from keras.models import Model
from keras.utils.vis_utils import model_to_dot
from keras.utils import to_categorical
from ner_util import read_dataset, convert_to_window_based, preprocess, LBLS, \
UNK_TOK, plot_confusion_matrix, visualize_loss_and_acc
from ag_news_util import read_ag_news, AG_NEWS_LBLS, create_model_input, create_vocab
! pip install -q tqdm flair
from tqdm import tqdm
from flair.data import Sentence
from flair.models import SequenceTagger
And define the model's hyperparameters:
NUM_NEIGHBORS = 1
WINDOW_SIZE = 2 * NUM_NEIGHBORS + 1
VOCAB_SIZE = 10*1000
EMBEDDING_DIM = 300
NUM_CLASSES = 5
BATCH_SIZE = 512
As discussed earlier, we want to include the word casing information. Here's our desired function to encode the casing detail in d-dimensional vector. Words "Hello", "hello", "HELLO" and "hELLO" have four different casings. Your encoding should support all of them; In other words, the implemented function must return 4 different vectors for these inputs, but the same output for "Bye" and "Hello", "bye" and "hello", "bYe" and "hEllo", etc.
# The Default dimension for the casing vector.
# You can change it to match your desiered encoding.
CASING_DIM = 4
def get_casing(word):
"""
Return the casing information in a numpy array.
Args:
word(str): input word, E.g. Hello
Returns:
np.array(shape=(CASING_DIM,)): encoded casing
Hint: You might find the one-hot encoding useful.
"""
casing = np.zeros(shape=(CASING_DIM,))
########################################
# Put your implementation here #
########################################
assert casing.shape == (CASING_DIM,)
return casing
print("case(hello) =", get_casing('hello'))
print("case(Hello) =", get_casing('Hello'))
print("case(HELLO) =", get_casing('HELLO'))
print("case(hEllO) =", get_casing('hEllO'))
Describe two other features that would help the window-based model to perform better (apart from word casing).
$\color{red}{\text{Write your answer here}}$
CONLL 2003[1] is a classic NER dataset; It has five tags per each word: [PER, ORG, LOC, MISC, O]
, where the label O
is for words that have no named entities. We use this dataset to train our window-based model. Note that our split is different from the original one.
# First read the dataset
train, valid, vocab = read_dataset(VOCAB_SIZE)
print("# Dataset sample")
print("valid[0] = ", end='')
pprint((' '.join(valid[0][0]), ' '.join(valid[0][1])))
# Convert to window-based data points
wtrain = convert_to_window_based(train, n=NUM_NEIGHBORS)
wvalid = convert_to_window_based(valid, n=NUM_NEIGHBORS)
print("\n# Window based dataset sample")
print("wvalid[:7] = ")
pprint(wvalid[:len(valid[0][1])])
# Create a dictionary to lookup word ids
tok2id = {w:i for i, w in enumerate(vocab)}
# Process windowed dataset
(w_train, c_train), y_train = preprocess(wtrain, tok2id, get_casing)
(w_valid, c_valid), y_valid = preprocess(wvalid, tok2id, get_casing)
print("\n# Pre precessed dataset stats")
print("w_train.shape, c_train.shape, y_train.shape =", w_train.shape, c_train.shape, y_train.shape)
print("\n# Pre precessed sample")
print("w_valid[0] =", w_valid[0])
print("c_valid[0] =", c_valid[0])
print("y_valid[0] =", y_valid[0])
Download and construct pre-trained embedding matrix using Glove word vectors.
! wget "http://nlp.stanford.edu/data/glove.6B.zip" -O glove.6B.zip && unzip glove.6B.zip
word2vec = {}
with open('glove.6B.300d.txt') as f:
for line in tqdm(f, total=400000):
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
word2vec[word] = coefs
print('Found %s word vectors.' % len(word2vec))
# It is a good practice to initialize out-of-vocabulary tokens
# with the embeddings' mean
mean_embed = np.mean(np.array(list(word2vec.values())), axis=0)
# Create the embedding matrix according to our vocabulary
embedding_matrix = np.zeros((len(tok2id), EMBEDDING_DIM))
for word, i in tok2id.items():
embedding_matrix[i] = word2vec.get(word, mean_embed)
print("embedding_matrix.shape =", embedding_matrix.shape)
Let's build the model. we recommend Keras functional API. Number of layer as well as their dimensions is totally up to you.
def get_window_based_ner_model():
window = Input(shape=(WINDOW_SIZE,), dtype='int64', name='window')
casing = Input(shape=(WINDOW_SIZE * CASING_DIM,), dtype='float32', name='casing')
########################################
# Put your implementation here #
########################################
output = None
model = Model([window, casing], output)
return model
# Let's create and visualize the NER model
ner_model = get_window_based_ner_model()
ner_model.compile(optimizer='rmsprop', loss='categorical_crossentropy',metrics=['acc'])
ner_model.summary()
SVG(model_to_dot(ner_model,show_shapes=True).create(prog='dot', format='svg'))
# Train the model and visualize the traning at the end
ner_model_hist = ner_model.fit(
[w_train, c_train], y_train,
epochs=10,
batch_size=BATCH_SIZE,
validation_data=([w_valid, c_valid], y_valid)
)
visualize_loss_and_acc(ner_model_hist)
# Don't forget to run this cell.
# this is a deliverable item of your assignemnt
ner_model.save(str(ASSIGNMENT_PATH / 'window_based_ner.h5'))
Now, It's time to analyze the model behavior. Here is an interactive shell that will enable us to explore the model's limitations and capabilities. Note that the sentences should be entered with spaces between tokens, and Use "do n't" instead of "don't".
#@title Interactive Shell
input_sentence = "I bought 300 shares of Apple Corp in the last year ."#@param {type:"string"}
tokens = input_sentence.strip().split(" ")
input_example = [(tokens, ["O"] * len(tokens))]
winput = convert_to_window_based(input_example)
(w_pred, c_pred), _ = preprocess(winput, tok2id, get_casing)
predictions = ner_model.predict([w_pred, c_pred])
predictions = [LBLS[np.argmax(l)] for l in predictions]
print_sentence(sys.stdout, tokens, None, predictions)
To further understand and analyze mistakes made by the model, let's see the confusion matrix:
from sklearn.metrics import classification_report, confusion_matrix
y_pred = ner_model.predict([w_valid, c_valid])
y_pred_id = np.argmax(y_pred, axis=1)
y_valid_id = np.argmax(y_valid, axis=1)
print("\n# Classification Report")
print(classification_report(y_valid_id, y_pred_id, target_names=LBLS))
print("# Confusion Matrix")
cm = confusion_matrix(y_valid_id, y_pred_id)
plot_confusion_matrix(cm, LBLS, normalize=False)
Describe the window-based network modeling limitations by exploring its outputs. You need to support your conclusion by showing us the errors your model makes. You can either use validation set samples or a manually entered sentence to force the model to make an error. Remember to copy and paste input/output from the interactive shell here.
$\color{red}{\text{Write your answer here}}$
We have shown arithmetic relations are present in the embedding space. For example $e_{queen} = e_{king} + e_{women} - e_{man}$ . But are they strong enough for building a rich representation of a sentence? Can we classify a sentence according to the mean of its word's embeddings? In this section, we will find the answers to the above questions.
Assume sentence $X = [\mathbf{x^{(1)}}, ..., \mathbf{x^{(N)}}]$ is given, then a sentence representation $\mathbf{R}$ can be calculated as following:
$$ \mathbf{R} = \frac{1}{N} \sum_{i=1}^{N} e_{x^{(i)}} \ \ \mathbf{R} \in \mathbb{R}^d $$where $e_{x^{(i)}}$ is an embedding vector for the token $x^{(i)}$.
Having such a simple model will enable us to analyze and understand its capabilities more easily. In addition, we will try one of the state-of-the-art text processing tools, called Flair, which can be run on GPUs. The task is text classification on the AG News corpus, which consists of news articles from more than 2000 news sources. Our split has 110K samples for the training and 10k for the validation set. Dataset examples are labeled with 4 major labels: {World, Sports, Business, Sci/Tech}
Often, datasets in NLP come with unprocessed sentences. As a deep learning expert, you should be familiar with popular text processing tools such as NLTK, Spacy, Stanford CoreNLP, and Flair. Generally, text pre-processing in deep learning includes Tokenization, Vocabulary creation, and Padding. But here we want to do one more step, NER replacement. Basically, we want to replace named entities with their corresponding tags. For example "George Washington went to New York" will be converted to "\
The purpose of this step is to reduce the size of vocabulary and support more words. This strategy is proved to be most beneficial when our dataset contains a large number of named entities, e.g. News dataset.
Most pre-processing parts are implemented for you. You only need to fill the following function. Be sure to read the Flair documentations first.
def get_tagged_string(sentence):
"""
Join tokens and replace named enitites
Args:
sentence(flair.data.Sentence): An input sentence, containing list of tokens and their NER tag
Returns:
output(str): A String of sentence tokens separated by spaces and
each named enitity is replaced by its Tag
Hint: Check out flair tutorials, https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_2_TAGGING.md
sentence.get_spans('ner'), sentence.tokens, token.idx and entity.tag might be helpful.
"""
output = ""
########################################
# Put your implementation here #
########################################
return output
Test your implementation:
tagger = SequenceTagger.load('ner-ontonotes')
s = Sentence('Chad asks the IMF for a loan to pay for looking after more than 100,000 refugees from conflict-torn Darfur in western Sudan.', use_tokenizer=True)
tagger.predict(s)
s_ner = get_tagged_string(s)
assert s_ner == '<PERSON> asks the <ORG> for a loan to pay for looking after <CARDINAL> refugees from conflict-torn <GPE> in western <GPE> .'
Define model's hyperparameters
VOCAB_SIZE = 10*1000
EMBEDDING_DIM = 300
NUM_CLASSES = 4
BATCH_SIZE = 512
MAX_LEN = 30
Process the entire corpus. It will approximately take 50 minutes. Please be patient. You may want to go for the next sections.
TAGGER_BATCH_SIZE = 512
if 'tagger' not in dir() or tagger is None:
tagger = SequenceTagger.load('ner-ontonotes')
def precoess_sents(lst):
output = []
for i in tqdm(range(0, len(lst), TAGGER_BATCH_SIZE)):
batch = [Sentence(x, use_tokenizer=True) for x in lst[i:i + TAGGER_BATCH_SIZE]]
tagger.predict(batch, mini_batch_size=TAGGER_BATCH_SIZE, verbose=False)
batch = [get_tagged_string(s).lower() for s in batch]
output += batch
return output
print("# Download and read dataset")
(train_sents, train_lbls), (valid_sents, valid_lbls) = read_ag_news()
print("\n# Replace named entities with their corresponding tags")
# We need to free the gpu memory due to some unknown bug in flair library
del tagger; tagger = SequenceTagger.load('ner-ontonotes')
import torch; torch.cuda.empty_cache()
train_sents_ner = precoess_sents(train_sents)
torch.cuda.empty_cache()
del tagger
tagger = SequenceTagger.load('ner-ontonotes')
torch.cuda.empty_cache()
valid_sents_ner = precoess_sents(valid_sents)
torch.cuda.empty_cache()
assert len(train_sents_ner) == len(train_lbls)
assert len(valid_sents_ner) == len(valid_lbls)
del tagger
tagger = SequenceTagger.load('ner-ontonotes')
torch.cuda.empty_cache()
del tagger
print("# Processed dataset sample")
print("train_sents[0] =", train_sents[0])
print("train_sents_ner[0] =", train_sents_ner[0])
Create the embedding matrix
# First create the vocabulary
vocab = create_vocab(train_sents_ner, VOCAB_SIZE)
tok2id = {w:i for i, w in enumerate(vocab)}
# It is a good practice to initialize out-of-vocabulary tokens
# with the embedding matrix mean
mean_embed = np.mean(np.array(list(word2vec.values())), axis=0)
# Create the embedding matrix according to the vocabulary
embedding_matrix = np.zeros((len(tok2id), EMBEDDING_DIM))
for word, i in tok2id.items():
embedding_matrix[i] = word2vec.get(word, mean_embed)
# Fill index 0 with zero values: padding word vector
embedding_matrix[0] = np.zeros(shape=(EMBEDDING_DIM, ))
# Prepare the model input
x_train, y_train = create_model_input(train_sents_ner, tok2id, MAX_LEN), to_categorical(train_lbls, NUM_CLASSES)
x_valid, y_valid = create_model_input(valid_sents_ner, tok2id, MAX_LEN), to_categorical(valid_lbls, NUM_CLASSES)
Let's build the model. As always Keras functional API is recommended. Numeber of layer as well as their dimensionality is totally up to you.
import keras
from keras import backend as K
class BowModel(keras.Model):
def __init__(self):
super(BowModel, self).__init__(name='bow')
########################################
# Put your implementation here #
########################################
def call(self, words):
"""
Args:
words(Tensor): An input tensor for word ids with shape (?, MAX_LEN)
"""
########################################
# Put your implementation here #
########################################
return output
# Let's create and visualize the NER model
bow_model = BowModel()
bow_model.compile(optimizer='rmsprop', loss='categorical_crossentropy',metrics=['acc'])
# Train and visualize training
bow_model_hist = bow_model.fit(
x_train, y_train,
batch_size=BATCH_SIZE, epochs=10,
validation_data=(x_valid, y_valid)
)
visualize_loss_and_acc(bow_model_hist)
bow_model.summary()
# Don't forget to run this cell.
# this is a deliverable item of your assignemnt
bow_model.save_weights(str(ASSIGNMENT_PATH / 'bow_model.h5'))
Same as the previous section, an interactive shell is provided. You can enter an input sequence to get the predicted label. The preprocessing functions will do the tokenization, thus don't worry about the spacing.
#@title Interactive Shell
if 'tagger' not in dir() or tagger is None:
tagger = SequenceTagger.load('ner-ontonotes')
input_text = "Chad asks the IMF for a loan to pay for looking after more than 100,000 refugees from conflict-torn Darfur in western Sudan."#@param {type:"string"}
input_sents_ner = precoess_sents([input_text])
input_tensor = create_model_input(input_sents_ner, tok2id, MAX_LEN)
pred_label = bow_model.predict(input_tensor)
print("\n-----\n\n x: ", input_text)
print("x_ner: ", input_sents_ner[0])
print("\n y': ", AG_NEWS_LBLS[np.argmax(pred_label[0])])
It is always helpful to see the confusion matrix:
from sklearn.metrics import classification_report, confusion_matrix
yp_valid = bow_model.predict(x_valid)
yp_valid_ids = np.argmax(yp_valid, axis=1)
y_valid_ids = np.argmax(y_valid, axis=1)
print("\n# Classification Report")
print(classification_report(y_valid_ids, yp_valid_ids, target_names=AG_NEWS_LBLS))
print("# Confusion Matrix")
cm = confusion_matrix(y_valid_ids, yp_valid_ids)
plot_confusion_matrix(cm, AG_NEWS_LBLS, normalize=False)
Obviously, this is a relatively simple model. Hence it has limited modeling capabilities; Now it's time to find its mistakes. Can you fool the model by feeding a toxic example? Can you see the bag-of-word effect in its behavior? Write down the model limitation, Answers to the above questions, and keep in mind that you need to support each of your thoughts with an input/output example
$\color{red}{\text{Write your answer here}}$
Up to now, we've investigated window-based neural networks and the bag-of-words model. Given their simple architectures, the representation power of these models mainly relies on the pre-trained embeddings. For example, a window-based model cannot understand the previous token's label which makes it struggle in identifying multi-word entities. While, adding a single word "not" can entirely change the meaning of a sentence, the BoW model is not sensitive to this as it ignores the order and computes the average embedding (in which single words do not play big roles).
In contrast, RNNs read sentences word by word. At each step, the softmax classifier is forced to predict the label not only by using the input word but also using its context information. If we see the context information as a working memory for RNNs, it will be interesting to find what kind of information is stored in them while it parses a sentence.
To visualize an RNN memory, we will train a language model on a huge chunk of text, and use the validation set to analyze its brain. Then, we will watch each context neuron activation to see if it shows a meaningful pattern while it goes through a sentence. The following figure illustrates a random neuron in the memory which captures the concept of line length. It gradually turns off by reach the sentence end. Probably our model uses this neuron to handle "\n" generation.
Here is another neuron which is sensitive when it's inside a quote.
Here, our goal is to find other meaningful patterns in the RNN hidden states. There is an open source library called LSTMVIs which provides pre-trained models and a great visualization tool. First, watch its tutorial and then answer the following questions:
For each model, find at least two meaningful patterns, and support your hypothesis with screenshots of LSTMVis.
$\color{red}{\text{Write your answer here}}$
$\color{red}{\text{Write your answer here}}$
3- Can you spot the difference between a character-based and a word-based language model?
$\color{red}{\text{Write your answer here}}$
Congratulations! You finished the assignment & you're ready to submit your work. Please follow the instruction:
dl_asg04__xx__xx.zip
) and submit it via https://forms.gle/xTkMG2XhDgxTPwUf8.Note: We need your Github token to create a new repository (if it doesn't exist previously) in order to store learned model data. Also Google Drive token enables us to download the current notebook and Create the submission. If you are interested, feel free to check our code.
#@title
! pip install -U --quiet PyDrive > /dev/null
! wget -q https://github.com/github/hub/releases/download/v2.10.0/hub-linux-amd64-2.10.0.tgz
import os
import time
import yaml
import json
from google.colab import files
from IPython.display import Javascript
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
asg_name = 'assignment_04'
script_save = '''
require(["base/js/namespace"],function(Jupyter) {
Jupyter.notebook.save_checkpoint();
});
'''
repo_name = 'iust-deep-learning-assignments'
submission_file_name = 'dl_asg04__%s__%s.zip'%(student_id, student_name.lower().replace(' ', '_'))
course_url = 'https://iust-deep-learning.github.io/972/'
! tar xf hub-linux-amd64-2.10.0.tgz
! cd hub-linux-amd64-2.10.0/ && chmod a+x install && ./install
! hub config --global hub.protocol https
! hub config --global user.email "$Your_Github_account_Email"
! hub config --global user.name "$student_name"
! hub api -X GET /user
! hub api -X GET /user > user_info.json
! hub api -F affiliation=owner -X GET /user/repos > repos.json
user_info = json.load(open('user_info.json'))
repos = json.load(open('repos.json'))
repo_names = [r['name'] for r in repos]
has_repository = repo_name in repo_names
if not has_repository:
get_ipython().system_raw('! hub api -X POST -F name=%s /user/repos homepage="%s" > repo_info.json' % (repo_name, course_url))
repo_info = json.load(open('repo_info.json'))
repo_url = repo_info['clone_url']
else:
username = user_info['login']
! hub api -F homepage="$course_url" -X PATCH /repos/$username/$repo_name
for r in repos:
if r['name'] == repo_name:
repo_url = r['clone_url']
stream = open("/root/.config/hub", "r")
token = list(yaml.load_all(stream))[0]['github.com'][0]['oauth_token']
repo_url_with_token = 'https://'+token+"@" +repo_url.split('https://')[1]
! git clone "$repo_url_with_token"
! cp -r "$ASSIGNMENT_PATH" "$repo_name"/
! cd "$repo_name" && git add -A
! cd "$repo_name" && git commit -m "Add assignment 02 results"
! cd "$repo_name" && git push -u origin master
sub_info = {
'student_id': student_id,
'student_name': student_name,
'repo_url': repo_url,
'asg_dir_contents': os.listdir(str(ASSIGNMENT_PATH)),
'datetime': str(time.time()),
'asg_name': asg_name
}
json.dump(sub_info, open('info.json', 'w'))
Javascript(script_save)
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
file_id = drive.ListFile({'q':"title='%s.ipynb'"%asg_name}).GetList()[0]['id']
downloaded = drive.CreateFile({'id': file_id})
downloaded.GetContentFile('%s.ipynb'%asg_name)
! jupyter nbconvert --to script "$asg_name".ipynb > /dev/null
! jupyter nbconvert --to html "$asg_name".ipynb > /dev/null
! zip "$submission_file_name" "$asg_name".ipynb "$asg_name".html "$asg_name".txt info.json > /dev/null
print("##########################################")
print("Done! Submisson created, Please download using the bellow cell!")
#@title
files.download(submission_file_name)