Problem Set 8 Review & Transfer Learning with word2vec

Import various modules that we need for this notebook (now using Keras 1.0.0)

In [1]:
%pylab inline

import copy

import numpy as np
import pandas as pd
import sys
import os
import re

from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD, RMSprop
from keras.layers.normalization import BatchNormalization
from keras.layers.wrappers import TimeDistributed
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import SimpleRNN, LSTM, GRU

from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from gensim.models import word2vec
Populating the interactive namespace from numpy and matplotlib
Using TensorFlow backend.

I. Problem Set 8, Part 1

Let's work through a solution to the first part of problem set 8, where you applied various techniques to the STL-10 dataset.

In [2]:
dir_in = "../../../class_data/stl10/"
X_train = np.genfromtxt(dir_in + 'X_train_new.csv', delimiter=',')
Y_train = np.genfromtxt(dir_in + 'Y_train.csv', delimiter=',')
X_test = np.genfromtxt(dir_in + 'X_test_new.csv', delimiter=',')
Y_test = np.genfromtxt(dir_in + 'Y_test.csv', delimiter=',')

And construct a flattened version of it, for the linear model case:

In [3]:
Y_train_flat = np.zeros(Y_train.shape[0])
Y_test_flat  = np.zeros(Y_test.shape[0])
for i in range(10):
    Y_train_flat[Y_train[:,i] == 1] = i
    Y_test_flat[Y_test[:,i] == 1]   = i

(1) neural network

We now build and evaluate a neural network.

In [4]:
model = Sequential()

model.add(Dense(1024, input_shape = (X_train.shape[1],)))
model.add(Activation("relu"))
model.add(BatchNormalization())
model.add(Dropout(0.5))

model.add(Dense(1024))
model.add(Activation("relu"))
model.add(BatchNormalization())
model.add(Dropout(0.5))

model.add(Dense(1024))
model.add(Activation("relu"))
model.add(BatchNormalization())
model.add(Dropout(0.5))

model.add(Dense(10))
model.add(Activation('softmax'))

rms = RMSprop()
model.compile(loss='categorical_crossentropy', optimizer=rms,
              metrics=['accuracy'])
In [5]:
model.fit(X_train, Y_train, batch_size=32, nb_epoch=5, verbose=1)
Epoch 1/5
5000/5000 [==============================] - 29s - loss: 0.6513 - acc: 0.8248    
Epoch 2/5
5000/5000 [==============================] - 29s - loss: 0.3446 - acc: 0.9052    
Epoch 3/5
5000/5000 [==============================] - 29s - loss: 0.2453 - acc: 0.9326    
Epoch 4/5
5000/5000 [==============================] - 29s - loss: 0.1787 - acc: 0.9470    
Epoch 5/5
5000/5000 [==============================] - 29s - loss: 0.1196 - acc: 0.9600    
Out[5]:
<keras.callbacks.History at 0x1f4030eb8>
In [6]:
test_rate = model.evaluate(X_test, Y_test)[1]
print("Test classification rate %0.05f" % test_rate)
8000/8000 [==============================] - 5s     
Test classification rate 0.90900

(2) support vector machine

And now, a basic linear support vector machine.

In [7]:
svc_obj = SVC(kernel='linear', C=1)
svc_obj.fit(X_train, Y_train_flat)

pred = svc_obj.predict(X_test)
pd.crosstab(pred, Y_test_flat)
c_rate = sum(pred == Y_test_flat) / len(pred)
print("Test classification rate %0.05f" % c_rate)
Test classification rate 0.94088

(3) penalized logistc model

And finally, an L1 penalized model:

In [8]:
lr = LogisticRegression(penalty = 'l1')
lr.fit(X_train, Y_train_flat)

pred = lr.predict(X_test)
pd.crosstab(pred, Y_test_flat)
c_rate = sum(pred == Y_test_flat) / len(pred)
print("Test classification rate %0.05f" % c_rate)
Test classification rate 0.93712

II. Problem Set 8, Part 2

Now, let's read in the Chicago crime dataset and see how well we can get a neural network to perform on it.

In [9]:
dir_in = "../../../class_data/chi_python/"
X_train = np.genfromtxt(dir_in + 'chiCrimeMat_X_train.csv', delimiter=',')
Y_train = np.genfromtxt(dir_in + 'chiCrimeMat_Y_train.csv', delimiter=',')
X_test = np.genfromtxt(dir_in + 'chiCrimeMat_X_test.csv', delimiter=',')
Y_test = np.genfromtxt(dir_in + 'chiCrimeMat_Y_test.csv', delimiter=',')

Now, built a neural network for the model

In [10]:
model = Sequential()

model.add(Dense(1024, input_shape = (434,)))
model.add(Activation("relu"))
model.add(BatchNormalization())
model.add(Dropout(0.2))

model.add(Dense(1024))
model.add(Activation("relu"))
model.add(BatchNormalization())
model.add(Dropout(0.2))

model.add(Dense(1024))
model.add(Activation("relu"))
model.add(BatchNormalization())
model.add(Dropout(0.2))

model.add(Dense(5))
model.add(Activation('softmax'))

rms = RMSprop()
model.compile(loss='categorical_crossentropy', optimizer=rms,
              metrics=['accuracy'])
In [11]:
# downsample, if need be:
num_sample = X_train.shape[0]

model.fit(X_train[:num_sample], Y_train[:num_sample], batch_size=32,
          nb_epoch=10, verbose=1)
Epoch 1/10
337619/337619 [==============================] - 972s - loss: 0.7771 - acc: 0.7334   
Epoch 2/10
337619/337619 [==============================] - 984s - loss: 0.7164 - acc: 0.7506   
Epoch 3/10
337619/337619 [==============================] - 961s - loss: 0.7043 - acc: 0.7547   
Epoch 4/10
337619/337619 [==============================] - 953s - loss: 0.6943 - acc: 0.7577   
Epoch 5/10
337619/337619 [==============================] - 953s - loss: 0.6880 - acc: 0.7589   
Epoch 6/10
337619/337619 [==============================] - 956s - loss: 0.6825 - acc: 0.7603   
Epoch 7/10
337619/337619 [==============================] - 959s - loss: 0.6785 - acc: 0.7611   
Epoch 8/10
337619/337619 [==============================] - 958s - loss: 0.6745 - acc: 0.7620   
Epoch 9/10
337619/337619 [==============================] - 958s - loss: 0.6722 - acc: 0.7631   
Epoch 10/10
337619/337619 [==============================] - 961s - loss: 0.6699 - acc: 0.7631   
Out[11]:
<keras.callbacks.History at 0x123f31400>
In [12]:
test_rate = model.evaluate(X_test, Y_test)[1]
print("Test classification rate %0.05f" % test_rate)
174320/174320 [==============================] - 85s    
Test classification rate 0.76034

III. Transfer Learning IMDB Sentiment analysis

Now, let's use the word2vec embeddings on the IMDB sentiment analysis corpus. This will allow us to use a significantly larger vocabulary of words. I'll start by reading in the IMDB corpus again from the raw text.

In [13]:
path = "../../../class_data/aclImdb/"

ff = [path + "train/pos/" + x for x in os.listdir(path + "train/pos")] + \
     [path + "train/neg/" + x for x in os.listdir(path + "train/neg")] + \
     [path + "test/pos/" + x for x in os.listdir(path + "test/pos")] + \
     [path + "test/neg/" + x for x in os.listdir(path + "test/neg")]

TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text):
    return TAG_RE.sub('', text)
    
input_label = ([1] * 12500 + [0] * 12500) * 2
input_text  = []

for f in ff:
    with open(f) as fin:
        pass
        input_text += [remove_tags(" ".join(fin.readlines()))]

I'll fit a significantly larger vocabular this time, as the embeddings are basically given for us.

In [14]:
num_words = 5000
max_len = 400
tok = Tokenizer(num_words)
tok.fit_on_texts(input_text[:25000])
In [15]:
X_train = tok.texts_to_sequences(input_text[:25000])
X_test  = tok.texts_to_sequences(input_text[25000:])
y_train = input_label[:25000]
y_test  = input_label[25000:]

X_train = sequence.pad_sequences(X_train, maxlen=max_len)
X_test  = sequence.pad_sequences(X_test,  maxlen=max_len)
In [16]:
words = []
for iter in range(num_words):
    words += [key for key,value in tok.word_index.items() if value==iter+1]
In [17]:
loc = "/Users/taylor/files/word2vec_python/GoogleNews-vectors-negative300.bin"
w2v = word2vec.Word2Vec.load_word2vec_format(loc, binary=True)
In [18]:
weights = np.zeros((num_words,300))
for idx, w in enumerate(words):
    try:
        weights[idx,:] = w2v[w]
    except KeyError as e:
        pass
In [19]:
model = Sequential()

model.add(Embedding(num_words, 300, input_length=max_len))
model.add(Dropout(0.5))

model.add(GRU(16,activation='relu'))

model.add(Dense(128))
model.add(Dropout(0.5))
model.add(Activation('relu'))

model.add(Dense(1))
model.add(Activation('sigmoid'))

model.layers[0].set_weights([weights])
model.layers[0].trainable = False

model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
In [22]:
model.fit(X_train, y_train, batch_size=32, nb_epoch=5, verbose=1,
          validation_data=(X_test, y_test))
Train on 25000 samples, validate on 25000 samples
Epoch 1/5
25000/25000 [==============================] - 1349s - loss: 0.1542 - acc: 0.9439 - val_loss: 0.2739 - val_acc: 0.9000
Epoch 2/5
25000/25000 [==============================] - 1347s - loss: 0.1450 - acc: 0.9486 - val_loss: 0.3019 - val_acc: 0.8990
Epoch 3/5
25000/25000 [==============================] - 1355s - loss: 0.1371 - acc: 0.9499 - val_loss: 0.2853 - val_acc: 0.8960
Epoch 4/5
25000/25000 [==============================] - 1359s - loss: 0.1303 - acc: 0.9527 - val_loss: 0.3348 - val_acc: 0.8966
Epoch 5/5
25000/25000 [==============================] - 1383s - loss: 0.1214 - acc: 0.9565 - val_loss: 0.3109 - val_acc: 0.8980
Out[22]:
<keras.callbacks.History at 0x17f5a3f60>
In [ ]: