Blog | DLology

Keras + Universal Sentence Encoder = Transfer Learning for text data

2018-06-10T08:45:41+00:00

We are going to build a Keras model that leverages the pre-trained "Universal Sentence Encoder" to classify a given question text to one of the six categories.

TensorFlow Hub modules can be applied to a variety of transfer learning tasks and datasets, whether it is images or text. "Universal Sentence Encoder" is one of the many newly published TensorFlow Hub reusable modules, a self-contained piece of TensorFlow graph, with pre-trained weights value included.

A runnable Colab notebook is available, you can experiment with the code while reading on.

What is Universal Sentence Encoder and how it was trained

While you can choose to treat all TensorFlow Hub modules as black boxes, agnostic of what happens inside and still be able to build a functional transfer learning model. It would be helpful to develop a deeper understanding, that gives you a new perspective on what each module is capable of, its constraints and how well the transfer learning result could potentially be.

Universal Sentence Encoder VS Words embedding

If you recall the GloVe word embeddings vectors in our previous tutorial which turns a word to 50-dimensional vector, the Universal Sentence Encoder is much more powerful, and it is able to embed not only words but phrases and sentences. That is, it takes variable length English text as input and outputs a 512-dimensional vector. Handling variable length text input sounds great, but what's the catch is as sentence getting longer counted by words, the more diluted embedding results could be. And since the model was trained at the word level, it will likely find typos and difficult words challenging to process. More on the difference between world and character level language model, you can read my previous tutorial.

There are two Universal Sentence Encoders to choose from with different encoder architectures to achieve distinct design goals, one based on the transformer architecture targets high accuracy at the cost of greater model complexity and resource consumption. The other targets efficient inference with slightly reduced accuracy by the deep averaging network(DAN).

Side by side Model architectures comparison for the Transformer and DAN sentence encoders.

The original Transformer model constitutes an encoder and decoder, but here we only use its encoder part.

The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. They also employed a residual connection around each of the two sub-layers, followed by layer normalization. Since the model contains no recurrence and no convolution, for the model to make use of the order of the sequence, it must inject some information about the relative or absolute position of the tokens in the sequence, that is what the "positional encodings" does. The transformer based encoder achieves the best overall transfer task performance. However, this comes at the cost of computing time and memory usage scaling dramatically with sentence length.

Deep Averaging Network(DAN) is much simpler where input embeddings for words and bi-grams are first averaged together and then passed through a feedforward deep neural network (DNN) to produce sentence embeddings. The primary advantage of the DAN encoder is that compute time is linear in the length of the input sequence.

Depends on what type of training data and the chosen training metric, it can have a significant impact on the transfer learning result.

Both models were trained with the Stanford Natural Language Inference (SNLI) corpus. The SNLI corpus is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE). Essentially, the models were trained to learn the semantic similarity between the sentence pairs.

With that in mind, the sentence embeddings can be trivially used to compute sentence-level semantic similarity scores.

The source code to generate the similarity heat map is available both in my Colab notebook and in GitHub repo. Colored based on the inner product of the encodings for any two sentences. That means the more similar two sentences are, the darker the color is.

Loading Universal Sentence Encoder and computing the embeddings for some text can be as easy as below.

import tensorflow as tf
import tensorflow_hub as hub
module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3"
# Import the Universal Sentence Encoder's TF Hub module
embed = hub.Module(module_url)

# Compute a representation for each message, showing various lengths supported.
messages = ["That band rocks!", "That song is really cool."]

with tf.Session() as session:
  session.run([tf.global_variables_initializer(), tf.tables_initializer()])
  message_embeddings = session.run(embed(messages))

First time loading the module can take a while since it will download the weights files.

The value of message_embeddings are two arrays corresponding to two sentences' embeddings, each is an array of 512 floating point numbers.

array([[ 0.06587551, 0.02066354, -0.01454356, ..., 0.06447642, 0.01654527, -0.04688655], [ 0.06909196, 0.01529877, 0.03278331, ..., 0.01220771, 0.03000253, -0.01277521]], dtype=float32)

Question classification task and data preprocessing

To respond correctly to a question given a large collection of texts, classifying questions into fine-grained classes is crucial in question answering as a retrieval task. Our goal is to categorize questions into different semantic classes that impose constraints on potential answers so that they can be utilized in later stages of the question answering process. For example, when considering the question Q: What Canadian city has the largest population? The hope is to classify this question as having answer type location, implying that only candidate answers that are locations need consideration.

The dataset we use is the TREC Question Classification dataset, There are entirely 5452 training and 500 test samples, that is 5452 + 500 questions each categorized into one of the six labels.

ABBR - 'abbreviation': expression abbreviated, etc.
DESC - 'description and abstract concepts': manner of an action, description of sth. etc.
ENTY - 'entities': animals, colors, events, food, etc.
HUM - 'human beings': a group or organization of persons, an individual, etc.
LOC - 'locations': cities, countries, etc.
NUM - 'numeric values': postcodes, dates, speed,temperature, etc

We want our model to be a multiclass classification model that takes strings as input and output probability for each of the 6 class labels. With this in mind, you know how to prepare the training and testing data for it.

The first step is to turn the raw text file into a pandas DataFrame and set the "label" column to be categorical column so as we can further access a label as a numeric value.

def get_dataframe(filename):
    lines = open(filename, 'r').read().splitlines()
    data = []
    for i in range(0, len(lines)):
        label = lines[i].split(' ')[0]
        label = label.split(":")[0]
        text = ' '.join(lines[i].split(' ')[1:])
        text = re.sub('[^A-Za-z0-9 ,\?\'\"-._\+\!/\`@=;:]+', '', text)
        data.append([label, text])

    df = pd.DataFrame(data, columns=['label', 'text'])
    df.label = df.label.astype('category')
    return df

df_train = get_dataframe('train_5500.txt')
df_train.head()

First 5 training samples look like this.

Next step we will prepare the input/output data for the model, the input as a list of question strings, and output as a list of one-hot encoded labels. If you are unfamiliar with one-hot encoding yet, I got you covered in part of my previous post.

train_text = df_train['text'].tolist()
train_text = np.array(train_text, dtype=object)[:, np.newaxis]

train_label = np.asarray(pd.get_dummies(df_train.label), dtype = np.int8)

If you take a peek at the value of train_label, you will see it in one-hot encoded form.

array([[0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 1, 0], [0, 0, 0, 1, 0, 0] ...], dtype=int8)

Now we are ready to build the model.

Keras meets Universal Sentence Encoder

We have previously loaded the Universal Sentence Encoder as variable "embed", to have it work with Keras nicely, it is necessary to wrap it in a Keras Lambda layer and explicitly cast its input as a string.

def UniversalEmbedding(x):
    return embed(tf.squeeze(tf.cast(x, tf.string)), 
    	signature="default", as_dict=True)["default"]

Then we build the Keras model in its standard Functional API,

input_text = layers.Input(shape=(1,), dtype=tf.string)
embedding = layers.Lambda(UniversalEmbedding,
	output_shape=(embed_size,))(input_text)
dense = layers.Dense(256, activation='relu')(embedding)
pred = layers.Dense(category_counts, activation='softmax')(dense)
model = Model(inputs=[input_text], outputs=pred)
model.compile(loss='categorical_crossentropy', 
	optimizer='adam', metrics=['accuracy'])

We can view the model summary and realize that only the Keras layers are trainable, that is how the transfer learning task works by assuring the Universal Sentence Encoder weights untouched.

_________________________________________________________________
Layer (type) Output Shape Param # 
=================================================================
input_1 (InputLayer) (None, 1) 0 
_________________________________________________________________
lambda_1 (Lambda) (None, 512) 0 
_________________________________________________________________
dense_1 (Dense) (None, 256) 131328 
_________________________________________________________________
dense_2 (Dense) (None, 6) 1542 
=================================================================
Total params: 132,870
Trainable params: 132,870
Non-trainable params: 0
_________________________________________________________________

In the next step, we train the model with the training datasets and validate its performance at the end of each training epoch with test datasets.

with tf.Session() as session:
  K.set_session(session)
  session.run(tf.global_variables_initializer())
  session.run(tf.tables_initializer())
  history = model.fit(train_text, 
            train_label,
            validation_data=(test_text, test_label),
            epochs=10,
            batch_size=32)
  model.save_weights('./model.h5')

The final validation result shows the highest accuracy gets around 97% after training for 10 epochs.

After we have the model trained and its weights saved to a file, it is really to make predictions on new questions.

Here we come up with 3 new questions for the model to classify.

new_text = ["In what year did the titanic sink ?", 
            "What is the highest peak in California ?", 
            "Who invented the light bulb ?"]

new_text = np.array(new_text, dtype=object)[:, np.newaxis]
with tf.Session() as session:
  K.set_session(session)
  session.run(tf.global_variables_initializer())
  session.run(tf.tables_initializer())
  model.load_weights('./model.h5')  
  predicts = model.predict(new_text, batch_size=32)

categories = df_train.label.cat.categories.tolist()
predict_logits = predicts.argmax(axis=1)
predict_labels = [categories[logit] for logit in predict_logits]
print(predict_labels)

The classification results look decent.

['NUM', 'LOC', 'HUM']

Conclusion and further reading

Congratulation! You have built a Keras text transfer learning model powered by the Universal Sentence Encoder and achieved a great result in question classification task. The Universal Sentence Encoder can embed longer paragraphs, so feel free to experiment with other datasets like the news topic classification, sentiment analysis, etc.

Some related resources you might find useful.

TensorFlow Hub

TensorFlow Hub example notebooks

For an intro to use Google Colab notebook, you can read the first section of my post- How to run Object Detection and Segmentation on a Video Fast for Free.

The source code in my GitHub and a runnable Colab notebook.

Simple Stock Sentiment Analysis with news data in Keras

2018-05-20T21:13:05+00:00

Have you wonder what impact everyday news might have on the stock market. In this tutorial, we are going to explore and build a model that reads the top 25 voted world news from Reddit users and predict whether the Dow Jones will go up or down for a given day.

After reading this post, you will learn,

How to pre-processing text data for deep learning sequence model.
How to use pre-trained GloVe embeddings vectors to initialize Keras Embedding layer.
Build a GRU model that can process word sequences and is able to take word order into account.

Now let's get started, read till the end since there will be a secret bonus.

Text data pre-processing

For the input text, we are going to concatenate all 25 news to one long string for each day.

After that are going to convert all sentences to lower-case, remove characters such as numbers and punctuations that cannot be represented by the GloVe embeddings later.

The next step is to convert all your training sentences into lists of indices, then zero-pad all those lists so that their length is the same.

It is helpful to visualize the length distribution across all input samples before deciding the maximum sequence length.

Keep in mind that the longer maximum length we pick, the longer it will take to train the model, so instead of choosing the longest sequence length in our datasets which is around 700, we are going to pick 500 as a tradeoff to cover the majority of the text across all samples while remaining relatively short training time.

The embedding layer

In Keras, the embedding matrix is represented as a "layer" and maps positive integers(indices corresponding to words) into dense vectors of fixed size (the embedding vectors). It can be trained or initialized with a pre-trained embedding. In the part, you will learn how to create an Embedding layer in Keras, initialize it with GloVe 50-dimensional vectors. Because our training set is quite small, we will not update the word embeddings but will instead leave their values fixed. I will show you how Keras allows you to set whether the embedding is trainable or not.

The Embedding() layer takes an integer matrix of size (batch size, max input length) as input, this corresponds to sentences converted into lists of indices (integers), as shown in the figure below.

The following function handles the first step of converting sentence strings to an array of indices. The word to index mapping is taken from GloVe embedding file so we can seamlessly convert indices to word vectors later.

def sentences_to_indices(X, word_to_index, max_len):
    """
    Converts an array of sentences (strings) into an array of indices corresponding to words in the sentences.
    The output shape should be such that it can be given to `Embedding()`. 
    
    Arguments:
    X -- array of sentences (strings), of shape (m, 1)
    word_to_index -- a dictionary containing the each word mapped to its index
    max_len -- maximum number of words in a sentence. You can assume every sentence in X is no longer than this. 
    
    Returns:
    X_indices -- array of indices corresponding to words in the sentences from X, of shape (m, max_len)
    """
    
    m = X.shape[0]                                   # number of training examples
    
    # Initialize X_indices as a numpy matrix of zeros and the correct shape
    X_indices = np.zeros((m, max_len), dtype=int)
    
    for i in range(m):                               # loop over training examples
        
        # Convert the ith training sentence in lower case and split is into words. You should get a list of words.
        sentence_words = [w.lower() for w in X[i].split()]
        
        # Initialize j to 0
        j = 0
        
        # Loop over the words of sentence_words
        for w in sentence_words:
            # Set the (i,j)th entry of X_indices to the index of the correct word.
            if w in word_to_index:
                X_indices[i, j] = word_to_index[w]
                # Increment j to j + 1
                j += 1
                if j >= max_len:
                    break
            
    return X_indices

After that, we can implement the pre-trained embedding layer like so.

Initialize the embedding matrix as a numpy array of zeros with the correct shape. (vocab_len, dimension of word vectors)
Fill the embedding matrix with all the word embeddings.
Define Keras embedding layer and make is non-trainable by setting trainable to False.
Set the weight of the embedding layer to the embedding matrix.

from keras.layers.embeddings import Embedding

def pretrained_embedding_layer(word_to_vec_map, word_to_index):
    """
    Creates a Keras Embedding() layer and loads in pre-trained GloVe 50-dimensional vectors.
    
    Arguments:
    word_to_vec_map -- dictionary mapping words to their GloVe vector representation.
    word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)

    Returns:
    embedding_layer -- pretrained layer Keras instance
    """
    
    vocab_len = len(word_to_index) + 1                  # adding 1 to fit Keras embedding (requirement)
    emb_dim = word_to_vec_map["cucumber"].shape[0]      # define dimensionality of your GloVe word vectors (= 50)
    
    # Initialize the embedding matrix as a numpy array of zeros of shape (vocab_len, dimensions of word vectors = emb_dim)
    emb_matrix = np.zeros((vocab_len, emb_dim))
    
    # Set each row "index" of the embedding matrix to be the word vector representation of the "index"th word of the vocabulary
    for word, index in word_to_index.items():
        emb_matrix[index, :] = word_to_vec_map[word]

    # Define Keras embedding layer with the correct output/input sizes, make it trainable. Use Embedding(...). Make sure to set trainable=False. 
    embedding_layer = Embedding(vocab_len, emb_dim, trainable=False)
    # Build the embedding layer, it is required before setting the weights of the embedding layer. Do not modify the "None".
    embedding_layer.build((None,))
    
    # Set the weights of the embedding layer to the embedding matrix. Your layer is now pretrained.
    embedding_layer.set_weights([emb_matrix])
    
    return embedding_layer

Let's have a quick check of the embedding layer by asking for the vector representation of the word "cat".

embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
embedding_layer.get_weights()[0][word_to_index['cat']]
# array([ 0.45281 , -0.50108 , ... 0.71278 ,  0.23782 ], dtype=float32)

The result is a 50 dimension array. You can further explore the word vectors and measure similarity using cosine similarity or solve word analogy problems such as Man is to Woman as King is to __.

Build and evaluate the model

The task for the model is to take the news string sequence and make a binary classification whether the Dow Jones close value will rose/fail compared to previous close value. It outputs "1" if the value rose or stays the same, "0" when the value decreases.

We are building a simple model contains two stacked GRU layers after the pre-trained embedding layer. A Dense layer generates the final output with softmax activation. GRU is a type of recurrent network that processes and considers the order of sequences, it is similar to LSTM regarding their functionality and performance but less computationally expensive to train.

model = Sequential()
model.add(pretrained_embedding_layer(word_to_vec_map, word_to_index))
model.add(GRU(128, dropout=0.2, return_sequences=True)) 
model.add(GRU(128, dropout=0.2))
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

Next, we can train the evaluate the model.

history = model.fit(X_train_indices, Y_train, batch_size=batch_size, epochs=10,
          validation_data=(X_test_indices, Y_test))

model.save("./model.h5")
score, acc = model.evaluate(X_test_indices, Y_test,
                            batch_size=batch_size)

It is also helpful to generate the ROC or our binary classification classifier to access its performance visually.

Our model is about 2.8% better than the random guess of the market trend.

For more information about ROC and AUC, you can read my other blog - Simple guide on how to generate ROC plot for Keras classifier.

Conclusion and Further thought

In this post, we introduced a quick and simple way to build a Keras model with Embedding layer initialized with pre-trained GloVe embeddings. Something you can try after reading this post,

Make the Embedding layer weights trainable, train the model from the start then compare the result.
Increase the maximum sequence length and see how that might impact the model performance and training time.
Incorporate other input to form a multi-input Keras model, since other factors might correlate with stock index fluctuation. For example, there are MACD(Moving Average Convergence/Divergence oscillator), Momentum indicator for your consideration. To have multi-input, you can use the Keras functional API.

Any ideas to improve the model? Comment and share your thoughts.

You can find the full source code and training data here in my Github repo.

Bonus for investors

If you are new to the whole investment world like I did years ago, you may wonder where to start, preferably invest for free with zero commissions. By learning how to trade stocks for free, you'll not only save money, but your investments will potentially compound at a faster rate. Robinhood, one of the best investing app does just that, whether you are buying only one or 100 shares, there are no commissions. It was built from the ground up to be as efficient as possible by cutting out the fat and pass the savings to the customers. Join Robinhood, and we'll both get a stock like Apple, Ford, or Sprint for free. Make sure you use my shared link.

How to generate realistic yelp restaurant reviews with Keras

2018-02-22T12:57:55+00:00

TL; DR

After reading this article. You will be able to build a model to generate 5-star Yelp reviews like those.

Samples of generated review text (unmodified)

<SOR>I had the steak, mussels with a side of chicken parmesan. All were very good. We will be back.<EOR>
<SOR>The food, service, atmosphere, and service are excellent. I would recommend it to all my friends<EOR>
<SOR>Good atmosphere, amazing food and great service.Service is also pretty good. Give them a try!<EOR>

I will show you how to,

Acquire and prepare the training data.
Build the character-level language models.
Tips when training the model.
Generate random reviews.

Training the model could easily take up a couple of days even on GPU. Luckily the pre-trained model weights are available. So we could jump directly to the fun part to generate reviews.

Getting the Data ready

The Yelp Dataset is freely available in JSON format.

After downloading and extracting, you will find 2 files we need in the dataset folder,

review.json
business.json

Those two files are quite large, especially the review.json file (3.7 GB).

Each line of the review.json file is a review of JSON string. The two files do not have the JSON start and end square brackets "[ ]". So the content of the JSON file as a whole is not a valid JSON string. Plus it might be difficult to fit the whole review.json file content to the memory. So, let's first convert them to CSV format line by line with our helper script.

python json_converter.py ./dataset/review.json
python json_converter.py ./dataset/business.json

After that, you will find those two files in dataset folder,

review.csv
business.csv

Those two are valid CSV files we can open by pandas library.

Here is what we are going to do. We only extract 5-stars review texts from the businesses that have 'Restaurant' tag in their categories.

# Read thow two CSV files to pandas dataframes
df_business=pd.read_csv('../dataset/business.csv')
df_review=pd.read_csv('../dataset/review.csv')
# Filter 'Restaurants' businesses
restaurants = df_business[df_business['categories'].str.contains('Restaurants')]
# Filter 5-stars reviews
five_star=df_review[df_review['stars']==5]
# merge the reviews with restaurants by key 'business_id'
# This keep only 5-star restaurants reviews
combo=pd.merge(restaurants_clean, five_star, on='business_id')
# review texts column
rnn_fivestar_reviews_only=combo[['text']]

Next, let's remove the new line characters in reviews and any duplicated reviews.

# remove new line characters
rnn_fivestar_reviews_only=rnn_fivestar_reviews_only.replace({r'\n+': ''}, regex=True)
# remove dupliated reviews
final=rnn_fivestar_reviews_only.drop_duplicates()

To show the model where is the start and end of a review. We need to add special markers to our review texts.

So one line in the finally prepared review will look like this as you expected.

"<SOR>Hummus is amazing and fresh! Loved the falafels. I will definitely be back. Great owner, friendly staff<EOR>"

Build the model

The model we are building here is a character-level language model, meaning the minimum distinguishable symbol is a character. You may also come across the word- level model where the input is the word tokens.

There are some pros and cons for the character-level language model.

Pro:

Don’t have to worry about unknown vocabulary.
Able to learn large vocabulary.

Con:

End up with very long sequences. Not as good as word level language models at capturing long-range dependencies between how the earlier parts of the sentence also affect the later part of the sentence.
And character level models are also just more computationally expensive to train.

The model is quite similar to the official lstm_text_generation.py demo code, except we are stacking RNN cells allows storing more information throughout the hidden states between the input and output layer. It generates more realistic Yelp reviews.

Before showing the code for the model, let's peek a little deeper on how stacking RNN works.

You may have seen in the standard neural network.(That is the Dense layers in Keras)

The first layer takes the input x to compute the activation value a^[1], that stack next layer to compute the next activation value a^[2].

Stacking RNN is a bit like the standard neural network and "unrolling in time".

For notation a^[l]<t> means activation asslocation for layer l, and <t> means timestep t.

Let's take a look how an activation value is computed.

To compute a^[2]<3>, there are two input, a^[2]<2>and a^[1]<3>

g is the activation function, w_a^[2] and b_a^[2] are the layer 2 parameters.

As we can see, to stack RNNs, the previous RNN need to return all the timesteps a^<t> to the subsequent RNN.

By default, an RNN layer such as LSTM in Keras only returns the last timestep activation value a^<T>. To return all timesteps' activation values, we set the return_sequences parameter to True.

So here is how we build the model in Keras. Each input sample is a one-hot representation of 60 characters, and there are total 95 possible characters.

Each output is a list of 95 predicted probabilities for each character.

import keras
from keras import layers

model = keras.models.Sequential()
model.add(layers.LSTM(1024, input_shape=(60, 95),return_sequences=True))
model.add(layers.LSTM(1024, input_shape=(60, 95)))
model.add(layers.Dense(95, activation='softmax'))

And here is the graphical model structure to help you visualize it.

Training the model

The idea to train the model is simple, we train it with the input/output pair. Each input is 60 characters, and the corresponding output is the immediately following character.

In the data preparing step, we created a list of clean 5-star reviews text. Total 1,214,016 lines of reviews. To simplify the training, we are only going to train on reviews equal or less than 250 characters long which end up with 418,955 lines of reviews.

Then we shuffle the order of the reviews, so we don't train on 100 reviews for the same restaurant in a row.

We read all reviews as a long text string. Then create a python dictionary (i.e., a hash table) to map each character to an index from 0-94 (total 95 unique characters).

# List of unique characters in the corpus
chars = sorted(list(set(text)))
print('Unique characters:', len(chars))
# Dictionary mapping unique characters to their index in `chars`
char_indices = dict((char, chars.index(char)) for char in chars)

The text corpus has a total of 72,662,807 characters. It is hard to process it as a whole. So let's break it down into chunks of 90k characters each.

For each chunk of a corpus, we are going to generate pairs of inputs and outputs. By shifting the pointer from beginning to end of the chunk, one character at a time if step set to 1.

def getDataFromChunk(txtChunk, maxlen=60, step=1):
    sentences = []
    next_chars = []
    for i in range(0, len(txtChunk) - maxlen, step):
        sentences.append(txtChunk[i : i + maxlen])
        next_chars.append(txtChunk[i + maxlen])
    print('nb sequences:', len(sentences))
    print('Vectorization...')
    X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
    y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
    for i, sentence in enumerate(sentences):
        for t, char in enumerate(sentence):
            X[i, t, char_indices[char]] = 1
            y[i, char_indices[next_chars[i]]] = 1
    return [X, y]

Training one chunk for one epoch takes 219 seconds on GPU (GTX1070), so training the full corpus will take about 2 days.

72662807 / 90000 * 219 /60 / 60/ 24 = 2.0 days

Two Keras callbacks come handy, ModelCheckpoint and ReduceLROnPlateau.

ModelCheckpoint helps us save the weights everytime it improves.

ReduceLROnPlateau callback automatically reduces learning rate when the loss metric stops decreasing. The main benefit of it is that we don’t need to tune the learning rate manually. Its main weakness is that its learning rate is always decreasing and decaying.

# this saves the weights everytime they improve so you can let it train.  Also learning rate decay
filepath="Feb-22-all-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
reduce_lr = ReduceLROnPlateau(monitor='loss', factor=0.5,
              patience=1, min_lr=0.00001)
callbacks_list = [checkpoint, reduce_lr]

Code to train the model for 20 epochs looks like this.

for iteration in range(1, 20):
    print('Iteration', iteration)
    with open("../dataset/short_reviews_shuffle.txt") as f:
        for chunk in iter(lambda: f.read(90000), ""):
            X, y = getDataFromChunk(chunk)
            model.fit(X, y, batch_size=128, epochs=1, callbacks=callbacks_list)

It will take one month or so as you might guess. But training for about 2 hours already produces some promising results in my case. Feel free to give it a try.

Generate 5-star reviews

Whether you jump right to this section or you have read through the previous ones. Here is the fun part!

With the pre-trained model weights or one you trained by yourself, we can generate some interesting yelp reviews.

Here is the idea, we "seed" the model with initial 60 characters and ask the model to predict the very next character.

The "sampling index" process will add some variety to the final result by generating some randomness with the given prediction.

If the temperature is very small, it will always pick the index with highest predicted probability.

def sample(preds, temperature=1.0):
    '''
    Generate some randomness with the given preds
    which is a list of numbers, if the temperature
    is very small, it will always pick the index
    with highest pred value
    '''
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

To generate 300 characters with following code

# We generate 300 characters
for i in range(300):
    sampled = np.zeros((1, maxlen, len(chars)))
    # Turn each char to char index.
    for t, char in enumerate(generated_text):
        sampled[0, t, char_indices[char]] = 1.
    # Predict next char probabilities
    preds = model.predict(sampled, verbose=0)[0]
    # Add some randomness by sampling given probabilities.
    next_index = sample(preds, temperature)
    # Turn char index to char.
    next_char = chars[next_index]
    # Append char to generated text string
    generated_text += next_char
    # Pop the first char in generated text string.
    generated_text = generated_text[1:]
    # Print the new generated char.
    sys.stdout.write(next_char)
    sys.stdout.flush()
print(generated_text)

Summary and Further reading

In this post, you know how to build and train a character-level text generation model from beginning to end. The source code is available on my GitHub repo as well as the pre-train model to play with.

The model shown here is trained in a many to one fashion. There is also another optional implementation in many to many fashion. Consider the input sequence as characters of length 7 "The cak" and the expected output is "he cake". You can check it out here, char_rnn_karpathy_keras.

For a reference to building a word-level model, check out my other blog: Simple Stock Sentiment Analysis with news data in Keras.

How to teach AI to suggest product prices to online sellers

2017-12-23T13:09:19+00:00

Given the product brand, name, category and one sentence or two for short descriptions, then we can predict its price. Could it be that simple?

In this Kaggle competition, we are doing exactly the same thing. Developers around the world are fighting for the "Mercari Prize: Price Suggestion Challenge", and the prize money is The total amount is $ 100,000 (first place: $ 60,000, second place: $ 30,000, third place: $ 10,000).

In this post, I will walk you through building a simple model to tackle the challenge in the deep learning library Keras.

If you are new to Kaggle, in order to download the datasets, you need to register an account, totally pain-free. Once you have the account, go to the "Data" tab in the Mercari Price Suggestion Challenge.

Download all three files to your local computer, extract and save them to a folder named "input", and at the root folder create a folder named "scripts" where we will start coding.

Right now you should have your directories structured similarly to this.

Preparing the data

First, let's take some time to understand the datasets at hand

train.tsv, test.tsv

The files consist of a list of product listings. These files are tab-delimited.

train_id or test_id - the id of the listing
name - the title of the listing. Note that we have cleaned the data to remove text that looks like prices (e.g. $20) to avoid leakage. These removed prices are represented as [rm]
item_condition_id - the condition of the items provided by the seller
category_name - category of the listing
brand_name
price - the price that the item was sold for. This is the target variable that you will predict. The unit is USD. This column doesn't exist in test.tsv since that is what you will predict.
shipping - 1 if shipping fee is paid by seller and 0 by a buyer
item_description - the full description of the item. Note that we have cleaned the data to remove text that looks like prices (e.g. $20) to avoid leakage. These removed prices are represented as [rm]

First 5 rows of train datasets

train_id	name	item_condition_id	category_name	brand_name	price	shipping	item_description
0	MLB Cincinnati Reds T Shirt Size XL	3	Men/Tops/T-shirts	NaN	10.0	1	No description yet
1	Razer BlackWidow Chroma Keyboard	3	Electronics/Computers & Tablets/Components & P...	Razer	52.0	0	This keyboard is in great condition and works ...
2	AVA-VIV Blouse	1	Women/Tops & Blouses/Blouse	Target	10.0	1	Adorable top with a hint of lace and a key hol...
3	Leather Horse Statues	1	Home/Home Décor/Home Décor Accents	NaN	35.0	1	New with tags. Leather horses. Retail for [rm]...
4	24K GOLD plated rose	1	Women/Jewelry/Necklaces	NaN	44.0	0	Complete with certificate of authenticity

For the price column, it would be problematic to feed into neural network values that all take wildly different ranges. The network might be able to automatically adapt to such heterogeneous data, but it would definitely make learning more difficult.

A widespread best practice to deal with such data is to do feature-wise normalization: for each feature in the input data (a column in the input data matrix) we apply log(x+1) to it.

train['target'] = np.log1p(train['price'])

Let's take a look at the distribution of the new 'target' column.

Text preprocessing

Replace contractions

We will replace contractions pair like those below, the purpose is to unify the vocabulary to make it easier to train the model.

"what's" → "what is",
"who'll" → "who will",
"wouldn't" → "would not",
"you'd" → "you would",
"you're" → "you are"

Before we are doing this let's count how many rows contain any of the contractions in the "item_description" column.

5 top listed below, which is no surprise.

can't - 6136
won't - 4867
that's - 4806
it's - 26444
don't - 32645
doesn't - 8520

Here is the code to remove contractions for both 'item_description' and 'name' columns in train and test datasets.

contractions = { 
    "ain't": "am not",
    "aren't": "are not",
    "can't": "cannot",
    "can't've": "cannot have",
    # more contractions pairs
    ...
    }

for contraction in contractions:
    train['item_description'] = train['item_description'].str.replace(contraction, contractions[contraction])
    test['item_description'] = test['item_description'].str.replace(contraction, contractions[contraction])
    train['name'] = train['name'].str.replace(contraction, contractions[contraction])
    test['name'] = test['name'].str.replace(contraction, contractions[contraction])

Handel missing values

The concept of missing values is important to understand in order to successfully manage data. If the missing values are not handled properly by the researcher, then he/she may end up drawing an inaccurate inference about the data.

First take a look at how many data is missing, and which column.

train.isnull().sum()/len(train)

In the output we know there are 42% brand_name are missing, "category_name" and "item_description" columns are also missing less than 1% data.

train_id             0.000000
name                 0.000000
item_condition_id    0.000000
category_name        0.004268
brand_name           0.426757
price                0.000000
shipping             0.000000
item_description     0.000003
dtype: float64

Suppose the number of cases of missing values is extremely small; then, an expert researcher may drop or omit those values from the analysis. In statistical language, if the number of the cases is less than 5% of the sample, then the researcher can drop them. In our case, we can drop rows with "category_name" or "item_description" column missing value.

But for simplicity, let's replace all missing text values with the string "missing".

#HANDLE MISSING VALUES
print("Handling missing values...")
def handle_missing(dataset):
    dataset.category_name.fillna(value="missing", inplace=True)
    dataset.brand_name.fillna(value="missing", inplace=True)
    dataset.item_description.fillna(value="missing", inplace=True)
    return (dataset)

train = handle_missing(train)
test = handle_missing(test)
print(train.shape)
print(test.shape)

Create categorical columns

There are two text columns has special meanings,

category_name
brand_name

Different products could have same category name or brand name, so it will be helpful to create categorical columns from them.

We will use sklearn's LabelEncoder for this purpose. After the transform, we will have two more new columns "category" and "brand" with an integer type.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

le.fit(np.hstack([train.category_name, test.category_name]))
train['category'] = le.transform(train.category_name)
test['category'] = le.transform(test.category_name)

le.fit(np.hstack([train.brand_name, test.brand_name]))
train['brand'] = le.transform(train.brand_name)
test['brand'] = le.transform(test.brand_name)

Tokenize - texts to sequences

For each unique word in the vocabulary, we will turn it to one integer to represent it. So one sentence will become a list of integers.

First, we need to gather the vocabulary list from our text columns we are going to tokenize. i.e. those 3 columns,

category_name
item_description
name

And we will use Keras' text processing Tokenizer class.

from keras.preprocessing.text import Tokenizer
raw_text = np.hstack([train.category_name.str.lower(), 
                      train.item_description.str.lower(), 
                      train.name.str.lower()])
# Tokenize
print("Tokenizing!")
tok_raw = Tokenizer()
tok_raw.fit_on_texts(raw_text)
print("   Transforming text to seq...")
train["seq_category_name"] = tok_raw.texts_to_sequences(train.category_name.str.lower())
test["seq_category_name"] = tok_raw.texts_to_sequences(test.category_name.str.lower())
train["seq_item_description"] = tok_raw.texts_to_sequences(train.item_description.str.lower())
test["seq_item_description"] = tok_raw.texts_to_sequences(test.item_description.str.lower())
train["seq_name"] = tok_raw.texts_to_sequences(train.name.str.lower())
test["seq_name"] = tok_raw.texts_to_sequences(test.name.str.lower())

The fit_on_texts() method will train the Tokenizer to generate the vocabulary word index mapping. And the texts_to_sequences() method will actually make the sequences from texts.

Padding sequences

The words sequences generated in the previous step are in different lengths. Since the first layer in our network for those sequences is the Embedding layers. Each Embedding layer takes as input a 2D tensor of integers, of shape (samples, sequence_length)

All sequences in a batch must have the same length since we need to pack them into a single tensor. So sequences that are shorter than others should be padded with zeros, and sequences that are longer should be truncated.

To begin with, we need to choose the forsequence_length each of our sequences columns. If it is too long, the model training will take forever. If it is too short, we are at the risk of truncating important information. It will be great to visualize the sequence length distribution before we make this decision.

This line of code will plot sequence length distribution for the "seq_item_description" column in a histogram.

train.seq_item_description.apply(lambda x: len(x)).hist(bins=30)

Let's pick the number 60 for the max sequence length since it covers up the majority sequences.

In the code below, we are using Keras' sequence processing pad_sequences() method to pad sequences to be the same length for each column.

#KERAS DATA DEFINITION
from keras.preprocessing.sequence import pad_sequences

def get_keras_data(dataset):
    X = {
        'name': pad_sequences(dataset.seq_name, maxlen=MAX_NAME_SEQ)
        ,'item_desc': pad_sequences(dataset.seq_item_description
                                    , maxlen=MAX_ITEM_DESC_SEQ)
        ,'brand': np.array(dataset.brand)
        ,'category': np.array(dataset.category)
        ,'category_name': pad_sequences(dataset.seq_category_name
                                        , maxlen=MAX_CATEGORY_NAME_SEQ)
        ,'item_condition': np.array(dataset.item_condition_id)
        ,'shipping': np.array(dataset[["shipping"]])
    }
    return X

X_train = get_keras_data(dtrain)
X_valid = get_keras_data(dvalid)
X_test = get_keras_data(test)

Build the model

This will be a multi-input model for those inputs

'name': texts converted to sequences
'item_desc': texts converted to sequences
'brand': texts converted to integers
'category': texts converted to integers
'category_name': texts converted to sequences
'item_condition': integers
'shipping': integers 1 or 0

All Inputs except “shipping” will first go to embedding layer

For those sequences inputs, we need to feed them to Embedding layers. Embedding layers turn integer indices (which stand for specific words) to dense vectors. It takes as input integers, it looks up these integers into an internal dictionary, and it returns the associated vectors. It's effectively a dictionary lookup.

The embedded sequence will then feed to the GRU layer, like other types of recurrent networks, it is good at learning patterns in sequences of data.

Non-sequential data embedded layers will just be flattened to 2 dimensions by the Flatten layer.

All layers including the “shipping” will then be concatenated to a big two-dimensional tensor.

Followed by several Dense layers, final output Dense layer takes “linear” activation regression to arbitrary price values, same as specifying None for the activation parameter.

from keras.layers import Input, Dropout, Dense, \
    concatenate, GRU, Embedding, Flatten
from keras.models import Model
from keras import optimizers

def get_model():
    
    #Inputs
    name = Input(shape=[X_train["name"].shape[1]], name="name")
    item_desc = Input(shape=[X_train["item_desc"].shape[1]], name="item_desc")
    brand = Input(shape=[1], name="brand")
    category = Input(shape=[1], name="category")
    category_name = Input(shape=[X_train["category_name"].shape[1]], 
                          name="category_name")
    item_condition = Input(shape=[1], name="item_condition")
    shipping = Input(shape=[X_train["shipping"].shape[1]], name="shipping")
    
    #Embeddings layers
    emb_size = 60
    emb_name = Embedding(MAX_TEXT, emb_size//3)(name)
    emb_item_desc = Embedding(MAX_TEXT, emb_size)(item_desc)
    emb_category_name = Embedding(MAX_TEXT, emb_size//3)(category_name)
    emb_brand = Embedding(MAX_BRAND, 10)(brand)
    emb_category = Embedding(MAX_CATEGORY, 10)(category)
    emb_item_condition = Embedding(MAX_CONDITION, 5)(item_condition)
    
    rnn_layer1 = GRU(16) (emb_item_desc)
    rnn_layer2 = GRU(8) (emb_category_name)
    rnn_layer3 = GRU(8) (emb_name)
    
    #main layer
    main_l = concatenate([
        Flatten() (emb_brand)
        , Flatten() (emb_category)
        , Flatten() (emb_item_condition)
        , rnn_layer1
        , rnn_layer2
        , rnn_layer3
        , shipping
    ])
    main_l = Dropout(dr)(Dense(512,activation='relu') (main_l))
    main_l = Dropout(dr)(Dense(64,activation='relu') (main_l))
    main_l = Dropout(dr)(Dense(32,activation='relu') (main_l))
    
    #output
    output = Dense(1,activation="linear") (main_l)
    
    #model
    model = Model([name, item_desc, brand
                   , category, category_name
                   , item_condition, shipping], output)
    #optimizer = optimizers.RMSprop()
    optimizer = optimizers.Adam()
    model.compile(loss="mse", 
                  optimizer=optimizer)
    return model

I like visualization, so I plot the model structure as well.

It can be done with those two lines of code if you are curious.

You need to install Graphviz executable. pip install graphviz and pydot packages before trying to plot.

from keras.utils import plot_model
plot_model(model, to_file='model.png', show_shapes=True)

Training the model is easy, let's train it for 2 epochs, X_train is the dictionary we created earlier, mapping input names to Numpy arrays.

epochs = 3
BATCH_SIZE = 512 * 3
history = model.fit(X_train, dtrain.target
                    , epochs=epochs
                    , batch_size=BATCH_SIZE
                    , validation_split=0.01
                    )

Evaluate the model

The Kaggle challenge page has chosen "Root Mean Squared Logarithmic Error" as the loss function.

The following code will take our trained model and compute the loss value given the validation data.

def rmsle(y, y_pred):
    import math
    assert len(y) == len(y_pred)
    to_sum = [(math.log(y_pred[i] + 1) - math.log(y[i] + 1)) ** 2.0 \
              for i, pred in enumerate(y_pred)]
    return (sum(to_sum) * (1.0/len(y))) ** 0.5

def eval_model(model):
    val_preds = model.predict(X_valid)
    val_preds = np.expm1(val_preds)
    
    y_true = np.array(dvalid.price.values)
    y_pred = val_preds[:, 0]
    v_rmsle = rmsle(y_true, y_pred)
    print(" RMSLE error on dev test: "+str(v_rmsle))
    return v_rmsle

v_rmsle = eval_model(model)

Generate file for submission

If you are planning on generating the actual prices for the test datasets and try your luck on Kaggle. This block of code will reverse the feature normalization process we discussed previously and write the prices to a CSV file.

preds = model.predict(X_test, batch_size=BATCH_SIZE)
preds = np.expm1(preds)
submission = test[["test_id"]][:test_len]
submission["price"] = preds[:test_len]
submission.to_csv("./submission.csv", index=False)

Summary

We walked through how to predict prices give multiple input features. How to preprocessing the text data, dealing with missing data and finally build, train and evaluate the model.

Full source code posted on my GitHub.

How to do multi-class multi-label classification for news categories

2017-11-18T07:03:47+00:00

My previous post shows how to choose last layer activation and loss functions for different tasks. This post we focus on the multi-class multi-label classification.

Overview of the task

We are going to use the Reuters-21578 news dataset. With a given news, our task is to give it one or multiple tags. The dataset is divided into five main categories:

Topics
Places
People
Organizations
Exchanges

For example, one given news could have those 3 tags belonging two categories

Places: USA, China
Topics: trade

Structure of the code

Prepare documents and categories

Read the category files to acquire all available 672 tags from those 5 categories.

Read all the news files and find the most common 20 tags out of 672 we are going to use for classification. Here is a list those 20 tags. Each one is prefixed with its categories for clarity. For instance "pl_usa" means tag "Places: USA", "to_trade" is "Topics: trade" etc.

Name	Type	Newslines
619	pl_usa	Places	12542
35	to_earn	Topics	3987
0	to_acq	Topics	2448
616	pl_uk	Places	1489
542	pl_japan	Places	1138
489	pl_canada	Places	1104
73	to_money-fx	Topics	801
28	to_crude	Topics	634
45	to_grain	Topics	628
625	pl_west-germany	Places	567
126	to_trade	Topics	552
55	to_interest	Topics	513
514	pl_france	Places	469
412	or_ec	Organizations	349
481	pl_brazil	Places	332
130	to_wheat	Topics	306
108	to_ship	Topics	305
468	pl_australia	Places	270
19	to_corn	Topics	254
495	pl_china	Places	223

Clean up the data for model

In previous step, we read the news contents and stored in a list

One news looks like this

average yen cd rates fall in latest week
    tokyo, feb 27 - average interest rates on yen certificates
of deposit, cd, fell to 4.27 pct in the week ended february 25
from 4.32 pct the previous week, the bank of japan said.
    new rates (previous in brackets), were -
    average cd rates all banks 4.27 pct (4.32)
    money market certificate, mmc, ceiling rates for the week
starting from march 2          3.52 pct (3.57)
    average cd rates of city, trust and long-term banks
    less than 60 days          4.33 pct (4.32)
    60-90 days                 4.13 pct (4.37)
    average cd rates of city, trust and long-term banks
    90-120 days             4.35 pct (4.30)
    120-150 days            4.38 pct (4.29)
    150-180 days            unquoted (unquoted)
    180-270 days            3.67 pct (unquoted)
    over 270 days           4.01 pct (unquoted)
    average yen bankers' acceptance rates of city, trust and
long-term banks
    30 to less than 60 days unquoted (4.13)
    60-90 days              unquoted (unquoted)
    90-120 days             unquoted (unquoted)
 reuter

We start up the cleaning up by

Only take characters inside A-Za-z0-9
remove stop words (words like "in" , "on", "from" that don't really contain any special information)
lemmatize (e.g. turning word "rates" to "rate")

After this our news will looks much "friendly" to our model, each word is seperated by space.

average yen cd rate fall latest week tokyo feb 27 average interest rate yen certificatesof deposit cd fell 427 pct week ended february 25from 432 pct previous week bank japan said new rate previous bracket average cd rate bank 427 pct 432 money market certificate mmc ceiling rate weekstarting march 2 352 pct 357 average cd rate city trust longterm bank le 60 day 433 pct 432 6090 day 413 pct 437 average cd rate city trust longterm bank 90120 day 435 pct 430 120150 day 438 pct 429 150180 day unquoted unquoted 180270 day 367 pct unquoted 270 day 401 pct unquoted average yen banker acceptance rate city trust andlongterm bank 30 le 60 day unquoted 413 6090 day unquoted unquoted 90120 day unquoted unquoted reuter

Since a small portation of news are quite long even after the cleanup, let's set a limit to the maximum input sequence to 88 words, this will cover up 70% of all news in full length. We could have set a larger input sequence limit to cover more news but that will also increase the model training time.

Lastly, we will turn words into the form of ids and pad the sequence to input limit (88) if it is shorter.

Keras text processing makes this trivial.

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
max_vocab_size = 200000
input_tokenizer = Tokenizer(max_vocab_size)
input_tokenizer.fit_on_texts(totalX)
input_vocab_size = len(input_tokenizer.word_index) + 1
print("input_vocab_size:",input_vocab_size) # input_vocab_size: 167135
totalX = np.array(pad_sequences(input_tokenizer.texts_to_sequences(totalX), 
                                maxlen=maxLength))

The same news will look like this, each number represents a unique word in the vocabulary.

array([ 6943,     5,  5525,   177,    22,   699, 13146,  1620,    32,
       35130,     7,   130,  6482,     5,  8473,   301,  1764,    32,
         364,   458,   794,    11,   442,   546,   131,  7180,     5,
        5525, 18247,   131,  7451,     5,  8088,   301,  1764,    32,
         364,   458,   794,    11, 21414,   131,  7452,     5,  4009,
       35131,   131,  4864,     5,  6712, 35132,   131,  3530,  3530,
       26347,   131,  5526,     5,  3530,  2965,   131,  7181,     5,
        3530,   301,   149,   312,  1922,    32,   364,   458,  9332,
          11,    76,   442,   546,   131,  3530,  7451, 18247,   131,
        3530,  3530, 21414,   131,  3530,  3530,     3])

Create and train model

Embedding layer embed a sequence of vectors of size 256
GRU layers(recurrent network) which process the sequence data
Dense layer output the classification result of 20 categories

embedding_dim = 256
model = Sequential()
model.add(Embedding(input_vocab_size, embedding_dim,input_length = maxLength))
model.add(GRU(256, dropout=0.9, return_sequences=True))
model.add(GRU(256, dropout=0.9))
model.add(Dense(num_categories, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

history = model.fit(totalX, totalY, validation_split=0.1, batch_size=128, epochs=10)

Visualize the training performance

After training our model for 10 epochs in about 5 minutes, we have achieved the following result.

loss: 0.1062 - acc: 0.9650 - val_loss: 0.0961 - val_acc: 0.9690

The following code will generate a nice graph to visualize the progress of each training epochs.

import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

Make a prediction

Take one cleaned up news (each word is separated by space) to the same input tokenizer turning it to ids.

Call the model predict method, the output will be a list of 20 float numbers representing probabilities to those 20 tags. For demo purpose, lets take any tags will probability larger than 0.2.

textArray = np.array(pad_sequences(input_tokenizer.texts_to_sequences([input_x_220]), maxlen=maxLength))
predicted = model.predict(textArray)[0]
for i, prob in enumerate(predicted):
    if prob > 0.2:
        print(selected_categories[i])

This produces three tags

pl_uk
pl_japan
to_money-fx

the ground truth is

pl_japan
to_money-fx
to_interest

The model got 2 out of 3 right for the given news.

Summary

We start with cleaning up the raw news data for the model input. Built a Keras model to do multi-class multi-label classification. Visualize the training result and make a prediction. Further improvements could be made

Cleaning up the data better
Use longer input sequence limit
More training epochs

The source code for the jupyter notebook is available on my GitHub repo if you are interested.

How to Summarize Amazon Reviews with Tensorflow

2017-10-14T12:56:49+00:00

The objective of this project is to build a model that can create relevant summaries for reviews written about fine foods sold on Amazon. This dataset contains above 500,000 reviews and is hosted on Kaggle.

Here are two examples to show what the data looks like

Review # 1
Good Quality Dog Food
I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.

Review # 2
Not as Advertised
Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".

To build our model we will use a two-layered bidirectional RNN with LSTMs on the input data and two layers, each with an LSTM using Bahdanau attention on the target data.

The sections of this project are:
1.Inspecting the Data
2.Preparing the Data
3.Building the Model
4.Training the Model
5.Making Our Own Summaries

Inspired by the post Text Summarization with Amazon Reviews, with a few improvements and updates to work with latest TensorFlow Version 1.3, those improvements get better accuracy.

Summary of improvements

1. Tokenize the sentence better

Orginal code tokenizes the words by text.split(), it is not foolproof,

e.g. words followed by punctuations "Are you kidding?I think you are." would be incorrectly tokenized as

['Are', 'you', 'kidding?I', 'think', 'you', 'are.']

We use this line instead

text = re.findall(r"[\w']+", text)

which will correctly generate words list

['Are', 'you', 'kidding', 'I', 'think', 'you', 'are']

2. Increased data preparation filter and sort speed

The original author uses two for loops to sort and filter the data, here we are using the Python build in sort and filter function to do the same thing but much faster.

Filter for length limit and number of <UNK>s

Sort the summaries and texts by the length of the element in texts from shortest to longest

max_text_length = 83 # This will cover up to 89.5% lengthes
max_summary_length = 13 # This will cover up to 99% lengthes
min_length = 2
unk_text_limit = 1 # text can contain up to 1 UNK word
unk_summary_limit = 0 # Summary should not contain any UNK word

def filter_condition(item):
    int_summary = item[0]
    int_text = item[1]
    if(len(int_summary) >= min_length and 
       len(int_summary) <= max_summary_length and 
       len(int_text) >= min_length and 
       len(int_text) <= max_text_length and 
       unk_counter(int_summary) <= unk_summary_limit and 
       unk_counter(int_text) <= unk_text_limit):
        return True
    else:
        return False

int_text_summaries = list(zip(int_summaries , int_texts))
int_text_summaries_filtered = list(filter(filter_condition, int_text_summaries))
sorted_int_text_summaries = sorted(int_text_summaries_filtered, key=lambda item: len(item[1]))
sorted_int_text_summaries = list(zip(*sorted_int_text_summaries))
sorted_summaries = list(sorted_int_text_summaries[0])
sorted_texts = list(sorted_int_text_summaries[1])
# Delete those temporary varaibles
del int_text_summaries, sorted_int_text_summaries, int_text_summaries_filtered
# Compare lengths to ensure they match
print(len(sorted_summaries))
print(len(sorted_texts))

3. Connects RNN layers in encoder

The original code is missing this line below, that is how we connect layers by feeding the current layer's output to next layer's input. So the orginal code only behaves like a single bidirectional RNN layer in the encoder.

rnn_inputs = enc_output

def encoding_layer(rnn_size, sequence_length, num_layers, rnn_inputs, keep_prob):
    for layer in range(num_layers):
        with tf.variable_scope('encoder_{}'.format(layer)):
            cell_fw = tf.contrib.rnn.LSTMCell(rnn_size,
                                              initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            cell_fw = tf.contrib.rnn.DropoutWrapper(cell_fw, 
                                                    input_keep_prob = keep_prob)

            cell_bw = tf.contrib.rnn.LSTMCell(rnn_size,
                                              initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            cell_bw = tf.contrib.rnn.DropoutWrapper(cell_bw, 
                                                    input_keep_prob = keep_prob)

            enc_output, enc_state = tf.nn.bidirectional_dynamic_rnn(cell_fw, 
                                                                    cell_bw, 
                                                                    rnn_inputs,
                                                                    sequence_length,
                                                                    dtype=tf.float32)
            enc_output = tf.concat(enc_output,2)
            # original code is missing this line below, that is how we connect layers 
            # by feeding the current layer's output to next layer's input
            rnn_inputs = enc_output
    return enc_output, enc_state

4. Decoding layers use MultiRNNCell

The original author uses a for loop to connect num_layers of LSTMCell together, here we use MultiRNNCell to composed sequentially of multiple simple cells(BasicLSTMCell) to simplify the code.

def lstm_cell(lstm_size, keep_prob):
    cell = tf.contrib.rnn.BasicLSTMCell(lstm_size)
    return tf.contrib.rnn.DropoutWrapper(cell, input_keep_prob = keep_prob)

def decoding_layer(dec_embed_input, embeddings, enc_output, enc_state, vocab_size, text_length, summary_length,
                   max_summary_length, rnn_size, vocab_to_int, keep_prob, batch_size, num_layers):
    '''Create the decoding cell and attention for the training and inference decoding layers'''
    dec_cell = tf.contrib.rnn.MultiRNNCell([lstm_cell(rnn_size, keep_prob) for _ in range(num_layers)])
    # ......

The training result

After 2 hours training with GPU, the loss went below 1, settled at 0.707.

Here are some summaries generated with the trained model.

- Review:
 The coffee tasted great and was at such a good price! I highly recommend this to everyone!
- Summary:
 great great coffee


- Review:
 love individual oatmeal cups found years ago sam quit selling sound big lots quit selling found target expensive buy individually trilled get entire case time go anywhere need water microwave spoon know quaker flavor packets
- Summary:
 great taste

Check out the full source code on my GitHub.

How to triage patient queries with Keras (1 minute training)

2017-10-08T13:51:15+00:00

In this tutorial, we are going to build a model to triage based on patient queries text data.

For example

query (input)	triage (output)
Skin is quite itchy.	dermatology
Sore throat fever fatigue.	mouthface
Lower back hurt, so painful.	back

We are going to use Keras with Tensorflow (version 1.3.0) backend to build the model

For source code and dataset used in this tutorial, check out my GitHub repo.

Dependencies

Python 3.5, numpy, pickle, keras, tensorflow, nltk, pandas

About the data

1261 patient queries, phrases_embed.csv came from Babylon blog "How the chatbot understands sentences".

Check out the data visualization here.

Preparing the data

We will be doing the following steps to prepare data for training the model.

1. Read the data from CSV file to Pandas data frame, only keep 2 columns "Disease" and "class"

2. Convert Pandas data frame to numpy arrays pairs

"Disease" columns ==> documents

"class" columns ==> body_positions

3. Clean up the data

For each sentence, we convert all letter to lower case, only keep English letters and numbers, remove stopwords as shown below.

strip_special_chars = re.compile("[^A-Za-z0-9 ]+")
def cleanUpSentence(r, stop_words = None):
    r = r.lower().replace("<br />", " ")
    r = re.sub(strip_special_chars, "", r.lower())
    if stop_words is not None:
        words = word_tokenize(r)
        filtered_sentence = []
        for w in words:
            if w not in stop_words:
                filtered_sentence.append(w)
        return " ".join(filtered_sentence)
    else:
        return r

4. Input tokenizer converts input words to ids, pad each input sequence to max input length if it is shorter.

Save the input tokenizer since we need to use the same one to tokenize any new input data during prediction.

5. Convert output words to ids then to categories(one-hot vectors)

7. Make target_reverse_word_index to turn the predicated class ids to text.

Build the model

The model structure will look like this

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 18, 256)           232960    
_________________________________________________________________
gru_1 (GRU)                  (None, 18, 256)           393984    
_________________________________________________________________
gru_2 (GRU)                  (None, 256)               393984    
_________________________________________________________________
dense_1 (Dense)              (None, 19)                4883      
=================================================================

The embedding layer transforms words ids into their corresponding word embeddings, each output from embedding layer would have a size of ( 18 x 256) which is the maximum input sequence padding length times embedding dimension.

The data is pass to a recurrent layer to process the input sequence, we are using GRU here, you can also try LSTM.

All the intermediate outputs are collected and then passed on to the second GRU layer.

The output is then sent to a fully connected layer that would give us our final prediction classes. We are using "softmax" activation to give us a probability for each class.

Use standard 'categorical_crossentropy' loss function for multiclass classification.

Use "adam" optimizer since it adapts the learning rate.

model = Sequential()
model.add(Embedding(vocab_size, embedding_dim,input_length = maxLength))
model.add(GRU(256, dropout=0.9, return_sequences=True))
model.add(GRU(256, dropout=0.9))
model.add(Dense(output_dimen, activation='softmax'))
tbCallBack = TensorBoard(log_dir='./Graph/medical_triage', histogram_freq=0,
                            write_graph=True, write_images=True)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

We will then train the model and save it for later prediction.

Predict new data

1. Load the model we save earlier.

2. Load the input tokenizer and tokenize a new patient query text, pad the sequence to max length

3. Feed the sequence to model, the model will output the class id along with the probability, we use "target_reverse_word_index" to turn the class id to actual triage result text.

Here are some predicted result

Summary

Keras trained for 40 epochs, takes less than 1 minute with GPU (GTX 1070) final acc:0.9146

The training data size is relatively small, having larger datasets might increase the final accuracy.

Check out my GitHub repo for the Jupyter notebook source code and dataset.

An easy guide to Chinese Sentiment analysis with hotel review data

2017-09-26T04:45:50+00:00

For source code and dataset used in this tutorial, check out my github repo.

Dependencies

Python 3.5, numpy, pickle, keras, tensorflow, jieba

About the data

Customer hotel reviews, including

2916 positive reviews and 3000 negative reviews

Optional for plotting

pylab, scipy

Key difference compared to English dataset

File Encoding

Some data files contain abnormal encoding characters which encoding GB2312 will complain about. Solution: read as bytes then decode as GB2312 line by line, skip lines with abnormal encodings. We also convert any traditional Chinese characters to simplified Chinese characters.

documents = []
for filename in positiveFiles:
    text = ""
    with codecs.open(filename, "rb") as doc_file:
        for line in doc_file:
            try:
                line = line.decode("GB2312")
            except:
                continue
            text+=Converter('zh-hans').convert(line)
            text = text.replace("\n", "")
            text = text.replace("\r", "")
    documents.append((text, "pos"))

Convert from traditional to simplified Chinese (繁体转简体)

Have those two files download from

langconv.py

zh_wiki.py

those two lines below will convert string "line" from traditional to simplified Chinese.

from langconv import *
Converter('zh-hans').convert(line)

Tokenize

Use jieba to tokenize chinese sentences, then join the list of tokens seperated by spaces.

We then feed the string to Keras Tokenizer which expect each sentence with words tokens seperated by spaces.

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import jieba
seg_list = jieba.cut(text, cut_all=False)
text = " ".join(seg_list)
# totalX = [text , .....]
# maxLength is the sentence words length to keep
input_tokenizer = Tokenizer(30000)
input_tokenizer.fit_on_texts(totalX)
input_vocab_size = len(input_tokenizer.word_index) + 1
totalX = np.array(pad_sequences(input_tokenizer.texts_to_sequences(totalX), maxlen=maxLength))

Chinese stop words

First get a list of stop words from the file chinese_stop_words.txt , then check each tokenized Chinese words against this list

stopwords = [ line.rstrip() for line in open('./data/chinese_stop_words.txt',"r", encoding="utf-8") ]
for doc in documents:
    seg_list = jieba.cut(doc[0], cut_all=False)
    final =[]
    seg_list = list(seg_list)
    for seg in seg_list:
        if seg not in stopwords:
            final.append(seg)

Result

Keras trained for 20 epochs, takes 7 minutes 14 seconds with GPU (GTX 1070)

acc:0.9726

Try some new comments

For the Python Jupyter notebook source code and dataset, check out my github repo.

For an updated word-level English model, check out my other blog: Simple Stock Sentiment Analysis with news data in Keras.

Blog | DLology

Keras + Universal Sentence Encoder = Transfer Learning for text data

What is Universal Sentence Encoder and how it was trained

Universal Sentence Encoder VS Words embedding

Question classification task and data preprocessing

Keras meets Universal Sentence Encoder

Conclusion and further reading

Simple Stock Sentiment Analysis with news data in Keras

Text data pre-processing

The embedding layer

Build and evaluate the model

Conclusion and Further thought

Bonus for investors

How to generate realistic yelp restaurant reviews with Keras

Getting the Data ready

Build the model

Training the model

Generate 5-star reviews

Summary and Further reading

How to teach AI to suggest product prices to online sellers

Preparing the data

First 5 rows of train datasets

Text preprocessing

Replace contractions

Handel missing values

Create categorical columns

Tokenize - texts to sequences

Padding sequences

Build the model

Evaluate the model

Generate file for submission

Summary

How to do multi-class multi-label classification for news categories

Overview of the task

Structure of the code

Prepare documents and categories

Clean up the data for model

Create and train model

Visualize the training performance

Make a prediction

Summary

How to Summarize Amazon Reviews with Tensorflow

Summary of improvements

1. Tokenize the sentence better

2. Increased data preparation filter and sort speed

3. Connects RNN layers in encoder

4. Decoding layers use MultiRNNCell

The training result

How to triage patient queries with Keras (1 minute training)

Dependencies

About the data

Preparing the data

Build the model

Predict new data

Summary

An easy guide to Chinese Sentiment analysis with hotel review data

For source code and dataset used in this tutorial, check out my github repo.

Dependencies

About the data

Optional for plotting

Key difference compared to English dataset

File Encoding

Convert from traditional to simplified Chinese (繁体转简体)

Tokenize

Chinese stop words

Result

Try some new comments