Keras + Universal Sentence Encoder = Transfer Learning for text data



We are going to build a Keras model that leverages the pre-trained "Universal Sentence Encoder" to classify a given question text to one of the six categories.

TensorFlow Hub modules can be applied to a variety of transfer learning tasks and datasets, whether it is images or text. "Universal Sentence Encoder" is one of the many newly published TensorFlow Hub reusable modules, a self-contained piece of TensorFlow graph, with pre-trained weights value included.

A runnable Colab notebook is available, you can experiment with the code while reading on.

What is Universal Sentence Encoder and how it was trained

While you can choose to treat all TensorFlow Hub modules as black boxes, agnostic of what happens inside and still be able to build a functional transfer learning model. It would be helpful to develop a deeper understanding, that gives you a new perspective on what each module is capable of, its constraints and how well the transfer learning result could potentially be.

Universal Sentence Encoder VS Words embedding

If you recall the GloVe word embeddings vectors in our previous tutorial which turns a word to 50-dimensional vector, the Universal Sentence Encoder is much more powerful, and it is able to embed not only words but phrases and sentences. That is, it takes variable length English text as input and outputs a 512-dimensional vector. Handling variable length text input sounds great, but what's the catch is as sentence getting longer counted by words, the more diluted embedding results could be. And since the model was trained at the word level, it will likely find typos and difficult words challenging to process. More on the difference between world and character level language model, you can read my previous tutorial.

There are two Universal Sentence Encoders to choose from with different encoder architectures to achieve distinct design goals, one based on the transformer architecture targets high accuracy at the cost of greater model complexity and resource consumption. The other targets efficient inference with slightly reduced accuracy by the deep averaging network(DAN).

Side by side Model architectures comparison for the Transformer and DAN sentence encoders.


The original Transformer model constitutes an encoder and decoder, but here we only use its encoder part.

The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. They also employed a residual connection around each of the two sub-layers, followed by layer normalization. Since the model contains no recurrence and no convolution, for the model to make use of the order of the sequence, it must inject some information about the relative or absolute position of the tokens in the sequence, that is what the "positional encodings" does. The transformer based encoder achieves the best overall transfer task performance. However, this comes at the cost of computing time and memory usage scaling dramatically with sentence length.

Deep Averaging Network(DAN) is much simpler where input embeddings for words and bi-grams are first averaged together and then passed through a feedforward deep neural network (DNN) to produce sentence embeddings. The primary advantage of the DAN encoder is that compute time is linear in the length of the input sequence. 

Depends on what type of training data and the chosen training metric, it can have a significant impact on the transfer learning result.

Both models were trained with the Stanford Natural Language Inference (SNLI) corpus. The SNLI corpus is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE). Essentially, the models were trained to learn the semantic similarity between the sentence pairs.

With that in mind, the sentence embeddings can be trivially used to compute sentence-level semantic similarity scores.


The source code to generate the similarity heat map is available both in my Colab notebook and in GitHub repo. Colored based on the inner product of the encodings for any two sentences. That means the more similar two sentences are, the darker the color is.

Loading Universal Sentence Encoder and computing the embeddings for some text can be as easy as below. 

import tensorflow as tf
import tensorflow_hub as hub
module_url = ""
# Import the Universal Sentence Encoder's TF Hub module
embed = hub.Module(module_url)

# Compute a representation for each message, showing various lengths supported.
messages = ["That band rocks!", "That song is really cool."]

with tf.Session() as session:[tf.global_variables_initializer(), tf.tables_initializer()])
  message_embeddings =

First time loading the module can take a while since it will download the weights files.

The value of message_embeddings are two arrays corresponding to two sentences' embeddings, each is an array of 512 floating point numbers.

array([[ 0.06587551, 0.02066354, -0.01454356, ..., 0.06447642, 0.01654527, -0.04688655], [ 0.06909196, 0.01529877, 0.03278331, ..., 0.01220771, 0.03000253, -0.01277521]], dtype=float32)

Question classification task and data preprocessing

To respond correctly to a question given a large collection of texts, classifying questions into fine-grained classes is crucial in question answering as a retrieval task.  Our goal is to categorize questions into different semantic classes that impose constraints on potential answers so that they can be utilized in later stages of the question answering process. For example, when considering the question Q: What Canadian city has the largest population? The hope is to classify this question as having answer type location, implying that only candidate answers that are locations need consideration.

The dataset we use is the TREC Question Classification dataset, There are entirely 5452 training and 500 test samples, that is 5452 + 500 questions each categorized into one of the six labels.

  1. ABBR - 'abbreviation': expression abbreviated, etc.
  2. DESC - 'description and abstract concepts': manner of an action, description of sth. etc.
  3. ENTY - 'entities': animals, colors, events, food, etc.
  4. HUM - 'human beings': a group or organization of persons, an individual, etc.
  5. LOC - 'locations': cities, countries, etc.
  6. NUM - 'numeric values': postcodes, dates, speed,temperature, etc

We want our model to be a multiclass classification model that takes strings as input and output probability for each of the 6 class labels. With this in mind, you know how to prepare the training and testing data for it.

The first step is to turn the raw text file into a pandas DataFrame and set the "label" column to be categorical column so as we can further access a label as a numeric value.

def get_dataframe(filename):
    lines = open(filename, 'r').read().splitlines()
    data = []
    for i in range(0, len(lines)):
        label = lines[i].split(' ')[0]
        label = label.split(":")[0]
        text = ' '.join(lines[i].split(' ')[1:])
        text = re.sub('[^A-Za-z0-9 ,\?\'\"-._\+\!/\`@=;:]+', '', text)
        data.append([label, text])

    df = pd.DataFrame(data, columns=['label', 'text'])
    df.label = df.label.astype('category')
    return df

df_train = get_dataframe('train_5500.txt')

First 5 training samples look like this.


Next step we will prepare the input/output data for the model, the input as a list of question strings, and output as a list of one-hot encoded labels. If you are unfamiliar with one-hot encoding yet, I got you covered in part of my previous post.

train_text = df_train['text'].tolist()
train_text = np.array(train_text, dtype=object)[:, np.newaxis]

train_label = np.asarray(pd.get_dummies(df_train.label), dtype = np.int8)

If you take a peek at the value of train_label, you will see it in one-hot encoded form.

array([[0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 1, 0], [0, 0, 0, 1, 0, 0] ...], dtype=int8)

Now we are ready to build the model.

Keras meets Universal Sentence Encoder

We have previously loaded the Universal Sentence Encoder as variable "embed", to have it work with Keras nicely, it is necessary to wrap it in a Keras Lambda layer and explicitly cast its input as a string.

def UniversalEmbedding(x):
    return embed(tf.squeeze(tf.cast(x, tf.string)), 
    	signature="default", as_dict=True)["default"]

Then we build the Keras model in its standard Functional API,

input_text = layers.Input(shape=(1,), dtype=tf.string)
embedding = layers.Lambda(UniversalEmbedding,
dense = layers.Dense(256, activation='relu')(embedding)
pred = layers.Dense(category_counts, activation='softmax')(dense)
model = Model(inputs=[input_text], outputs=pred)
	optimizer='adam', metrics=['accuracy'])

We can view the model summary and realize that only the Keras layers are trainable, that is how the transfer learning task works by assuring the Universal Sentence Encoder weights untouched.

Layer (type) Output Shape Param #
input_1 (InputLayer) (None, 1) 0
lambda_1 (Lambda) (None, 512) 0
dense_1 (Dense) (None, 256) 131328
dense_2 (Dense) (None, 6) 1542
Total params: 132,870
Trainable params: 132,870
Non-trainable params: 0

In the next step, we train the model with the training datasets and validate its performance at the end of each training epoch with test datasets.

with tf.Session() as session:
  history =, 
            validation_data=(test_text, test_label),

The final validation result shows the highest accuracy gets around 97% after training for 10 epochs.

After we have the model trained and its weights saved to a file, it is really to make predictions on new questions.

Here we come up with 3 new questions for the model to classify.

new_text = ["In what year did the titanic sink ?", 
            "What is the highest peak in California ?", 
            "Who invented the light bulb ?"]

new_text = np.array(new_text, dtype=object)[:, np.newaxis]
with tf.Session() as session:
  predicts = model.predict(new_text, batch_size=32)

categories =
predict_logits = predicts.argmax(axis=1)
predict_labels = [categories[logit] for logit in predict_logits]

The classification results look decent.

['NUM', 'LOC', 'HUM']

Conclusion and further reading

Congratulation! You have built a Keras text transfer learning model powered by the Universal Sentence Encoder and achieved a great result in question classification task. The Universal Sentence Encoder can embed longer paragraphs, so feel free to experiment with other datasets like the news topic classification, sentiment analysis, etc.

Some related resources you might find useful.

TensorFlow Hub

TensorFlow Hub example notebooks

For an intro to use Google Colab notebook, you can read the first section of my post- How to run Object Detection and Segmentation on a Video Fast for Free.

The source code in my GitHub and a runnable Colab notebook.

Current rating: 4.6