How to Summarize Amazon Reviews with Tensorflow



The objective of this project is to build a model that can create relevant summaries for reviews written about fine foods sold on Amazon. This dataset contains above 500,000 reviews and is hosted on Kaggle.

Here are two examples to show what the data looks like

Review # 1
Good Quality Dog Food
I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.

Review # 2
Not as Advertised
Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".

To build our model we will use a two-layered bidirectional RNN with LSTMs on the input data and two layers, each with an LSTM using Bahdanau attention on the target data.

The sections of this project are:
1.Inspecting the Data
2.Preparing the Data
3.Building the Model
4.Training the Model
5.Making Our Own Summaries

Inspired by the post Text Summarization with Amazon Reviews, with a few improvements and updates to work with latest TensorFlow Version 1.3, those improvements get better accuracy.

Summary of improvements

1. Tokenize the sentence better

Orginal code tokenizes the words by text.split(), it is not foolproof,

e.g. words followed by punctuations "Are you kidding?I think you are." would be incorrectly tokenized as 

['Are', 'you', 'kidding?I', 'think', 'you', 'are.']

We use this line instead

text = re.findall(r"[\w']+", text)

which will correctly generate words list

['Are', 'you', 'kidding', 'I', 'think', 'you', 'are']

2. Increased data preparation filter and sort speed

The original author uses two for loops to sort and filter the data, here we are using the Python build in sort and filter function to do the same thing but much faster.

Filter for length limit and number of <UNK>s

Sort the summaries and texts by the length of the element in texts from shortest to longest

max_text_length = 83 # This will cover up to 89.5% lengthes
max_summary_length = 13 # This will cover up to 99% lengthes
min_length = 2
unk_text_limit = 1 # text can contain up to 1 UNK word
unk_summary_limit = 0 # Summary should not contain any UNK word

def filter_condition(item):
    int_summary = item[0]
    int_text = item[1]
    if(len(int_summary) >= min_length and 
       len(int_summary) <= max_summary_length and 
       len(int_text) >= min_length and 
       len(int_text) <= max_text_length and 
       unk_counter(int_summary) <= unk_summary_limit and 
       unk_counter(int_text) <= unk_text_limit):
        return True
        return False

int_text_summaries = list(zip(int_summaries , int_texts))
int_text_summaries_filtered = list(filter(filter_condition, int_text_summaries))
sorted_int_text_summaries = sorted(int_text_summaries_filtered, key=lambda item: len(item[1]))
sorted_int_text_summaries = list(zip(*sorted_int_text_summaries))
sorted_summaries = list(sorted_int_text_summaries[0])
sorted_texts = list(sorted_int_text_summaries[1])
# Delete those temporary varaibles
del int_text_summaries, sorted_int_text_summaries, int_text_summaries_filtered
# Compare lengths to ensure they match

3. Connects RNN layers in encoder

The original code is missing this line below, that is how we connect layers by feeding the current layer's output to next layer's input. So the orginal code only behaves like a single bidirectional RNN layer in the encoder.

rnn_inputs = enc_output
def encoding_layer(rnn_size, sequence_length, num_layers, rnn_inputs, keep_prob):
    for layer in range(num_layers):
        with tf.variable_scope('encoder_{}'.format(layer)):
            cell_fw = tf.contrib.rnn.LSTMCell(rnn_size,
                                              initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            cell_fw = tf.contrib.rnn.DropoutWrapper(cell_fw, 
                                                    input_keep_prob = keep_prob)

            cell_bw = tf.contrib.rnn.LSTMCell(rnn_size,
                                              initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            cell_bw = tf.contrib.rnn.DropoutWrapper(cell_bw, 
                                                    input_keep_prob = keep_prob)

            enc_output, enc_state = tf.nn.bidirectional_dynamic_rnn(cell_fw, 
            enc_output = tf.concat(enc_output,2)
            # original code is missing this line below, that is how we connect layers 
            # by feeding the current layer's output to next layer's input
            rnn_inputs = enc_output
    return enc_output, enc_state

4. Decoding layers use MultiRNNCell

The original author uses a for loop to connect num_layers of LSTMCell together, here we use MultiRNNCell to composed sequentially of multiple simple cells(BasicLSTMCell) to simplify the code.

def lstm_cell(lstm_size, keep_prob):
    cell = tf.contrib.rnn.BasicLSTMCell(lstm_size)
    return tf.contrib.rnn.DropoutWrapper(cell, input_keep_prob = keep_prob)

def decoding_layer(dec_embed_input, embeddings, enc_output, enc_state, vocab_size, text_length, summary_length,
                   max_summary_length, rnn_size, vocab_to_int, keep_prob, batch_size, num_layers):
    '''Create the decoding cell and attention for the training and inference decoding layers'''
    dec_cell = tf.contrib.rnn.MultiRNNCell([lstm_cell(rnn_size, keep_prob) for _ in range(num_layers)])
    # ......

The training result

After 2 hours training with GPU, the loss went below 1, settled at 0.707.

Here are some summaries generated with the trained model.

- Review:
 The coffee tasted great and was at such a good price! I highly recommend this to everyone!
- Summary:
 great great coffee

- Review:
 love individual oatmeal cups found years ago sam quit selling sound big lots quit selling found target expensive buy individually trilled get entire case time go anywhere need water microwave spoon know quaker flavor packets
- Summary:
 great taste

Check out the full source code on my GitHub.

Current rating: 2.8