(Comments)
The objective of this project is to build a model that can create relevant summaries for reviews written about fine foods sold on Amazon. This dataset contains above 500,000 reviews and is hosted on Kaggle.
Here are two examples to show what the data looks like
Review # 1 Good Quality Dog Food I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most. Review # 2 Not as Advertised Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".
To build our model we will use a two-layered bidirectional RNN with LSTMs on the input data and two layers, each with an LSTM using Bahdanau attention on the target data.
The sections of this project are:
1.Inspecting the Data
2.Preparing the Data
3.Building the Model
4.Training the Model
5.Making Our Own Summaries
Inspired by the post Text Summarization with Amazon Reviews, with a few improvements and updates to work with latest TensorFlow Version 1.3, those improvements get better accuracy.
Orginal code tokenizes the words by text.split(), it is not foolproof,
e.g. words followed by punctuations "Are you kidding?I think you are." would be incorrectly tokenized as
['Are', 'you', 'kidding?I', 'think', 'you', 'are.']
We use this line instead
text = re.findall(r"[\w']+", text)
which will correctly generate words list
['Are', 'you', 'kidding', 'I', 'think', 'you', 'are']
The original author uses two for loops to sort and filter the data, here we are using the Python build in sort and filter function to do the same thing but much faster.
Filter for length limit and number of <UNK>
s
Sort the summaries and texts by the length of the element in texts from shortest to longest
max_text_length = 83 # This will cover up to 89.5% lengthes
max_summary_length = 13 # This will cover up to 99% lengthes
min_length = 2
unk_text_limit = 1 # text can contain up to 1 UNK word
unk_summary_limit = 0 # Summary should not contain any UNK word
def filter_condition(item):
int_summary = item[0]
int_text = item[1]
if(len(int_summary) >= min_length and
len(int_summary) <= max_summary_length and
len(int_text) >= min_length and
len(int_text) <= max_text_length and
unk_counter(int_summary) <= unk_summary_limit and
unk_counter(int_text) <= unk_text_limit):
return True
else:
return False
int_text_summaries = list(zip(int_summaries , int_texts))
int_text_summaries_filtered = list(filter(filter_condition, int_text_summaries))
sorted_int_text_summaries = sorted(int_text_summaries_filtered, key=lambda item: len(item[1]))
sorted_int_text_summaries = list(zip(*sorted_int_text_summaries))
sorted_summaries = list(sorted_int_text_summaries[0])
sorted_texts = list(sorted_int_text_summaries[1])
# Delete those temporary varaibles
del int_text_summaries, sorted_int_text_summaries, int_text_summaries_filtered
# Compare lengths to ensure they match
print(len(sorted_summaries))
print(len(sorted_texts))
The original code is missing this line below, that is how we connect layers by feeding the current layer's output to next layer's input. So the orginal code only behaves like a single bidirectional RNN layer in the encoder.
rnn_inputs = enc_output
def encoding_layer(rnn_size, sequence_length, num_layers, rnn_inputs, keep_prob):
for layer in range(num_layers):
with tf.variable_scope('encoder_{}'.format(layer)):
cell_fw = tf.contrib.rnn.LSTMCell(rnn_size,
initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
cell_fw = tf.contrib.rnn.DropoutWrapper(cell_fw,
input_keep_prob = keep_prob)
cell_bw = tf.contrib.rnn.LSTMCell(rnn_size,
initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
cell_bw = tf.contrib.rnn.DropoutWrapper(cell_bw,
input_keep_prob = keep_prob)
enc_output, enc_state = tf.nn.bidirectional_dynamic_rnn(cell_fw,
cell_bw,
rnn_inputs,
sequence_length,
dtype=tf.float32)
enc_output = tf.concat(enc_output,2)
# original code is missing this line below, that is how we connect layers
# by feeding the current layer's output to next layer's input
rnn_inputs = enc_output
return enc_output, enc_state
The original author uses a for loop to connect num_layers of LSTMCell together, here we use MultiRNNCell to composed sequentially of multiple simple cells(BasicLSTMCell) to simplify the code.
def lstm_cell(lstm_size, keep_prob):
cell = tf.contrib.rnn.BasicLSTMCell(lstm_size)
return tf.contrib.rnn.DropoutWrapper(cell, input_keep_prob = keep_prob)
def decoding_layer(dec_embed_input, embeddings, enc_output, enc_state, vocab_size, text_length, summary_length,
max_summary_length, rnn_size, vocab_to_int, keep_prob, batch_size, num_layers):
'''Create the decoding cell and attention for the training and inference decoding layers'''
dec_cell = tf.contrib.rnn.MultiRNNCell([lstm_cell(rnn_size, keep_prob) for _ in range(num_layers)])
# ......
After 2 hours training with GPU, the loss went below 1, settled at 0.707.
Here are some summaries generated with the trained model.
- Review: The coffee tasted great and was at such a good price! I highly recommend this to everyone! - Summary: great great coffee - Review: love individual oatmeal cups found years ago sam quit selling sound big lots quit selling found target expensive buy individually trilled get entire case time go anywhere need water microwave spoon know quaker flavor packets - Summary: great taste
Check out the full source code on my GitHub.
Share on Twitter Share on Facebook
Comments