How to generate realistic yelp restaurant reviews with Keras

(Comments)

restaurant reviews

TL; DR

After reading this article. You will be able to build a model to generate 5-star Yelp reviews like those.

Samples of generated review text (unmodified)

<SOR>I had the steak, mussels with a side of chicken parmesan. All were very good. We will be back.<EOR>
<SOR>The food, service, atmosphere, and service are excellent. I would recommend it to all my friends<EOR>
<SOR>Good atmosphere, amazing food and great service.Service is also pretty good. Give them a try!<EOR>

I will show you how to,

  • Acquire and prepare the training data.
  • Build the character-level language models.
  • Tips when training the model.
  • Generate random reviews.

Training the model could easily take up a couple of days even on GPU. Luckily the pre-trained model weights are available. So we could jump directly to the fun part to generate reviews.

Getting the Data ready

The Yelp Dataset is freely available in JSON format.

After downloading and extracting, you will find 2 files we need in the dataset folder,

  • review.json
  • business.json

Those two files are quite large, especially the review.json file (3.7 GB).

Each line of the review.json file is a review of JSON string. The two files do not have the JSON start and end square brackets "[ ]". So the content of the JSON file as a whole is not a valid JSON string. Plus it might be difficult to fit the whole review.json file content to the memory. So, let's first convert them to CSV format line by line with our helper script.

python json_converter.py ./dataset/review.json
python json_converter.py ./dataset/business.json

After that, you will find those two files in dataset folder,

  • review.csv
  • business.csv

Those two are valid CSV files we can open by pandas library.

Here is what we are going to do. We only extract 5-stars review texts from the businesses that have 'Restaurant' tag in their categories.

# Read thow two CSV files to pandas dataframes
df_business=pd.read_csv('../dataset/business.csv')
df_review=pd.read_csv('../dataset/review.csv')
# Filter 'Restaurants' businesses
restaurants = df_business[df_business['categories'].str.contains('Restaurants')]
# Filter 5-stars reviews
five_star=df_review[df_review['stars']==5]
# merge the reviews with restaurants by key 'business_id'
# This keep only 5-star restaurants reviews
combo=pd.merge(restaurants_clean, five_star, on='business_id')
# review texts column
rnn_fivestar_reviews_only=combo[['text']]

Next, let's remove the new line characters in reviews and any duplicated reviews.

# remove new line characters
rnn_fivestar_reviews_only=rnn_fivestar_reviews_only.replace({r'\n+': ''}, regex=True)
# remove dupliated reviews
final=rnn_fivestar_reviews_only.drop_duplicates()

To show the model where is the start and end of a review. We need to add special markers to our review texts.

So one line in the finally prepared review will look like this as you expected.

"<SOR>Hummus is amazing and fresh! Loved the falafels. I will definitely be back. Great owner, friendly staff<EOR>"

Build the model

The model we are building here is a character-level language model, meaning the minimum distinguishable symbol is a character. You may also come across the word- level model where the input is the word tokens.

There are some pros and cons for the character-level language model.

Pro:

  • Don’t have to worry about unknown vocabulary.
  • Able to learn large vocabulary.

Con:

  • End up with very long sequences. Not as good as word level language models at capturing long-range dependencies between how the earlier parts of the sentence also affect the later part of the sentence.
  • And character level models are also just more computationally expensive to train.

The model is quite similar to the official lstm_text_generation.py demo code, except we are stacking RNN cells allows storing more information throughout the hidden states between the input and output layer. It generates more realistic Yelp reviews.

Before showing the code for the model, let's peek a little deeper on how stacking RNN works.

You may have seen in the standard neural network.(That is the Dense layers in Keras)

The first layer takes the input x to compute the activation value a[1], that stack next layer to compute the next activation value a[2].

stack dense

Stacking RNN is a bit like the standard neural network and "unrolling in time".

For notation a[l]<t> means activation asslocation for layer l, and <t> means timestep t.

stack rnn

Let's take a look how an activation value is computed.

To compute a[2]<3>, there are two input, a[2]<2> and a[1]<3>

g is the activation function, wa[2] and ba[2] are the layer 2 parameters.

activation a23

As we can see, to stack RNNs, the previous RNN need to return all the timesteps a<t> to the subsequent RNN.

By default, an RNN layer such as LSTM in Keras only returns the last timestep activation value a<T>. To return all timesteps' activation values, we set the return_sequences parameter to True.

So here is how we build the model in Keras. Each input sample is a one-hot representation of 60 characters, and there are total 95 possible characters.

Each output is a list of 95 predicted probabilities for each character.

import keras
from keras import layers

model = keras.models.Sequential()
model.add(layers.LSTM(1024, input_shape=(60, 95),return_sequences=True))
model.add(layers.LSTM(1024, input_shape=(60, 95)))
model.add(layers.Dense(95, activation='softmax'))

And here is the graphical model structure to help you visualize it.

model structure

Training the model

The idea to train the model is simple, we train it with the input/output pair. Each input is 60 characters, and the corresponding output is the immediately following character.

In the data preparing step, we created a list of clean 5-star reviews text. Total 1,214,016 lines of reviews. To simplify the training, we are only going to train on reviews equal or less than 250 characters long which end up with 418,955 lines of reviews.

Then we shuffle the order of the reviews, so we don't train on 100 reviews for the same restaurant in a row.

We read all reviews as a long text string. Then create a python dictionary (i.e., a hash table) to map each character to an index from 0-94 (total 95 unique characters).

# List of unique characters in the corpus
chars = sorted(list(set(text)))
print('Unique characters:', len(chars))
# Dictionary mapping unique characters to their index in `chars`
char_indices = dict((char, chars.index(char)) for char in chars)

The text corpus has a total of 72,662,807 characters. It is hard to process it as a whole. So let's break it down into chunks of 90k characters each.

For each chunk of a corpus, we are going to generate pairs of inputs and outputs. By shifting the pointer from beginning to end of the chunk, one character at a time if step set to 1.

def getDataFromChunk(txtChunk, maxlen=60, step=1):
    sentences = []
    next_chars = []
    for i in range(0, len(txtChunk) - maxlen, step):
        sentences.append(txtChunk[i : i + maxlen])
        next_chars.append(txtChunk[i + maxlen])
    print('nb sequences:', len(sentences))
    print('Vectorization...')
    X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
    y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
    for i, sentence in enumerate(sentences):
        for t, char in enumerate(sentence):
            X[i, t, char_indices[char]] = 1
            y[i, char_indices[next_chars[i]]] = 1
    return [X, y]

Training one chunk for one epoch takes 219 seconds on GPU (GTX1070), so training the full corpus will take about 2 days.

72662807 / 90000 * 219 /60 / 60/ 24 = 2.0 days

Two Keras callbacks come handy, ModelCheckpoint and ReduceLROnPlateau.

ModelCheckpoint helps us save the weights everytime it improves. 

ReduceLROnPlateau callback automatically reduces learning rate when the loss metric stops decreasing. The main benefit of it is that we don’t need to tune the learning rate manually. Its main weakness is that its learning rate is always decreasing and decaying.

# this saves the weights everytime they improve so you can let it train.  Also learning rate decay
filepath="Feb-22-all-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
reduce_lr = ReduceLROnPlateau(monitor='loss', factor=0.5,
              patience=1, min_lr=0.00001)
callbacks_list = [checkpoint, reduce_lr]

Code to train the model for 20 epochs looks like this.

for iteration in range(1, 20):
    print('Iteration', iteration)
    with open("../dataset/short_reviews_shuffle.txt") as f:
        for chunk in iter(lambda: f.read(90000), ""):
            X, y = getDataFromChunk(chunk)
            model.fit(X, y, batch_size=128, epochs=1, callbacks=callbacks_list)

It will take one month or so as you might guess. But training for about 2 hours already produces some promising results in my case. Feel free to give it a try.

Generate 5-star reviews

Whether you jump right to this section or you have read through the previous ones. Here is the fun part!

With the pre-trained model weights or one you trained by yourself, we can generate some interesting yelp reviews.

Here is the idea, we "seed" the model with initial 60 characters and ask the model to predict the very next character.

generate sample

The "sampling index" process will add some variety to the final result by generating some randomness with the given prediction.

If the temperature is very small, it will always pick the index with highest predicted probability.

def sample(preds, temperature=1.0):
    '''
    Generate some randomness with the given preds
    which is a list of numbers, if the temperature
    is very small, it will always pick the index
    with highest pred value
    '''
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

To generate 300 characters with following code

# We generate 300 characters
for i in range(300):
    sampled = np.zeros((1, maxlen, len(chars)))
    # Turn each char to char index.
    for t, char in enumerate(generated_text):
        sampled[0, t, char_indices[char]] = 1.
    # Predict next char probabilities
    preds = model.predict(sampled, verbose=0)[0]
    # Add some randomness by sampling given probabilities.
    next_index = sample(preds, temperature)
    # Turn char index to char.
    next_char = chars[next_index]
    # Append char to generated text string
    generated_text += next_char
    # Pop the first char in generated text string.
    generated_text = generated_text[1:]
    # Print the new generated char.
    sys.stdout.write(next_char)
    sys.stdout.flush()
print(generated_text)

Summary and Further reading

In this post, you know how to build and train a character-level text generation model from beginning to end. The source code is available on my GitHub repo as well as the pre-train model to play with.

The model shown here is trained in a many to one fashion. There is also another optional implementation in many to many fashion. Consider the input sequence as characters of length 7 "The cak" and the expected output is "he cake". You can check it out here, char_rnn_karpathy_keras.

For a reference to building a word-level model, check out my other blog: Simple Stock Sentiment Analysis with news data in Keras.

Current rating: 4.2

Comments