Simple Stock Sentiment Analysis with news data in Keras

(Comments)

stock

Have you wonder what impact everyday news might have on the stock market. In this tutorial, we are going to explore and build a model that reads the top 25 voted world news from Reddit users and predict whether the Dow Jones will go up or down for a given day.

After reading this post, you will learn,

  • How to pre-processing text data for deep learning sequence model.
  • How to use pre-trained GloVe embeddings vectors to initialize Keras Embedding layer.
  • Build a GRU model that can process word sequences and is able to take word order into account.

Now let's get started, read till the end since there will be a secret bonus.

Text data pre-processing

reddit-news

For the input text, we are going to concatenate all 25 news to one long string for each day.

After that are going to convert all sentences to lower-case, remove characters such as numbers and punctuations that cannot be represented by the GloVe embeddings later.

The next step is to convert all your training sentences into lists of indices, then zero-pad all those lists so that their length is the same.

It is helpful to visualize the length distribution across all input samples before deciding the maximum sequence length.

sentences-length

Keep in mind that the longer maximum length we pick, the longer it will take to train the model, so instead of choosing the longest sequence length in our datasets which is around 700, we are going to pick 500 as a tradeoff to cover the majority of the text across all samples while remaining relatively short training time.

The embedding layer

In Keras, the embedding matrix is represented as a "layer" and maps positive integers(indices corresponding to words) into dense vectors of fixed size (the embedding vectors). It can be trained or initialized with a pre-trained embedding. In the part, you will learn how to create an Embedding layer in Keras, initialize it with GloVe 50-dimensional vectors. Because our training set is quite small, we will not update the word embeddings but will instead leave their values fixed. I will show you how Keras allows you to set whether the embedding is trainable or not.

The Embedding() layer takes an integer matrix of size (batch size, max input length) as input, this corresponds to sentences converted into lists of indices (integers), as shown in the figure below.

embedding

The following function handles the first step of converting sentence strings to an array of indices. The word to index mapping is taken from GloVe embedding file so we can seamlessly convert indices to word vectors later.

def sentences_to_indices(X, word_to_index, max_len):
    """
    Converts an array of sentences (strings) into an array of indices corresponding to words in the sentences.
    The output shape should be such that it can be given to `Embedding()`. 
    
    Arguments:
    X -- array of sentences (strings), of shape (m, 1)
    word_to_index -- a dictionary containing the each word mapped to its index
    max_len -- maximum number of words in a sentence. You can assume every sentence in X is no longer than this. 
    
    Returns:
    X_indices -- array of indices corresponding to words in the sentences from X, of shape (m, max_len)
    """
    
    m = X.shape[0]                                   # number of training examples
    
    # Initialize X_indices as a numpy matrix of zeros and the correct shape
    X_indices = np.zeros((m, max_len), dtype=int)
    
    for i in range(m):                               # loop over training examples
        
        # Convert the ith training sentence in lower case and split is into words. You should get a list of words.
        sentence_words = [w.lower() for w in X[i].split()]
        
        # Initialize j to 0
        j = 0
        
        # Loop over the words of sentence_words
        for w in sentence_words:
            # Set the (i,j)th entry of X_indices to the index of the correct word.
            if w in word_to_index:
                X_indices[i, j] = word_to_index[w]
                # Increment j to j + 1
                j += 1
                if j >= max_len:
                    break
            
    return X_indices

After that, we can implement the pre-trained embedding layer like so.

  • Initialize the embedding matrix as a numpy array of zeros with the correct shape. (vocab_len, dimension of word vectors)
  • Fill the embedding matrix with all the word embeddings.
  • Define Keras embedding layer and make is non-trainable by setting trainable to False.
  • Set the weight of the embedding layer to the embedding matrix.
from keras.layers.embeddings import Embedding

def pretrained_embedding_layer(word_to_vec_map, word_to_index):
    """
    Creates a Keras Embedding() layer and loads in pre-trained GloVe 50-dimensional vectors.
    
    Arguments:
    word_to_vec_map -- dictionary mapping words to their GloVe vector representation.
    word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)

    Returns:
    embedding_layer -- pretrained layer Keras instance
    """
    
    vocab_len = len(word_to_index) + 1                  # adding 1 to fit Keras embedding (requirement)
    emb_dim = word_to_vec_map["cucumber"].shape[0]      # define dimensionality of your GloVe word vectors (= 50)
    
    # Initialize the embedding matrix as a numpy array of zeros of shape (vocab_len, dimensions of word vectors = emb_dim)
    emb_matrix = np.zeros((vocab_len, emb_dim))
    
    # Set each row "index" of the embedding matrix to be the word vector representation of the "index"th word of the vocabulary
    for word, index in word_to_index.items():
        emb_matrix[index, :] = word_to_vec_map[word]

    # Define Keras embedding layer with the correct output/input sizes, make it trainable. Use Embedding(...). Make sure to set trainable=False. 
    embedding_layer = Embedding(vocab_len, emb_dim, trainable=False)
    # Build the embedding layer, it is required before setting the weights of the embedding layer. Do not modify the "None".
    embedding_layer.build((None,))
    
    # Set the weights of the embedding layer to the embedding matrix. Your layer is now pretrained.
    embedding_layer.set_weights([emb_matrix])
    
    return embedding_layer

Let's have a quick check of the embedding layer by asking for the vector representation of the word "cat".

embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
embedding_layer.get_weights()[0][word_to_index['cat']]
# array([ 0.45281 , -0.50108 , ... 0.71278 ,  0.23782 ], dtype=float32)

The result is a 50 dimension array. You can further explore the word vectors and measure similarity using cosine similarity or solve word analogy problems such as Man is to Woman as King is to __.

Build and evaluate the model

The task for the model is to take the news string sequence and make a binary classification whether the Dow Jones close value will rose/fail compared to previous close value. It outputs "1" if the value rose or stays the same, "0" when the value decreases.

We are building a simple model contains two stacked GRU layers after the pre-trained embedding layer. A Dense layer generates the final output with softmax activation. GRU is a type of recurrent network that processes and considers the order of sequences, it is similar to LSTM regarding their functionality and performance but less computationally expensive to train.

model = Sequential()
model.add(pretrained_embedding_layer(word_to_vec_map, word_to_index))
model.add(GRU(128, dropout=0.2, return_sequences=True)) 
model.add(GRU(128, dropout=0.2))
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

Next, we can train the evaluate the model.

history = model.fit(X_train_indices, Y_train, batch_size=batch_size, epochs=10,
          validation_data=(X_test_indices, Y_test))

model.save("./model.h5")
score, acc = model.evaluate(X_test_indices, Y_test,
                            batch_size=batch_size)

It is also helpful to generate the ROC or our binary classification classifier to access its performance visually.

ROC

Our model is about 2.8% better than the random guess of the market trend.

For more information about ROC and AUC, you can read my other blog - Simple guide on how to generate ROC plot for Keras classifier.

Conclusion and Further thought

In this post, we introduced a quick and simple way to build a Keras model with Embedding layer initialized with pre-trained GloVe embeddings. Something you can try after reading this post,

  • Make the Embedding layer weights trainable, train the model from the start then compare the result.
  • Increase the maximum sequence length and see how that might impact the model performance and training time.
  • Incorporate other input to form a multi-input Keras model, since other factors might correlate with stock index fluctuation. For example, there are MACD(Moving Average Convergence/Divergence oscillator), Momentum indicator for your consideration. To have multi-input, you can use the Keras functional API.

Any ideas to improve the model? Comment and share your thoughts.

You can find the full source code and training data here in my Github repo.

Bonus for investors

stock_ticket

If you are new to the whole investment world like I did years ago, you may wonder where to start, preferably invest for free with zero commissions. By learning how to trade stocks for free, you'll not only save money, but your investments will potentially compound at a faster rate. Robinhood, one of the best investing app does just that, whether you are buying only one or 100 shares, there are no commissions. It was built from the ground up to be as efficient as possible by cutting out the fat and pass the savings to the customers. Join Robinhood, and we'll both get a stock like Apple, Ford, or Sprint for free. Make sure you use my shared link.

Currently unrated

Comments