Have you wonder what impact everyday news might have on the stock market. In this tutorial, we are going to explore and build a model that reads the top 25 voted world news from Reddit users and predict whether the Dow Jones will go up or down for a given day.
After reading this post, you will learn,
Now let's get started, read till the end since there will be a secret bonus.
For the input text, we are going to concatenate all 25 news to one long string for each day.
After that are going to convert all sentences to lower-case, remove characters such as numbers and punctuations that cannot be represented by the GloVe embeddings later.
The next step is to convert all your training sentences into lists of indices, then zero-pad all those lists so that their length is the same.
It is helpful to visualize the length distribution across all input samples before deciding the maximum sequence length.
Keep in mind that the longer maximum length we pick, the longer it will take to train the model, so instead of choosing the longest sequence length in our datasets which is around 700, we are going to pick 500 as a tradeoff to cover the majority of the text across all samples while remaining relatively short training time.
In Keras, the embedding matrix is represented as a "layer" and maps positive integers(indices corresponding to words) into dense vectors of fixed size (the embedding vectors). It can be trained or initialized with a pre-trained embedding. In the part, you will learn how to create an Embedding layer in Keras, initialize it with GloVe 50-dimensional vectors. Because our training set is quite small, we will not update the word embeddings but will instead leave their values fixed. I will show you how Keras allows you to set whether the embedding is trainable or not.
The following function handles the first step of converting sentence strings to an array of indices. The word to index mapping is taken from GloVe embedding file so we can seamlessly convert indices to word vectors later.
def sentences_to_indices(X, word_to_index, max_len): """ Converts an array of sentences (strings) into an array of indices corresponding to words in the sentences. The output shape should be such that it can be given to `Embedding()`. Arguments: X -- array of sentences (strings), of shape (m, 1) word_to_index -- a dictionary containing the each word mapped to its index max_len -- maximum number of words in a sentence. You can assume every sentence in X is no longer than this. Returns: X_indices -- array of indices corresponding to words in the sentences from X, of shape (m, max_len) """ m = X.shape # number of training examples # Initialize X_indices as a numpy matrix of zeros and the correct shape X_indices = np.zeros((m, max_len), dtype=int) for i in range(m): # loop over training examples # Convert the ith training sentence in lower case and split is into words. You should get a list of words. sentence_words = [w.lower() for w in X[i].split()] # Initialize j to 0 j = 0 # Loop over the words of sentence_words for w in sentence_words: # Set the (i,j)th entry of X_indices to the index of the correct word. if w in word_to_index: X_indices[i, j] = word_to_index[w] # Increment j to j + 1 j += 1 if j >= max_len: break return X_indices
After that, we can implement the pre-trained embedding layer like so.
from keras.layers.embeddings import Embedding def pretrained_embedding_layer(word_to_vec_map, word_to_index): """ Creates a Keras Embedding() layer and loads in pre-trained GloVe 50-dimensional vectors. Arguments: word_to_vec_map -- dictionary mapping words to their GloVe vector representation. word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words) Returns: embedding_layer -- pretrained layer Keras instance """ vocab_len = len(word_to_index) + 1 # adding 1 to fit Keras embedding (requirement) emb_dim = word_to_vec_map["cucumber"].shape # define dimensionality of your GloVe word vectors (= 50) # Initialize the embedding matrix as a numpy array of zeros of shape (vocab_len, dimensions of word vectors = emb_dim) emb_matrix = np.zeros((vocab_len, emb_dim)) # Set each row "index" of the embedding matrix to be the word vector representation of the "index"th word of the vocabulary for word, index in word_to_index.items(): emb_matrix[index, :] = word_to_vec_map[word] # Define Keras embedding layer with the correct output/input sizes, make it trainable. Use Embedding(...). Make sure to set trainable=False. embedding_layer = Embedding(vocab_len, emb_dim, trainable=False) # Build the embedding layer, it is required before setting the weights of the embedding layer. Do not modify the "None". embedding_layer.build((None,)) # Set the weights of the embedding layer to the embedding matrix. Your layer is now pretrained. embedding_layer.set_weights([emb_matrix]) return embedding_layer
Let's have a quick check of the embedding layer by asking for the vector representation of the word "cat".
embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index) embedding_layer.get_weights()[word_to_index['cat']] # array([ 0.45281 , -0.50108 , ... 0.71278 , 0.23782 ], dtype=float32)
The result is a 50 dimension array. You can further explore the word vectors and measure similarity using cosine similarity or solve word analogy problems such as Man is to Woman as King is to __.
The task for the model is to take the news string sequence and make a binary classification whether the Dow Jones close value will rose/fail compared to previous close value. It outputs "1" if the value rose or stays the same, "0" when the value decreases.
We are building a simple model contains two stacked GRU layers after the pre-trained embedding layer. A Dense layer generates the final output with softmax activation. GRU is a type of recurrent network that processes and considers the order of sequences, it is similar to LSTM regarding their functionality and performance but less computationally expensive to train.
model = Sequential() model.add(pretrained_embedding_layer(word_to_vec_map, word_to_index)) model.add(GRU(128, dropout=0.2, return_sequences=True)) model.add(GRU(128, dropout=0.2)) model.add(Dense(1)) model.add(Activation('sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
Next, we can train the evaluate the model.
history = model.fit(X_train_indices, Y_train, batch_size=batch_size, epochs=10, validation_data=(X_test_indices, Y_test)) model.save("./model.h5") score, acc = model.evaluate(X_test_indices, Y_test, batch_size=batch_size)
It is also helpful to generate the ROC or our binary classification classifier to access its performance visually.
Our model is about 2.8% better than the random guess of the market trend.
For more information about ROC and AUC, you can read my other blog -
In this post, we introduced a quick and simple way to build a Keras model with Embedding layer initialized with pre-trained GloVe embeddings. Something you can try after reading this post,
Any ideas to improve the model? Comment and share your thoughts.
You can find the full source code and training data here in my Github repo.
If you are new to the whole investment world like I did years ago, you may wonder where to start, preferably invest for free with zero commissions. By learning how to trade stocks for free, you'll not only save money, but your investments will potentially compound at a faster rate. Robinhood, one of the best investing app does just that, whether you are buying only one or 100 shares, there are no commissions. It was built from the ground up to be as efficient as possible by cutting out the fat and pass the savings to the customers. Join Robinhood, and we'll both get a stock like Apple, Ford, or Sprint for free. Make sure you use my shared link.Share on Twitter Share on Facebook