Blog | DLologyhttps://www.dlology.com/blog/2024-03-19T08:12:55+00:00ArticlesKeras + Universal Sentence Encoder = Transfer Learning for text data2018-06-10T08:45:41+00:002024-03-19T00:17:58+00:00Chengweihttps://www.dlology.com/blog/author/Chengwei/https://www.dlology.com/blog/keras-meets-universal-sentence-encoder-transfer-learning-for-text-data/<p><img alt="tf-hub-meets-keras" height="441" src="https://gitcdn.xyz/cdn/Tony607/blog_statics/149ddc2a3845b95b821898f42f351497480be68b/images/hub/tf-hub-meets-keras.png" width="688"/></p>
<p>We are going to build a Keras model that leverages the pre-trained "Universal Sentence Encoder" to classify a given question text to one of the six categories.</p>
<p><span>TensorFlow Hub modules can be applied to a variety of transfer learning tasks and datasets, whether it is images or text. </span><span>"Universal Sentence Encoder" is </span>one of the many newly published TensorFlow Hub reusable modules, a self-contained piece of TensorFlow graph, with pre-trained weights value included.</p>
<p><em>A runnable <a href="https://colab.research.google.com/drive/1Odry08Jm0f_YALhAt4vp9qa5w8prUzDY">Colab notebook</a> is available, you can experiment with the code while reading on.</em></p>
<h2><span>What is Universal Sentence Encoder and how it was trained</span></h2>
<p><span>While you can choose to treat all TensorFlow Hub modules as black boxes, agnostic of what happens inside and still be able to build a functional transfer learning model. It would be helpful to develop a deeper understanding, that gives you a new perspective on what each module is capable of, its constraints and how well the transfer learning result could potentially be.</span></p>
<h3><span>Universal Sentence Encoder VS Words embedding</span></h3>
<p><span>If you recall the GloVe word embeddings vectors in our <a href="https://www.dlology.com/blog/simple-stock-sentiment-analysis-with-news-data-in-keras/">previous tutorial</a> which turns a word to 50-dimensional vector, the Universal Sentence Encoder is much more powerful, and it is able to embed not only words but phrases and sentences. That is, it takes <span>variable length English text as input and outputs a 512-dimensional vector. Handling variable length text input sounds great, but what's the catch is as sentence getting longer counted by words, the more diluted embedding results could be. And since the model was trained at the word level, it will likely find typos and difficult words challenging to process. More on the difference between world and character level language model, you can read my <a href="https://www.dlology.com/blog/how-to-train-a-keras-model-to-generate-colors/">previous tutorial</a>.</span></span></p>
<p><span><span>There are two Universal Sentence Encoders to choose from with different encoder architectures to achieve distinct design goals, one based on the transformer architecture targets <span class="fontstyle0">high accuracy at the cost of greater model complexity and resource consumption. The other targets efficient inference with slightly reduced accuracy by the deep averaging network(DAN).</span> <br/> </span></span></p>
<p><span><span><span class="fontstyle0">Side by side Model architectures comparison for the Transformer and DAN sentence encoders.</span> <br/> </span></span></p>
<p><img alt="dan-and-transformer" height="795" src="https://gitcdn.xyz/cdn/Tony607/blog_statics/149ddc2a3845b95b821898f42f351497480be68b/images/hub/dan-and-transformer.png" width="665"/></p>
<p>The original <a href="https://arxiv.org/pdf/1706.03762.pdf">Transformer</a> model constitutes an encoder and decoder, but here we only use its encoder part.</p>
<p>The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. They also employed a residual connection around each of the two sub-layers, followed by layer normalization. <span class="fontstyle0">Since the model contains no recurrence and no convolution, for the model to make use of the order of the sequence, it must inject some information about the relative or absolute position of the</span> <span class="fontstyle0">tokens in the sequence, that is what the "<span class="fontstyle0">positional encodings</span>" does. T<span class="fontstyle0">he <g class="gr_ gr_124 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" data-gr-id="124" id="124">transformer based</g> encoder achieves the best overall transfer task performance. However, this comes at the cost of computing time and memory usage scaling dramatically with sentence length.</span> <br/> </span></p>
<p><span class="fontstyle0">Deep Averaging Network(DAN) is much simpler where <span class="fontstyle0">input embeddings for words and bi-grams are first averaged together and then passed through a feedforward deep neural network (DNN) to produce sentence embeddings.</span> <span class="fontstyle0">The primary advantage of the DAN encoder is that compute time is linear in the length of the input sequence. </span></span></p>
<p><span class="fontstyle0">Depends on what type of training data and the chosen training metric, it can have a significant impact on the transfer learning result. <br/> </span></p>
<p>Both models were trained with the <span class="fontstyle0">Stanford Natural Language Inference (SNLI) corpus. The <a href="https://nlp.stanford.edu/pubs/snli_paper.pdf">SNLI corpus</a> is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE). Essentially, the models were trained to learn the semantic similarity between the sentence pairs.</span></p>
<p><span class="fontstyle0">With that in mind, </span><span>the sentence embeddings can be trivially used to compute sentence-level semantic similarity scores.</span></p>
<p><img alt="semantic-similarity" height="571" src="https://gitcdn.xyz/cdn/Tony607/blog_statics/149ddc2a3845b95b821898f42f351497480be68b/images/hub/semantic-similarity.png" width="678"/></p>
<p><span>The source code to generate the similarity heat map is available both in my Colab notebook and in GitHub repo. Colored based on the inner product of the encodings for any two sentences. That means the more similar two sentences are, the darker the color is.</span></p>
<p>Loading Universal Sentence Encoder and computing the embeddings for some text can be as easy as below. </p>
<div class="highlight">
<pre><span class="kn">import</span> <span class="nn">tensorflow</span> <span class="kn">as</span> <span class="nn">tf</span>
<span class="kn">import</span> <span class="nn">tensorflow_hub</span> <span class="kn">as</span> <span class="nn">hub</span>
<span class="n">module_url</span> <span class="o">=</span> <span class="s">"https://tfhub.dev/google/universal-sentence-encoder-large/3"</span>
<span class="c1"># Import the Universal Sentence Encoder's TF Hub module</span>
<span class="n">embed</span> <span class="o">=</span> <span class="n">hub</span><span class="o">.</span><span class="n">Module</span><span class="p">(</span><span class="n">module_url</span><span class="p">)</span>
<span class="c1"># Compute a representation for each message, showing various lengths supported.</span>
<span class="n">messages</span> <span class="o">=</span> <span class="p">[</span><span class="s">"That band rocks!"</span><span class="p">,</span> <span class="s">"That song is really cool."</span><span class="p">]</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">Session</span><span class="p">()</span> <span class="k">as</span> <span class="n">session</span><span class="p">:</span>
<span class="n">session</span><span class="o">.</span><span class="n">run</span><span class="p">([</span><span class="n">tf</span><span class="o">.</span><span class="n">global_variables_initializer</span><span class="p">(),</span> <span class="n">tf</span><span class="o">.</span><span class="n">tables_initializer</span><span class="p">()])</span>
<span class="n">message_embeddings</span> <span class="o">=</span> <span class="n">session</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">embed</span><span class="p">(</span><span class="n">messages</span><span class="p">))</span>
</pre>
</div>
<p><span>First time loading the module can take a while since it will download the weights files.</span></p>
<p><span>The value <g class="gr_ gr_127 gr-alert gr_gramm gr_inline_cards gr_run_anim Style multiReplace" data-gr-id="127" id="127">of </g><code>message_embeddings</code><g class="gr_ gr_127 gr-alert gr_gramm gr_inline_cards gr_disable_anim_appear Style multiReplace" data-gr-id="127" id="127"> <g class="gr_ gr_125 gr-alert gr_gramm gr_inline_cards gr_run_anim Grammar multiReplace" data-gr-id="125" id="125">are</g></g> two arrays corresponding to two sentences' embeddings, each is an array of 512 floating point numbers.</span></p>
<pre><span>array([[ 0.06587551, 0.02066354, -0.01454356, ..., 0.06447642, 0.01654527, -0.04688655], [ 0.06909196, 0.01529877, 0.03278331, ..., 0.01220771, 0.03000253, -0.01277521]], dtype=float32)</span></pre>
<h2><span>Question classification task and data preprocessing</span></h2>
<p><span>To respond correctly to a question given a large collection of texts, classifying questions into fine-grained classes is crucial <g class="gr_ gr_145 gr-alert gr_gramm gr_inline_cards gr_run_anim Grammar multiReplace" data-gr-id="145" id="145">in</g> question answering as a retrieval task. <span class="fontstyle0">Our goal is to categorize questions into different semantic classes that impose constraints on potential answers so that they can be utilized in later stages of the question answering process. For example, when considering the question </span><span class="fontstyle2"><strong>Q</strong>: </span><span style="text-decoration: underline;"><em><span class="fontstyle3">What Canadian city has the largest population?</span></em></span><span class="fontstyle0"> The hope is to classify this question as having answer type </span><strong><span class="fontstyle2">location</span></strong><span class="fontstyle0">, implying that only candidate answers that are locations need consideration.</span> <br/> </span></p>
<p><span>The dataset we use is the <a href="http://cogcomp.org/Data/QA/QC/">TREC Question Classification dataset</a>, There are entirely 5452 training and 500 test samples, that is 5452 + 500 questions each categorized into one of the six labels.</span></p>
<ol>
<li><span><strong>ABBR - 'abbreviation'</strong>: expression abbreviated, etc.</span></li>
<li><span><strong>DESC - 'description and abstract concepts'</strong>: manner of an action, description of sth. etc.</span></li>
<li><span><strong><g class="gr_ gr_117 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" data-gr-id="117" id="117">ENTY</g> - 'entities'</strong>: animals, colors, events, food, etc.</span></li>
<li><span><strong>HUM - 'human beings'</strong>: a group or organization of persons, an individual, etc.</span></li>
<li><span><strong>LOC - 'locations'</strong>: cities, countries, etc.</span></li>
<li><span><strong>NUM - 'numeric values'</strong>: postcodes, dates, speed,temperature, etc</span></li>
</ol>
<p>We want our model to be a multiclass classification model that takes strings as input and output probability for each of the 6 class labels. With this in mind, you know how to prepare the training and testing data for it.</p>
<p>The first step is to turn the raw text file into a pandas DataFrame and set the "label" column to be categorical column so as we can further access a label as a numeric value.</p>
<div class="highlight">
<pre><span class="k">def</span> <span class="nf">get_dataframe</span><span class="p">(</span><span class="n">filename</span><span class="p">):</span>
<span class="n">lines</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="s">'r'</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span><span class="o">.</span><span class="n">splitlines</span><span class="p">()</span>
<span class="n">data</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">lines</span><span class="p">)):</span>
<span class="n">label</span> <span class="o">=</span> <span class="n">lines</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">' '</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">label</span> <span class="o">=</span> <span class="n">label</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">":"</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">text</span> <span class="o">=</span> <span class="s">' '</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">lines</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">' '</span><span class="p">)[</span><span class="mi">1</span><span class="p">:])</span>
<span class="n">text</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="s">'[^A-Za-z0-9 ,\?</span><span class="se">\'\"</span><span class="s">-._\+\!/\`@=;:]+'</span><span class="p">,</span> <span class="s">''</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span>
<span class="n">data</span><span class="o">.</span><span class="n">append</span><span class="p">([</span><span class="n">label</span><span class="p">,</span> <span class="n">text</span><span class="p">])</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">'label'</span><span class="p">,</span> <span class="s">'text'</span><span class="p">])</span>
<span class="n">df</span><span class="o">.</span><span class="n">label</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">label</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="s">'category'</span><span class="p">)</span>
<span class="k">return</span> <span class="n">df</span>
<span class="n">df_train</span> <span class="o">=</span> <span class="n">get_dataframe</span><span class="p">(</span><span class="s">'train_5500.txt'</span><span class="p">)</span>
<span class="n">df_train</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</pre>
</div>
<p>First 5 training samples look like this.</p>
<p><img alt="df_train_head" height="192" src="https://gitcdn.xyz/cdn/Tony607/blog_statics/149ddc2a3845b95b821898f42f351497480be68b/images/hub/df_train_head.png" width="452"/></p>
<p>Next step we will prepare the input/output data for the model, the input as a list of question strings, and output as a list of one-hot encoded labels. If you are unfamiliar with one-hot encoding yet, I got you covered in part of my <a href="https://www.dlology.com/blog/how-to-train-a-keras-model-to-generate-colors/">previous post</a>.</p>
<div class="highlight">
<pre><span class="n">train_text</span> <span class="o">=</span> <span class="n">df_train</span><span class="p">[</span><span class="s">'text'</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span>
<span class="n">train_text</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">train_text</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">object</span><span class="p">)[:,</span> <span class="n">np</span><span class="o">.</span><span class="n">newaxis</span><span class="p">]</span>
<span class="n">train_label</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">asarray</span><span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">get_dummies</span><span class="p">(</span><span class="n">df_train</span><span class="o">.</span><span class="n">label</span><span class="p">),</span> <span class="n">dtype</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">int8</span><span class="p">)</span></pre>
</div>
<p>If you take a peek at the value <g class="gr_ gr_123 gr-alert gr_gramm gr_inline_cards gr_run_anim Style multiReplace" data-gr-id="123" id="123">of </g><code>train_label</code><g class="gr_ gr_123 gr-alert gr_gramm gr_inline_cards gr_disable_anim_appear Style multiReplace" data-gr-id="123" id="123">,</g> you will see it in one-hot encoded form.</p>
<pre><span>array([[0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 1, 0], [0, 0, 0, 1, 0, 0] ...], dtype=int8)</span></pre>
<p>Now we are ready to build the model.</p>
<h2><span>Keras meets Universal Sentence Encoder</span></h2>
<p><span>We have previously loaded the Universal Sentence Encoder as variable <g class="gr_ gr_128 gr-alert gr_gramm gr_inline_cards gr_run_anim Punctuation only-del replaceWithoutSep" data-gr-id="128" id="128">"</g><code>embed</code><g class="gr_ gr_128 gr-alert gr_gramm gr_inline_cards gr_disable_anim_appear Punctuation only-del replaceWithoutSep" data-gr-id="128" id="128">"</g>, to have it work with Keras nicely, it is necessary to wrap it in a Keras Lambda layer and explicitly cast its input as a string.</span></p>
<div class="highlight">
<pre><span class="k">def</span> <span class="nf">UniversalEmbedding</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="k">return</span> <span class="n">embed</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">squeeze</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">cast</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">tf</span><span class="o">.</span><span class="n">string</span><span class="p">)),</span>
<span class="n">signature</span><span class="o">=</span><span class="s">"default"</span><span class="p">,</span> <span class="n">as_dict</span><span class="o">=</span><span class="bp">True</span><span class="p">)[</span><span class="s">"default"</span><span class="p">]</span>
</pre>
</div>
<p><span>Then we build the Keras model in its standard <a href="https://keras.io/getting-started/functional-api-guide/">Functional API</a>,</span></p>
<div class="highlight">
<pre><span class="n">input_text</span> <span class="o">=</span> <span class="n">layers</span><span class="o">.</span><span class="n">Input</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">string</span><span class="p">)</span>
<span class="n">embedding</span> <span class="o">=</span> <span class="n">layers</span><span class="o">.</span><span class="n">Lambda</span><span class="p">(</span><span class="n">UniversalEmbedding</span><span class="p">,</span>
<span class="n">output_shape</span><span class="o">=</span><span class="p">(</span><span class="n">embed_size</span><span class="p">,))(</span><span class="n">input_text</span><span class="p">)</span>
<span class="n">dense</span> <span class="o">=</span> <span class="n">layers</span><span class="o">.</span><span class="n">Dense</span><span class="p">(</span><span class="mi">256</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">)(</span><span class="n">embedding</span><span class="p">)</span>
<span class="n">pred</span> <span class="o">=</span> <span class="n">layers</span><span class="o">.</span><span class="n">Dense</span><span class="p">(</span><span class="n">category_counts</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'softmax'</span><span class="p">)(</span><span class="n">dense</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">Model</span><span class="p">(</span><span class="n">inputs</span><span class="o">=</span><span class="p">[</span><span class="n">input_text</span><span class="p">],</span> <span class="n">outputs</span><span class="o">=</span><span class="n">pred</span><span class="p">)</span>
<span class="n">model</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="n">loss</span><span class="o">=</span><span class="s">'categorical_crossentropy'</span><span class="p">,</span>
<span class="n">optimizer</span><span class="o">=</span><span class="s">'adam'</span><span class="p">,</span> <span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="s">'accuracy'</span><span class="p">])</span>
</pre>
</div>
<p>We can view the model summary and realize that only the Keras layers are trainable, that is how the transfer learning task works by assuring the Universal Sentence Encoder weights untouched.</p>
<pre>_________________________________________________________________<br/>Layer (type) Output Shape Param # <br/>=================================================================<br/>input_1 (InputLayer) (None, 1) 0 <br/>_________________________________________________________________<br/>lambda_1 (Lambda) (None, 512) 0 <br/>_________________________________________________________________<br/>dense_1 (Dense) (None, 256) 131328 <br/>_________________________________________________________________<br/>dense_2 (Dense) (None, 6) 1542 <br/>=================================================================<br/>Total params: 132,870<br/>Trainable params: 132,870<br/>Non-trainable params: 0<br/>_________________________________________________________________</pre>
<p>In the next step, we train the model with the training datasets and validate its performance at the end of each training epoch with test datasets.</p>
<div class="highlight">
<pre><span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">Session</span><span class="p">()</span> <span class="k">as</span> <span class="n">session</span><span class="p">:</span>
<span class="n">K</span><span class="o">.</span><span class="n">set_session</span><span class="p">(</span><span class="n">session</span><span class="p">)</span>
<span class="n">session</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">global_variables_initializer</span><span class="p">())</span>
<span class="n">session</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">tables_initializer</span><span class="p">())</span>
<span class="n">history</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">train_text</span><span class="p">,</span>
<span class="n">train_label</span><span class="p">,</span>
<span class="n">validation_data</span><span class="o">=</span><span class="p">(</span><span class="n">test_text</span><span class="p">,</span> <span class="n">test_label</span><span class="p">),</span>
<span class="n">epochs</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
<span class="n">batch_size</span><span class="o">=</span><span class="mi">32</span><span class="p">)</span>
<span class="n">model</span><span class="o">.</span><span class="n">save_weights</span><span class="p">(</span><span class="s">'./model.h5'</span><span class="p">)</span>
</pre>
</div>
<p><span>The final validation result shows the highest accuracy gets around 97% after training for 10 epochs.</span></p>
<p>After we have the model trained and its weights saved to a file, it is really to make predictions on new questions.</p>
<p>Here we come up with 3 new questions for the model to classify.</p>
<div class="highlight">
<pre><span class="n">new_text</span> <span class="o">=</span> <span class="p">[</span><span class="s">"In what year did the titanic sink ?"</span><span class="p">,</span>
<span class="s">"What is the highest peak in California ?"</span><span class="p">,</span>
<span class="s">"Who invented the light bulb ?"</span><span class="p">]</span>
<span class="n">new_text</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">new_text</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">object</span><span class="p">)[:,</span> <span class="n">np</span><span class="o">.</span><span class="n">newaxis</span><span class="p">]</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">Session</span><span class="p">()</span> <span class="k">as</span> <span class="n">session</span><span class="p">:</span>
<span class="n">K</span><span class="o">.</span><span class="n">set_session</span><span class="p">(</span><span class="n">session</span><span class="p">)</span>
<span class="n">session</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">global_variables_initializer</span><span class="p">())</span>
<span class="n">session</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">tables_initializer</span><span class="p">())</span>
<span class="n">model</span><span class="o">.</span><span class="n">load_weights</span><span class="p">(</span><span class="s">'./model.h5'</span><span class="p">)</span>
<span class="n">predicts</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">new_text</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">32</span><span class="p">)</span>
<span class="n">categories</span> <span class="o">=</span> <span class="n">df_train</span><span class="o">.</span><span class="n">label</span><span class="o">.</span><span class="n">cat</span><span class="o">.</span><span class="n">categories</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span>
<span class="n">predict_logits</span> <span class="o">=</span> <span class="n">predicts</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">predict_labels</span> <span class="o">=</span> <span class="p">[</span><span class="n">categories</span><span class="p">[</span><span class="n">logit</span><span class="p">]</span> <span class="k">for</span> <span class="n">logit</span> <span class="ow">in</span> <span class="n">predict_logits</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="n">predict_labels</span><span class="p">)</span>
</pre>
</div>
<p><span>The classification results look decent.</span></p>
<pre><span>['NUM', 'LOC', 'HUM']</span></pre>
<h2>Conclusion and further reading</h2>
<p>Congratulation! You have built a Keras text transfer learning model powered by the <span>Universal Sentence Encoder and achieved a great result in question classification task. The Universal Sentence Encoder can embed longer paragraphs, so feel free to experiment with other datasets like the news topic classification, sentiment analysis, etc.</span></p>
<p><span>Some related resources you might find useful.</span></p>
<p><a href="https://www.tensorflow.org/hub/">TensorFlow Hub</a></p>
<p><a href="https://github.com/tensorflow/hub/tree/master/examples/colab">TensorFlow Hub example notebooks</a></p>
<p>For an intro to use Google Colab notebook, you can read the first section of my post- <a href="https://www.dlology.com/blog/how-to-run-object-detection-and-segmentation-on-video-fast-for-free/">How to run Object Detection and Segmentation on a Video Fast for Free</a>.</p>
<p>The source code in <a href="https://github.com/Tony607/Keras-Text-Transfer-Learning">my GitHub</a> and a runnable <a href="https://colab.research.google.com/drive/1Odry08Jm0f_YALhAt4vp9qa5w8prUzDY">Colab notebook</a>.</p>
<p><span></span></p>
<p><span></span></p>
<p></p>Simple Stock Sentiment Analysis with news data in Keras2018-05-20T21:13:05+00:002024-03-19T04:47:08+00:00Chengweihttps://www.dlology.com/blog/author/Chengwei/https://www.dlology.com/blog/simple-stock-sentiment-analysis-with-news-data-in-keras/<p><img alt="stock" src="https://gitcdn.xyz/cdn/Tony607/blog_statics/98959cf5c4fa5fed5084f65fee304c054edad088/images/finance/stock.jpg"/></p>
<p>Have you wonder what impact everyday news might have on the stock market. In this tutorial, we are going to explore and build a model that reads the top <a href="https://www.reddit.com/r/worldnews/?hl=">25 voted world news from Reddit</a> users and predict whether the Dow Jones will go up or down for a given day.</p>
<p>After reading this post, you will learn,</p>
<ul>
<li>How to pre-processing text data for deep learning sequence model.</li>
<li>How to use pre-trained GloVe embeddings vectors to initialize Keras Embedding layer.</li>
<li>Build a GRU model that can process word sequences and is able to take word order into account.</li>
</ul>
<p>Now let's get started, read till the end since there will be a secret bonus.</p>
<h3>Text data pre-processing</h3>
<p><img alt="reddit-news" src="https://gitcdn.xyz/cdn/Tony607/blog_statics/98959cf5c4fa5fed5084f65fee304c054edad088/images/finance/reddit-news.png"/></p>
<p>For the input text, we are going to concatenate all 25 news to one long string for each day.</p>
<p>After that are going to convert all sentences to lower-case, remove characters such as numbers and punctuations that cannot be represented by the GloVe embeddings later.</p>
<p>The next step is to convert all your training sentences into lists of indices, then zero-pad all those lists so that their length is the same.</p>
<p>It is helpful to visualize the length distribution across all input samples before deciding the maximum sequence length.</p>
<p><img alt="sentences-length" src="https://gitcdn.xyz/cdn/Tony607/blog_statics/98959cf5c4fa5fed5084f65fee304c054edad088/images/finance/sentences-length.png"/></p>
<p>Keep in mind that the longer maximum <span>length we pick, the longer it will take to train the model, so instead of choosing the longest sequence length in our datasets which is around 700, we are going to pick 500 as a tradeoff to cover the majority of the text across all samples while remaining relatively short training time.</span></p>
<h3>The embedding layer</h3>
<p>In Keras, the embedding matrix is represented as a "layer" and maps positive integers(indices corresponding to words) into dense vectors of fixed size (the embedding vectors).<span> It can be trained or initialized with a pre-trained embedding. In the part, you will learn how to create an Embedding layer in Keras, initialize it with GloVe 50-dimensional vectors. Because our training set is quite small, we will not update the word embeddings but will instead leave their values fixed. I will show you how Keras allows you to set whether the embedding is trainable or not.</span></p>
<p><span><g class="gr_ gr_99 gr-alert gr_gramm gr_inline_cards gr_run_anim Style multiReplace" data-gr-id="99" id="99">The </g><code>Embedding()</code><span><span><g class="gr_ gr_99 gr-alert gr_gramm gr_inline_cards gr_disable_anim_appear Style multiReplace" data-gr-id="99" id="99"> </g></span><g class="gr_ gr_99 gr-alert gr_gramm gr_inline_cards gr_disable_anim_appear Style multiReplace" data-gr-id="99" id="99">layer</g> takes an integer matrix of size (batch size, max input length) as input, this corresponds to sentences converted into lists of indices (integers), as shown in the figure below.</span></span></p>
<p><img alt="embedding" src="https://gitcdn.xyz/cdn/Tony607/blog_statics/98959cf5c4fa5fed5084f65fee304c054edad088/images/finance/embedding.jpg"/></p>
<p>The following function handles the first step of converting sentence strings to an array of indices. The word to index mapping is taken from GloVe embedding file so we can seamlessly convert indices to word vectors later.</p>
<div class="highlight">
<pre><span class="k">def</span> <span class="nf">sentences_to_indices</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">word_to_index</span><span class="p">,</span> <span class="n">max_len</span><span class="p">):</span>
<span class="sd">"""</span>
<span class="sd"> Converts an array of sentences (strings) into an array of indices corresponding to words in the sentences.</span>
<span class="sd"> The output shape should be such that it can be given to `Embedding()`. </span>
<span class="sd"> </span>
<span class="sd"> Arguments:</span>
<span class="sd"> X -- array of sentences (strings), of shape (m, 1)</span>
<span class="sd"> word_to_index -- a dictionary containing the each word mapped to its index</span>
<span class="sd"> max_len -- maximum number of words in a sentence. You can assume every sentence in X is no longer than this. </span>
<span class="sd"> </span>
<span class="sd"> Returns:</span>
<span class="sd"> X_indices -- array of indices corresponding to words in the sentences from X, of shape (m, max_len)</span>
<span class="sd"> """</span>
<span class="n">m</span> <span class="o">=</span> <span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="c1"># number of training examples</span>
<span class="c1"># Initialize X_indices as a numpy matrix of zeros and the correct shape</span>
<span class="n">X_indices</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">m</span><span class="p">,</span> <span class="n">max_len</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">int</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">m</span><span class="p">):</span> <span class="c1"># loop over training examples</span>
<span class="c1"># Convert the ith training sentence in lower case and split is into words. You should get a list of words.</span>
<span class="n">sentence_words</span> <span class="o">=</span> <span class="p">[</span><span class="n">w</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">X</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">split</span><span class="p">()]</span>
<span class="c1"># Initialize j to 0</span>
<span class="n">j</span> <span class="o">=</span> <span class="mi">0</span>
<span class="c1"># Loop over the words of sentence_words</span>
<span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">sentence_words</span><span class="p">:</span>
<span class="c1"># Set the (i,j)th entry of X_indices to the index of the correct word.</span>
<span class="k">if</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">word_to_index</span><span class="p">:</span>
<span class="n">X_indices</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">word_to_index</span><span class="p">[</span><span class="n">w</span><span class="p">]</span>
<span class="c1"># Increment j to j + 1</span>
<span class="n">j</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">if</span> <span class="n">j</span> <span class="o">>=</span> <span class="n">max_len</span><span class="p">:</span>
<span class="k">break</span>
<span class="k">return</span> <span class="n">X_indices</span>
</pre>
</div>
<p><span></span>After that, we can implement the pre-trained embedding layer like so.</p>
<ul>
<li>Initialize the embedding matrix as a numpy array of zeros with the correct shape. (vocab_len, <g class="gr_ gr_98 gr-alert gr_gramm gr_inline_cards gr_run_anim Grammar only-ins doubleReplace replaceWithoutSep" data-gr-id="98" id="98">dimension</g> of word vectors)</li>
<li><span></span>Fill the embedding matrix with all the word embeddings.</li>
<li>Define Keras embedding layer and make is non-trainable by <g class="gr_ gr_95 gr-alert gr_gramm gr_inline_cards gr_run_anim Style multiReplace" data-gr-id="95" id="95">setting </g><code>trainable</code><g class="gr_ gr_95 gr-alert gr_gramm gr_inline_cards gr_disable_anim_appear Style multiReplace" data-gr-id="95" id="95"> to</g> False.</li>
<li>Set the weight of the embedding layer to the embedding matrix.</li>
</ul>
<div class="highlight">
<pre><span class="kn">from</span> <span class="nn">keras.layers.embeddings</span> <span class="kn">import</span> <span class="n">Embedding</span>
<span class="k">def</span> <span class="nf">pretrained_embedding_layer</span><span class="p">(</span><span class="n">word_to_vec_map</span><span class="p">,</span> <span class="n">word_to_index</span><span class="p">):</span>
<span class="sd">"""</span>
<span class="sd"> Creates a Keras Embedding() layer and loads in pre-trained GloVe 50-dimensional vectors.</span>
<span class="sd"> </span>
<span class="sd"> Arguments:</span>
<span class="sd"> word_to_vec_map -- dictionary mapping words to their GloVe vector representation.</span>
<span class="sd"> word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)</span>
<span class="sd"> Returns:</span>
<span class="sd"> embedding_layer -- pretrained layer Keras instance</span>
<span class="sd"> """</span>
<span class="n">vocab_len</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">word_to_index</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span> <span class="c1"># adding 1 to fit Keras embedding (requirement)</span>
<span class="n">emb_dim</span> <span class="o">=</span> <span class="n">word_to_vec_map</span><span class="p">[</span><span class="s">"cucumber"</span><span class="p">]</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="c1"># define dimensionality of your GloVe word vectors (= 50)</span>
<span class="c1"># Initialize the embedding matrix as a numpy array of zeros of shape (vocab_len, dimensions of word vectors = emb_dim)</span>
<span class="n">emb_matrix</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">vocab_len</span><span class="p">,</span> <span class="n">emb_dim</span><span class="p">))</span>
<span class="c1"># Set each row "index" of the embedding matrix to be the word vector representation of the "index"th word of the vocabulary</span>
<span class="k">for</span> <span class="n">word</span><span class="p">,</span> <span class="n">index</span> <span class="ow">in</span> <span class="n">word_to_index</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
<span class="n">emb_matrix</span><span class="p">[</span><span class="n">index</span><span class="p">,</span> <span class="p">:]</span> <span class="o">=</span> <span class="n">word_to_vec_map</span><span class="p">[</span><span class="n">word</span><span class="p">]</span>
<span class="c1"># Define Keras embedding layer with the correct output/input sizes, make it trainable. Use Embedding(...). Make sure to set trainable=False. </span>
<span class="n">embedding_layer</span> <span class="o">=</span> <span class="n">Embedding</span><span class="p">(</span><span class="n">vocab_len</span><span class="p">,</span> <span class="n">emb_dim</span><span class="p">,</span> <span class="n">trainable</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="c1"># Build the embedding layer, it is required before setting the weights of the embedding layer. Do not modify the "None".</span>
<span class="n">embedding_layer</span><span class="o">.</span><span class="n">build</span><span class="p">((</span><span class="bp">None</span><span class="p">,))</span>
<span class="c1"># Set the weights of the embedding layer to the embedding matrix. Your layer is now pretrained.</span>
<span class="n">embedding_layer</span><span class="o">.</span><span class="n">set_weights</span><span class="p">([</span><span class="n">emb_matrix</span><span class="p">])</span>
<span class="k">return</span> <span class="n">embedding_layer</span>
</pre>
</div>
<p>Let's have a quick check of the embedding layer by asking for the vector representation of the word "cat".</p>
<div class="highlight">
<pre><span class="n">embedding_layer</span> <span class="o">=</span> <span class="n">pretrained_embedding_layer</span><span class="p">(</span><span class="n">word_to_vec_map</span><span class="p">,</span> <span class="n">word_to_index</span><span class="p">)</span>
<span class="n">embedding_layer</span><span class="o">.</span><span class="n">get_weights</span><span class="p">()[</span><span class="mi">0</span><span class="p">][</span><span class="n">word_to_index</span><span class="p">[</span><span class="s">'cat'</span><span class="p">]]</span>
<span class="c1"># array([ 0.45281 , -0.50108 , ... 0.71278 , 0.23782 ], dtype=float32)</span>
</pre>
</div>
<p><span>The result is a 50 dimension array. You can further explore the word vectors and measure similarity using cosine similarity or solve word analogy problems such as Man is to Woman as King is to __.</span></p>
<h3>Build and evaluate the model</h3>
<p>The task for the model is to take the news string sequence and make a binary classification whether the Dow Jones close value will rose/fail compared to previous close value. It outputs "1" if the value rose or stays the same, "0" when the value decreases.</p>
<p>We are building a simple model contains two stacked GRU layers after the pre-trained embedding layer. A Dense layer generates the final output with softmax activation. GRU is a type of recurrent network that processes and considers the order of sequences, it is similar to LSTM regarding their functionality and performance but less computationally expensive to train.</p>
<div class="highlight">
<pre><span class="n">model</span> <span class="o">=</span> <span class="n">Sequential</span><span class="p">()</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">pretrained_embedding_layer</span><span class="p">(</span><span class="n">word_to_vec_map</span><span class="p">,</span> <span class="n">word_to_index</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">GRU</span><span class="p">(</span><span class="mi">128</span><span class="p">,</span> <span class="n">dropout</span><span class="o">=</span><span class="mf">0.2</span><span class="p">,</span> <span class="n">return_sequences</span><span class="o">=</span><span class="bp">True</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">GRU</span><span class="p">(</span><span class="mi">128</span><span class="p">,</span> <span class="n">dropout</span><span class="o">=</span><span class="mf">0.2</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">1</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Activation</span><span class="p">(</span><span class="s">'sigmoid'</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="n">loss</span><span class="o">=</span><span class="s">'binary_crossentropy'</span><span class="p">,</span>
<span class="n">optimizer</span><span class="o">=</span><span class="s">'adam'</span><span class="p">,</span>
<span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="s">'accuracy'</span><span class="p">])</span>
</pre>
</div>
<p>Next, we can train the evaluate the model.</p>
<div class="highlight">
<pre><span class="n">history</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train_indices</span><span class="p">,</span> <span class="n">Y_train</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
<span class="n">validation_data</span><span class="o">=</span><span class="p">(</span><span class="n">X_test_indices</span><span class="p">,</span> <span class="n">Y_test</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s">"./model.h5"</span><span class="p">)</span>
<span class="n">score</span><span class="p">,</span> <span class="n">acc</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">evaluate</span><span class="p">(</span><span class="n">X_test_indices</span><span class="p">,</span> <span class="n">Y_test</span><span class="p">,</span>
<span class="n">batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="p">)</span>
</pre>
</div>
<p>It is also helpful to generate the ROC or our binary classification classifier to access its performance visually.</p>
<p><img alt="ROC" src="https://gitcdn.xyz/cdn/Tony607/blog_statics/98959cf5c4fa5fed5084f65fee304c054edad088/images/finance/roc.png"/></p>
<p>Our model is about 2.8% better than the random guess of the market trend.</p>
<p>For more information about ROC and AUC, you can read my other blog - <a href="https://www.dlology.com/blog/simple-guide-on-how-to-generate-roc-plot-for-keras-classifier/"><g class="gr_ gr_93 gr-alert gr_gramm gr_inline_cards gr_run_anim Grammar only-ins doubleReplace replaceWithoutSep" data-gr-id="93" id="93">Simple</g> guide on how to generate ROC plot for Keras classifier</a>.</p>
<h3>Conclusion and Further thought</h3>
<p>In this post, we introduced a quick and simple way to build a Keras model with Embedding layer initialized with pre-trained GloVe embeddings. Something you can try after reading this post,</p>
<ul>
<li>Make the Embedding layer weights trainable, train the model from the start then compare the result.</li>
<li>Increase the maximum sequence length and see how that might impact the model performance and training time.</li>
<li>Incorporate other input to form a multi-input Keras model, since other factors might correlate with stock index fluctuation. For example, there are <a href="https://www.investopedia.com/terms/m/macd.asp">MACD</a>(Moving Average Convergence/Divergence oscillator), <a href="https://www.investopedia.com/terms/m/momentum.asp">Momentum </a>indicator for your consideration. To have multi-input, you can use the <a href="https://keras.io/getting-started/functional-api-guide/">Keras functional API</a>.</li>
</ul>
<p>Any ideas to improve the model? Comment and share your thoughts.</p>
<p>You can find the full source code and training data here in <a href="https://github.com/Tony607/SentimentStock">my Github repo</a>.</p>
<h3>Bonus for investors</h3>
<p><img alt="stock_ticket" src="https://gitcdn.xyz/cdn/Tony607/blog_statics/e6466d70837a591d2da764509942cf20c2b0f93c/images/finance/stock_ticket.png"/></p>
<p>If you are new to the whole investment world like I did years ago, you may wonder where to start, preferably invest for free with zero commissions. By learning how to trade stocks for free, you'll not only save money, but your investments will potentially compound at a faster rate. <a href="https://www.forbes.com/companies/robinhood/">Robinhood</a>, one of the best investing app does just that, whether you are buying only one or 100 shares, there are no commissions. It <span>was built from the ground up to be as efficient as possible by cutting out the fat and pass the savings to the customers. Join Robinhood, and we'll both get a stock like Apple, Ford, or Sprint for free. Make sure you use my <a href="https://share.robinhood.com/chengwz1">shared link</a>.</span></p>How to generate realistic yelp restaurant reviews with Keras2018-02-22T12:57:55+00:002024-03-19T02:00:51+00:00Chengweihttps://www.dlology.com/blog/author/Chengwei/https://www.dlology.com/blog/how-to-generate-realistic-yelp-restaurant-reviews-with-keras/<p><img alt="restaurant reviews" src="https://gitcdn.xyz/repo/Tony607/blog_statics/master/images/rnn/restaurant.jpg"/></p>
<p>TL; DR</p>
<p>After reading this article. You will be able to build a model to generate 5-star Yelp reviews like those.</p>
<p><span style="text-decoration: underline;"><em>Samples of generated review text (unmodified)</em></span></p>
<pre><SOR>I had the steak, mussels with a side of chicken parmesan. All were very good. We will be back.<EOR><br/><SOR>The food, service, atmosphere, and service are excellent. I would recommend it to all my friends<EOR><br/><SOR>Good atmosphere, amazing food and great service.Service is also pretty good. Give them a try!<EOR></pre>
<p>I will show you how to,</p>
<ul>
<li>Acquire and prepare the training data.</li>
<li>Build the character-level language models.</li>
<li>Tips when training the model.</li>
<li>Generate random reviews.</li>
</ul>
<p>Training the model could easily take up a couple of days even on GPU. Luckily the pre-trained model weights are available. So we could jump directly to the fun part to generate reviews.</p>
<h2>Getting the Data ready</h2>
<p>The <a href="https://www.yelp.com/dataset/challenge">Yelp Dataset </a>is freely available in JSON format.</p>
<p>After downloading and extracting, you will find 2 files we need in the <strong>dataset</strong> folder,</p>
<ul>
<li>review.json</li>
<li>business.json</li>
</ul>
<p>Those two files are quite large, especially the <strong>review.json</strong> file (3.7 GB).</p>
<p>Each line of the<strong> review.json</strong> file is a review of JSON string. The two files do not have the JSON start and end square brackets "[ ]". So the content of the JSON file as a whole is not a valid JSON string. Plus it might be difficult to fit the whole <strong>review.json</strong> file content to the memory. So, let's first convert them to CSV format line by line with our helper script.</p>
<pre>python json_converter.py ./dataset/review.json<br/>python json_converter.py ./dataset/business.json</pre>
<p>After that, you will find those two files in <strong>dataset</strong> folder,</p>
<ul>
<li>review.csv</li>
<li>business.csv</li>
</ul>
<p>Those two are valid CSV files we can open by <strong>pandas</strong> library.</p>
<p>Here is what we are going to do. We only extract <strong>5-stars</strong> review texts from the businesses that have '<strong>Restaurant</strong>' tag in their categories.</p>
<div class="highlight">
<pre><span class="c1"># Read thow two CSV files to pandas dataframes</span>
<span class="n">df_business</span><span class="o">=</span><span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'../dataset/business.csv'</span><span class="p">)</span>
<span class="n">df_review</span><span class="o">=</span><span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'../dataset/review.csv'</span><span class="p">)</span>
<span class="c1"># Filter 'Restaurants' businesses</span>
<span class="n">restaurants</span> <span class="o">=</span> <span class="n">df_business</span><span class="p">[</span><span class="n">df_business</span><span class="p">[</span><span class="s">'categories'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s">'Restaurants'</span><span class="p">)]</span>
<span class="c1"># Filter 5-stars reviews</span>
<span class="n">five_star</span><span class="o">=</span><span class="n">df_review</span><span class="p">[</span><span class="n">df_review</span><span class="p">[</span><span class="s">'stars'</span><span class="p">]</span><span class="o">==</span><span class="mi">5</span><span class="p">]</span>
<span class="c1"># merge the reviews with restaurants by key 'business_id'</span>
<span class="c1"># This keep only 5-star restaurants reviews</span>
<span class="n">combo</span><span class="o">=</span><span class="n">pd</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">restaurants_clean</span><span class="p">,</span> <span class="n">five_star</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="s">'business_id'</span><span class="p">)</span>
<span class="c1"># review texts column</span>
<span class="n">rnn_fivestar_reviews_only</span><span class="o">=</span><span class="n">combo</span><span class="p">[[</span><span class="s">'text'</span><span class="p">]]</span>
</pre>
</div>
<p>Next, let's remove the new line characters in reviews and any duplicated reviews.</p>
<div class="highlight">
<pre><span class="c1"># remove new line characters</span>
<span class="n">rnn_fivestar_reviews_only</span><span class="o">=</span><span class="n">rnn_fivestar_reviews_only</span><span class="o">.</span><span class="n">replace</span><span class="p">({</span><span class="s">r'\n+'</span><span class="p">:</span> <span class="s">''</span><span class="p">},</span> <span class="n">regex</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c1"># remove dupliated reviews</span>
<span class="n">final</span><span class="o">=</span><span class="n">rnn_fivestar_reviews_only</span><span class="o">.</span><span class="n">drop_duplicates</span><span class="p">()</span>
</pre>
</div>
<p>To show the model where is the start and end of a review. We need to add special markers to our review texts.</p>
<p>So one line in the finally prepared review will look like this as you expected.</p>
<pre>"<SOR>Hummus is amazing and fresh! Loved the falafels. I will definitely be back. Great owner, friendly staff<EOR>"</pre>
<h2>Build the model</h2>
<p>The model we are building here is a <strong>character-level language model</strong>, meaning the minimum distinguishable symbol is a character. You may also come across the word- level model where the input is the word tokens.</p>
<p>There are some pros and cons for the <strong>character-level language model</strong>.</p>
<p><strong>Pro:</strong></p>
<ul>
<li><span>Don’t have to worry about unknown vocabulary.</span></li>
<li><span>Able to learn large vocabulary.</span></li>
</ul>
<p><strong>Con:</strong></p>
<ul>
<li><span>End up with very long sequences. Not as good as word level language models at capturing <strong>long-range dependencies</strong> between how the earlier parts of the sentence also affect the later part of the sentence.</span></li>
<li><span>And character level models are also just more <strong>computationally expensive</strong> to train.</span></li>
</ul>
<p><span></span></p>
<p>The model is quite similar to the official <a href="https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py"><strong class="final-path">lstm_text_generation.py </strong>demo code</a>, except we are stacking RNN cells allows storing more information throughout the hidden states between the input and output layer. It generates more realistic Yelp reviews.</p>
<p>Before showing the code for the model, let's peek a little deeper on how stacking RNN works.</p>
<p>You may have seen in the standard neural network<g class="gr_ gr_182 gr-alert gr_gramm gr_inline_cards gr_run_anim Style replaceWithoutSep" data-gr-id="182" id="182">.(</g>That is the <strong>Dense</strong> layers in Keras)</p>
<p>The first layer takes the input <strong>x</strong> to compute the activation value <strong>a<sup>[1]</sup></strong>, that stack next layer to compute the next activation value <strong>a<sup>[2]</sup></strong>.</p>
<p><img alt="stack dense" src="https://gitcdn.link/cdn/Tony607/blog_statics/482cbeb7d2abf3b1295885e901431b07e87ec8d3/images/rnn/stack_dense.svg"/></p>
<p>Stacking RNN is a bit like the standard neural network and "unrolling in time".</p>
<p>For notation<strong> a<sup>[l]<t></sup></strong> means activation asslocation for <strong>layer l, </strong>and<strong> <t> </strong>means <strong>timestep t</strong>.</p>
<p><img alt="stack rnn" src="https://gitcdn.link/cdn/Tony607/blog_statics/482cbeb7d2abf3b1295885e901431b07e87ec8d3/images/rnn/stack_rnn.svg"/></p>
<p>Let's take a <g class="gr_ gr_160 gr-alert gr_gramm gr_inline_cards gr_run_anim Grammar only-ins replaceWithoutSep" data-gr-id="160" id="160">look</g> how an activation value is computed.</p>
<p>To compute <strong>a<sup>[2]<3></sup></strong>, there are two input, <strong>a<sup>[2]<2> </sup></strong><span>and <strong>a<sup>[1]<3></sup></strong></span></p>
<p><span>g is the activation function, w<sub>a</sub><sup>[2]</sup> and b<sub>a</sub><sup>[2]</sup> are the layer 2 parameters.</span></p>
<p><img alt="activation a23" src="https://gitcdn.link/cdn/Tony607/blog_statics/482cbeb7d2abf3b1295885e901431b07e87ec8d3/images/rnn/a23.png"/></p>
<p>As we can see, to stack RNNs, the previous RNN need to return all the timesteps a<sup><t></sup> to the subsequent RNN.</p>
<p>By default, an <strong>RNN</strong> layer such as <strong>LSTM</strong> in Keras only returns the last timestep activation value a<sup><T></sup>. To return all timesteps' activation values, we set <g class="gr_ gr_163 gr-alert gr_gramm gr_inline_cards gr_run_anim Style multiReplace" data-gr-id="163" id="163">the </g><code>return_sequences</code><g class="gr_ gr_163 gr-alert gr_gramm gr_inline_cards gr_disable_anim_appear Style multiReplace" data-gr-id="163" id="163"> parameter</g> <g class="gr_ gr_164 gr-alert gr_gramm gr_inline_cards gr_run_anim Style multiReplace" data-gr-id="164" id="164">to </g><code>True</code><g class="gr_ gr_164 gr-alert gr_gramm gr_inline_cards gr_disable_anim_appear Style multiReplace" data-gr-id="164" id="164">.</g></p>
<p>So here is how we build the model in Keras. Each input sample is a one-hot representation of 60 characters, and there are total 95 possible characters.</p>
<p>Each output is a list of 95 predicted probabilities for each character.</p>
<div class="highlight">
<pre><span class="kn">import</span> <span class="nn">keras</span>
<span class="kn">from</span> <span class="nn">keras</span> <span class="kn">import</span> <span class="n">layers</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">keras</span><span class="o">.</span><span class="n">models</span><span class="o">.</span><span class="n">Sequential</span><span class="p">()</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">layers</span><span class="o">.</span><span class="n">LSTM</span><span class="p">(</span><span class="mi">1024</span><span class="p">,</span> <span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">60</span><span class="p">,</span> <span class="mi">95</span><span class="p">),</span><span class="n">return_sequences</span><span class="o">=</span><span class="bp">True</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">layers</span><span class="o">.</span><span class="n">LSTM</span><span class="p">(</span><span class="mi">1024</span><span class="p">,</span> <span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">60</span><span class="p">,</span> <span class="mi">95</span><span class="p">)))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">layers</span><span class="o">.</span><span class="n">Dense</span><span class="p">(</span><span class="mi">95</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'softmax'</span><span class="p">))</span>
</pre>
</div>
<p>And here is the graphical model structure to help you visualize it.</p>
<p><img alt="model structure" src="https://gitcdn.link/cdn/Tony607/blog_statics/482cbeb7d2abf3b1295885e901431b07e87ec8d3/images/rnn/model_yelp.svg"/></p>
<h2><span>Training the model</span></h2>
<p><span>The idea to train the model is simple, we train it with the input/output pair. Each input is 60 characters, and the corresponding output is the immediately following character.</span></p>
<p></p>
<p><span>In the data preparing step, we created a list of clean 5-star reviews text. Total 1,214,016 lines of reviews. To simplify the training, we are only going to train on reviews equal or less than 250 characters long which end up with 418,955 lines of reviews.</span></p>
<p>Then we shuffle the order of the reviews, so we don't train on 100 reviews for the same restaurant in a row.</p>
<p><span>We read all reviews as a long text string. Then create a python dictionary (i.e., a hash table) to map each character to an index from 0-94 (total 95 unique characters).</span></p>
<div class="highlight">
<pre><span class="c1"># List of unique characters in the corpus</span>
<span class="n">chars</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">text</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Unique characters:'</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">chars</span><span class="p">))</span>
<span class="c1"># Dictionary mapping unique characters to their index in `chars`</span>
<span class="n">char_indices</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">((</span><span class="n">char</span><span class="p">,</span> <span class="n">chars</span><span class="o">.</span><span class="n">index</span><span class="p">(</span><span class="n">char</span><span class="p">))</span> <span class="k">for</span> <span class="n">char</span> <span class="ow">in</span> <span class="n">chars</span><span class="p">)</span>
</pre>
</div>
<p>The text corpus has a total of 72,662,807 characters. It is hard to process it as a whole. So let's break it down into chunks of 90k characters each.</p>
<p>For each chunk of a corpus, we are going to generate pairs of inputs and outputs. By shifting the pointer from beginning to end of the chunk, one character at a time if step set to 1.</p>
<div class="highlight">
<pre><span class="k">def</span> <span class="nf">getDataFromChunk</span><span class="p">(</span><span class="n">txtChunk</span><span class="p">,</span> <span class="n">maxlen</span><span class="o">=</span><span class="mi">60</span><span class="p">,</span> <span class="n">step</span><span class="o">=</span><span class="mi">1</span><span class="p">):</span>
<span class="n">sentences</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">next_chars</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">txtChunk</span><span class="p">)</span> <span class="o">-</span> <span class="n">maxlen</span><span class="p">,</span> <span class="n">step</span><span class="p">):</span>
<span class="n">sentences</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">txtChunk</span><span class="p">[</span><span class="n">i</span> <span class="p">:</span> <span class="n">i</span> <span class="o">+</span> <span class="n">maxlen</span><span class="p">])</span>
<span class="n">next_chars</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">txtChunk</span><span class="p">[</span><span class="n">i</span> <span class="o">+</span> <span class="n">maxlen</span><span class="p">])</span>
<span class="k">print</span><span class="p">(</span><span class="s">'nb sequences:'</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">sentences</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Vectorization...'</span><span class="p">)</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="nb">len</span><span class="p">(</span><span class="n">sentences</span><span class="p">),</span> <span class="n">maxlen</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">chars</span><span class="p">)),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">bool</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="nb">len</span><span class="p">(</span><span class="n">sentences</span><span class="p">),</span> <span class="nb">len</span><span class="p">(</span><span class="n">chars</span><span class="p">)),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">bool</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">sentence</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">sentences</span><span class="p">):</span>
<span class="k">for</span> <span class="n">t</span><span class="p">,</span> <span class="n">char</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">sentence</span><span class="p">):</span>
<span class="n">X</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">t</span><span class="p">,</span> <span class="n">char_indices</span><span class="p">[</span><span class="n">char</span><span class="p">]]</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">y</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">char_indices</span><span class="p">[</span><span class="n">next_chars</span><span class="p">[</span><span class="n">i</span><span class="p">]]]</span> <span class="o">=</span> <span class="mi">1</span>
<span class="k">return</span> <span class="p">[</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">]</span>
</pre>
</div>
<p>Training one chunk for one epoch takes 219 seconds on GPU (GTX1070), so training the full corpus will take about 2 days.</p>
<pre>72662807 / 90000 * 219 /60 / 60/ 24 = 2.0 days</pre>
<p>Two Keras callbacks come handy, <strong>ModelCheckpoint</strong> and <strong>ReduceLROnPlateau</strong>.</p>
<p><strong>ModelCheckpoint </strong>helps us save the weights everytime it improves. </p>
<p><strong>ReduceLROnPlateau</strong> callback automatically reduces learning rate when the <strong>loss</strong> metric stops decreasing. The main benefit of it is that we don’t need to tune the learning rate manually. Its main weakness is that its learning rate is always decreasing and decaying.</p>
<div class="highlight">
<pre><span class="c1"># this saves the weights everytime they improve so you can let it train. Also learning rate decay</span>
<span class="n">filepath</span><span class="o">=</span><span class="s">"Feb-22-all-{epoch:02d}-{loss:.4f}.hdf5"</span>
<span class="n">checkpoint</span> <span class="o">=</span> <span class="n">ModelCheckpoint</span><span class="p">(</span><span class="n">filepath</span><span class="p">,</span> <span class="n">monitor</span><span class="o">=</span><span class="s">'loss'</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">save_best_only</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">mode</span><span class="o">=</span><span class="s">'min'</span><span class="p">)</span>
<span class="n">reduce_lr</span> <span class="o">=</span> <span class="n">ReduceLROnPlateau</span><span class="p">(</span><span class="n">monitor</span><span class="o">=</span><span class="s">'loss'</span><span class="p">,</span> <span class="n">factor</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span>
<span class="n">patience</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">min_lr</span><span class="o">=</span><span class="mf">0.00001</span><span class="p">)</span>
<span class="n">callbacks_list</span> <span class="o">=</span> <span class="p">[</span><span class="n">checkpoint</span><span class="p">,</span> <span class="n">reduce_lr</span><span class="p">]</span>
</pre>
</div>
<p>Code to train the model for 20 epochs looks like this.</p>
<div class="highlight">
<pre><span class="k">for</span> <span class="n">iteration</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">20</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Iteration'</span><span class="p">,</span> <span class="n">iteration</span><span class="p">)</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">"../dataset/short_reviews_shuffle.txt"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="nb">iter</span><span class="p">(</span><span class="k">lambda</span><span class="p">:</span> <span class="n">f</span><span class="o">.</span><span class="n">read</span><span class="p">(</span><span class="mi">90000</span><span class="p">),</span> <span class="s">""</span><span class="p">):</span>
<span class="n">X</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">getDataFromChunk</span><span class="p">(</span><span class="n">chunk</span><span class="p">)</span>
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">128</span><span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">callbacks</span><span class="o">=</span><span class="n">callbacks_list</span><span class="p">)</span>
</pre>
</div>
<p>It will take one month or so as you might guess. But training for about 2 hours already produces some promising results in my case. Feel free to give it a try.</p>
<h2>Generate 5-star reviews</h2>
<p>Whether you jump right to this section or you have read <g class="gr_ gr_171 gr-alert gr_gramm gr_inline_cards gr_run_anim Grammar replaceWithoutSep" data-gr-id="171" id="171">through</g> the previous ones. <span>Here is the fun part!</span></p>
<p><span>With the pre-trained model weights or one you trained by yourself, we can generate some interesting yelp reviews.</span></p>
<p><span>Here is the idea, we "seed" the model with initial 60 characters and ask the model to predict the very next character.</span></p>
<p><img alt="generate sample" src="https://gitcdn.link/cdn/Tony607/blog_statics/482cbeb7d2abf3b1295885e901431b07e87ec8d3/images/rnn/sample.svg"/></p>
<p><span>The "sampling index" process will add some variety to the final result </span><span>by generating some randomness with the given prediction.</span></p>
<p><span>If the temperature is very small, it will always pick the index with highest predicted probability.</span></p>
<div class="highlight">
<pre><span class="k">def</span> <span class="nf">sample</span><span class="p">(</span><span class="n">preds</span><span class="p">,</span> <span class="n">temperature</span><span class="o">=</span><span class="mf">1.0</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Generate some randomness with the given preds</span>
<span class="sd"> which is a list of numbers, if the temperature</span>
<span class="sd"> is very small, it will always pick the index</span>
<span class="sd"> with highest pred value</span>
<span class="sd"> '''</span>
<span class="n">preds</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">asarray</span><span class="p">(</span><span class="n">preds</span><span class="p">)</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="s">'float64'</span><span class="p">)</span>
<span class="n">preds</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">preds</span><span class="p">)</span> <span class="o">/</span> <span class="n">temperature</span>
<span class="n">exp_preds</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">preds</span><span class="p">)</span>
<span class="n">preds</span> <span class="o">=</span> <span class="n">exp_preds</span> <span class="o">/</span> <span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">exp_preds</span><span class="p">)</span>
<span class="n">probas</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">multinomial</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">preds</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">probas</span><span class="p">)</span>
</pre>
</div>
<p><span>To generate 300 characters with following code</span></p>
<div class="highlight">
<pre><span class="c1"># We generate 300 characters</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">300</span><span class="p">):</span>
<span class="n">sampled</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span> <span class="n">maxlen</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">chars</span><span class="p">)))</span>
<span class="c1"># Turn each char to char index.</span>
<span class="k">for</span> <span class="n">t</span><span class="p">,</span> <span class="n">char</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">generated_text</span><span class="p">):</span>
<span class="n">sampled</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="n">t</span><span class="p">,</span> <span class="n">char_indices</span><span class="p">[</span><span class="n">char</span><span class="p">]]</span> <span class="o">=</span> <span class="mf">1.</span>
<span class="c1"># Predict next char probabilities</span>
<span class="n">preds</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">sampled</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span><span class="mi">0</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="c1"># Add some randomness by sampling given probabilities.</span>
<span class="n">next_index</span> <span class="o">=</span> <span class="n">sample</span><span class="p">(</span><span class="n">preds</span><span class="p">,</span> <span class="n">temperature</span><span class="p">)</span>
<span class="c1"># Turn char index to char.</span>
<span class="n">next_char</span> <span class="o">=</span> <span class="n">chars</span><span class="p">[</span><span class="n">next_index</span><span class="p">]</span>
<span class="c1"># Append char to generated text string</span>
<span class="n">generated_text</span> <span class="o">+=</span> <span class="n">next_char</span>
<span class="c1"># Pop the first char in generated text string.</span>
<span class="n">generated_text</span> <span class="o">=</span> <span class="n">generated_text</span><span class="p">[</span><span class="mi">1</span><span class="p">:]</span>
<span class="c1"># Print the new generated char.</span>
<span class="n">sys</span><span class="o">.</span><span class="n">stdout</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">next_char</span><span class="p">)</span>
<span class="n">sys</span><span class="o">.</span><span class="n">stdout</span><span class="o">.</span><span class="n">flush</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">generated_text</span><span class="p">)</span>
</pre>
</div>
<h2>Summary and Further reading</h2>
<p>In this post, you know how to build and train a character-level text generation model from beginning to end. The source code is available on my <a href="https://github.com/Tony607/Yelp_review_generation">GitHub repo</a> as well as the pre-train model to play with.</p>
<p>The model shown here is trained in a many to one fashion. There is also another optional implementation in many to many <g class="gr_ gr_187 gr-alert gr_gramm gr_inline_cards gr_run_anim Grammar multiReplace" data-gr-id="187" id="187">fashion</g>. Consider the input sequence as characters of length 7 <strong>"The <g class="gr_ gr_152 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" data-gr-id="152" id="152">cak</g>"</strong> and the expected output is <strong>"<g class="gr_ gr_151 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del" data-gr-id="151" id="151">he</g> cake"</strong>. You can check it out here, <a href="https://github.com/mineshmathew/char_rnn_karpathy_keras">char_rnn_karpathy_keras</a>.</p>
<p>For a reference to building a word-level model, check out my other blog: <a href="https://www.dlology.com/blog/simple-stock-sentiment-analysis-with-news-data-in-keras/">Simple Stock Sentiment Analysis with news data in Keras</a>.</p>
<p></p>How to teach AI to suggest product prices to online sellers2017-12-23T13:09:19+00:002024-03-15T16:57:07+00:00Chengweihttps://www.dlology.com/blog/author/Chengwei/https://www.dlology.com/blog/how-to-teach-ai-to-suggest-product-prices-to-online-sellers/<p><img alt="price prediction" src="https://gitcdn.xyz/cdn/Tony607/blog_statics/master/images/kaggle_price/price.png"/></p>
<p>Given the product brand, name, category and one sentence or two for short descriptions, then we can predict its price. Could it be that simple?</p>
<p>In <a href="https://www.kaggle.com/c/mercari-price-suggestion-challenge">this</a> Kaggle competition, we are doing exactly the same thing. Developers around the world are fighting for the "Mercari Prize: Price Suggestion Challenge", and the prize money is The total amount is $ 100,000 (first place: $ 60,000, second place: $ 30,000, third place: $ 10,000).</p>
<p>In this post, I will walk you through building a simple model to tackle the challenge in the deep learning library <a href="https://keras.io">Keras</a>.</p>
<p>If you are new to Kaggle, in order to download the datasets, you need to register an account, totally pain-free. Once you have the account, go to the <a href="https://www.kaggle.com/c/mercari-price-suggestion-challenge/data">"Data" tab</a> in the Mercari Price Suggestion Challenge.</p>
<p>Download all three files to your local computer, extract and save them to a folder named "<strong>input</strong>", and at the root folder create a folder named "<strong>scripts</strong>" where we will start coding.</p>
<p>Right now you should have your directories structured similarly to this.</p>
<p></p>
<p>./Pricing_Challenge<br/>|-- <strong>input</strong><br/>| |-- test.tsv<br/>| |-- sample_submission.csv<br/>| `-- train.tsv<br/>`-- <strong>scripts</strong></p>
<h1>Preparing the data</h1>
<p>First, let's take some time to understand the datasets at hand</p>
<p><strong>train.tsv, test.tsv</strong></p>
<p>The files consist of a list of product listings. These files are tab-delimited.</p>
<ul>
<li><code>train_id</code><span> </span>or<span> </span><code>test_id</code><span> </span>- the id of the listing</li>
<li><code>name</code><span> </span>- the title of the listing. Note that we have cleaned the data to remove text that looks like prices (e.g. $20) to avoid leakage. These removed prices are represented as<span> </span><code>[rm]</code></li>
<li><code>item_condition_id</code><span> </span>- the condition of the items provided by the seller</li>
<li><code>category_name</code><span> </span>- category of the listing</li>
<li><code>brand_name</code></li>
<li><code>price</code><span> </span>- the price that the item was sold for. This is the target variable that you will predict. The unit is USD. This column doesn't exist in<span> </span><code>test.tsv</code><span> </span>since that is what you will predict.</li>
<li><code>shipping</code><span> </span>- 1 if shipping fee is paid by seller and 0 by a buyer</li>
<li><code>item_description</code><span> </span>- the full description of the item. Note that we have cleaned the data to remove text that looks like prices (e.g. $20) to avoid leakage. These removed prices are represented as<span> </span><code>[rm]</code></li>
</ul>
<div class="table-responsive">
<h3><span>First 5 rows of train datasets</span></h3>
<table class="table table-striped">
<thead>
<tr>
<th>train_id</th>
<th>name</th>
<th>item_condition_id</th>
<th>category_name</th>
<th>brand_name</th>
<th>price</th>
<th>shipping</th>
<th>item_description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>MLB Cincinnati Reds T Shirt Size XL</td>
<td>3</td>
<td>Men/Tops/T-shirts</td>
<td>NaN</td>
<td>10.0</td>
<td>1</td>
<td>No description yet</td>
</tr>
<tr>
<td>1</td>
<td>Razer BlackWidow Chroma Keyboard</td>
<td>3</td>
<td>Electronics/Computers & Tablets/Components & P...</td>
<td>Razer</td>
<td>52.0</td>
<td>0</td>
<td>This keyboard is in great condition and works ...</td>
</tr>
<tr>
<td>2</td>
<td>AVA-VIV Blouse</td>
<td>1</td>
<td>Women/Tops & Blouses/Blouse</td>
<td>Target</td>
<td>10.0</td>
<td>1</td>
<td>Adorable top with a hint of lace and a key hol...</td>
</tr>
<tr>
<td>3</td>
<td>Leather Horse Statues</td>
<td>1</td>
<td>Home/Home Décor/Home Décor Accents</td>
<td>NaN</td>
<td>35.0</td>
<td>1</td>
<td>New with tags. Leather horses. Retail for [rm]...</td>
</tr>
<tr>
<td>4</td>
<td>24K GOLD plated rose</td>
<td>1</td>
<td>Women/Jewelry/Necklaces</td>
<td>NaN</td>
<td>44.0</td>
<td>0</td>
<td>Complete with certificate of authenticity</td>
</tr>
</tbody>
</table>
</div>
<p> </p>
<p>For the price column, it would be problematic to feed into neural network values that all take wildly different ranges. The network might be able to automatically adapt to such heterogeneous data, but it would definitely make learning more difficult.</p>
<p>A widespread best practice to deal with such data is to do feature-wise normalization: for each feature in the input data (a column in the input data matrix) we apply <span><code>log(x+1)</code> to it.</span></p>
<div class="highlight">
<pre><span class="n">train</span><span class="p">[</span><span class="s">'target'</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">log1p</span><span class="p">(</span><span class="n">train</span><span class="p">[</span><span class="s">'price'</span><span class="p">])</span>
</pre>
</div>
<div class="highlight"></div>
<p>Let's take a look at the distribution of the new 'target' column.</p>
<p><img alt="target" src="https://gitcdn.xyz/cdn/Tony607/blog_statics/master/images/kaggle_price/target.png"/></p>
<h2>Text preprocessing</h2>
<h3>Replace contractions</h3>
<p>We will replace contractions pair like those below, the purpose is to unify the vocabulary to make it easier to train the model.</p>
<pre>"what's" → "what is",<br/>"who'll"<span> →</span> "who will",<br/>"wouldn't"<span> →</span> "would not",<br/>"you'd"<span> →</span> "you would",<br/>"you're"<span> →</span> "you are"</pre>
<p>Before we are doing this let's count how many rows contain any of the contractions in the "item_description" column.</p>
<p>5 top listed below, which is no surprise.</p>
<pre>can't - 6136<br/>won't - 4867<br/>that's - 4806<br/>it's - 26444<br/>don't - 32645<br/>doesn't - 8520</pre>
<p>Here is the code to remove contractions for both 'item_description' and 'name' columns in train and test datasets.</p>
<div class="highlight">
<pre><span class="n">contractions</span> <span class="o">=</span> <span class="p">{</span>
<span class="s">"ain't"</span><span class="p">:</span> <span class="s">"am not"</span><span class="p">,</span>
<span class="s">"aren't"</span><span class="p">:</span> <span class="s">"are not"</span><span class="p">,</span>
<span class="s">"can't"</span><span class="p">:</span> <span class="s">"cannot"</span><span class="p">,</span>
<span class="s">"can't've"</span><span class="p">:</span> <span class="s">"cannot have"</span><span class="p">,</span>
<span class="c1"># more contractions pairs</span>
<span class="o">...</span>
<span class="p">}</span>
<span class="k">for</span> <span class="n">contraction</span> <span class="ow">in</span> <span class="n">contractions</span><span class="p">:</span>
<span class="n">train</span><span class="p">[</span><span class="s">'item_description'</span><span class="p">]</span> <span class="o">=</span> <span class="n">train</span><span class="p">[</span><span class="s">'item_description'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">contraction</span><span class="p">,</span> <span class="n">contractions</span><span class="p">[</span><span class="n">contraction</span><span class="p">])</span>
<span class="n">test</span><span class="p">[</span><span class="s">'item_description'</span><span class="p">]</span> <span class="o">=</span> <span class="n">test</span><span class="p">[</span><span class="s">'item_description'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">contraction</span><span class="p">,</span> <span class="n">contractions</span><span class="p">[</span><span class="n">contraction</span><span class="p">])</span>
<span class="n">train</span><span class="p">[</span><span class="s">'name'</span><span class="p">]</span> <span class="o">=</span> <span class="n">train</span><span class="p">[</span><span class="s">'name'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">contraction</span><span class="p">,</span> <span class="n">contractions</span><span class="p">[</span><span class="n">contraction</span><span class="p">])</span>
<span class="n">test</span><span class="p">[</span><span class="s">'name'</span><span class="p">]</span> <span class="o">=</span> <span class="n">test</span><span class="p">[</span><span class="s">'name'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">contraction</span><span class="p">,</span> <span class="n">contractions</span><span class="p">[</span><span class="n">contraction</span><span class="p">])</span>
</pre>
</div>
<h3>Handel missing values</h3>
<p>The concept of missing values is important to understand in order to successfully manage data. If the missing values are not handled properly by the researcher, then he/she may end up drawing an inaccurate inference about the data. </p>
<p>First take a look at how many data is missing, and which column.</p>
<div class="highlight">
<pre><span class="n">train</span><span class="o">.</span><span class="n">isnull</span><span class="p">()</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">train</span><span class="p">)</span>
</pre>
</div>
<p>In the output we know there are 42% brand_name are missing, "category_name" and "item_description" columns are also missing less than 1% data.</p>
<pre>train_id 0.000000
name 0.000000
item_condition_id 0.000000
<strong>category_name 0.004268</strong>
<strong>brand_name 0.426757</strong>
price 0.000000
shipping 0.000000
<strong>item_description 0.000003</strong>
dtype: float64</pre>
<p> Suppose the number of cases of missing values is extremely small; then, an expert researcher may drop or omit those values from the analysis. In statistical language, if the number of the cases is less than 5% of the sample, then the researcher can drop them. In our case, we can drop rows with <span>"<strong>category_name</strong>" or "<strong>item_description</strong>" column missing value.</span></p>
<p><span>But for simplicity, let's replace all missing text values with the string "<strong>missing</strong>".</span></p>
<div class="highlight">
<pre><span class="c1">#HANDLE MISSING VALUES</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Handling missing values..."</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">handle_missing</span><span class="p">(</span><span class="n">dataset</span><span class="p">):</span>
<span class="n">dataset</span><span class="o">.</span><span class="n">category_name</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="n">value</span><span class="o">=</span><span class="s">"missing"</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">dataset</span><span class="o">.</span><span class="n">brand_name</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="n">value</span><span class="o">=</span><span class="s">"missing"</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">dataset</span><span class="o">.</span><span class="n">item_description</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="n">value</span><span class="o">=</span><span class="s">"missing"</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">return</span> <span class="p">(</span><span class="n">dataset</span><span class="p">)</span>
<span class="n">train</span> <span class="o">=</span> <span class="n">handle_missing</span><span class="p">(</span><span class="n">train</span><span class="p">)</span>
<span class="n">test</span> <span class="o">=</span> <span class="n">handle_missing</span><span class="p">(</span><span class="n">test</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">train</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">test</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
</pre>
</div>
<h3>Create categorical columns</h3>
<p>There are two text columns has special meanings,</p>
<ul>
<li>category_name</li>
<li>brand_name</li>
</ul>
<p></p>
<p>Different products could have same category name or brand name, so it will be helpful to create categorical columns from them.</p>
<p>We will use sklearn's LabelEncoder for this purpose. After the transform, we will have two more new columns "category" and "brand" with an integer type.</p>
<div class="highlight">
<pre><span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">LabelEncoder</span>
<span class="n">le</span> <span class="o">=</span> <span class="n">LabelEncoder</span><span class="p">()</span>
<span class="n">le</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">hstack</span><span class="p">([</span><span class="n">train</span><span class="o">.</span><span class="n">category_name</span><span class="p">,</span> <span class="n">test</span><span class="o">.</span><span class="n">category_name</span><span class="p">]))</span>
<span class="n">train</span><span class="p">[</span><span class="s">'category'</span><span class="p">]</span> <span class="o">=</span> <span class="n">le</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">train</span><span class="o">.</span><span class="n">category_name</span><span class="p">)</span>
<span class="n">test</span><span class="p">[</span><span class="s">'category'</span><span class="p">]</span> <span class="o">=</span> <span class="n">le</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">test</span><span class="o">.</span><span class="n">category_name</span><span class="p">)</span>
<span class="n">le</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">hstack</span><span class="p">([</span><span class="n">train</span><span class="o">.</span><span class="n">brand_name</span><span class="p">,</span> <span class="n">test</span><span class="o">.</span><span class="n">brand_name</span><span class="p">]))</span>
<span class="n">train</span><span class="p">[</span><span class="s">'brand'</span><span class="p">]</span> <span class="o">=</span> <span class="n">le</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">train</span><span class="o">.</span><span class="n">brand_name</span><span class="p">)</span>
<span class="n">test</span><span class="p">[</span><span class="s">'brand'</span><span class="p">]</span> <span class="o">=</span> <span class="n">le</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">test</span><span class="o">.</span><span class="n">brand_name</span><span class="p">)</span>
</pre>
</div>
<h3>Tokenize - texts to sequences</h3>
<p>For each unique word in the vocabulary, we will turn it to one integer to represent it. So one sentence will become a list of integers.</p>
<p>First, we need to gather the vocabulary list from our text columns we are going to tokenize. i.e. those 3 columns, </p>
<ul>
<li>category_name</li>
<li>item_description</li>
<li>name</li>
</ul>
<p>And we will use Keras' text processing <strong>Tokenizer </strong>class.</p>
<div class="highlight">
<pre><span class="kn">from</span> <span class="nn">keras.preprocessing.text</span> <span class="kn">import</span> <span class="n">Tokenizer</span>
<span class="n">raw_text</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">hstack</span><span class="p">([</span><span class="n">train</span><span class="o">.</span><span class="n">category_name</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">lower</span><span class="p">(),</span>
<span class="n">train</span><span class="o">.</span><span class="n">item_description</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">lower</span><span class="p">(),</span>
<span class="n">train</span><span class="o">.</span><span class="n">name</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">lower</span><span class="p">()])</span>
<span class="c1"># Tokenize</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Tokenizing!"</span><span class="p">)</span>
<span class="n">tok_raw</span> <span class="o">=</span> <span class="n">Tokenizer</span><span class="p">()</span>
<span class="n">tok_raw</span><span class="o">.</span><span class="n">fit_on_texts</span><span class="p">(</span><span class="n">raw_text</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">" Transforming text to seq..."</span><span class="p">)</span>
<span class="n">train</span><span class="p">[</span><span class="s">"seq_category_name"</span><span class="p">]</span> <span class="o">=</span> <span class="n">tok_raw</span><span class="o">.</span><span class="n">texts_to_sequences</span><span class="p">(</span><span class="n">train</span><span class="o">.</span><span class="n">category_name</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">lower</span><span class="p">())</span>
<span class="n">test</span><span class="p">[</span><span class="s">"seq_category_name"</span><span class="p">]</span> <span class="o">=</span> <span class="n">tok_raw</span><span class="o">.</span><span class="n">texts_to_sequences</span><span class="p">(</span><span class="n">test</span><span class="o">.</span><span class="n">category_name</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">lower</span><span class="p">())</span>
<span class="n">train</span><span class="p">[</span><span class="s">"seq_item_description"</span><span class="p">]</span> <span class="o">=</span> <span class="n">tok_raw</span><span class="o">.</span><span class="n">texts_to_sequences</span><span class="p">(</span><span class="n">train</span><span class="o">.</span><span class="n">item_description</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">lower</span><span class="p">())</span>
<span class="n">test</span><span class="p">[</span><span class="s">"seq_item_description"</span><span class="p">]</span> <span class="o">=</span> <span class="n">tok_raw</span><span class="o">.</span><span class="n">texts_to_sequences</span><span class="p">(</span><span class="n">test</span><span class="o">.</span><span class="n">item_description</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">lower</span><span class="p">())</span>
<span class="n">train</span><span class="p">[</span><span class="s">"seq_name"</span><span class="p">]</span> <span class="o">=</span> <span class="n">tok_raw</span><span class="o">.</span><span class="n">texts_to_sequences</span><span class="p">(</span><span class="n">train</span><span class="o">.</span><span class="n">name</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">lower</span><span class="p">())</span>
<span class="n">test</span><span class="p">[</span><span class="s">"seq_name"</span><span class="p">]</span> <span class="o">=</span> <span class="n">tok_raw</span><span class="o">.</span><span class="n">texts_to_sequences</span><span class="p">(</span><span class="n">test</span><span class="o">.</span><span class="n">name</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">lower</span><span class="p">())</span>
</pre>
</div>
<p>The <code>fit_on_texts()</code> method will train the Tokenizer to generate the vocabulary word index mapping. And the<code> texts_to_sequences()</code> method will actually make the sequences from texts.</p>
<h3>Padding sequences</h3>
<p>The words sequences generated in the previous step are in different lengths. Since the first layer in our network for those sequences is t<span>he </span><code>Embedding</code><span><span> </span>layers. Each Embedding layer takes as input a 2D tensor of integers, of shape<span> </span></span><code>(samples, sequence_length) </code></p>
<p><span> All sequences in a batch must have the same length since we need to pack them into a single tensor. So sequences that are shorter than others should be padded with zeros, and sequences that are longer should be truncated.</span></p>
<p><span>To begin with, we need to choose the for<code>sequence_length</code> each of our sequences columns. If it is too long, the model training will take forever. If it is too short, we are at the risk of truncating important information. It will be great to visualize the sequence length distribution before we make this decision.</span></p>
<p><span>This line of code will plot sequence length distribution for the "seq_item_description" column in a histogram.</span></p>
<div class="highlight">
<pre><span class="n">train</span><span class="o">.</span><span class="n">seq_item_description</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">))</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">bins</span><span class="o">=</span><span class="mi">30</span><span class="p">)</span>
</pre>
</div>
<p><span><img alt="sequence lengths histogram" src="https://gitcdn.link/cdn/Tony607/blog_statics/master/images/kaggle_price/length_hist.png"/></span></p>
<p></p>
<p>Let's pick the number 60 for the max sequence length since it covers up the majority sequences.</p>
<p>In the code below, we are using <span><span> </span>Keras' sequence processing <code>pad_sequences()</code> method to pad sequences to be the same length for each column.</span></p>
<div class="highlight">
<pre><span class="c1">#KERAS DATA DEFINITION</span>
<span class="kn">from</span> <span class="nn">keras.preprocessing.sequence</span> <span class="kn">import</span> <span class="n">pad_sequences</span>
<span class="k">def</span> <span class="nf">get_keras_data</span><span class="p">(</span><span class="n">dataset</span><span class="p">):</span>
<span class="n">X</span> <span class="o">=</span> <span class="p">{</span>
<span class="s">'name'</span><span class="p">:</span> <span class="n">pad_sequences</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">seq_name</span><span class="p">,</span> <span class="n">maxlen</span><span class="o">=</span><span class="n">MAX_NAME_SEQ</span><span class="p">)</span>
<span class="p">,</span><span class="s">'item_desc'</span><span class="p">:</span> <span class="n">pad_sequences</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">seq_item_description</span>
<span class="p">,</span> <span class="n">maxlen</span><span class="o">=</span><span class="n">MAX_ITEM_DESC_SEQ</span><span class="p">)</span>
<span class="p">,</span><span class="s">'brand'</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">brand</span><span class="p">)</span>
<span class="p">,</span><span class="s">'category'</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">category</span><span class="p">)</span>
<span class="p">,</span><span class="s">'category_name'</span><span class="p">:</span> <span class="n">pad_sequences</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">seq_category_name</span>
<span class="p">,</span> <span class="n">maxlen</span><span class="o">=</span><span class="n">MAX_CATEGORY_NAME_SEQ</span><span class="p">)</span>
<span class="p">,</span><span class="s">'item_condition'</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">dataset</span><span class="o">.</span><span class="n">item_condition_id</span><span class="p">)</span>
<span class="p">,</span><span class="s">'shipping'</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">dataset</span><span class="p">[[</span><span class="s">"shipping"</span><span class="p">]])</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">X</span>
<span class="n">X_train</span> <span class="o">=</span> <span class="n">get_keras_data</span><span class="p">(</span><span class="n">dtrain</span><span class="p">)</span>
<span class="n">X_valid</span> <span class="o">=</span> <span class="n">get_keras_data</span><span class="p">(</span><span class="n">dvalid</span><span class="p">)</span>
<span class="n">X_test</span> <span class="o">=</span> <span class="n">get_keras_data</span><span class="p">(</span><span class="n">test</span><span class="p">)</span>
</pre>
</div>
<p><br/><span></span></p>
<h1>Build the model</h1>
<p>This will be a multi-input model for those inputs</p>
<ul>
<li>'<strong>name</strong>': texts converted to <strong>sequences</strong></li>
<li>'<strong>item_desc</strong>': <span>texts converted to </span><strong>sequences</strong></li>
<li>'<strong>brand</strong>': texts converted to <strong>integers</strong></li>
<li>'<strong>category</strong>': <span>texts converted to </span><strong>integers</strong></li>
<li>'<strong>category_name</strong>': <span>texts converted to </span><strong>sequences</strong></li>
<li>'<strong>item_condition</strong>': <strong>integers</strong></li>
<li>'<strong>shipping</strong>': <strong>integers</strong> 1 or 0</li>
</ul>
<p></p>
<p>All Inputs except “shipping” will first go to embedding layer</p>
<p>For those sequences inputs, we need to feed them to <code>Embedding</code> layers. <code>Embedding</code> layers turn <span>integer indices (which stand for specific words) to dense vectors. It takes as input integers, it looks up these integers into an internal dictionary, and it returns the associated vectors. It's effectively a dictionary lookup.</span></p>
<p><span>The embedded sequence will then feed to the <code>GRU</code> layer, like other types of recurrent networks, it is good at learning patterns in sequences of data.</span></p>
<p><span>Non-sequential data embedded layers will just be flattened to 2 dimensions by the <code>Flatten</code> layer.</span></p>
<p>All layers including the “shipping” will then be concatenated to a big two-dimensional tensor.</p>
<p>Followed by several Dense layers, final output Dense layer takes “<strong>linear</strong>” activation regression to arbitrary price values, same as specifying <strong>None</strong> for the <code>activation</code> parameter.</p>
<div class="highlight">
<pre><span class="kn">from</span> <span class="nn">keras.layers</span> <span class="kn">import</span> <span class="n">Input</span><span class="p">,</span> <span class="n">Dropout</span><span class="p">,</span> <span class="n">Dense</span><span class="p">,</span> \
<span class="n">concatenate</span><span class="p">,</span> <span class="n">GRU</span><span class="p">,</span> <span class="n">Embedding</span><span class="p">,</span> <span class="n">Flatten</span>
<span class="kn">from</span> <span class="nn">keras.models</span> <span class="kn">import</span> <span class="n">Model</span>
<span class="kn">from</span> <span class="nn">keras</span> <span class="kn">import</span> <span class="n">optimizers</span>
<span class="k">def</span> <span class="nf">get_model</span><span class="p">():</span>
<span class="c1">#Inputs</span>
<span class="n">name</span> <span class="o">=</span> <span class="n">Input</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">[</span><span class="n">X_train</span><span class="p">[</span><span class="s">"name"</span><span class="p">]</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]],</span> <span class="n">name</span><span class="o">=</span><span class="s">"name"</span><span class="p">)</span>
<span class="n">item_desc</span> <span class="o">=</span> <span class="n">Input</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">[</span><span class="n">X_train</span><span class="p">[</span><span class="s">"item_desc"</span><span class="p">]</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]],</span> <span class="n">name</span><span class="o">=</span><span class="s">"item_desc"</span><span class="p">)</span>
<span class="n">brand</span> <span class="o">=</span> <span class="n">Input</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">name</span><span class="o">=</span><span class="s">"brand"</span><span class="p">)</span>
<span class="n">category</span> <span class="o">=</span> <span class="n">Input</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">name</span><span class="o">=</span><span class="s">"category"</span><span class="p">)</span>
<span class="n">category_name</span> <span class="o">=</span> <span class="n">Input</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">[</span><span class="n">X_train</span><span class="p">[</span><span class="s">"category_name"</span><span class="p">]</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]],</span>
<span class="n">name</span><span class="o">=</span><span class="s">"category_name"</span><span class="p">)</span>
<span class="n">item_condition</span> <span class="o">=</span> <span class="n">Input</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">name</span><span class="o">=</span><span class="s">"item_condition"</span><span class="p">)</span>
<span class="n">shipping</span> <span class="o">=</span> <span class="n">Input</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">[</span><span class="n">X_train</span><span class="p">[</span><span class="s">"shipping"</span><span class="p">]</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]],</span> <span class="n">name</span><span class="o">=</span><span class="s">"shipping"</span><span class="p">)</span>
<span class="c1">#Embeddings layers</span>
<span class="n">emb_size</span> <span class="o">=</span> <span class="mi">60</span>
<span class="n">emb_name</span> <span class="o">=</span> <span class="n">Embedding</span><span class="p">(</span><span class="n">MAX_TEXT</span><span class="p">,</span> <span class="n">emb_size</span><span class="o">//</span><span class="mi">3</span><span class="p">)(</span><span class="n">name</span><span class="p">)</span>
<span class="n">emb_item_desc</span> <span class="o">=</span> <span class="n">Embedding</span><span class="p">(</span><span class="n">MAX_TEXT</span><span class="p">,</span> <span class="n">emb_size</span><span class="p">)(</span><span class="n">item_desc</span><span class="p">)</span>
<span class="n">emb_category_name</span> <span class="o">=</span> <span class="n">Embedding</span><span class="p">(</span><span class="n">MAX_TEXT</span><span class="p">,</span> <span class="n">emb_size</span><span class="o">//</span><span class="mi">3</span><span class="p">)(</span><span class="n">category_name</span><span class="p">)</span>
<span class="n">emb_brand</span> <span class="o">=</span> <span class="n">Embedding</span><span class="p">(</span><span class="n">MAX_BRAND</span><span class="p">,</span> <span class="mi">10</span><span class="p">)(</span><span class="n">brand</span><span class="p">)</span>
<span class="n">emb_category</span> <span class="o">=</span> <span class="n">Embedding</span><span class="p">(</span><span class="n">MAX_CATEGORY</span><span class="p">,</span> <span class="mi">10</span><span class="p">)(</span><span class="n">category</span><span class="p">)</span>
<span class="n">emb_item_condition</span> <span class="o">=</span> <span class="n">Embedding</span><span class="p">(</span><span class="n">MAX_CONDITION</span><span class="p">,</span> <span class="mi">5</span><span class="p">)(</span><span class="n">item_condition</span><span class="p">)</span>
<span class="n">rnn_layer1</span> <span class="o">=</span> <span class="n">GRU</span><span class="p">(</span><span class="mi">16</span><span class="p">)</span> <span class="p">(</span><span class="n">emb_item_desc</span><span class="p">)</span>
<span class="n">rnn_layer2</span> <span class="o">=</span> <span class="n">GRU</span><span class="p">(</span><span class="mi">8</span><span class="p">)</span> <span class="p">(</span><span class="n">emb_category_name</span><span class="p">)</span>
<span class="n">rnn_layer3</span> <span class="o">=</span> <span class="n">GRU</span><span class="p">(</span><span class="mi">8</span><span class="p">)</span> <span class="p">(</span><span class="n">emb_name</span><span class="p">)</span>
<span class="c1">#main layer</span>
<span class="n">main_l</span> <span class="o">=</span> <span class="n">concatenate</span><span class="p">([</span>
<span class="n">Flatten</span><span class="p">()</span> <span class="p">(</span><span class="n">emb_brand</span><span class="p">)</span>
<span class="p">,</span> <span class="n">Flatten</span><span class="p">()</span> <span class="p">(</span><span class="n">emb_category</span><span class="p">)</span>
<span class="p">,</span> <span class="n">Flatten</span><span class="p">()</span> <span class="p">(</span><span class="n">emb_item_condition</span><span class="p">)</span>
<span class="p">,</span> <span class="n">rnn_layer1</span>
<span class="p">,</span> <span class="n">rnn_layer2</span>
<span class="p">,</span> <span class="n">rnn_layer3</span>
<span class="p">,</span> <span class="n">shipping</span>
<span class="p">])</span>
<span class="n">main_l</span> <span class="o">=</span> <span class="n">Dropout</span><span class="p">(</span><span class="n">dr</span><span class="p">)(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">512</span><span class="p">,</span><span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">)</span> <span class="p">(</span><span class="n">main_l</span><span class="p">))</span>
<span class="n">main_l</span> <span class="o">=</span> <span class="n">Dropout</span><span class="p">(</span><span class="n">dr</span><span class="p">)(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span><span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">)</span> <span class="p">(</span><span class="n">main_l</span><span class="p">))</span>
<span class="n">main_l</span> <span class="o">=</span> <span class="n">Dropout</span><span class="p">(</span><span class="n">dr</span><span class="p">)(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">32</span><span class="p">,</span><span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">)</span> <span class="p">(</span><span class="n">main_l</span><span class="p">))</span>
<span class="c1">#output</span>
<span class="n">output</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="n">activation</span><span class="o">=</span><span class="s">"linear"</span><span class="p">)</span> <span class="p">(</span><span class="n">main_l</span><span class="p">)</span>
<span class="c1">#model</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">Model</span><span class="p">([</span><span class="n">name</span><span class="p">,</span> <span class="n">item_desc</span><span class="p">,</span> <span class="n">brand</span>
<span class="p">,</span> <span class="n">category</span><span class="p">,</span> <span class="n">category_name</span>
<span class="p">,</span> <span class="n">item_condition</span><span class="p">,</span> <span class="n">shipping</span><span class="p">],</span> <span class="n">output</span><span class="p">)</span>
<span class="c1">#optimizer = optimizers.RMSprop()</span>
<span class="n">optimizer</span> <span class="o">=</span> <span class="n">optimizers</span><span class="o">.</span><span class="n">Adam</span><span class="p">()</span>
<span class="n">model</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="n">loss</span><span class="o">=</span><span class="s">"mse"</span><span class="p">,</span>
<span class="n">optimizer</span><span class="o">=</span><span class="n">optimizer</span><span class="p">)</span>
<span class="k">return</span> <span class="n">model</span>
</pre>
</div>
<p>I like visualization, so I plot the model structure as well.</p>
<p></p>
<p><img alt="model plot" src="https://gitcdn.xyz/repo/Tony607/blog_statics/master/images/kaggle_price/model.png"/></p>
<p>It can be done with those two lines of code if you are curious.</p>
<p>You need to install <a href="http://www.graphviz.org/download/">Graphviz executable</a>. pip install <strong>graphviz</strong> and <strong>pydot</strong> packages before trying to plot.</p>
<div class="highlight">
<pre><span class="kn">from</span> <span class="nn">keras.utils</span> <span class="kn">import</span> <span class="n">plot_model</span>
<span class="n">plot_model</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">to_file</span><span class="o">=</span><span class="s">'model.png'</span><span class="p">,</span> <span class="n">show_shapes</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre>
</div>
<p>Training the model is easy, let's train it for 2 epochs, <code>X_train</code> is the dictionary we created earlier, mapping input names to Numpy arrays.</p>
<div class="highlight">
<pre><span class="n">epochs</span> <span class="o">=</span> <span class="mi">3</span>
<span class="n">BATCH_SIZE</span> <span class="o">=</span> <span class="mi">512</span> <span class="o">*</span> <span class="mi">3</span>
<span class="n">history</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">dtrain</span><span class="o">.</span><span class="n">target</span>
<span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="n">epochs</span>
<span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="n">BATCH_SIZE</span>
<span class="p">,</span> <span class="n">validation_split</span><span class="o">=</span><span class="mf">0.01</span>
<span class="p">)</span>
</pre>
</div>
<h1>Evaluate the model</h1>
<p>The Kaggle challenge page has chosen "Root Mean Squared Logarithmic Error" as the loss function.</p>
<p></p>
<p>The following code will take our trained model and compute the loss value given the validation data.</p>
<div class="highlight">
<pre><span class="k">def</span> <span class="nf">rmsle</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">y_pred</span><span class="p">):</span>
<span class="kn">import</span> <span class="nn">math</span>
<span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="n">y</span><span class="p">)</span> <span class="o">==</span> <span class="nb">len</span><span class="p">(</span><span class="n">y_pred</span><span class="p">)</span>
<span class="n">to_sum</span> <span class="o">=</span> <span class="p">[(</span><span class="n">math</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">y_pred</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">-</span> <span class="n">math</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">y</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span><span class="p">))</span> <span class="o">**</span> <span class="mf">2.0</span> \
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">pred</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">y_pred</span><span class="p">)]</span>
<span class="k">return</span> <span class="p">(</span><span class="nb">sum</span><span class="p">(</span><span class="n">to_sum</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="mf">1.0</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">y</span><span class="p">)))</span> <span class="o">**</span> <span class="mf">0.5</span>
<span class="k">def</span> <span class="nf">eval_model</span><span class="p">(</span><span class="n">model</span><span class="p">):</span>
<span class="n">val_preds</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_valid</span><span class="p">)</span>
<span class="n">val_preds</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">expm1</span><span class="p">(</span><span class="n">val_preds</span><span class="p">)</span>
<span class="n">y_true</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">dvalid</span><span class="o">.</span><span class="n">price</span><span class="o">.</span><span class="n">values</span><span class="p">)</span>
<span class="n">y_pred</span> <span class="o">=</span> <span class="n">val_preds</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">]</span>
<span class="n">v_rmsle</span> <span class="o">=</span> <span class="n">rmsle</span><span class="p">(</span><span class="n">y_true</span><span class="p">,</span> <span class="n">y_pred</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">" RMSLE error on dev test: "</span><span class="o">+</span><span class="nb">str</span><span class="p">(</span><span class="n">v_rmsle</span><span class="p">))</span>
<span class="k">return</span> <span class="n">v_rmsle</span>
<span class="n">v_rmsle</span> <span class="o">=</span> <span class="n">eval_model</span><span class="p">(</span><span class="n">model</span><span class="p">)</span>
</pre>
</div>
<h1>Generate file for submission</h1>
<p>If you are planning on generating the actual prices for the test datasets and try your luck on Kaggle. This block of code will reverse the feature normalization process we discussed previously and write the prices to a CSV file.</p>
<div class="highlight">
<pre><span class="n">preds</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="n">BATCH_SIZE</span><span class="p">)</span>
<span class="n">preds</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">expm1</span><span class="p">(</span><span class="n">preds</span><span class="p">)</span>
<span class="n">submission</span> <span class="o">=</span> <span class="n">test</span><span class="p">[[</span><span class="s">"test_id"</span><span class="p">]][:</span><span class="n">test_len</span><span class="p">]</span>
<span class="n">submission</span><span class="p">[</span><span class="s">"price"</span><span class="p">]</span> <span class="o">=</span> <span class="n">preds</span><span class="p">[:</span><span class="n">test_len</span><span class="p">]</span>
<span class="n">submission</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s">"./submission.csv"</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
</pre>
</div>
<h1>Summary</h1>
<p>We walked through how to predict prices give multiple input features. How to preprocessing the text data, dealing with missing data and finally build, train and evaluate the model.</p>
<p>Full source code posted on my <a href="https://github.com/Tony607/Pricing_Challenge">GitHub</a>.</p>
<p></p>How to do multi-class multi-label classification for news categories2017-11-18T07:03:47+00:002024-03-19T07:42:28+00:00Chengweihttps://www.dlology.com/blog/author/Chengwei/https://www.dlology.com/blog/how-to-do-multi-class-multi-label-classification-for-news-categories/<p><img alt="classify" src="https://www.dlology.com/static/media/uploads/news_categories/classified.jpg"/></p>
<p>My <a href="https://www.dlology.com/blog/how-to-choose-last-layer-activation-and-loss-function/">previous post</a> shows how to choose last layer activation and loss functions for different tasks. This post we focus on the multi-class multi-label classification.</p>
<h2>Overview of the task</h2>
<p>We are going to use the <a href="https://archive.ics.uci.edu/ml/machine-learning-databases/reuters21578-mld/">Reuters-21578</a> news dataset. With a given news, our task is to give it one or multiple tags. The dataset is divided into five main categories:</p>
<ul>
<li>Topics</li>
<li>Places</li>
<li>People</li>
<li>Organizations</li>
<li>Exchanges</li>
</ul>
<p>For example, one given news could have those 3 tags belonging two categories</p>
<ul>
<li>Places: <strong>USA</strong>, <strong>China</strong></li>
<li>Topics: <strong>trade</strong></li>
</ul>
<h2>Structure of the code</h2>
<ul>
<li>
<h3>Prepare documents and categories</h3>
<ol>
<li>Read the category files to acquire all available 672 tags from those 5 categories.</li>
<li>Read all the news files and find the most common 20 tags out of 672 we are going to use for classification. Here is a list those 20 tags. Each one is prefixed with its categories for clarity. For instance <strong>"pl_usa"</strong> means tag <strong>"Places: USA"</strong>,<strong> "to_trade"</strong> is<strong> "Topics: trade" </strong>etc.
<table border="1" class="dataframe">
<thead>
<tr>
<th>Name</th>
<th>Type</th>
<th>Newslines</th>
</tr>
</thead>
<tbody>
<tr>
<th>619</th>
<td>pl_usa</td>
<td>Places</td>
<td>12542</td>
</tr>
<tr>
<th>35</th>
<td>to_earn</td>
<td>Topics</td>
<td>3987</td>
</tr>
<tr>
<th>0</th>
<td>to_acq</td>
<td>Topics</td>
<td>2448</td>
</tr>
<tr>
<th>616</th>
<td>pl_uk</td>
<td>Places</td>
<td>1489</td>
</tr>
<tr>
<th>542</th>
<td>pl_japan</td>
<td>Places</td>
<td>1138</td>
</tr>
<tr>
<th>489</th>
<td>pl_canada</td>
<td>Places</td>
<td>1104</td>
</tr>
<tr>
<th>73</th>
<td>to_money-fx</td>
<td>Topics</td>
<td>801</td>
</tr>
<tr>
<th>28</th>
<td>to_crude</td>
<td>Topics</td>
<td>634</td>
</tr>
<tr>
<th>45</th>
<td>to_grain</td>
<td>Topics</td>
<td>628</td>
</tr>
<tr>
<th>625</th>
<td>pl_west-germany</td>
<td>Places</td>
<td>567</td>
</tr>
<tr>
<th>126</th>
<td>to_trade</td>
<td>Topics</td>
<td>552</td>
</tr>
<tr>
<th>55</th>
<td>to_interest</td>
<td>Topics</td>
<td>513</td>
</tr>
<tr>
<th>514</th>
<td>pl_france</td>
<td>Places</td>
<td>469</td>
</tr>
<tr>
<th>412</th>
<td>or_ec</td>
<td>Organizations</td>
<td>349</td>
</tr>
<tr>
<th>481</th>
<td>pl_brazil</td>
<td>Places</td>
<td>332</td>
</tr>
<tr>
<th>130</th>
<td>to_wheat</td>
<td>Topics</td>
<td>306</td>
</tr>
<tr>
<th>108</th>
<td>to_ship</td>
<td>Topics</td>
<td>305</td>
</tr>
<tr>
<th>468</th>
<td>pl_australia</td>
<td>Places</td>
<td>270</td>
</tr>
<tr>
<th>19</th>
<td>to_corn</td>
<td>Topics</td>
<td>254</td>
</tr>
<tr>
<th>495</th>
<td>pl_china</td>
<td>Places</td>
<td>223</td>
</tr>
</tbody>
</table>
</li>
</ol>
</li>
<li>
<h3>Clean up the data for model</h3>
</li>
</ul>
<p>In previous step, we read the news contents and stored in a list</p>
<p>One news looks like this</p>
<pre>average yen cd rates fall in latest week
tokyo, feb 27 - average interest rates on yen certificates
of deposit, cd, fell to 4.27 pct in the week ended february 25
from 4.32 pct the previous week, the bank of japan said.
new rates (previous in brackets), were -
average cd rates all banks 4.27 pct (4.32)
money market certificate, mmc, ceiling rates for the week
starting from march 2 3.52 pct (3.57)
average cd rates of city, trust and long-term banks
less than 60 days 4.33 pct (4.32)
60-90 days 4.13 pct (4.37)
average cd rates of city, trust and long-term banks
90-120 days 4.35 pct (4.30)
120-150 days 4.38 pct (4.29)
150-180 days unquoted (unquoted)
180-270 days 3.67 pct (unquoted)
over 270 days 4.01 pct (unquoted)
average yen bankers' acceptance rates of city, trust and
long-term banks
30 to less than 60 days unquoted (4.13)
60-90 days unquoted (unquoted)
90-120 days unquoted (unquoted)
reuter</pre>
<p>We start up the cleaning up by </p>
<ul>
<li>Only take characters inside A-Za-z0-9</li>
<li>remove stop words (words like "in" , "on", "from" that don't really contain any special information)</li>
<li>lemmatize (e.g. turning word "rates" to "rate")</li>
</ul>
<p>After this our news will looks much "friendly" to our model, each word is seperated by space.</p>
<pre>average yen cd rate fall latest week tokyo feb 27 average interest rate yen certificatesof deposit cd fell 427 pct week ended february 25from 432 pct previous week bank japan said new rate previous bracket average cd rate bank 427 pct 432 money market certificate mmc ceiling rate weekstarting march 2 352 pct 357 average cd rate city trust longterm bank le 60 day 433 pct 432 6090 day 413 pct 437 average cd rate city trust longterm bank 90120 day 435 pct 430 120150 day 438 pct 429 150180 day unquoted unquoted 180270 day 367 pct unquoted 270 day 401 pct unquoted average yen banker acceptance rate city trust andlongterm bank 30 le 60 day unquoted 413 6090 day unquoted unquoted 90120 day unquoted unquoted reuter</pre>
<p>Since a small portation of news are quite long even after the cleanup, let's set a limit to the maximum input sequence to 88 words, this will cover up 70% of all news in full length. We could have set a larger input sequence limit to cover more news but that will also increase the model training time.</p>
<p>Lastly, we will turn words into the form of ids and pad the sequence to input limit (88) if it is shorter.</p>
<p>Keras text processing makes this trivial.</p>
<div class="highlight">
<pre><span></span><span class="kn">from</span> <span class="nn">keras.preprocessing.text</span> <span class="kn">import</span> <span class="n">Tokenizer</span>
<span class="kn">from</span> <span class="nn">keras.preprocessing.sequence</span> <span class="kn">import</span> <span class="n">pad_sequences</span>
<span class="n">max_vocab_size</span> <span class="o">=</span> <span class="mi">200000</span>
<span class="n">input_tokenizer</span> <span class="o">=</span> <span class="n">Tokenizer</span><span class="p">(</span><span class="n">max_vocab_size</span><span class="p">)</span>
<span class="n">input_tokenizer</span><span class="o">.</span><span class="n">fit_on_texts</span><span class="p">(</span><span class="n">totalX</span><span class="p">)</span>
<span class="n">input_vocab_size</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">input_tokenizer</span><span class="o">.</span><span class="n">word_index</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span>
<span class="k">print</span><span class="p">(</span><span class="s2">"input_vocab_size:"</span><span class="p">,</span><span class="n">input_vocab_size</span><span class="p">)</span> <span class="c1"># input_vocab_size: 167135</span>
<span class="n">totalX</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">pad_sequences</span><span class="p">(</span><span class="n">input_tokenizer</span><span class="o">.</span><span class="n">texts_to_sequences</span><span class="p">(</span><span class="n">totalX</span><span class="p">),</span>
<span class="n">maxlen</span><span class="o">=</span><span class="n">maxLength</span><span class="p">))</span>
</pre>
</div>
<p>The same news will look like this, each number represents a unique word in the vocabulary.</p>
<pre>array([ 6943, 5, 5525, 177, 22, 699, 13146, 1620, 32,
35130, 7, 130, 6482, 5, 8473, 301, 1764, 32,
364, 458, 794, 11, 442, 546, 131, 7180, 5,
5525, 18247, 131, 7451, 5, 8088, 301, 1764, 32,
364, 458, 794, 11, 21414, 131, 7452, 5, 4009,
35131, 131, 4864, 5, 6712, 35132, 131, 3530, 3530,
26347, 131, 5526, 5, 3530, 2965, 131, 7181, 5,
3530, 301, 149, 312, 1922, 32, 364, 458, 9332,
11, 76, 442, 546, 131, 3530, 7451, 18247, 131,
3530, 3530, 21414, 131, 3530, 3530, 3])</pre>
<p></p>
<ul>
<li>
<h3>Create and train model</h3>
</li>
</ul>
<ul>
<li>Embedding layer embed a sequence of vectors of size 256</li>
<li>GRU layers(recurrent network) which process the sequence data</li>
<li>Dense layer output the classification result of 20 categories</li>
</ul>
<div class="highlight">
<pre><span></span><span class="n">embedding_dim</span> <span class="o">=</span> <span class="mi">256</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">Sequential</span><span class="p">()</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Embedding</span><span class="p">(</span><span class="n">input_vocab_size</span><span class="p">,</span> <span class="n">embedding_dim</span><span class="p">,</span><span class="n">input_length</span> <span class="o">=</span> <span class="n">maxLength</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">GRU</span><span class="p">(</span><span class="mi">256</span><span class="p">,</span> <span class="n">dropout</span><span class="o">=</span><span class="mf">0.9</span><span class="p">,</span> <span class="n">return_sequences</span><span class="o">=</span><span class="bp">True</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">GRU</span><span class="p">(</span><span class="mi">256</span><span class="p">,</span> <span class="n">dropout</span><span class="o">=</span><span class="mf">0.9</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="n">num_categories</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s1">'sigmoid'</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="n">loss</span><span class="o">=</span><span class="s1">'binary_crossentropy'</span><span class="p">,</span> <span class="n">optimizer</span><span class="o">=</span><span class="s1">'adam'</span><span class="p">,</span> <span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="s1">'accuracy'</span><span class="p">])</span>
<span class="n">history</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">totalX</span><span class="p">,</span> <span class="n">totalY</span><span class="p">,</span> <span class="n">validation_split</span><span class="o">=</span><span class="mf">0.1</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">128</span><span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</pre>
</div>
<ul>
<li>
<h3>Visualize the training performance</h3>
</li>
</ul>
<p>After training our model for 10 epochs in about 5 minutes, we have achieved the following result.</p>
<pre>loss: 0.1062 - acc: 0.9650 - val_loss: 0.0961 - val_acc: 0.9690</pre>
<p>The following code will generate a nice graph to visualize the progress of each training epochs.</p>
<div class="highlight">
<pre><span></span><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="kn">as</span> <span class="nn">plt</span>
<span class="n">acc</span> <span class="o">=</span> <span class="n">history</span><span class="o">.</span><span class="n">history</span><span class="p">[</span><span class="s1">'acc'</span><span class="p">]</span>
<span class="n">val_acc</span> <span class="o">=</span> <span class="n">history</span><span class="o">.</span><span class="n">history</span><span class="p">[</span><span class="s1">'val_acc'</span><span class="p">]</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">history</span><span class="o">.</span><span class="n">history</span><span class="p">[</span><span class="s1">'loss'</span><span class="p">]</span>
<span class="n">val_loss</span> <span class="o">=</span> <span class="n">history</span><span class="o">.</span><span class="n">history</span><span class="p">[</span><span class="s1">'val_loss'</span><span class="p">]</span>
<span class="n">epochs</span> <span class="o">=</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">acc</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">epochs</span><span class="p">,</span> <span class="n">acc</span><span class="p">,</span> <span class="s1">'bo'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'Training acc'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">epochs</span><span class="p">,</span> <span class="n">val_acc</span><span class="p">,</span> <span class="s1">'b'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'Validation acc'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">'Training and validation accuracy'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">epochs</span><span class="p">,</span> <span class="n">loss</span><span class="p">,</span> <span class="s1">'bo'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'Training loss'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">epochs</span><span class="p">,</span> <span class="n">val_loss</span><span class="p">,</span> <span class="s1">'b'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'Validation loss'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">'Training and validation loss'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</pre>
</div>
<p><img alt="visualize" src="https://www.dlology.com/static/media/uploads/news_categories/visualize_training.png"/></p>
<ul>
<li>
<h3>Make a prediction</h3>
</li>
</ul>
<p>Take one cleaned up news (<span>each word is separated by space</span>) to the same input tokenizer turning it to ids.</p>
<p>Call the model <strong>predict</strong> method, the output will be a list of 20 float numbers representing probabilities to those 20 tags. For demo purpose, lets take any tags will probability larger than 0.2.</p>
<div class="highlight">
<pre><span></span><span class="n">textArray</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">pad_sequences</span><span class="p">(</span><span class="n">input_tokenizer</span><span class="o">.</span><span class="n">texts_to_sequences</span><span class="p">([</span><span class="n">input_x_220</span><span class="p">]),</span> <span class="n">maxlen</span><span class="o">=</span><span class="n">maxLength</span><span class="p">))</span>
<span class="n">predicted</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">textArray</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">prob</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">predicted</span><span class="p">):</span>
<span class="k">if</span> <span class="n">prob</span> <span class="o">></span> <span class="mf">0.2</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="n">selected_categories</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
</pre>
</div>
<p>This produces three tags</p>
<pre>pl_uk
pl_japan
to_money-fx</pre>
<p>the ground truth is </p>
<pre>pl_japan
to_money-fx
to_interest</pre>
<p><span>The model got 2 out of 3 right for the given news.</span></p>
<h2><span>Summary</span></h2>
<p>We start with cleaning up the raw news data for the model input. Built a Keras model to do multi-class multi-label classification. Visualize the training result and make a prediction. Further improvements could be made</p>
<ul>
<li>Cleaning up the data better</li>
<li>Use longer input sequence limit</li>
<li>More training epochs</li>
</ul>
<p>The source code for the <a href="https://github.com/Tony607/Text_multi-class_multi-label_Classification">jupyter notebook is available on my GitHub</a> repo if you are interested.</p>How to Summarize Amazon Reviews with Tensorflow2017-10-14T12:56:49+00:002024-03-14T14:43:42+00:00Chengweihttps://www.dlology.com/blog/author/Chengwei/https://www.dlology.com/blog/tutorial-summarizing-text-with-amazon-reviews/<div><img alt="reviews" src="https://gitcdn.link/cdn/Tony607/blog_statics/4d40dcea14a9ec03c4453219776ca258afc27f73/images/reviews.jpg"/></div>
<p>The objective of this project is to build a model that can create relevant summaries for reviews written about fine foods sold on Amazon. This dataset contains above 500,000 reviews and is hosted on <a href="https://www.kaggle.com/snap/amazon-fine-food-reviews">Kaggle</a>.</p>
<p>Here are two examples to show what the data looks like</p>
<pre>Review # 1
<strong>Good Quality Dog Food</strong>
I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.
Review # 2
<strong>Not as Advertised</strong>
Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".</pre>
<p>To build our model we will use a two-layered bidirectional RNN with LSTMs on the input data and two layers, each with an LSTM using Bahdanau attention on the target data.</p>
<p>The sections of this project are:<br/>1.Inspecting the Data<br/>2.Preparing the Data<br/>3.Building the Model<br/>4.Training the Model<br/>5.Making Our Own Summaries</p>
<p>Inspired by the post <a href="https://medium.com/towards-data-science/text-summarization-with-amazon-reviews-41801c2210b">Text Summarization with Amazon Reviews</a>, with a few improvements and updates to work with latest TensorFlow Version 1.3, those improvements get better accuracy.</p>
<h2>Summary of improvements</h2>
<h3>1. Tokenize the sentence better</h3>
<p>Orginal code tokenizes the words by <strong>text.split()</strong>, it is not foolproof,</p>
<p>e.g. words followed by punctuations "Are you kidding?I think you are." would be incorrectly tokenized as </p>
<p>['Are', 'you', 'kidding?I', 'think', 'you', 'are.']</p>
<p>We use this line instead</p>
<div class="highlight">
<pre><span></span><span class="n">text</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="sa">r</span><span class="s2">"[\w']+"</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span>
</pre>
</div>
<p>which will correctly generate words list</p>
<p>['Are', 'you', 'kidding', 'I', 'think', 'you', 'are']</p>
<p></p>
<h3>2. Increased data p<span>reparation f</span>ilter and sort speed</h3>
<p>The original author uses two for loops to sort and filter the data, here we are using the Python build in sort and filter function to do the same thing but much faster.</p>
<p><strong>Filter</strong><span> </span>for length limit and number of<span> </span><code><UNK></code>s</p>
<p><strong>Sort</strong><span> </span>the summaries and texts by the length of the element in<span> </span><strong>texts</strong><span> </span>from shortest to longest</p>
<div class="highlight">
<pre><span></span><span class="n">max_text_length</span> <span class="o">=</span> <span class="mi">83</span> <span class="c1"># This will cover up to 89.5% lengthes</span>
<span class="n">max_summary_length</span> <span class="o">=</span> <span class="mi">13</span> <span class="c1"># This will cover up to 99% lengthes</span>
<span class="n">min_length</span> <span class="o">=</span> <span class="mi">2</span>
<span class="n">unk_text_limit</span> <span class="o">=</span> <span class="mi">1</span> <span class="c1"># text can contain up to 1 UNK word</span>
<span class="n">unk_summary_limit</span> <span class="o">=</span> <span class="mi">0</span> <span class="c1"># Summary should not contain any UNK word</span>
<span class="k">def</span> <span class="nf">filter_condition</span><span class="p">(</span><span class="n">item</span><span class="p">):</span>
<span class="n">int_summary</span> <span class="o">=</span> <span class="n">item</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">int_text</span> <span class="o">=</span> <span class="n">item</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="k">if</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">int_summary</span><span class="p">)</span> <span class="o">>=</span> <span class="n">min_length</span> <span class="ow">and</span>
<span class="nb">len</span><span class="p">(</span><span class="n">int_summary</span><span class="p">)</span> <span class="o"><=</span> <span class="n">max_summary_length</span> <span class="ow">and</span>
<span class="nb">len</span><span class="p">(</span><span class="n">int_text</span><span class="p">)</span> <span class="o">>=</span> <span class="n">min_length</span> <span class="ow">and</span>
<span class="nb">len</span><span class="p">(</span><span class="n">int_text</span><span class="p">)</span> <span class="o"><=</span> <span class="n">max_text_length</span> <span class="ow">and</span>
<span class="n">unk_counter</span><span class="p">(</span><span class="n">int_summary</span><span class="p">)</span> <span class="o"><=</span> <span class="n">unk_summary_limit</span> <span class="ow">and</span>
<span class="n">unk_counter</span><span class="p">(</span><span class="n">int_text</span><span class="p">)</span> <span class="o"><=</span> <span class="n">unk_text_limit</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">True</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">False</span>
<span class="n">int_text_summaries</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">int_summaries</span> <span class="p">,</span> <span class="n">int_texts</span><span class="p">))</span>
<span class="n">int_text_summaries_filtered</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">filter</span><span class="p">(</span><span class="n">filter_condition</span><span class="p">,</span> <span class="n">int_text_summaries</span><span class="p">))</span>
<span class="n">sorted_int_text_summaries</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">int_text_summaries_filtered</span><span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">item</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">item</span><span class="p">[</span><span class="mi">1</span><span class="p">]))</span>
<span class="n">sorted_int_text_summaries</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="o">*</span><span class="n">sorted_int_text_summaries</span><span class="p">))</span>
<span class="n">sorted_summaries</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">sorted_int_text_summaries</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">sorted_texts</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">sorted_int_text_summaries</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="c1"># Delete those temporary varaibles</span>
<span class="k">del</span> <span class="n">int_text_summaries</span><span class="p">,</span> <span class="n">sorted_int_text_summaries</span><span class="p">,</span> <span class="n">int_text_summaries_filtered</span>
<span class="c1"># Compare lengths to ensure they match</span>
<span class="k">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">sorted_summaries</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">sorted_texts</span><span class="p">))</span>
</pre>
</div>
<h3><span lang="EN-US">3. Connects RNN layers in encoder</span></h3>
<p>The original code is missing this line below, that is how we connect layers by feeding the current layer's output to next layer's input. So the orginal code only behaves like a single bidirectional RNN layer in the encoder.</p>
<div class="highlight">
<pre><span></span><span class="n">rnn_inputs</span> <span class="o">=</span> <span class="n">enc_output</span></pre>
</div>
<div class="highlight">
<pre><span></span><span class="k">def</span> <span class="nf">encoding_layer</span><span class="p">(</span><span class="n">rnn_size</span><span class="p">,</span> <span class="n">sequence_length</span><span class="p">,</span> <span class="n">num_layers</span><span class="p">,</span> <span class="n">rnn_inputs</span><span class="p">,</span> <span class="n">keep_prob</span><span class="p">):</span>
<span class="k">for</span> <span class="n">layer</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_layers</span><span class="p">):</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">variable_scope</span><span class="p">(</span><span class="s1">'encoder_{}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">layer</span><span class="p">)):</span>
<span class="n">cell_fw</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">contrib</span><span class="o">.</span><span class="n">rnn</span><span class="o">.</span><span class="n">LSTMCell</span><span class="p">(</span><span class="n">rnn_size</span><span class="p">,</span>
<span class="n">initializer</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">random_uniform_initializer</span><span class="p">(</span><span class="o">-</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="n">seed</span><span class="o">=</span><span class="mi">2</span><span class="p">))</span>
<span class="n">cell_fw</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">contrib</span><span class="o">.</span><span class="n">rnn</span><span class="o">.</span><span class="n">DropoutWrapper</span><span class="p">(</span><span class="n">cell_fw</span><span class="p">,</span>
<span class="n">input_keep_prob</span> <span class="o">=</span> <span class="n">keep_prob</span><span class="p">)</span>
<span class="n">cell_bw</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">contrib</span><span class="o">.</span><span class="n">rnn</span><span class="o">.</span><span class="n">LSTMCell</span><span class="p">(</span><span class="n">rnn_size</span><span class="p">,</span>
<span class="n">initializer</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">random_uniform_initializer</span><span class="p">(</span><span class="o">-</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="n">seed</span><span class="o">=</span><span class="mi">2</span><span class="p">))</span>
<span class="n">cell_bw</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">contrib</span><span class="o">.</span><span class="n">rnn</span><span class="o">.</span><span class="n">DropoutWrapper</span><span class="p">(</span><span class="n">cell_bw</span><span class="p">,</span>
<span class="n">input_keep_prob</span> <span class="o">=</span> <span class="n">keep_prob</span><span class="p">)</span>
<span class="n">enc_output</span><span class="p">,</span> <span class="n">enc_state</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">bidirectional_dynamic_rnn</span><span class="p">(</span><span class="n">cell_fw</span><span class="p">,</span>
<span class="n">cell_bw</span><span class="p">,</span>
<span class="n">rnn_inputs</span><span class="p">,</span>
<span class="n">sequence_length</span><span class="p">,</span>
<span class="n">dtype</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">float32</span><span class="p">)</span>
<span class="n">enc_output</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">concat</span><span class="p">(</span><span class="n">enc_output</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span>
<span class="c1"># original code is missing this line below, that is how we connect layers </span>
<span class="c1"># by feeding the current layer's output to next layer's input</span>
<span class="n">rnn_inputs</span> <span class="o">=</span> <span class="n">enc_output</span>
<span class="k">return</span> <span class="n">enc_output</span><span class="p">,</span> <span class="n">enc_state</span>
</pre>
</div>
<h3>4. Decoding layers use<strong> MultiRNNCell</strong></h3>
<p>The original author uses a for loop to connect num_layers of <span>LSTMCell</span> together, here we use <strong>MultiRNNCell</strong> to<span> composed sequentially of multiple simple cells(<strong>BasicLSTMCell</strong>) to simplify the code.</span></p>
<div class="highlight">
<pre><span></span><span class="k">def</span> <span class="nf">lstm_cell</span><span class="p">(</span><span class="n">lstm_size</span><span class="p">,</span> <span class="n">keep_prob</span><span class="p">):</span>
<span class="n">cell</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">contrib</span><span class="o">.</span><span class="n">rnn</span><span class="o">.</span><span class="n">BasicLSTMCell</span><span class="p">(</span><span class="n">lstm_size</span><span class="p">)</span>
<span class="k">return</span> <span class="n">tf</span><span class="o">.</span><span class="n">contrib</span><span class="o">.</span><span class="n">rnn</span><span class="o">.</span><span class="n">DropoutWrapper</span><span class="p">(</span><span class="n">cell</span><span class="p">,</span> <span class="n">input_keep_prob</span> <span class="o">=</span> <span class="n">keep_prob</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">decoding_layer</span><span class="p">(</span><span class="n">dec_embed_input</span><span class="p">,</span> <span class="n">embeddings</span><span class="p">,</span> <span class="n">enc_output</span><span class="p">,</span> <span class="n">enc_state</span><span class="p">,</span> <span class="n">vocab_size</span><span class="p">,</span> <span class="n">text_length</span><span class="p">,</span> <span class="n">summary_length</span><span class="p">,</span>
<span class="n">max_summary_length</span><span class="p">,</span> <span class="n">rnn_size</span><span class="p">,</span> <span class="n">vocab_to_int</span><span class="p">,</span> <span class="n">keep_prob</span><span class="p">,</span> <span class="n">batch_size</span><span class="p">,</span> <span class="n">num_layers</span><span class="p">):</span>
<span class="sd">'''Create the decoding cell and attention for the training and inference decoding layers'''</span>
<span class="n">dec_cell</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">contrib</span><span class="o">.</span><span class="n">rnn</span><span class="o">.</span><span class="n">MultiRNNCell</span><span class="p">([</span><span class="n">lstm_cell</span><span class="p">(</span><span class="n">rnn_size</span><span class="p">,</span> <span class="n">keep_prob</span><span class="p">)</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_layers</span><span class="p">)])</span>
<span class="c1"># ......</span>
</pre>
</div>
<p></p>
<h2>The training result</h2>
<p>After 2 hours training with GPU, the loss went below 1, settled at 0.707.</p>
<p>Here are some summaries generated with the trained model.</p>
<pre>- Review:
The coffee tasted great and was at such a good price! I highly recommend this to everyone!
- Summary:
great great coffee
- Review:
love individual oatmeal cups found years ago sam quit selling sound big lots quit selling found target expensive buy individually trilled get entire case time go anywhere need water microwave spoon know quaker flavor packets
- Summary:
great taste</pre>
<p>Check out the full source code on my <a href="https://github.com/Tony607/Summarizing_Text_Amazon_Reviews">GitHub</a>.</p>How to triage patient queries with Keras (1 minute training)2017-10-08T13:51:15+00:002024-03-19T08:12:55+00:00Chengweihttps://www.dlology.com/blog/author/Chengwei/https://www.dlology.com/blog/tutorial-medical-triage-with-patient-query/<div><img alt="triage" src="https://gitcdn.link/repo/Tony607/blog_statics/master/images/triage.jpg"/></div>
<p>In this tutorial, we are going to build a model to triage based on patient queries text data. </p>
<p>For example</p>
<table width="299">
<tbody>
<tr>
<td width="193"><strong>query (input)</strong></td>
<td width="106"><strong>triage (output)</strong></td>
</tr>
<tr>
<td>Skin is quite itchy.</td>
<td>dermatology</td>
</tr>
<tr>
<td>Sore throat fever fatigue.</td>
<td>mouthface</td>
</tr>
<tr>
<td>Lower back hurt, so painful.</td>
<td>back</td>
</tr>
</tbody>
</table>
<p> </p>
<p></p>
<p>We are going to use Keras with Tensorflow (version 1.3.0) backend to build the model</p>
<p>For source code and dataset used in this tutorial, check out my <a href="https://github.com/Tony607/Medical_Triage">GitHub repo</a>.</p>
<h2>Dependencies<a class="anchor-link" href="http://localhost:8888/notebooks/chinese_sentiment_analysis.ipynb#Dependencies"></a></h2>
<p>Python 3.5, numpy, pickle, keras, tensorflow, nltk, pandas</p>
<p></p>
<h2>About the data</h2>
<p>1261 patient queries, <strong>phrases_embed.csv</strong><span> </span>came from<span> </span><a href="https://blog.babylonhealth.com/how-the-chatbot-understands-sentences-fe6c5deb6e81">Babylon blog "How the chatbot understands sentences"</a>.</p>
<p>Check out the data visualization<span> </span><a href="http://s3-eu-west-1.amazonaws.com/nils-demo/phrases.html">here</a>.</p>
<div><img alt="data visualization" src="https://cdn-images-1.medium.com/max/1600/1*sodiusH7tbwyPAfTfzoamw.png"/></div>
<h2>Preparing the data</h2>
<p>We will be doing the following steps to prepare data for training the model.</p>
<p>1. Read the data from CSV file to Pandas data frame, only keep 2 columns "Disease" and "class"</p>
<div><img alt="triage_csv" src="https://www.dlology.com/static/media/uploads/triage_csv.png"/></div>
<p>2. Convert Pandas data frame to numpy arrays pairs</p>
<p>"Disease" columns ==> documents</p>
<p>"class" columns ==> body_positions</p>
<p>3. Clean up the data</p>
<p>For each sentence, we convert all letter to lower case, only keep English letters and numbers, remove stopwords as shown below.</p>
<div class="highlight">
<pre><span></span><span class="n">strip_special_chars</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s2">"[^A-Za-z0-9 ]+"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">cleanUpSentence</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">stop_words</span> <span class="o">=</span> <span class="bp">None</span><span class="p">):</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">r</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s2">"<br />"</span><span class="p">,</span> <span class="s2">" "</span><span class="p">)</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="n">strip_special_chars</span><span class="p">,</span> <span class="s2">""</span><span class="p">,</span> <span class="n">r</span><span class="o">.</span><span class="n">lower</span><span class="p">())</span>
<span class="k">if</span> <span class="n">stop_words</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
<span class="n">words</span> <span class="o">=</span> <span class="n">word_tokenize</span><span class="p">(</span><span class="n">r</span><span class="p">)</span>
<span class="n">filtered_sentence</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">words</span><span class="p">:</span>
<span class="k">if</span> <span class="n">w</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">stop_words</span><span class="p">:</span>
<span class="n">filtered_sentence</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">w</span><span class="p">)</span>
<span class="k">return</span> <span class="s2">" "</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">filtered_sentence</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="n">r</span>
</pre>
</div>
<p>4. Input <strong>tokenizer</strong> converts input words to ids, pad each input sequence to max input length if it is shorter.</p>
<p>Save the input tokenizer since we need to use the same one to tokenize any new input data during prediction.</p>
<p>5. Convert output words to ids then to categories(one-hot vectors)</p>
<p>7. Make <strong>target_reverse_word_index</strong><span><span> </span>to turn the predicated class ids to text.</span></p>
<h2>Build the model</h2>
<p>The model structure will look like this</p>
<pre>_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 18, 256) 232960
_________________________________________________________________
gru_1 (GRU) (None, 18, 256) 393984
_________________________________________________________________
gru_2 (GRU) (None, 256) 393984
_________________________________________________________________
dense_1 (Dense) (None, 19) 4883
=================================================================</pre>
<p>The <strong>embedding</strong> layer <span>transforms words ids into their corresponding word embeddings, each output from embedding layer would have a size of ( 18 x 256) which is the <strong>maximum input sequence padding length </strong>times<strong> </strong><strong>embedding dimension</strong>.</span></p>
<p>The data is pass to a recurrent layer to process the input sequence, we are using <strong>GRU</strong> here, you can also try LSTM.</p>
<p>All the intermediate outputs are collected and then passed on to the <strong>second GRU</strong> layer.</p>
<p>The output is then sent to a <strong>fully connected layer</strong> that would give us our final prediction classes. We are using "<strong>softmax</strong>" <strong>activation</strong> to give us a probability for each class.</p>
<p>Use standard <strong>'categorical_crossentropy'</strong> loss function for multiclass classification.</p>
<p>Use<strong> "</strong>adam<strong>"</strong> optimizer since it adapts the learning rate.</p>
<div class="highlight">
<pre><span></span><span class="n">model</span> <span class="o">=</span> <span class="n">Sequential</span><span class="p">()</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Embedding</span><span class="p">(</span><span class="n">vocab_size</span><span class="p">,</span> <span class="n">embedding_dim</span><span class="p">,</span><span class="n">input_length</span> <span class="o">=</span> <span class="n">maxLength</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">GRU</span><span class="p">(</span><span class="mi">256</span><span class="p">,</span> <span class="n">dropout</span><span class="o">=</span><span class="mf">0.9</span><span class="p">,</span> <span class="n">return_sequences</span><span class="o">=</span><span class="bp">True</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">GRU</span><span class="p">(</span><span class="mi">256</span><span class="p">,</span> <span class="n">dropout</span><span class="o">=</span><span class="mf">0.9</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="n">output_dimen</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s1">'softmax'</span><span class="p">))</span>
<span class="n">tbCallBack</span> <span class="o">=</span> <span class="n">TensorBoard</span><span class="p">(</span><span class="n">log_dir</span><span class="o">=</span><span class="s1">'./Graph/medical_triage'</span><span class="p">,</span> <span class="n">histogram_freq</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
<span class="n">write_graph</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">write_images</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">model</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="n">loss</span><span class="o">=</span><span class="s1">'categorical_crossentropy'</span><span class="p">,</span> <span class="n">optimizer</span><span class="o">=</span><span class="s1">'adam'</span><span class="p">,</span> <span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="s1">'accuracy'</span><span class="p">])</span>
</pre>
</div>
<p>We will then train the model and save it for later prediction.</p>
<h2>Predict new data</h2>
<p>1. Load the model we save earlier.</p>
<p>2. Load the input tokenizer and tokenize a new patient query text, pad the sequence to max length</p>
<p>3. Feed the sequence to model, the model will output the class id along with the probability, we use "target_reverse_word_index" to turn the class id to actual triage result text.</p>
<p>Here are some predicted result</p>
<div><img alt="triage_predict" src="https://www.dlology.com/static/media/uploads/triage/triage_predict.png"/></div>
<p></p>
<h2>Summary</h2>
<p><span>Keras trained for 40 epochs, takes less than 1 minute with GPU (GTX 1070) final </span><span>acc:0.9146</span></p>
<div><img alt="triage result" src="https://www.dlology.com/static/media/uploads/triage/triage_acc.png"/></div>
<p><span>The training data size is relatively small, having larger datasets might increase the final accuracy.</span></p>
<p><span>Check out my <a href="https://github.com/Tony607/Medical_Triage">GitHub repo</a> for the Jupyter notebook source code and dataset.</span></p>An easy guide to Chinese Sentiment analysis with hotel review data2017-09-26T04:45:50+00:002024-03-18T08:36:17+00:00Chengweihttps://www.dlology.com/blog/author/Chengwei/https://www.dlology.com/blog/tutorial-chinese-sentiment-analysis-with-hotel-review-data/<div><img alt="good_bad" src="https://gitcdn.link/repo/Tony607/blog_statics/master/images/good_bad.jpg"/></div>
<h6><span>For source code and dataset used in this tutorial, check out my </span><a href="https://github.com/Tony607/Chinese_sentiment_analysis"><g class="gr_ gr_39 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" data-gr-id="39" id="39">github</g> repo</a><span>.</span></h6>
<h2>Dependencies<a class="anchor-link" href="http://localhost:8888/notebooks/chinese_sentiment_analysis.ipynb#Dependencies"></a></h2>
<p>Python 3.5, numpy, pickle, keras, <g class="gr_ gr_42 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" data-gr-id="42" id="42">tensorflow</g>,<span> </span><a href="https://github.com/fxsjy/jieba" target="_blank"><g class="gr_ gr_43 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" data-gr-id="43" id="43">jieba</g></a></p>
<p></p>
<h2>About the data</h2>
<p>Customer hotel reviews, including</p>
<p>2916 positive reviews and 3000 negative reviews</p>
<h3>Optional for plotting<a class="anchor-link" href="http://localhost:8888/notebooks/chinese_sentiment_analysis.ipynb#Optional-for-plotting"></a></h3>
<p><g class="gr_ gr_49 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling" data-gr-id="49" id="49">pylab</g>, <g class="gr_ gr_50 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling" data-gr-id="50" id="50">scipy</g></p>
<p></p>
<h2><span lang="EN-US">Key difference compared to English dataset</span></h2>
<h3><span lang="EN-US">File Encoding</span></h3>
<p><span lang="EN-US"><span>Some data files contain abnormal encoding characters which encoding GB2312 will complain about. Solution: read as bytes then decode as GB2312 line by line, skip lines with abnormal encodings. We also convert any traditional Chinese characters to simplified Chinese characters.</span></span></p>
<div class="highlight">
<pre><span></span><span class="n">documents</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">filename</span> <span class="ow">in</span> <span class="n">positiveFiles</span><span class="p">:</span>
<span class="n">text</span> <span class="o">=</span> <span class="s2">""</span>
<span class="k">with</span> <span class="n">codecs</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="s2">"rb"</span><span class="p">)</span> <span class="k">as</span> <span class="n">doc_file</span><span class="p">:</span>
<span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">doc_file</span><span class="p">:</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">line</span> <span class="o">=</span> <span class="n">line</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="s2">"GB2312"</span><span class="p">)</span>
<span class="k">except</span><span class="p">:</span>
<span class="k">continue</span>
<span class="n">text</span><span class="o">+=</span><span class="n">Converter</span><span class="p">(</span><span class="s1">'zh-hans'</span><span class="p">)</span><span class="o">.</span><span class="n">convert</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
<span class="n">text</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s2">"</span><span class="se">\n</span><span class="s2">"</span><span class="p">,</span> <span class="s2">""</span><span class="p">)</span>
<span class="n">text</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s2">"</span><span class="se">\r</span><span class="s2">"</span><span class="p">,</span> <span class="s2">""</span><span class="p">)</span>
<span class="n">documents</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">text</span><span class="p">,</span> <span class="s2">"pos"</span><span class="p">))</span>
</pre>
</div>
<p></p>
<h3>Convert from traditional to simplified Chinese (繁体转简体)</h3>
<p><span>Have those two files download from</span></p>
<p><span><a href="https://github.com/skydark/nstools/blob/master/zhtools/langconv.py">langconv</a>.py</span></p>
<p><span><a href="https://github.com/skydark/nstools/blob/master/zhtools/zh_wiki.py">zh_wiki</a>.py</span></p>
<p><span>those two lines below will convert string "<strong>line"</strong> from traditional to simplified Chinese.</span></p>
<div class="highlight">
<pre><span></span><span class="kn">from</span> <span class="nn">langconv</span> <span class="kn">import</span> <span class="o">*</span>
<span class="n">Converter</span><span class="p">(</span><span class="s1">'zh-hans'</span><span class="p">)</span><span class="o">.</span><span class="n">convert</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
</pre>
</div>
<h3><span lang="EN-US">Tokenize</span></h3>
<p><span lang="EN-US">Use <a href="https://github.com/fxsjy/jieba" target="_blank"><g class="gr_ gr_47 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" data-gr-id="47" id="47">jieba</g></a> to tokenize <g class="gr_ gr_46 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" data-gr-id="46" id="46">chinese</g> sentences, then join the list of tokens <g class="gr_ gr_48 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" data-gr-id="48" id="48">seperated</g> by spaces.</span></p>
<p><span lang="EN-US">We then feed the string to Keras Tokenizer which <g class="gr_ gr_51 gr-alert gr_gramm gr_inline_cards gr_run_anim Grammar multiReplace" data-gr-id="51" id="51">expect</g> each sentence with words tokens <g class="gr_ gr_41 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" data-gr-id="41" id="41">seperated</g> by spaces.</span></p>
<div class="highlight">
<pre><span></span><span class="kn">from</span> <span class="nn">keras.preprocessing.text</span> <span class="kn">import</span> <span class="n">Tokenizer</span>
<span class="kn">from</span> <span class="nn">keras.preprocessing.sequence</span> <span class="kn">import</span> <span class="n">pad_sequences</span>
<span class="kn">import</span> <span class="nn">jieba</span>
<span class="n">seg_list</span> <span class="o">=</span> <span class="n">jieba</span><span class="o">.</span><span class="n">cut</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">cut_all</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="n">text</span> <span class="o">=</span> <span class="s2">" "</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">seg_list</span><span class="p">)</span>
<span class="c1"># totalX = [text , .....]</span>
<span class="c1"># maxLength is the sentence words length to keep</span>
<span class="n">input_tokenizer</span> <span class="o">=</span> <span class="n">Tokenizer</span><span class="p">(</span><span class="mi">30000</span><span class="p">)</span>
<span class="n">input_tokenizer</span><span class="o">.</span><span class="n">fit_on_texts</span><span class="p">(</span><span class="n">totalX</span><span class="p">)</span>
<span class="n">input_vocab_size</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">input_tokenizer</span><span class="o">.</span><span class="n">word_index</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">totalX</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">pad_sequences</span><span class="p">(</span><span class="n">input_tokenizer</span><span class="o">.</span><span class="n">texts_to_sequences</span><span class="p">(</span><span class="n">totalX</span><span class="p">),</span> <span class="n">maxlen</span><span class="o">=</span><span class="n">maxLength</span><span class="p">))</span>
</pre>
</div>
<h3><span lang="EN-US">Chinese stop words</span></h3>
<p><span>First get a list of stop words from the file <strong><g class="gr_ gr_56 gr-alert gr_gramm gr_hide gr_inline_cards gr_run_anim Style multiReplace replaceWithoutSep replaceWithoutSep" data-gr-id="56" id="56">chinese_stop_words.txt</g></strong><g class="gr_ gr_56 gr-alert gr_gramm gr_hide gr_inline_cards gr_disable_anim_appear Style multiReplace replaceWithoutSep replaceWithoutSep" data-gr-id="56" id="56"> ,</g> then check each tokenized Chinese words against this list</span></p>
<div class="highlight">
<pre><span></span><span class="n">stopwords</span> <span class="o">=</span> <span class="p">[</span> <span class="n">line</span><span class="o">.</span><span class="n">rstrip</span><span class="p">()</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'./data/chinese_stop_words.txt'</span><span class="p">,</span><span class="s2">"r"</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s2">"utf-8"</span><span class="p">)</span> <span class="p">]</span>
<span class="k">for</span> <span class="n">doc</span> <span class="ow">in</span> <span class="n">documents</span><span class="p">:</span>
<span class="n">seg_list</span> <span class="o">=</span> <span class="n">jieba</span><span class="o">.</span><span class="n">cut</span><span class="p">(</span><span class="n">doc</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">cut_all</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="n">final</span> <span class="o">=</span><span class="p">[]</span>
<span class="n">seg_list</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">seg_list</span><span class="p">)</span>
<span class="k">for</span> <span class="n">seg</span> <span class="ow">in</span> <span class="n">seg_list</span><span class="p">:</span>
<span class="k">if</span> <span class="n">seg</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">stopwords</span><span class="p">:</span>
<span class="n">final</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">seg</span><span class="p">)</span>
</pre>
</div>
<h3>Result</h3>
<p><span>Keras trained for 20 epochs, takes 7 minutes 14 seconds with GPU (GTX 1070)</span></p>
<p><span>acc:0.9726</span></p>
<div><img alt="result" src="https://www.dlology.com/static/media/uploads/ch_sentiment_result.png"/></div>
<p><span></span></p>
<h4 id="Try-some-new-comments,-feel-free-to-try-your-own">Try some new comments</h4>
<div><img alt="prediction" src="https://www.dlology.com/static/media/uploads/ch_sentiment_predict.png"/></div>
<p><span> </span></p>
<p>For the Python Jupyter notebook source code and dataset, check out my <a href="https://github.com/Tony607/Chinese_sentiment_analysis"><g class="gr_ gr_40 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" data-gr-id="40" id="40">github</g> repo</a>.</p>
<p>For an updated word-level English model, <span>check out my other blog: </span><a href="https://www.dlology.com/blog/simple-stock-sentiment-analysis-with-news-data-in-keras/">Simple Stock Sentiment Analysis with news data in Keras</a><span>.</span></p>