(Comments)
Given the product brand, name, category and one sentence or two for short descriptions, then we can predict its price. Could it be that simple?
In this Kaggle competition, we are doing exactly the same thing. Developers around the world are fighting for the "Mercari Prize: Price Suggestion Challenge", and the prize money is The total amount is $ 100,000 (first place: $ 60,000, second place: $ 30,000, third place: $ 10,000).
In this post, I will walk you through building a simple model to tackle the challenge in the deep learning library Keras.
If you are new to Kaggle, in order to download the datasets, you need to register an account, totally pain-free. Once you have the account, go to the "Data" tab in the Mercari Price Suggestion Challenge.
Download all three files to your local computer, extract and save them to a folder named "input", and at the root folder create a folder named "scripts" where we will start coding.
Right now you should have your directories structured similarly to this.
./Pricing_Challenge
|-- input
| |-- test.tsv
| |-- sample_submission.csv
| `-- train.tsv
`-- scripts
First, let's take some time to understand the datasets at hand
train.tsv, test.tsv
The files consist of a list of product listings. These files are tab-delimited.
train_id
or test_id
- the id of the listingname
- the title of the listing. Note that we have cleaned the data to remove text that looks like prices (e.g. $20) to avoid leakage. These removed prices are represented as [rm]
item_condition_id
- the condition of the items provided by the sellercategory_name
- category of the listingbrand_name
price
- the price that the item was sold for. This is the target variable that you will predict. The unit is USD. This column doesn't exist in test.tsv
since that is what you will predict.shipping
- 1 if shipping fee is paid by seller and 0 by a buyeritem_description
- the full description of the item. Note that we have cleaned the data to remove text that looks like prices (e.g. $20) to avoid leakage. These removed prices are represented as [rm]
train_id | name | item_condition_id | category_name | brand_name | price | shipping | item_description |
---|---|---|---|---|---|---|---|
0 | MLB Cincinnati Reds T Shirt Size XL | 3 | Men/Tops/T-shirts | NaN | 10.0 | 1 | No description yet |
1 | Razer BlackWidow Chroma Keyboard | 3 | Electronics/Computers & Tablets/Components & P... | Razer | 52.0 | 0 | This keyboard is in great condition and works ... |
2 | AVA-VIV Blouse | 1 | Women/Tops & Blouses/Blouse | Target | 10.0 | 1 | Adorable top with a hint of lace and a key hol... |
3 | Leather Horse Statues | 1 | Home/Home Décor/Home Décor Accents | NaN | 35.0 | 1 | New with tags. Leather horses. Retail for [rm]... |
4 | 24K GOLD plated rose | 1 | Women/Jewelry/Necklaces | NaN | 44.0 | 0 | Complete with certificate of authenticity |
For the price column, it would be problematic to feed into neural network values that all take wildly different ranges. The network might be able to automatically adapt to such heterogeneous data, but it would definitely make learning more difficult.
A widespread best practice to deal with such data is to do feature-wise normalization: for each feature in the input data (a column in the input data matrix) we apply log(x+1)
to it.
train['target'] = np.log1p(train['price'])
Let's take a look at the distribution of the new 'target' column.
We will replace contractions pair like those below, the purpose is to unify the vocabulary to make it easier to train the model.
"what's" → "what is",
"who'll" → "who will",
"wouldn't" → "would not",
"you'd" → "you would",
"you're" → "you are"
Before we are doing this let's count how many rows contain any of the contractions in the "item_description" column.
5 top listed below, which is no surprise.
can't - 6136
won't - 4867
that's - 4806
it's - 26444
don't - 32645
doesn't - 8520
Here is the code to remove contractions for both 'item_description' and 'name' columns in train and test datasets.
contractions = {
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
# more contractions pairs
...
}
for contraction in contractions:
train['item_description'] = train['item_description'].str.replace(contraction, contractions[contraction])
test['item_description'] = test['item_description'].str.replace(contraction, contractions[contraction])
train['name'] = train['name'].str.replace(contraction, contractions[contraction])
test['name'] = test['name'].str.replace(contraction, contractions[contraction])
The concept of missing values is important to understand in order to successfully manage data. If the missing values are not handled properly by the researcher, then he/she may end up drawing an inaccurate inference about the data.
First take a look at how many data is missing, and which column.
train.isnull().sum()/len(train)
In the output we know there are 42% brand_name are missing, "category_name" and "item_description" columns are also missing less than 1% data.
train_id 0.000000 name 0.000000 item_condition_id 0.000000 category_name 0.004268 brand_name 0.426757 price 0.000000 shipping 0.000000 item_description 0.000003 dtype: float64
Suppose the number of cases of missing values is extremely small; then, an expert researcher may drop or omit those values from the analysis. In statistical language, if the number of the cases is less than 5% of the sample, then the researcher can drop them. In our case, we can drop rows with "category_name" or "item_description" column missing value.
But for simplicity, let's replace all missing text values with the string "missing".
#HANDLE MISSING VALUES
print("Handling missing values...")
def handle_missing(dataset):
dataset.category_name.fillna(value="missing", inplace=True)
dataset.brand_name.fillna(value="missing", inplace=True)
dataset.item_description.fillna(value="missing", inplace=True)
return (dataset)
train = handle_missing(train)
test = handle_missing(test)
print(train.shape)
print(test.shape)
There are two text columns has special meanings,
Different products could have same category name or brand name, so it will be helpful to create categorical columns from them.
We will use sklearn's LabelEncoder for this purpose. After the transform, we will have two more new columns "category" and "brand" with an integer type.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(np.hstack([train.category_name, test.category_name]))
train['category'] = le.transform(train.category_name)
test['category'] = le.transform(test.category_name)
le.fit(np.hstack([train.brand_name, test.brand_name]))
train['brand'] = le.transform(train.brand_name)
test['brand'] = le.transform(test.brand_name)
For each unique word in the vocabulary, we will turn it to one integer to represent it. So one sentence will become a list of integers.
First, we need to gather the vocabulary list from our text columns we are going to tokenize. i.e. those 3 columns,
And we will use Keras' text processing Tokenizer class.
from keras.preprocessing.text import Tokenizer
raw_text = np.hstack([train.category_name.str.lower(),
train.item_description.str.lower(),
train.name.str.lower()])
# Tokenize
print("Tokenizing!")
tok_raw = Tokenizer()
tok_raw.fit_on_texts(raw_text)
print(" Transforming text to seq...")
train["seq_category_name"] = tok_raw.texts_to_sequences(train.category_name.str.lower())
test["seq_category_name"] = tok_raw.texts_to_sequences(test.category_name.str.lower())
train["seq_item_description"] = tok_raw.texts_to_sequences(train.item_description.str.lower())
test["seq_item_description"] = tok_raw.texts_to_sequences(test.item_description.str.lower())
train["seq_name"] = tok_raw.texts_to_sequences(train.name.str.lower())
test["seq_name"] = tok_raw.texts_to_sequences(test.name.str.lower())
The fit_on_texts()
method will train the Tokenizer to generate the vocabulary word index mapping. And the texts_to_sequences()
method will actually make the sequences from texts.
The words sequences generated in the previous step are in different lengths. Since the first layer in our network for those sequences is the Embedding
layers. Each Embedding layer takes as input a 2D tensor of integers, of shape (samples, sequence_length)
All sequences in a batch must have the same length since we need to pack them into a single tensor. So sequences that are shorter than others should be padded with zeros, and sequences that are longer should be truncated.
To begin with, we need to choose the forsequence_length
each of our sequences columns. If it is too long, the model training will take forever. If it is too short, we are at the risk of truncating important information. It will be great to visualize the sequence length distribution before we make this decision.
This line of code will plot sequence length distribution for the "seq_item_description" column in a histogram.
train.seq_item_description.apply(lambda x: len(x)).hist(bins=30)
Let's pick the number 60 for the max sequence length since it covers up the majority sequences.
In the code below, we are using Keras' sequence processing pad_sequences()
method to pad sequences to be the same length for each column.
#KERAS DATA DEFINITION
from keras.preprocessing.sequence import pad_sequences
def get_keras_data(dataset):
X = {
'name': pad_sequences(dataset.seq_name, maxlen=MAX_NAME_SEQ)
,'item_desc': pad_sequences(dataset.seq_item_description
, maxlen=MAX_ITEM_DESC_SEQ)
,'brand': np.array(dataset.brand)
,'category': np.array(dataset.category)
,'category_name': pad_sequences(dataset.seq_category_name
, maxlen=MAX_CATEGORY_NAME_SEQ)
,'item_condition': np.array(dataset.item_condition_id)
,'shipping': np.array(dataset[["shipping"]])
}
return X
X_train = get_keras_data(dtrain)
X_valid = get_keras_data(dvalid)
X_test = get_keras_data(test)
This will be a multi-input model for those inputs
All Inputs except “shipping” will first go to embedding layer
For those sequences inputs, we need to feed them to Embedding
layers. Embedding
layers turn integer indices (which stand for specific words) to dense vectors. It takes as input integers, it looks up these integers into an internal dictionary, and it returns the associated vectors. It's effectively a dictionary lookup.
The embedded sequence will then feed to the GRU
layer, like other types of recurrent networks, it is good at learning patterns in sequences of data.
Non-sequential data embedded layers will just be flattened to 2 dimensions by the Flatten
layer.
All layers including the “shipping” will then be concatenated to a big two-dimensional tensor.
Followed by several Dense layers, final output Dense layer takes “linear” activation regression to arbitrary price values, same as specifying None for the activation
parameter.
from keras.layers import Input, Dropout, Dense, \
concatenate, GRU, Embedding, Flatten
from keras.models import Model
from keras import optimizers
def get_model():
#Inputs
name = Input(shape=[X_train["name"].shape[1]], name="name")
item_desc = Input(shape=[X_train["item_desc"].shape[1]], name="item_desc")
brand = Input(shape=[1], name="brand")
category = Input(shape=[1], name="category")
category_name = Input(shape=[X_train["category_name"].shape[1]],
name="category_name")
item_condition = Input(shape=[1], name="item_condition")
shipping = Input(shape=[X_train["shipping"].shape[1]], name="shipping")
#Embeddings layers
emb_size = 60
emb_name = Embedding(MAX_TEXT, emb_size//3)(name)
emb_item_desc = Embedding(MAX_TEXT, emb_size)(item_desc)
emb_category_name = Embedding(MAX_TEXT, emb_size//3)(category_name)
emb_brand = Embedding(MAX_BRAND, 10)(brand)
emb_category = Embedding(MAX_CATEGORY, 10)(category)
emb_item_condition = Embedding(MAX_CONDITION, 5)(item_condition)
rnn_layer1 = GRU(16) (emb_item_desc)
rnn_layer2 = GRU(8) (emb_category_name)
rnn_layer3 = GRU(8) (emb_name)
#main layer
main_l = concatenate([
Flatten() (emb_brand)
, Flatten() (emb_category)
, Flatten() (emb_item_condition)
, rnn_layer1
, rnn_layer2
, rnn_layer3
, shipping
])
main_l = Dropout(dr)(Dense(512,activation='relu') (main_l))
main_l = Dropout(dr)(Dense(64,activation='relu') (main_l))
main_l = Dropout(dr)(Dense(32,activation='relu') (main_l))
#output
output = Dense(1,activation="linear") (main_l)
#model
model = Model([name, item_desc, brand
, category, category_name
, item_condition, shipping], output)
#optimizer = optimizers.RMSprop()
optimizer = optimizers.Adam()
model.compile(loss="mse",
optimizer=optimizer)
return model
I like visualization, so I plot the model structure as well.
It can be done with those two lines of code if you are curious.
You need to install Graphviz executable. pip install graphviz and pydot packages before trying to plot.
from keras.utils import plot_model
plot_model(model, to_file='model.png', show_shapes=True)
Training the model is easy, let's train it for 2 epochs, X_train
is the dictionary we created earlier, mapping input names to Numpy arrays.
epochs = 3
BATCH_SIZE = 512 * 3
history = model.fit(X_train, dtrain.target
, epochs=epochs
, batch_size=BATCH_SIZE
, validation_split=0.01
)
The Kaggle challenge page has chosen "Root Mean Squared Logarithmic Error" as the loss function.
The following code will take our trained model and compute the loss value given the validation data.
def rmsle(y, y_pred):
import math
assert len(y) == len(y_pred)
to_sum = [(math.log(y_pred[i] + 1) - math.log(y[i] + 1)) ** 2.0 \
for i, pred in enumerate(y_pred)]
return (sum(to_sum) * (1.0/len(y))) ** 0.5
def eval_model(model):
val_preds = model.predict(X_valid)
val_preds = np.expm1(val_preds)
y_true = np.array(dvalid.price.values)
y_pred = val_preds[:, 0]
v_rmsle = rmsle(y_true, y_pred)
print(" RMSLE error on dev test: "+str(v_rmsle))
return v_rmsle
v_rmsle = eval_model(model)
If you are planning on generating the actual prices for the test datasets and try your luck on Kaggle. This block of code will reverse the feature normalization process we discussed previously and write the prices to a CSV file.
preds = model.predict(X_test, batch_size=BATCH_SIZE)
preds = np.expm1(preds)
submission = test[["test_id"]][:test_len]
submission["price"] = preds[:test_len]
submission.to_csv("./submission.csv", index=False)
We walked through how to predict prices give multiple input features. How to preprocessing the text data, dealing with missing data and finally build, train and evaluate the model.
Full source code posted on my GitHub.
Share on Twitter Share on Facebook
Comments