How to teach AI to suggest product prices to online sellers

(Comments)

price prediction

Given the product brand, name, category and one sentence or two for short descriptions, then we can predict its price. Could it be that simple?

In this Kaggle competition, we are doing exactly the same thing. Developers around the world are fighting for the "Mercari Prize: Price Suggestion Challenge", and the prize money is The total amount is $ 100,000 (first place: $ 60,000, second place: $ 30,000, third place: $ 10,000).

In this post, I will walk you through building a simple model to tackle the challenge in the deep learning library Keras.

If you are new to Kaggle, in order to download the datasets, you need to register an account, totally pain-free. Once you have the account, go to the "Data" tab in the Mercari Price Suggestion Challenge.

Download all three files to your local computer, extract and save them to a folder named "input", and at the root folder create a folder named "scripts" where we will start coding.

Right now you should have your directories structured similarly to this.

./Pricing_Challenge
|-- input
|    |-- test.tsv
|    |-- sample_submission.csv
|    `-- train.tsv
`-- scripts

Preparing the data

First, let's take some time to understand the datasets at hand

train.tsv, test.tsv

The files consist of a list of product listings. These files are tab-delimited.

  • train_id or test_id - the id of the listing
  • name - the title of the listing. Note that we have cleaned the data to remove text that looks like prices (e.g. $20) to avoid leakage. These removed prices are represented as [rm]
  • item_condition_id - the condition of the items provided by the seller
  • category_name - category of the listing
  • brand_name
  • price - the price that the item was sold for. This is the target variable that you will predict. The unit is USD. This column doesn't exist in test.tsv since that is what you will predict.
  • shipping - 1 if shipping fee is paid by seller and 0 by a buyer
  • item_description - the full description of the item. Note that we have cleaned the data to remove text that looks like prices (e.g. $20) to avoid leakage. These removed prices are represented as [rm]

First 5 rows of train datasets

train_id name item_condition_id category_name brand_name price shipping item_description
0 MLB Cincinnati Reds T Shirt Size XL 3 Men/Tops/T-shirts NaN 10.0 1 No description yet
1 Razer BlackWidow Chroma Keyboard 3 Electronics/Computers & Tablets/Components & P... Razer 52.0 0 This keyboard is in great condition and works ...
2 AVA-VIV Blouse 1 Women/Tops & Blouses/Blouse Target 10.0 1 Adorable top with a hint of lace and a key hol...
3 Leather Horse Statues 1 Home/Home Décor/Home Décor Accents NaN 35.0 1 New with tags. Leather horses. Retail for [rm]...
4 24K GOLD plated rose 1 Women/Jewelry/Necklaces NaN 44.0 0 Complete with certificate of authenticity

 

For the price column, it would be problematic to feed into neural network values that all take wildly different ranges. The network might be able to automatically adapt to such heterogeneous data, but it would definitely make learning more difficult.

A widespread best practice to deal with such data is to do feature-wise normalization: for each feature in the input data (a column in the input data matrix) we apply log(x+1) to it.

train['target'] = np.log1p(train['price'])

Let's take a look at the distribution of the new 'target' column.

target

Text preprocessing

Replace contractions

We will replace contractions pair like those below, the purpose is to unify the vocabulary to make it easier to train the model.

"what's" → "what is",
"who'll" → "who will",
"wouldn't" → "would not",
"you'd" → "you would",
"you're" → "you are"

Before we are doing this let's count how many rows contain any of the contractions in the "item_description" column.

5 top listed below, which is no surprise.

can't - 6136
won't - 4867
that's - 4806
it's - 26444
don't - 32645
doesn't - 8520

Here is the code to remove contractions for both 'item_description' and 'name' columns in train and test datasets.

contractions = { 
    "ain't": "am not",
    "aren't": "are not",
    "can't": "cannot",
    "can't've": "cannot have",
    # more contractions pairs
    ...
    }

for contraction in contractions:
    train['item_description'] = train['item_description'].str.replace(contraction, contractions[contraction])
    test['item_description'] = test['item_description'].str.replace(contraction, contractions[contraction])
    train['name'] = train['name'].str.replace(contraction, contractions[contraction])
    test['name'] = test['name'].str.replace(contraction, contractions[contraction])

Handel missing values

The concept of missing values is important to understand in order to successfully manage data.  If the missing values are not handled properly by the researcher, then he/she may end up drawing an inaccurate inference about the data. 

First take a look at how many data is missing, and which column.

train.isnull().sum()/len(train)

In the output we know there are 42% brand_name are missing, "category_name" and "item_description" columns are also missing less than 1% data.

train_id             0.000000
name                 0.000000
item_condition_id    0.000000
category_name        0.004268
brand_name           0.426757
price                0.000000
shipping             0.000000
item_description     0.000003
dtype: float64

 Suppose the number of cases of missing values is extremely small; then, an expert researcher may drop or omit those values from the analysis.  In statistical language, if the number of the cases is less than 5% of the sample, then the researcher can drop them. In our case, we can drop rows with "category_name" or "item_description" column missing value.

But for simplicity, let's replace all missing text values with the string "missing".

#HANDLE MISSING VALUES
print("Handling missing values...")
def handle_missing(dataset):
    dataset.category_name.fillna(value="missing", inplace=True)
    dataset.brand_name.fillna(value="missing", inplace=True)
    dataset.item_description.fillna(value="missing", inplace=True)
    return (dataset)

train = handle_missing(train)
test = handle_missing(test)
print(train.shape)
print(test.shape)

Create categorical columns

There are two text columns has special meanings,

  • category_name
  • brand_name

Different products could have same category name or brand name, so it will be helpful to create categorical columns from them.

We will use sklearn's LabelEncoder for this purpose. After the transform, we will have two more new columns "category" and "brand" with an integer type.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

le.fit(np.hstack([train.category_name, test.category_name]))
train['category'] = le.transform(train.category_name)
test['category'] = le.transform(test.category_name)

le.fit(np.hstack([train.brand_name, test.brand_name]))
train['brand'] = le.transform(train.brand_name)
test['brand'] = le.transform(test.brand_name)

Tokenize - texts to sequences

For each unique word in the vocabulary, we will turn it to one integer to represent it. So one sentence will become a list of integers.

First, we need to gather the vocabulary list from our text columns we are going to tokenize. i.e. those 3 columns, 

  • category_name
  • item_description
  • name

And we will use Keras' text processing Tokenizer class.

from keras.preprocessing.text import Tokenizer
raw_text = np.hstack([train.category_name.str.lower(), 
                      train.item_description.str.lower(), 
                      train.name.str.lower()])
# Tokenize
print("Tokenizing!")
tok_raw = Tokenizer()
tok_raw.fit_on_texts(raw_text)
print("   Transforming text to seq...")
train["seq_category_name"] = tok_raw.texts_to_sequences(train.category_name.str.lower())
test["seq_category_name"] = tok_raw.texts_to_sequences(test.category_name.str.lower())
train["seq_item_description"] = tok_raw.texts_to_sequences(train.item_description.str.lower())
test["seq_item_description"] = tok_raw.texts_to_sequences(test.item_description.str.lower())
train["seq_name"] = tok_raw.texts_to_sequences(train.name.str.lower())
test["seq_name"] = tok_raw.texts_to_sequences(test.name.str.lower())

The fit_on_texts() method will train the Tokenizer to generate the vocabulary word index mapping. And the texts_to_sequences() method will actually make the sequences from texts.

Padding sequences

The words sequences generated in the previous step are in different lengths. Since the first layer in our network for those sequences is the Embedding layers. Each Embedding layer takes as input a 2D tensor of integers, of shape (samples, sequence_length) 

 All sequences in a batch must have the same length since we need to pack them into a single tensor. So sequences that are shorter than others should be padded with zeros, and sequences that are longer should be truncated.

To begin with, we need to choose the forsequence_length each of our sequences columns. If it is too long, the model training will take forever. If it is too short, we are at the risk of truncating important information. It will be great to visualize the sequence length distribution before we make this decision.

This line of code will plot sequence length distribution for the "seq_item_description" column in a histogram.

train.seq_item_description.apply(lambda x: len(x)).hist(bins=30)

sequence lengths histogram

Let's pick the number 60 for the max sequence length since it covers up the majority sequences.

In the code below,  we are using  Keras' sequence processing pad_sequences() method to pad sequences to be the same length for each column.

#KERAS DATA DEFINITION
from keras.preprocessing.sequence import pad_sequences

def get_keras_data(dataset):
    X = {
        'name': pad_sequences(dataset.seq_name, maxlen=MAX_NAME_SEQ)
        ,'item_desc': pad_sequences(dataset.seq_item_description
                                    , maxlen=MAX_ITEM_DESC_SEQ)
        ,'brand': np.array(dataset.brand)
        ,'category': np.array(dataset.category)
        ,'category_name': pad_sequences(dataset.seq_category_name
                                        , maxlen=MAX_CATEGORY_NAME_SEQ)
        ,'item_condition': np.array(dataset.item_condition_id)
        ,'shipping': np.array(dataset[["shipping"]])
    }
    return X

X_train = get_keras_data(dtrain)
X_valid = get_keras_data(dvalid)
X_test = get_keras_data(test)


Build the model

This will be a multi-input model for those inputs

  • 'name': texts converted to sequences
  • 'item_desc': texts converted to sequences
  • 'brand': texts converted to integers
  • 'category': texts converted to integers
  • 'category_name': texts converted to sequences
  • 'item_condition': integers
  • 'shipping': integers 1 or 0

All Inputs except “shipping” will first go to embedding layer

For those sequences inputs, we need to feed them to Embedding layers. Embedding layers turn integer indices (which stand for specific words) to dense vectors. It takes as input integers, it looks up these integers into an internal dictionary, and it returns the associated vectors. It's effectively a dictionary lookup.

The embedded sequence will then feed to the GRU layer, like other types of recurrent networks, it is good at learning patterns in sequences of data.

Non-sequential data embedded layers will just be flattened to 2 dimensions by the Flatten layer.

All layers including the “shipping” will then be concatenated to a big two-dimensional tensor.

Followed by several Dense layers, final output Dense layer takes “linear” activation regression to arbitrary price values, same as specifying None for the activation parameter.

from keras.layers import Input, Dropout, Dense, \
    concatenate, GRU, Embedding, Flatten
from keras.models import Model
from keras import optimizers

def get_model():
    
    #Inputs
    name = Input(shape=[X_train["name"].shape[1]], name="name")
    item_desc = Input(shape=[X_train["item_desc"].shape[1]], name="item_desc")
    brand = Input(shape=[1], name="brand")
    category = Input(shape=[1], name="category")
    category_name = Input(shape=[X_train["category_name"].shape[1]], 
                          name="category_name")
    item_condition = Input(shape=[1], name="item_condition")
    shipping = Input(shape=[X_train["shipping"].shape[1]], name="shipping")
    
    #Embeddings layers
    emb_size = 60
    emb_name = Embedding(MAX_TEXT, emb_size//3)(name)
    emb_item_desc = Embedding(MAX_TEXT, emb_size)(item_desc)
    emb_category_name = Embedding(MAX_TEXT, emb_size//3)(category_name)
    emb_brand = Embedding(MAX_BRAND, 10)(brand)
    emb_category = Embedding(MAX_CATEGORY, 10)(category)
    emb_item_condition = Embedding(MAX_CONDITION, 5)(item_condition)
    
    rnn_layer1 = GRU(16) (emb_item_desc)
    rnn_layer2 = GRU(8) (emb_category_name)
    rnn_layer3 = GRU(8) (emb_name)
    
    #main layer
    main_l = concatenate([
        Flatten() (emb_brand)
        , Flatten() (emb_category)
        , Flatten() (emb_item_condition)
        , rnn_layer1
        , rnn_layer2
        , rnn_layer3
        , shipping
    ])
    main_l = Dropout(dr)(Dense(512,activation='relu') (main_l))
    main_l = Dropout(dr)(Dense(64,activation='relu') (main_l))
    main_l = Dropout(dr)(Dense(32,activation='relu') (main_l))
    
    #output
    output = Dense(1,activation="linear") (main_l)
    
    #model
    model = Model([name, item_desc, brand
                   , category, category_name
                   , item_condition, shipping], output)
    #optimizer = optimizers.RMSprop()
    optimizer = optimizers.Adam()
    model.compile(loss="mse", 
                  optimizer=optimizer)
    return model

I like visualization, so I plot the model structure as well.

model plot

It can be done with those two lines of code if you are curious.

You need to install Graphviz executable. pip install graphviz and pydot packages before trying to plot.

from keras.utils import plot_model
plot_model(model, to_file='model.png', show_shapes=True)

Training the model is easy, let's train it for 2 epochs, X_train is the dictionary we created earlier, mapping input names to Numpy arrays.

epochs = 3
BATCH_SIZE = 512 * 3
history = model.fit(X_train, dtrain.target
                    , epochs=epochs
                    , batch_size=BATCH_SIZE
                    , validation_split=0.01
                    )

Evaluate the model

The Kaggle challenge page has chosen "Root Mean Squared Logarithmic Error" as the loss function.

The following code will take our trained model and compute the loss value given the validation data.

def rmsle(y, y_pred):
    import math
    assert len(y) == len(y_pred)
    to_sum = [(math.log(y_pred[i] + 1) - math.log(y[i] + 1)) ** 2.0 \
              for i, pred in enumerate(y_pred)]
    return (sum(to_sum) * (1.0/len(y))) ** 0.5

def eval_model(model):
    val_preds = model.predict(X_valid)
    val_preds = np.expm1(val_preds)
    
    y_true = np.array(dvalid.price.values)
    y_pred = val_preds[:, 0]
    v_rmsle = rmsle(y_true, y_pred)
    print(" RMSLE error on dev test: "+str(v_rmsle))
    return v_rmsle

v_rmsle = eval_model(model)

Generate file for submission

If you are planning on generating the actual prices for the test datasets and try your luck on Kaggle. This block of code will reverse the feature normalization process we discussed previously and write the prices to a CSV file.

preds = model.predict(X_test, batch_size=BATCH_SIZE)
preds = np.expm1(preds)
submission = test[["test_id"]][:test_len]
submission["price"] = preds[:test_len]
submission.to_csv("./submission.csv", index=False)

Summary

We walked through how to predict prices give multiple input features. How to preprocessing the text data, dealing with missing data and finally build, train and evaluate the model.

Full source code posted on my GitHub.

Currently unrated

Comments