How to choose Last-layer activation and loss function

(Comments)

choices

Without further due, here are the different combinations of last-layer activation and loss function pair for different tasks.

Last-layer activation and loss function combinations

Problem type

Last-layer activation

Loss function

Example

Binary classification

sigmoid

binary_crossentropy

Dog vs cat,

Sentiemnt analysis(pos/neg)

Multi-class, single-label classification

softmax

categorical_crossentropy

MNIST has 10 classes single label (one prediction is one digit)

Multi-class, multi-label classification

sigmoid

binary_crossentropy

News tags classification, one blog can have multiple tags

Regression to arbitrary values

None

mse

Predict house price(an integer/float point)

Regression to values between 0 and 1

sigmoid

mse or binary_crossentropy

Engine health assessment where 0 is broken, 1 is new

Binary classification - Dog VS Cat

This competition on Kaggle is where you write an algorithm to classify whether images contain either a dog or a cat. It is a binary classification task where the output of the model is a single number range from 0~1 where the lower value indicates the image is more "Cat" like, and higher value if the model thing the image is more "Dog" like.

Here are the code for the last fully connected layer and the loss function used for the model

#Dog VS Cat last Dense layer
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer=optimizers.RMSprop(lr=1e-4),
              metrics=['acc'])

If you are interested in the full source code for this dog vs cat task, take a look at this awesome tutorial on GitHub.

Multi-class single-label classification - MNIST

The task is to classify grayscale images of handwritten digits (28 pixels by 28 pixels), into their 10 categories (0 to 9). The dataset came with Keras package so it's very easy to have a try.

Last layer use "softmax" activation, which means it will return an array of 10 probability scores (summing to 1). Each score will be the probability that the current digit image belongs to one of our 10 digit classes.

# MNIST last Dense layer
model.add(layers.Dense(10, activation='softmax'))
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

Again the full source code for MNIST classification is provided on GitHub.

Multi-class, multi-label classification - News tags classification

Reuters-21578 is a collection of about 20K news-lines and categorized with 672 labels. They are divided into five main categories:

  • Topics
  • Places
  • People
  • Organizations
  • Exchanges

For example, one news can have 3 tags

  • Places: USA, China
  • Topics:  trade
# News tags classification last Dense layer
model.add(Dense(num_categories, activation='sigmoid'))
model.compile(loss='binary_crossentropy', 
              optimizer='adam', metrics=['accuracy'])

You can take a look at the source code for this task on my GitHub.

I also wrote another blog for this task in detail as well, check out if you are interested.

Regression to arbitrary values - Bosten Housing price prediction

The goal is to predict a single continuous value instead of a discrete label of the house price with given data.

The network ends with a Dense without any activation because applying any activation function like sigmoid will constrain the value to 0~1 and we don't want that to happen.

The mse loss function, it computes the square of the difference between the predictions and the targets, a widely used loss function for regression tasks.

# predict house price last Dense layer
model.add(layers.Dense(1))
model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])

Full source code can be found in the same GitHub repo.

Regression to values between 0 and 1

For a task like making an assessment of the health condition of a jet engine providing several sensors recordings. We want the output to be a continuous value from 0~1 where 0 means the engine needs to be replaced and 1 means it is in perfect condition, whereas the value between 0 and 1 may mean some degree of maintenance is needed. Compare to previous regression problem we are applying the "sigmoid" activation to the last dense layer to constrain the value between 0 to 1.

# Jet engine health assessment last Dense layer
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])

Leave a comment if you have any questions.

Current rating: 4.3

Comments