(Comments)

In this tutorial, I will show you how to build a deep learning model to find defects on a surface, a popular application in many industrial inspection scenarios.

Courtesy of Nvidia
We will apply U-Net as a DL model for 2D industrial defect inspection. When there is a shortage of labeled data and fast performance is needed, U-net is a great choice. The basic architecture is an encoder-decoder pair with skip connections to combine low-level feature maps with higher-level ones. To verify the effectiveness of our model, we will use the DAGM dataset. The benefit of using U-Net is that it doesn't contain any dense layer, so the trained DL models are typically scaled invariant, meaning they need not be retrained across image sizes to be effective for multiple input sizes.
Here is the model structure.

As you can see we use four 2x2 max pool operations for downsampling, which reduces that resolutions by half for 4 times. On the right side, 2x2 Conv2DTranspose(called Deconvolution) upsamples the image back to its original resolution. In order for the downsampling and upsampling to work, the image resolution must be divisible by 16(or 24), that is why we resized our input image and mask to 512x512 resolution from the original DAGM dataset of size 500x500.
These skip connections from earlier layers in the network (prior to a downsampling operation) should provide the necessary detail in order to reconstruct accurate shapes for segmentation boundaries. Indeed, we can recover more fine-grain detail with the addition of these skip connections.
It is simple to compose such a model in Keras functional API.
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Conv2D, MaxPooling2D, UpSampling2D, Lambda, Conv2DTranspose, concatenate
def get_small_unet():
    inputs = Input((img_rows, img_cols, 1))
    inputs_norm = Lambda(lambda x: x/127.5 - 1.)
    conv1 = Conv2D(16, (3, 3), activation='relu', padding='same')(inputs)
    conv1 = Conv2D(16, (3, 3), activation='relu', padding='same')(conv1)
    pool1 = MaxPooling2D(pool_size=(2, 2))(conv1)
    conv2 = Conv2D(32, (3, 3), activation='relu', padding='same')(pool1)
    conv2 = Conv2D(32, (3, 3), activation='relu', padding='same')(conv2)
    pool2 = MaxPooling2D(pool_size=(2, 2))(conv2)
    conv3 = Conv2D(64, (3, 3), activation='relu', padding='same')(pool2)
    conv3 = Conv2D(64, (3, 3), activation='relu', padding='same')(conv3)
    pool3 = MaxPooling2D(pool_size=(2, 2))(conv3)
    conv4 = Conv2D(128, (3, 3), activation='relu', padding='same')(pool3)
    conv4 = Conv2D(128, (3, 3), activation='relu', padding='same')(conv4)
    pool4 = MaxPooling2D(pool_size=(2, 2))(conv4)
    conv5 = Conv2D(256, (3, 3), activation='relu', padding='same')(pool4)
    conv5 = Conv2D(256, (3, 3), activation='relu', padding='same')(conv5)
    up6 = concatenate([Conv2DTranspose(64, kernel_size=(
        2, 2), strides=(2, 2), padding='same')(conv5), conv4], axis=3)
    conv6 = Conv2D(128, (3, 3), activation='relu', padding='same')(up6)
    conv6 = Conv2D(128, (3, 3), activation='relu', padding='same')(conv6)
    up7 = concatenate([Conv2DTranspose(32, kernel_size=(
        2, 2), strides=(2, 2), padding='same')(conv6), conv3], axis=3)
    conv7 = Conv2D(64, (3, 3), activation='relu', padding='same')(up7)
    conv7 = Conv2D(64, (3, 3), activation='relu', padding='same')(conv7)
    up8 = concatenate([Conv2DTranspose(16, kernel_size=(
        2, 2), strides=(2, 2), padding='same')(conv7), conv2], axis=3)
    conv8 = Conv2D(32, (3, 3), activation='relu', padding='same')(up8)
    conv8 = Conv2D(32, (3, 3), activation='relu', padding='same')(conv8)
    up9 = concatenate([Conv2DTranspose(8, kernel_size=(
        2, 2), strides=(2, 2), padding='same')(conv8), conv1], axis=3)
    conv9 = Conv2D(16, (3, 3), activation='relu', padding='same')(up9)
    conv9 = Conv2D(16, (3, 3), activation='relu', padding='same')(conv9)
    conv10 = Conv2D(1, (1, 1), activation='sigmoid')(conv9)
    model = Model(inputs=inputs, outputs=conv10)
    return model
model = get_small_unet()
The most commonly used loss function for the task of image segmentation is a pixel-wise cross-entropy loss. This loss examines each pixel individually, comparing the class predictions (depth-wise pixel vector) to our one-hot encoded target vector.
Because the cross-entropy loss evaluates the class predictions for each pixel vector individually and then averages over all pixels, we're essentially asserting equal learning to each pixel in the image. This can be a problem if your various classes have unbalanced representation in the image, as training can be dominated by the most prevalent class. In our case, it is the foreground-to-background imbalance.
Another popular loss function for image segmentation tasks is based on the Dice coefficient, which is essentially a measure of overlap between two samples. This measure ranges from 0 to 1 where a Dice coefficient of 1 denotes perfect and complete overlap. The Dice coefficient was originally developed for binary data, and can be calculated as:

where |A∩B| represents the common elements between sets A and B, and |A| represents the number of elements in set A (and likewise for set B).
With respect to the neural network output, the numerator is concerned with the common activations between our prediction and target mask, whereas the denominator is concerned with the number of activations in each mask separately. This has the effect of normalizing our loss according to the size of the target mask such that the soft Dice loss does not struggle learning from classes with lesser spatial representation in an image.
Here we use add-one or Laplace smoothing, which simply adds one to each count. Add-one smoothing can be interpreted as a uniform prior which reduces overfitting and make the model easier to converge.
To implement a smoothed Dice coefficient loss.
from tensorflow.keras import backend as K
def smooth_dice_coeff(smooth=1.):
    smooth = float(smooth)
    # IOU or dice coeff calculation
    def IOU_calc(y_true, y_pred):
            y_true_f = K.flatten(y_true)
            y_pred_f = K.flatten(y_pred)
            intersection = K.sum(y_true_f * y_pred_f)
            return 2*(intersection + smooth) / (K.sum(y_true_f) + K.sum(y_pred_f) + smooth)
    def IOU_calc_loss(y_true, y_pred):
        return -IOU_calc(y_true, y_pred)
    return IOU_calc, IOU_calc_loss
IOU_calc, IOU_calc_loss = smooth_dice_coeff(1)
Here we compared the performance for both the binary cross-entropy loss and smoothed Dice coefficient loss.


As you can see, the model trained with Dice coefficient loss converged faster and achieved a better final IOU accuracy. Regarding the final test prediction result, the model trained with Dice coefficient loss delivered sharper segmentation edges that outperformed model trained with cross-entropy loss.
In this quick tutorial, you have learned how to build a deep learning model which can be trained end to end and detect defect for industrial applications. The DAGM dataset used in the post is relatively simple which makes it easy for fast prototyping and verification. However, in the real-world, image data might contain much richer contexts which require a deeper and more complex model to comprehend, one simple way to accomplish this is by experimenting with an increased number of kernels for CNN layers. While there are other options like this paper, the author proposed to replace each CNN block with a dense block that can be more capable when learning complex contextual features.
You can reproduce the results of this post by running this notebook on Google Colab with free GPU.
Source code available on my GitHub.
Share on Twitter Share on Facebook
Comments