A Beginner’s Tutorial on Building an AI Image Classifier using PyTorch

This is a step-by-step guide to build an image classifier. The AI model will be able to learn to label images. I use Python and Pytorch.

Step 1: Import libraries

When we write a program, it is a huge hassle manually coding every small action we perform. Sometimes, we want to use packages of code other people have already written. These packaged routines can be added into our program by importing libraries and then referencing the library later in the code.

We usually import all the libraries at the beginning of the program:

# Import Libraries
import torch
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models
import torch.nn as nn
import torch.optim as optim
import numpy as np
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt

Step 2: Define transforms

The next logical step would be to import the picture data we want our AI model to learn from. But before that, we need to specify the alterations we want to perform on these pictures — since the same command that imports them also transforms the data.

These transforms are made using the torchvision.transforms library. The best way to understand the transforms is to read the documentation here. But I’ll give a brief of what each command is doing.

  • transforms.Compose lets us compose multiple transforms together.
  • transforms.Resize((255)) resizes the images so the shortest side has a length of 255 pixels. The other side is scaled to maintain the aspect ratio of the image.
  • transforms.CenterCrop(224) crops the center of the image so it is a 224 pixels by 224 pixels square image.
  • We do these last two steps so all the images going into our AI model have the same size (AI models can’t handle inputs with varying sizes)
  • transforms.ToTensor() converts our image into numbers. It separates the three colors that every pixel of our picture is comprised of: red, green & blue. This gives us three different images (one tinted red, one green, one blue). Then it converts the pixels of each image into the brightness of their color from 0 to 255. These values are put in a range from 0 to 1 by simply diving by 255. Our image is now a Torch Tensor (a data structure that stores lots of numbers).
  • transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) subtracts the mean from each value and then divides by the standard deviation. We will be using a pre-trained model, so we need to use the means and standard deviations the Pytorch specifies. There are three values in the mean and standard deviation to match each RGB picture.
# Specify transforms using torchvision.transforms as transforms
# library
transformations = transforms.Compose([
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

Step 3: Import our data and put it into a DataLoader

Finally, we can import our pictures into the program. We use the torchvision.datasets library.

Read about it here.

We specify two different data sets, one for the images that the AI learns from (the training set) and the other for the dataset we use to test the AI model (the validation set).

The datasets.ImageFolder() command expects our data to be organized in the following way: root/label/picture.png. In other words, images are sorted into folders. For example, all the pictures of bees should be in one folder, all the pictures of ants should be in another etc.

We give the command the path to all the folders and we also give it the transforms that we specified in the last step.

# Load in each dataset and apply transformations using
# the torchvision.datasets as datasets library
train_set = datasets.ImageFolder("root/label/train", transform = transformations)
val_set = datasets.ImageFolder("root/label/valid", transform = transformations)

Then we want to put our imported images into a Dataloader, which is able to spit out our data in samples. We specify how many images we want at once as our batch_size. So 32 means we want to get 32 images at one time. We also want to shuffle our images so it gets inputted randomly into our AI model.

Read about the DataLoader here.

# Put into a Dataloader using torch library
train_loader = torch.utils.data.DataLoader(train_set, batch_size=32, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_set, batch_size =32, shuffle=True)

Step 4: Creating our model

AI models need to be trained on a lot of data to be effective. Since we don’t have that much data, we want to take a pre-trained model (a model that has been previously trained on many images) but tailor it to recognize our specific images.

This process is called transfer learning. There are two parts of an image recognition models the convolutional and the classifier. We want to keep the pre-trained convolutional part but put in our own classifier.

The convolution/pooling section of our model is used to recognize the features inside an image. It first identifies edges, then using the edges it identifies shapes, and using the shapes it can identify objects. We want to use the default convolutional layers because it has been trained to identify these feature very well. There are also pooling layers in between convolutional layers that distill an image to a smaller size so it can be easily inputted to our AI model.

The last part of the model is the classifier. The classifier takes all the information extracted from the photo in the convolution part, and uses it to identify the image. This is the part of the pre-trained model we want to replace and to train on our own images, so that the model is tailored to identify the images we give it.

We use the torchvision.models library to download a pre-trained model. There are many different models we can download, and more info can be found here. I chose a model called densenet161 and specified that we want it to be pre-trained by setting pretrained=True.

Then, we make sure we don’t train this model (since we only want to train the classifier we will put in next). Calculating gradients is only necessary for training, so we tell the model not to calculate the gradients of any parameter.

# Get pretrained model using torchvision.models as models library
model = models.densenet161(pretrained=True)
# Turn off training for their parameters
for param in model.parameters():
param.requires_grad = False

Now we want to replace the default classifier of the model with our own classifier. Classifiers are fully connected neural networks. A neural network is just a bunch of numbers that interact in particular ways. In this case, it takes the features of the image that were highlighted by the convolution section to determine how likely the image is a certain label.

The first thing we want to do is to determine the input to our neural network. The numbers coming into our classifier are the same number as the model’s default classifier.

Then, we want to determine the number of outputs. This number should match how many types of images you have. The model will give you a list of percentages, each corresponding to how certain the picture is to that label. So if you have images of bees, ants, and flies, there are 3 labels and there should be 3 numbers in the output layer.

Once we have those details, we use the torch.nn library to create the classifier. Information can be found here.

  • nn.Sequential can help us group multiple modules together.
  • nn.Linear specifies the interaction between two layers. We give it 2 numbers, specifying the number of nodes in each layer. So in the first command, the first layer is the input layer, and we can choose how many numbers we want in the second layer (I went with 1024).
  • nn.ReLU is an activation function for hidden layers. Activation functions helps the model learn complex relationships between the input and the output. We only use ReLU on all layers except for the output.

We repeat this for as many hidden layers as you want, with as many nodes as you want in each layer.

  • nn.LogSoftmax is the activation function for the output. The softmax function turns the numbers into percentages for multiple labels, and the log function is applied to make it computationally faster. We must specify that the output layer is a column, so we set dimension equal to 1.

After creating our own classifier, we replace the model’s default one.

# Create new classifier for model using torch.nn as nn library
classifier = nn.Sequential(nn.Linear(classifier_input, 1024),
nn.Linear(1024, 512),
nn.Linear(512, num_labels),
# Replace default classifier with new classifier
model.classifier = classifier

Now my model is created! Next, I just need to train it.

Step 5: Training and Evaluating our model

Training a model on a GPU is a lot faster than a CPU. So to determine which device is available for you, we use Torch to check. If there is a compatible GPU, we set the variable to GPU, if not it goes with CPU. We then move our model to this device.

# Find the device available to use using torch library
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Move model to the device specified above

While training, we need to determine how “off” our model is. To evaluate the amount of error our model has, we use nn.NLLLoss. This function takes in the output of our model, for which we used the nn.LogSoftmax function.

To train our model, we take our error and see how we can adjust the weights we multiplied our numbers by to get the smallest error. The method of calculating how we adjust our weights and applying it to our weights is called Adam. We use the torch.optim library to use this method and give it our parameters.

# Set the error function using torch.nn as nn library
criterion = nn.NLLLoss()
# Set the optimizer function using torch.optim as optim library
optimizer = optim.Adam(model.classifier.parameters())

Now we train. We want our model to go through the entire dataset multiple times, so we use a for loop. Every time it has gone over the entire set of images, it is called an epoch. In one epoch we want the model to go through both the training set and the validation set.

We start with the training set.

We first set the model to training mode and we use a for loop to go through every image. After moving the images and the labels to the appropriate device, we need to clear the adjustments of the weights by declaring optimizer.zero_grad(). We can then compute the output of our model given our images and how “off” our model is given its output and the correct answers. Then we can find the adjustments we need to make to decreases this error by calling loss.backward() and use our optimizer to adjust the weights by calling optimizer.step().

As we train, we want to know how things are going, so we keep track of the total errors we calculated and print out the progress of the training.

We move on to the validation set.

We set our model to evaluation mode and use a for loop to iterate over all the images in our set. We repeat the steps we took for the training set to get the output of our model and how much our model is “off” from the real labels.

Our model used the LogSoftmax function in order to increase computation speed, but now we want the real percentages, not the log percentages. So we use torch.exp to reverse the log function. We then want to see which class the model guessed for our images. .topk gives us the top class that was guessed, and what percentage it guessed it at — we only care about the class so we can ignore the percentage.

To determine how many images it guessed right, we check which guessed classes are equal to the real classes. Then we can average over the entire batch to determine the accuracy of our model (how many images it guessed right divided by the total amount of images).

After going through both the training and validation set, we want to print the errors for both and the accuracy of the validation set.

epochs = 10
for epoch in range(epochs):
train_loss = 0
val_loss = 0
accuracy = 0

# Training the model
counter = 0
for inputs, labels in train_loader:
# Move to device
inputs, labels = inputs.to(device), labels.to(device)
        # Clear optimizers
        # Forward pass
output = model.forward(inputs)
        # Loss
loss = criterion(output, labels)
        # Calculate gradients (backpropogation)
        # Adjust parameters based on gradients
        # Add the loss to the training set's rnning loss
train_loss += loss.item()*inputs.size(0)

# Print the progress of our training
counter += 1
print(counter, "/", len(train_loader))

# Evaluating the model
counter = 0
# Tell torch not to calculate gradients
with torch.no_grad():
for inputs, labels in val_loader:
# Move to device
inputs, labels = inputs.to(device), labels.to(device)
            # Forward pass
output = model.forward(inputs)
            # Calculate Loss
valloss = criterion(output, labels)
            # Add loss to the validation set's running loss
val_loss += valloss.item()*inputs.size(0)

# Since our model outputs a LogSoftmax, find the real
# percentages by reversing the log function
output = torch.exp(output)
            # Get the top class of the output
top_p, top_class = output.topk(1, dim=1)
            # See how many of the classes were correct?
equals = top_class == labels.view(*top_class.shape)
            # Calculate the mean (get the accuracy for this batch)
# and add it to the running accuracy for this epoch
accuracy += torch.mean(equals.type(torch.FloatTensor)).item()

# Print the progress of our evaluation
counter += 1
print(counter, "/", len(val_loader))

# Get the average loss for the entire epoch
train_loss = train_loss/len(train_loader.dataset)
valid_loss = val_loss/len(val_loader.dataset)
    # Print out the information
print('Accuracy: ', accuracy/len(val_loader))
print('Epoch: {} \tTraining Loss: {:.6f} \tValidation Loss: {:.6f}'.format(epoch, train_loss, valid_loss))

Step 5: Actually Using our model

That was great! You’ve just built an AI image classifier. But, now we want to actually use it — we want to give it a random image and see which label it thinks it is.

First, we set the model for evaluation mode.


Then we create a function that can process the image so it can be inputted into our model. We open the image, resize it by keeping the aspect ratio but making the shortest side only 255 px, and crop the center 224px by 224px. We then turn the picture into an array and make sure that the number of color channels is the first dimension instead of the last dimension by transposing the array. Lastly, we convert each value between 0 and 1 by dividing by 255 and normalize the values by subtracting the mean and dividing by the standard deviation. We then convert the array into a Torch tensor and convert the values to float.

These steps are the same steps we specified in Step 2, but this time we must manually code the commands instead of relying on the transforms library.

# Process our image
def process_image(image_path):
# Load Image
img = Image.open(image_path)

# Get the dimensions of the image
width, height = img.size

# Resize by keeping the aspect ratio, but changing the dimension
# so the shortest size is 255px
img = img.resize((255, int(255*(height/width))) if width < height else (int(255*(width/height)), 255))

# Get the dimensions of the new image size
width, height = img.size

# Set the coordinates to do a center crop of 224 x 224
left = (width - 224)/2
top = (height - 224)/2
right = (width + 224)/2
bottom = (height + 224)/2
img = img.crop((left, top, right, bottom))

# Turn image into numpy array
img = np.array(img)

# Make the color channel dimension first instead of last
img = img.transpose((2, 0, 1))

# Make all values between 0 and 1
img = img/255

# Normalize based on the preset mean and standard deviation
img[0] = (img[0] - 0.485)/0.229
img[1] = (img[1] - 0.456)/0.224
img[2] = (img[2] - 0.406)/0.225

# Add a fourth dimension to the beginning to indicate batch size
img = img[np.newaxis,:]

# Turn into a torch tensor
image = torch.from_numpy(img)
image = image.float()
return image

After processing the image, we can build a function to use our model to predict the label. We input the image into our model and obtain the output. We then reverse the log in the LogSoftmax function that we applied in the output layer and return the top class the model predicted and how certain it is of its guess.

# Using our model to predict the label
def predict(image, model):
# Pass the image through our model
output = model.forward(image)

# Reverse the log function in our output
output = torch.exp(output)

# Get the top predicted class, and the output percentage for
# that class
probs, classes = output.topk(1, dim=1)
return probs.item(), classes.item()

Lastly, we want to display the image. We turn the image back into an array, and un-normalize it by multiplying by the standard deviation and adding back the mean. We then use the matplotlib.pyplot library to plot the picture.

# Show Image
def show_image(image):
# Convert image to numpy
image = image.numpy()

# Un-normalize the image
image[0] = image[0] * 0.226 + 0.445

# Print the image
fig = plt.figure(figsize=(25, 4))
plt.imshow(np.transpose(image[0], (1, 2, 0)))

Now, we can use all these functions to print our model’s guess and how sure it was!

# Process Image
image = process_image("root/image1234.jpg")
# Give image to model to predict output
top_prob, top_class = predict(image, model)
# Show the image
# Print the results
print("The model is ", top_prob*100, "% certain that the image has a predicted class of ", top_class )

That’s it!

Original. Reposted with permission.

Opinions expressed by AI Time Journal contributors are their own.

About Alexander Wu

Contributor AI & Deep Learning enthusiast | Student at Minerva

View all posts by Alexander Wu →