Binary classification with Machine Learning: Neural Networks for classifying Chihuahuas and Muffins

[Article by Elia Togni] Introduction to Image Classification Image Classification is one of the most fundamental and studied topics in the field of machine learning. It refers to the ability to understand and categorize an image as a whole under a specific label. In image classification, the goal is to predict the class to which an input image belongs. While humans excel at this task, mimicking human perception is challenging for machines. Traditional computer vision techniques rely on local descriptors (algorithms or methods that capture information about the local appearance or features of specific regions in an image) to find similarities between images. However, advances in technology have shifted the focus toward Deep Learning, which automatically extracts representative features and patterns from images. One of the most prominent tools in modern image classification is the Convolutional Neural Network (CNN), a specialized class of neural networks designed for visual data analysis. CNNs are characterized by their ability to apply the convolution operation to extract features from small portions of input images, making them ideal for tasks such as image recognition, object detection, and more. Chihuahuas vs. Muffins This article explores the challenge of binary classification, which involves predicting one of two possible outcomes. Specifically, the task is to classify images as either chihuahuas or muffins. While this might seem trivial for humans, it poses a unique challenge for machine learning models due to the visual similarities between the two categories. We compare the performance of a Multi-Layer Neural Network (MLNN) and a Convolutional Neural Network (CNN), focusing on their suitability for image classification and their ability to generalize well to unseen data. Dataset Overview This project is based on the Chihuahua vs. Muffin dataset from Kaggle. The dataset consists of two folders, each containing images of the two classes—one for training and one for testing. After combining and inspecting these folders, the dataset was found to include: 3199 chihuahua images 2718 muffin images Data Cleaning A manual inspection of the dataset revealed several issues, such as mislabeled or irrelevant images. These images were classified into: Correctly labeled chihuahua images Incorrectly labeled images, such as muffins in the chihuahua folder Unrelated images, such as drawings or entirely unrelated objects Some examples of images removed include: After cleaning, the dataset was restructured to ensure only relevant images for each class were included. Images containing both chihuahuas and muffins or completely unrelated content were discarded. Addressing Data Imbalance Classifiers perform best when the dataset is balanced. To assess balance, the Imbalance Ratio was calculated: Imbalance Ratio ρ=Number of Instances in Majority ClassNumber of Instances in Minority Class\text{Imbalance Ratio } \rho = \frac{\text{Number of Instances in Majority Class}}{\text{Number of Instances in Minority Class}} For this dataset, ρ=1.22\rho = 1.22, which is sufficiently close to 1, indicating no significant imbalance. Preprocessing Preprocessing Pipeline The following steps were applied to preprocess the images: Image Resizing: All images were resized to $128 \times 128$ to ensure uniform input dimensions. Cropping and Padding: Large white borders were removed to help the model focus on relevant features. Zero-padding was used to maintain aspect ratios. Normalization: Pixel values were scaled to $[0, 1]$ to improve convergence. Data Augmentation: Techniques such as rotation, flipping, and brightness adjustments were used to increase dataset variability and reduce overfitting. Segmentation: Simple Linear Iterative Clustering (SLIC) was employed to isolate key regions in the image, removing irrelevant details. Dataset Variants From the preprocessing steps, three dataset variants were created: Original RGB Images Augmented RGB Images Segmented RGB Images Models Multi-Layer Neural Network (MLNN) The MLNN model consists of fully connected dense layers, batch normalization, dropout, and activation functions. It was configured to test various architectures: Code Implementation from tensorflow.keras import layers, models, Input from tensorflow.keras.optimizers import Adam from tensorflow.keras.metrics import BinaryAccuracy, Precision, Recall, AUC def MLNN_model(input_size, num_classes, hidden_layer_units, hidden_activation, output_activation, dropout_perc, n_channels, loss, learning_rate=0.001): """ Builds a Multi-Layer Neural Network (MLNN). """ input_layer = Input(shape=(input_size[0], input_size[1], n_channels)) flatten_layer = layers.Flatten()(input_layer) hidden_layer = layers.Rescaling(1./255)(flatten_layer) for units

Jan 15, 2025 - 09:24

Binary classification with Machine Learning: Neural Networks for classifying Chihuahuas and Muffins

Introduction to Image Classification

Image Classification is one of the most fundamental and studied topics in the field of machine learning. It refers to the ability to understand and categorize an image as a whole under a specific label.

In image classification, the goal is to predict the class to which an input image belongs. While humans excel at this task, mimicking human perception is challenging for machines. Traditional computer vision techniques rely on local descriptors (algorithms or methods that capture information about the local appearance or features of specific regions in an image) to find similarities between images. However, advances in technology have shifted the focus toward Deep Learning, which automatically extracts representative features and patterns from images.

One of the most prominent tools in modern image classification is the Convolutional Neural Network (CNN), a specialized class of neural networks designed for visual data analysis. CNNs are characterized by their ability to apply the convolution operation to extract features from small portions of input images, making them ideal for tasks such as image recognition, object detection, and more.

Chihuahuas vs. Muffins

This article explores the challenge of binary classification, which involves predicting one of two possible outcomes. Specifically, the task is to classify images as either chihuahuas or muffins. While this might seem trivial for humans, it poses a unique challenge for machine learning models due to the visual similarities between the two categories.

We compare the performance of a Multi-Layer Neural Network (MLNN) and a Convolutional Neural Network (CNN), focusing on their suitability for image classification and their ability to generalize well to unseen data.

Dataset Overview

This project is based on the Chihuahua vs. Muffin dataset from Kaggle. The dataset consists of two folders, each containing images of the two classes—one for training and one for testing. After combining and inspecting these folders, the dataset was found to include:

3199 chihuahua images
2718 muffin images

Data Cleaning

A manual inspection of the dataset revealed several issues, such as mislabeled or irrelevant images. These images were classified into:

Correctly labeled chihuahua images
Incorrectly labeled images, such as muffins in the chihuahua folder
Unrelated images, such as drawings or entirely unrelated objects

Some examples of images removed include:

After cleaning, the dataset was restructured to ensure only relevant images for each class were included. Images containing both chihuahuas and muffins or completely unrelated content were discarded.

Addressing Data Imbalance

Classifiers perform best when the dataset is balanced. To assess balance, the Imbalance Ratio was calculated:

Imbalance Ratio ρ=Number of Instances in Majority ClassNumber of Instances in Minority Class\text{Imbalance Ratio } \rho = \frac{\text{Number of Instances in Majority Class}}{\text{Number of Instances in Minority Class}}

For this dataset, ρ=1.22\rho = 1.22, which is sufficiently close to 1, indicating no significant imbalance.

Preprocessing

Preprocessing Pipeline

The following steps were applied to preprocess the images:

Image Resizing: All images were resized to $128 \times 128$ to ensure uniform input dimensions.
Cropping and Padding: Large white borders were removed to help the model focus on relevant features. Zero-padding was used to maintain aspect ratios.
Normalization: Pixel values were scaled to $[0, 1]$ to improve convergence.
Data Augmentation: Techniques such as rotation, flipping, and brightness adjustments were used to increase dataset variability and reduce overfitting.
Segmentation: Simple Linear Iterative Clustering (SLIC) was employed to isolate key regions in the image, removing irrelevant details.

Dataset Variants

From the preprocessing steps, three dataset variants were created:

Original RGB Images
Augmented RGB Images
Segmented RGB Images

Models

Multi-Layer Neural Network (MLNN)

The MLNN model consists of fully connected dense layers, batch normalization, dropout, and activation functions. It was configured to test various architectures:

Code Implementation

from tensorflow.keras import layers, models, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import BinaryAccuracy, Precision, Recall, AUC

def MLNN_model(input_size, num_classes, hidden_layer_units, hidden_activation, 
                output_activation, dropout_perc, n_channels, loss, learning_rate=0.001):
    """
    Builds a Multi-Layer Neural Network (MLNN).
    """
    input_layer = Input(shape=(input_size[0], input_size[1], n_channels))
    flatten_layer = layers.Flatten()(input_layer)
    hidden_layer = layers.Rescaling(1./255)(flatten_layer)

    for units in hidden_layer_units:
        hidden_layer = layers.BatchNormalization()(hidden_layer)
        hidden_layer = layers.Activation(hidden_activation)(hidden_layer)
        hidden_layer = layers.Dropout(dropout_perc)(hidden_layer)
        hidden_layer = layers.Dense(units, activation=hidden_activation)(hidden_layer)

    output_layer = layers.Dense(1, activation=output_activation)(hidden_layer)
    model = models.Model(inputs=input_layer, outputs=output_layer)
    model.compile(optimizer=Adam(learning_rate=learning_rate), loss=loss, metrics=[BinaryAccuracy(), Precision(), Recall(), AUC()])

    return model

MLNN Training Summary

Convolutional Neural Network (CNN)

CNNs are designed to handle spatial data and are more effective for image classification. This baseline CNN includes convolutional layers, pooling layers, and a dense layer for classification.

Code Implementation

from tensorflow.keras import layers, models

def CNN_model(input_size=(128, 128), n_channels=3, conv_filters=[32, 64, 128], kernel_size=(3, 3),
              pool_size=(2, 2), dense_units=128, output_activation='sigmoid', loss='binary_crossentropy',
              learning_rate=0.001):
    """
    Builds a Convolutional Neural Network (CNN).
    """
    model = models.Sequential()
    model.add(layers.Rescaling(1./255, input_shape=(input_size[0], input_size[1], n_channels)))

    for filters in conv_filters:
        model.add(layers.Conv2D(filters, kernel_size, activation='relu'))
        model.add(layers.MaxPooling2D(pool_size))

    model.add(layers.Flatten())
    model.add(layers.Dense(dense_units, activation='relu'))
    model.add(layers.Dense(1, activation=output_activation))
    model.compile(optimizer=Adam(learning_rate=learning_rate), loss=loss, metrics=[BinaryAccuracy(), Precision(), Recall(), AUC()])

    return model

CNN Training Summary

TogNet

TogNet is a custom CNN designed to address overfitting. Dropout layers were added to improve generalization.

Code Implementation

from tensorflow.keras import layers, models

def TogNet_model(input_size=(128, 128), n_channels=3, conv_filters=[32, 64, 64], kernel_size=(3, 3),
                 pool_size=(2, 2), dense_units=128, hidden_activation='relu', output_activation='sigmoid',
                 loss='binary_crossentropy', learning_rate=0.001, dropout_rate=0.2):
    """
    Builds the TogNet CNN.
    """
    model = models.Sequential()
    model.add(layers.Rescaling(1./255, input_shape=(input_size[0], input_size[1], n_channels)))

    for filters in conv_filters:
        model.add(layers.Conv2D(filters, kernel_size, activation=hidden_activation))
        model.add(layers.MaxPooling2D(pool_size))
        model.add(layers.Dropout(dropout_rate))

    model.add(layers.Flatten())
    model.add(layers.Dense(dense_units, activation=hidden_activation))
    model.add(layers.Dropout(dropout_rate))
    model.add(layers.Dense(1, activation=output_activation))

    model.compile(optimizer=Adam(learning_rate=learning_rate), loss=loss, metrics=[BinaryAccuracy(), Precision(), Recall(), AUC()])

    return model

TogNet Training Summary

Results

Performance Metrics

Model	Dataset	Binary Accuracy	Precision	Recall	AUC
MLNN 512_256_128	RGB	80.38%	78.02%	80.08%	88.01%
	Augmented	82.92%	89.06%	80.95%	91.54%
CNN 32_64_128	RGB	92.88%	91.73%	95.48%	97.01%
	Augmented	93.61%	94.18%	92.27%	96.60%
TogNet	RGB	92.88%	92.45%	94.36%	97.37%
	Augmented	91.53%	89.40%	92.86%	97.03%

Comparative Visualizations

Loss and Accuracy Trends for TogNet

Metrics Comparisons for MLNN Variants

Conclusion

This study demonstrates that CNNs, particularly the custom TogNet architecture, outperform MLNNs in binary image classification. Data augmentation improved model performance, while segmentation showed mixed results. Future work will explore advanced architectures and larger datasets to further enhance classification accuracy.

All the code is available on GitHub.