Binary classification with Machine Learning: Neural Networks for classifying Chihuahuas and Muffins
[Article by Elia Togni] Introduction to Image Classification Image Classification is one of the most fundamental and studied topics in the field of machine learning. It refers to the ability to understand and categorize an image as a whole under a specific label. In image classification, the goal is to predict the class to which an input image belongs. While humans excel at this task, mimicking human perception is challenging for machines. Traditional computer vision techniques rely on local descriptors (algorithms or methods that capture information about the local appearance or features of specific regions in an image) to find similarities between images. However, advances in technology have shifted the focus toward Deep Learning, which automatically extracts representative features and patterns from images. One of the most prominent tools in modern image classification is the Convolutional Neural Network (CNN), a specialized class of neural networks designed for visual data analysis. CNNs are characterized by their ability to apply the convolution operation to extract features from small portions of input images, making them ideal for tasks such as image recognition, object detection, and more. Chihuahuas vs. Muffins This article explores the challenge of binary classification, which involves predicting one of two possible outcomes. Specifically, the task is to classify images as either chihuahuas or muffins. While this might seem trivial for humans, it poses a unique challenge for machine learning models due to the visual similarities between the two categories. We compare the performance of a Multi-Layer Neural Network (MLNN) and a Convolutional Neural Network (CNN), focusing on their suitability for image classification and their ability to generalize well to unseen data. Dataset Overview This project is based on the Chihuahua vs. Muffin dataset from Kaggle. The dataset consists of two folders, each containing images of the two classes—one for training and one for testing. After combining and inspecting these folders, the dataset was found to include: 3199 chihuahua images 2718 muffin images Data Cleaning A manual inspection of the dataset revealed several issues, such as mislabeled or irrelevant images. These images were classified into: Correctly labeled chihuahua images Incorrectly labeled images, such as muffins in the chihuahua folder Unrelated images, such as drawings or entirely unrelated objects Some examples of images removed include: After cleaning, the dataset was restructured to ensure only relevant images for each class were included. Images containing both chihuahuas and muffins or completely unrelated content were discarded. Addressing Data Imbalance Classifiers perform best when the dataset is balanced. To assess balance, the Imbalance Ratio was calculated: Imbalance Ratio ρ=Number of Instances in Majority ClassNumber of Instances in Minority Class\text{Imbalance Ratio } \rho = \frac{\text{Number of Instances in Majority Class}}{\text{Number of Instances in Minority Class}} For this dataset, ρ=1.22\rho = 1.22, which is sufficiently close to 1, indicating no significant imbalance. Preprocessing Preprocessing Pipeline The following steps were applied to preprocess the images: Image Resizing: All images were resized to $128 \times 128$ to ensure uniform input dimensions. Cropping and Padding: Large white borders were removed to help the model focus on relevant features. Zero-padding was used to maintain aspect ratios. Normalization: Pixel values were scaled to $[0, 1]$ to improve convergence. Data Augmentation: Techniques such as rotation, flipping, and brightness adjustments were used to increase dataset variability and reduce overfitting. Segmentation: Simple Linear Iterative Clustering (SLIC) was employed to isolate key regions in the image, removing irrelevant details. Dataset Variants From the preprocessing steps, three dataset variants were created: Original RGB Images Augmented RGB Images Segmented RGB Images Models Multi-Layer Neural Network (MLNN) The MLNN model consists of fully connected dense layers, batch normalization, dropout, and activation functions. It was configured to test various architectures: Code Implementation from tensorflow.keras import layers, models, Input from tensorflow.keras.optimizers import Adam from tensorflow.keras.metrics import BinaryAccuracy, Precision, Recall, AUC def MLNN_model(input_size, num_classes, hidden_layer_units, hidden_activation, output_activation, dropout_perc, n_channels, loss, learning_rate=0.001): """ Builds a Multi-Layer Neural Network (MLNN). """ input_layer = Input(shape=(input_size[0], input_size[1], n_channels)) flatten_layer = layers.Flatten()(input_layer) hidden_layer = layers.Rescaling(1./255)(flatten_layer) for units
[Article by Elia Togni]
Introduction to Image Classification
Image Classification is one of the most fundamental and studied topics in the field of machine learning. It refers to the ability to understand and categorize an image as a whole under a specific label.
In image classification, the goal is to predict the class to which an input image belongs. While humans excel at this task, mimicking human perception is challenging for machines. Traditional computer vision techniques rely on local descriptors (algorithms or methods that capture information about the local appearance or features of specific regions in an image) to find similarities between images. However, advances in technology have shifted the focus toward Deep Learning, which automatically extracts representative features and patterns from images.
One of the most prominent tools in modern image classification is the Convolutional Neural Network (CNN), a specialized class of neural networks designed for visual data analysis. CNNs are characterized by their ability to apply the convolution operation to extract features from small portions of input images, making them ideal for tasks such as image recognition, object detection, and more.
Chihuahuas vs. Muffins
This article explores the challenge of binary classification, which involves predicting one of two possible outcomes. Specifically, the task is to classify images as either chihuahuas or muffins. While this might seem trivial for humans, it poses a unique challenge for machine learning models due to the visual similarities between the two categories.
We compare the performance of a Multi-Layer Neural Network (MLNN) and a Convolutional Neural Network (CNN), focusing on their suitability for image classification and their ability to generalize well to unseen data.
Dataset Overview
This project is based on the Chihuahua vs. Muffin dataset from Kaggle. The dataset consists of two folders, each containing images of the two classes—one for training and one for testing. After combining and inspecting these folders, the dataset was found to include:
- 3199 chihuahua images
- 2718 muffin images
Data Cleaning
A manual inspection of the dataset revealed several issues, such as mislabeled or irrelevant images. These images were classified into:
- Correctly labeled chihuahua images
- Incorrectly labeled images, such as muffins in the chihuahua folder
- Unrelated images, such as drawings or entirely unrelated objects
Some examples of images removed include:
After cleaning, the dataset was restructured to ensure only relevant images for each class were included. Images containing both chihuahuas and muffins or completely unrelated content were discarded.
Addressing Data Imbalance
Classifiers perform best when the dataset is balanced. To assess balance, the Imbalance Ratio was calculated:
Imbalance Ratio ρ=Number of Instances in Majority ClassNumber of Instances in Minority Class\text{Imbalance Ratio } \rho = \frac{\text{Number of Instances in Majority Class}}{\text{Number of Instances in Minority Class}}
For this dataset, ρ=1.22\rho = 1.22, which is sufficiently close to 1, indicating no significant imbalance.
Preprocessing
Preprocessing Pipeline
The following steps were applied to preprocess the images:
- Image Resizing: All images were resized to $128 \times 128$ to ensure uniform input dimensions.
- Cropping and Padding: Large white borders were removed to help the model focus on relevant features. Zero-padding was used to maintain aspect ratios.
- Normalization: Pixel values were scaled to $[0, 1]$ to improve convergence.
- Data Augmentation: Techniques such as rotation, flipping, and brightness adjustments were used to increase dataset variability and reduce overfitting.
- Segmentation: Simple Linear Iterative Clustering (SLIC) was employed to isolate key regions in the image, removing irrelevant details.
Dataset Variants
From the preprocessing steps, three dataset variants were created:
- Original RGB Images
- Augmented RGB Images
- Segmented RGB Images
Models
Multi-Layer Neural Network (MLNN)
The MLNN model consists of fully connected dense layers, batch normalization, dropout, and activation functions. It was configured to test various architectures:
Code Implementation
from tensorflow.keras import layers, models, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import BinaryAccuracy, Precision, Recall, AUC
def MLNN_model(input_size, num_classes, hidden_layer_units, hidden_activation,
output_activation, dropout_perc, n_channels, loss, learning_rate=0.001):
"""
Builds a Multi-Layer Neural Network (MLNN).
"""
input_layer = Input(shape=(input_size[0], input_size[1], n_channels))
flatten_layer = layers.Flatten()(input_layer)
hidden_layer = layers.Rescaling(1./255)(flatten_layer)
for units in hidden_layer_units:
hidden_layer = layers.BatchNormalization()(hidden_layer)
hidden_layer = layers.Activation(hidden_activation)(hidden_layer)
hidden_layer = layers.Dropout(dropout_perc)(hidden_layer)
hidden_layer = layers.Dense(units, activation=hidden_activation)(hidden_layer)
output_layer = layers.Dense(1, activation=output_activation)(hidden_layer)
model = models.Model(inputs=input_layer, outputs=output_layer)
model.compile(optimizer=Adam(learning_rate=learning_rate), loss=loss, metrics=[BinaryAccuracy(), Precision(), Recall(), AUC()])
return model
MLNN Training Summary
Convolutional Neural Network (CNN)
CNNs are designed to handle spatial data and are more effective for image classification. This baseline CNN includes convolutional layers, pooling layers, and a dense layer for classification.
Code Implementation
from tensorflow.keras import layers, models
def CNN_model(input_size=(128, 128), n_channels=3, conv_filters=[32, 64, 128], kernel_size=(3, 3),
pool_size=(2, 2), dense_units=128, output_activation='sigmoid', loss='binary_crossentropy',
learning_rate=0.001):
"""
Builds a Convolutional Neural Network (CNN).
"""
model = models.Sequential()
model.add(layers.Rescaling(1./255, input_shape=(input_size[0], input_size[1], n_channels)))
for filters in conv_filters:
model.add(layers.Conv2D(filters, kernel_size, activation='relu'))
model.add(layers.MaxPooling2D(pool_size))
model.add(layers.Flatten())
model.add(layers.Dense(dense_units, activation='relu'))
model.add(layers.Dense(1, activation=output_activation))
model.compile(optimizer=Adam(learning_rate=learning_rate), loss=loss, metrics=[BinaryAccuracy(), Precision(), Recall(), AUC()])
return model
CNN Training Summary
TogNet
TogNet is a custom CNN designed to address overfitting. Dropout layers were added to improve generalization.
Code Implementation
from tensorflow.keras import layers, models
def TogNet_model(input_size=(128, 128), n_channels=3, conv_filters=[32, 64, 64], kernel_size=(3, 3),
pool_size=(2, 2), dense_units=128, hidden_activation='relu', output_activation='sigmoid',
loss='binary_crossentropy', learning_rate=0.001, dropout_rate=0.2):
"""
Builds the TogNet CNN.
"""
model = models.Sequential()
model.add(layers.Rescaling(1./255, input_shape=(input_size[0], input_size[1], n_channels)))
for filters in conv_filters:
model.add(layers.Conv2D(filters, kernel_size, activation=hidden_activation))
model.add(layers.MaxPooling2D(pool_size))
model.add(layers.Dropout(dropout_rate))
model.add(layers.Flatten())
model.add(layers.Dense(dense_units, activation=hidden_activation))
model.add(layers.Dropout(dropout_rate))
model.add(layers.Dense(1, activation=output_activation))
model.compile(optimizer=Adam(learning_rate=learning_rate), loss=loss, metrics=[BinaryAccuracy(), Precision(), Recall(), AUC()])
return model
TogNet Training Summary
Results
Performance Metrics
Model | Dataset | Binary Accuracy | Precision | Recall | AUC |
---|---|---|---|---|---|
MLNN 512_256_128 | RGB | 80.38% | 78.02% | 80.08% | 88.01% |
Augmented | 82.92% | 89.06% | 80.95% | 91.54% | |
CNN 32_64_128 | RGB | 92.88% | 91.73% | 95.48% | 97.01% |
Augmented | 93.61% | 94.18% | 92.27% | 96.60% | |
TogNet | RGB | 92.88% | 92.45% | 94.36% | 97.37% |
Augmented | 91.53% | 89.40% | 92.86% | 97.03% |
Comparative Visualizations
Loss and Accuracy Trends for TogNet
Metrics Comparisons for MLNN Variants
Conclusion
This study demonstrates that CNNs, particularly the custom TogNet architecture, outperform MLNNs in binary image classification. Data augmentation improved model performance, while segmentation showed mixed results. Future work will explore advanced architectures and larger datasets to further enhance classification accuracy.
All the code is available on GitHub.