Effective ML with Limited Data: Where to Start!

Effective ML with Limited Data: Where to StartA launch pad for tackling projects with small datasetsPhoto by Google DeepMind: https://www.pexels.com/photo/an-artist-s-illustration-of-artificial-intelligence-ai-this-image-depicts-how-ai-can-help-humans-to-understand-the-complexity-of-biology-it-was-created-by-artist-khyati-trehan-as-part-17484975/Machine Learning (ML) has driven remarkable breakthroughs in computer vision, natural language processing, and speech recognition, largely due to the abundance of data in these fields. However, many challenges — especially those tied to specific product features or scientific research — suffer from limited data quality and quantity. This guide provides a roadmap for tackling small data problems based on your data constraints, and offers potential solutions, guiding your decision making early on.1. Difficulty with Gathering more DataRaw data is rarely a blocker for ML projects. High-quality labels on the other hand, are often prohibitively expensive and laborious to collect. Where obtaining an expert-labelled “ground truth” requires domain expertise, intensive fieldwork, or specialised knowledge. For instance, your problem might focus on rare events, maybe, endangered species monitoring, extreme climate events, or unusual manufacturing defects. Other times, business specific or scientific questions might be too specialised for off-the-shelf large-scale datasets. Ultimately this means many projects fail because label acquisition is too expensive.2. Core Challenges of Small DatasetsWith only a small dataset, any new project starts off with inherent risks. How much of the true variability does your dataset capture? In many ways this question is unanswerable the smaller your dataset gets. Making testing and validation increasingly difficult, and leaving a great deal of uncertainty about how well your model actually generalises. Your model doesn’t know what your data doesn’t capture. This means, with potentially only a few hundred samples, both the richness of the features you can extract, and the number of features you can use decreases, without significant risk of overfitting (that in many cases you can’t measure). This often leaves you limited to classical ML algorithms (Random Forest, SVM etc…), or heavily regularised deep learning methods. The presence of class imbalance will only exacerbate your problems. Making small datasets even more sensitive to noise, where only a few incorrect labels or faulty measurements will cause havoc and headaches.3. Questions to Get you StartedFor me, working the problem starts with asking a few simple questions about the data, labelling process, and end goals. By framing your problem with a “checklist”, we can clarify the constraints of your data. Have a go at answering the questions below:Is your dataset fully, partially, or mostly unlabelled?Fully labeled: You have labels for (nearly) all samples in your dataset.Partially labelled: A portion of the dataset has labels, but there’s a large portion of unlabelled data.Mostly unlabelled: You have very few (or no) labeled data points.How reliable are the labels you do have?Highly reliable: If multiple annotates agree on labels, or they are confirmed by trusted experts or well-established protocols.Noisy or weak: Labels may be crowd-sourced, generated automatically, or prone to human or sensor error.Are you solving one problem, or do you have multiple (related) tasks?Single-task: A singular objective, such as a binary classification or a single regression target.Multi-task: Multiple outputs or multiple objectives.Are you dealing with rare events or heavily imbalanced classes?Yes: Positive examples are very scarce (e.g., “equipment failure,” “adverse drug reactions,” or “financial fraud”).No: Classes are somewhat balanced, or your task doesn’t involve highly skewed distributions.Do you have expert knowledge available, and if so, in what form?Human experts: You can periodically query domain experts to label new data or verify predictions.Model-based experts: You have access to well-established simulation or physical models (e.g., fluid dynamics, chemical kinetics) that can inform or constrain your ML model.No: No relevant domain expertise available to guide or correct the model.Is labelling new data possible, and at what cost?Feasible and affordable: You can acquire more labeled examples if necessary.Difficult or expensive: Labelling is time-intensive, costly, or requires specialised domain knowledge (e.g., medical diagnosis, advanced scientific measurements).Do you have prior knowledge or access to pre-trained models relevant to your data?Yes: There exist large-scale models or datasets in your domain (e.g., ImageNet for images, BERT for text).No: Your domain is niche or specialised, and there aren’t obvious pre-trained resources.4. Matching Questions to TechniquesWith your answers to the questions above ready, we can move towards establishing a list of potential techniques for tackling your

Jan 17, 2025 - 19:16
Effective ML with Limited Data: Where to Start!

Effective ML with Limited Data: Where to Start

A launch pad for tackling projects with small datasets

Photo by Google DeepMind: https://www.pexels.com/photo/an-artist-s-illustration-of-artificial-intelligence-ai-this-image-depicts-how-ai-can-help-humans-to-understand-the-complexity-of-biology-it-was-created-by-artist-khyati-trehan-as-part-17484975/

Machine Learning (ML) has driven remarkable breakthroughs in computer vision, natural language processing, and speech recognition, largely due to the abundance of data in these fields. However, many challenges — especially those tied to specific product features or scientific research — suffer from limited data quality and quantity. This guide provides a roadmap for tackling small data problems based on your data constraints, and offers potential solutions, guiding your decision making early on.

1. Difficulty with Gathering more Data

Raw data is rarely a blocker for ML projects. High-quality labels on the other hand, are often prohibitively expensive and laborious to collect. Where obtaining an expert-labelled “ground truth” requires domain expertise, intensive fieldwork, or specialised knowledge. For instance, your problem might focus on rare events, maybe, endangered species monitoring, extreme climate events, or unusual manufacturing defects. Other times, business specific or scientific questions might be too specialised for off-the-shelf large-scale datasets. Ultimately this means many projects fail because label acquisition is too expensive.

2. Core Challenges of Small Datasets

With only a small dataset, any new project starts off with inherent risks. How much of the true variability does your dataset capture? In many ways this question is unanswerable the smaller your dataset gets. Making testing and validation increasingly difficult, and leaving a great deal of uncertainty about how well your model actually generalises. Your model doesn’t know what your data doesn’t capture. This means, with potentially only a few hundred samples, both the richness of the features you can extract, and the number of features you can use decreases, without significant risk of overfitting (that in many cases you can’t measure). This often leaves you limited to classical ML algorithms (Random Forest, SVM etc…), or heavily regularised deep learning methods. The presence of class imbalance will only exacerbate your problems. Making small datasets even more sensitive to noise, where only a few incorrect labels or faulty measurements will cause havoc and headaches.

3. Questions to Get you Started

For me, working the problem starts with asking a few simple questions about the data, labelling process, and end goals. By framing your problem with a “checklist”, we can clarify the constraints of your data. Have a go at answering the questions below:

Is your dataset fully, partially, or mostly unlabelled?

  • Fully labeled: You have labels for (nearly) all samples in your dataset.
  • Partially labelled: A portion of the dataset has labels, but there’s a large portion of unlabelled data.
  • Mostly unlabelled: You have very few (or no) labeled data points.

How reliable are the labels you do have?

  • Highly reliable: If multiple annotates agree on labels, or they are confirmed by trusted experts or well-established protocols.
  • Noisy or weak: Labels may be crowd-sourced, generated automatically, or prone to human or sensor error.

Are you solving one problem, or do you have multiple (related) tasks?

  • Single-task: A singular objective, such as a binary classification or a single regression target.
  • Multi-task: Multiple outputs or multiple objectives.

Are you dealing with rare events or heavily imbalanced classes?

  • Yes: Positive examples are very scarce (e.g., “equipment failure,” “adverse drug reactions,” or “financial fraud”).
  • No: Classes are somewhat balanced, or your task doesn’t involve highly skewed distributions.

Do you have expert knowledge available, and if so, in what form?

  • Human experts: You can periodically query domain experts to label new data or verify predictions.
  • Model-based experts: You have access to well-established simulation or physical models (e.g., fluid dynamics, chemical kinetics) that can inform or constrain your ML model.
  • No: No relevant domain expertise available to guide or correct the model.

Is labelling new data possible, and at what cost?

  • Feasible and affordable: You can acquire more labeled examples if necessary.
  • Difficult or expensive: Labelling is time-intensive, costly, or requires specialised domain knowledge (e.g., medical diagnosis, advanced scientific measurements).

Do you have prior knowledge or access to pre-trained models relevant to your data?

  • Yes: There exist large-scale models or datasets in your domain (e.g., ImageNet for images, BERT for text).
  • No: Your domain is niche or specialised, and there aren’t obvious pre-trained resources.

4. Matching Questions to Techniques

With your answers to the questions above ready, we can move towards establishing a list of potential techniques for tackling your problem. In practice, small dataset problems require hyper-nuanced experimentation, and so before implementing the techniques below give yourself a solid foundation by starting with a simple model, get a full pipeline working as quickly as possible and always cross-validate. This gives you a baseline to iteratively apply new techniques based on your error analysis, while focusing on conducting small scale experiments. This also helps avoid building an overly complicated pipeline that’s never properly validated. With a baseline in place, chances are your dataset will evolve rapidly. Tools like DVC or MLflow help track dataset versions and ensure reproducibility. In a small-data scenario, even a handful of new labeled examples can significantly change model performance — version control helps systematically manage that.

With that in mind, here’s how your answers to the questions above point towards specific strategies described later in this post:

Fully Labeled + Single Task + Sufficiently Reliable Labels:

  • Data Augmentation (Section 5.7) to increase effective sample size.
  • Ensemble Methods (Section 5.9) if you can afford multiple model training cycles.
  • Transfer Learning (Section 5.1) if a pre-trained model in your domain (or a related domain) is available.

Partially Labeled + Labelling is Reliable or Achievable:

  • Semi-Supervised Learning (Section 5) to leverage a larger pool of unlabelled data.
  • Active Learning (Section 5.6) if you have a human expert who can label the most informative samples.
  • Data Augmentation (Section 5.7) where possible.

Rarely Labeled or Mostly Unlabelled + Expert Knowledge Available:

  • Active Learning (Section 5.6) to selectively query an expert (especially if the expert is a person).
  • Process-Aware (Hybrid) Models (Section 5.10) if your “expert” is a well-established simulation or model.

Rarely Labeled or Mostly Unlabelled + No Expert / No Additional Labels:

  • Self-Supervised Learning (Section 5.2) to exploit inherent structure in unlabelled data.
  • Few-Shot or Zero-Shot Learning (Section 5.4) if you can rely on meta-learning or textual descriptions to handle novel classes.
  • Weakly Supervised Learning (Section 5.5) if your labels exist but are imprecise or high-level.

Multiple Related Tasks:

  • Multitask Learning (Section 5.8) to share representations between tasks, effectively pooling “signal” across the entire dataset.

Dealing with Noisy or Weak Labels:

  • Weakly Supervised Learning (Section 5.5) which explicitly handles label noise.
  • Combine with Active Learning or a small “gold standard” subset to clean up the worst labelling errors.

Highly Imbalanced / Rare Events:

  • Data Augmentation (Section 5.7) targeting minority classes (e.g., synthetic minority oversampling).
  • Active Learning (Section 5.6) to specifically label more of the rare cases.
  • Process-Aware Models (Section 5.10) or domain expertise to confirm rare cases, if possible.

Have a Pre-Trained Model or Domain-Specific Knowledge:

  • Transfer Learning (Section 5.1) is often the quickest win.
  • Process-Aware Models (Section 5.10) if combining your domain knowledge with ML can reduce data requirements.

5. Strategies for Tackling Small Data

Hopefully, the above has provided a starting point for solving your small data problem. It’s worth noting that many of the techniques discussed are complex and resource intensive. So keep in mind you’ll likely need to get buy-in from your team and project managers before starting. This is best done through clear, concise communication of the potential value they might provide. Frame experiments as strategic, foundational work that can be reused, refined, and leveraged for future projects. Focus on demonstrating clear, measurable impact from a short, tightly-scoped pilot.

Despite the relatively simple picture painted of each technique below, it’s important to keep in mind there’s no one-size-fits-all solution, and applying these techniques isn’t like stacking lego bricks, nor do they work out-of-the-box. To get you started I’ve provided a brief overview of each technique, this is by no means exhaustive, but looks to offer a starting point for your own research.

5.1 Transfer Learning

Transfer learning is about reusing existing models to solve new related problems. By starting with pre-trained weights, you leverage representations learned from large, diverse datasets and fine-tune the model on your smaller, target dataset.

Why it helps:

  • Leverages powerful features learnt from larger, often diverse datasets.
  • Fine-tuning pre-trained models typically leads to higher accuracy, even with limited samples, while reducing training time.
  • Ideal when compute resources or project timelines prevent training a model from scratch.

Tips:

  • Select a model aligned with your problem domain or a large general-purpose “foundation model” like Mistral (language) or CLIP/SAM (vision), accessible on platforms like Hugging Face. These models often outperform domain-specific pre-trained models due to their general-purpose capabilities.
  • Freeze layers that capture general features while fine-tuning only a few layers on top.
  • To counter the risk of overfitting to your small datasets try pruning. Here, less important weights or connections are removed reducing the number of trainable parameters and increasing inference speed.
  • If interpretability is required, large black-box models may not be ideal.
  • Without access to the pre-trained models source dataset, you risk reinforcing sampling biases during fine-tuning.

A nice example of transfer learning is described in the following paper. Where leveraging a pre-trained ResNet model enabled better classification of chest X-ray images and detecting COVID-19. Supported by the use of dropout and batch normalisation, the researchers froze the initial layers of the ResNet base model, while fine-tuning later layers, capturing task-specific, high-level features. This proved to be a cost effective method for achieving high accuracy with a small dataset.

5.2 Self-Supervised Learning

Self-supervised learning is a pre-training technique where artificial tasks (“pretext tasks”) are created to learn representations from broad unlabelled data. Examples include predicting masked tokens for text or rotation prediction, colorisation for images. The result is general-purpose representations you can later pair with transfer-learning (section 5.1) or semi-supervised (section 5) and fine-tune with your smaller dataset.

Why it helps:

  • Pre-trained models serve as a strong initialisation point, reducing the risk of future overfitting.
  • Learns to represent data in a way that captures intrinsic patterns and structures (e.g., spatial, temporal, or semantic relationships), making them more effective for downstream tasks.

Tips:

  • Pre-tasks like cropping, rotation, colour jitter, or noise injection are excellent for visual tasks. However it’s a balance, as excessive augmentation can distort the distribution of small data.
  • Ensure unlabelled data is representative of the small dataset’s distribution to help the model learn features that generalise well.
  • Self-supervised methods can be compute-intensive; often requiring enough unlabelled data to truly benefit and a large computation budget.

LEGAL-BERT is a prominent example of self-supervised learning. Legal-BERT is a domain-specific variant of the BERT language model, pre-trained on a large dataset of legal documents to improve its understanding of legal language, terminology, and context. The key, is the use of unlabelled data, where techniques such as masked language modelling (the model learns to predict masked words) and next sentence prediction (learning the relationships between sentences, and determining if one follows another) removes the requirement for labelling. This text embedding model can then be used for more specific legal based ML tasks.

5.3 Semi-Supervised Learning

Leverages a small labeled dataset in addition to a larger unlabelled set. The model iteratively refines predictions on unlabelled data, to generate task specific predictions that can be used as “pseudo-labels” for further iterations.

Why it helps:

  • Labeled data guides the task-specific objective, while the unlabelled data is used to improve generalisation (e.g., through pseudo-labelling, consistency regularisation, or other techniques).
  • Improves decision boundaries and can boost generalisation.

Tips:

  • Consistency regularisation is a method that assumes model predictions should be consistent across small perturbations (noise, augmentations) made to unlabelled data. The idea is to “smooth” the decision boundary of sparsely populated high-dimensional space.
  • Pseudo-labelling allows you to train an initial model with a small dataset and use future predictions on unlabelled data as “pseudo” labels for future training. With the aim of generalising better and reducing overfitting.

Financial fraud detection is a problem that naturally lends itself to semi-supervised learning, with very little real labelled data (confirmed fraud cases) and a large set of unlabelled transaction data. The following paper proposes a neat solution, by modelling transactions, users, and devices as nodes in a graph, where edges represent relationships, such as shared accounts or devices. The small set of labeled fraudulent data is then used to train the model by propagating fraud signals across the graph to the unlabelled nodes. For example, if a fraudulent transaction (labeled node) is linked to multiple unlabelled nodes (e.g., related users or devices), the model learns patterns and connections that might indicate fraud.

5.4 Few-Shot and Zero-Shot Learning

Few and zero-shot learning refers to a broad collection of techniques designed to tackle very small datasets head on. Generally these methods train a model to identify “novel” classes unseen during training, with a small labelled dataset used primarily for testing.

Why it helps:

  • These approaches enable models to quickly adapt to new tasks or classes without extensive retraining.
  • Useful for domains with rare or unique categories, such as rare diseases or niche object detection.

Tips:

  • Probably the most common technique, known as similarity-based learning, trains a model to compare pairs of items and decide if they belong to the same class. By learning a similarity or distance measure the model can generalise to unseen classes by comparing new instances to class prototypes (your small set of labelled data during testing) during testing. This approach requires a good way to represent different types of input (embedding), often created using Siamese neural networks or similar models.
  • Optimisation-based meta-learning, aims to train a model to quickly adapt to new tasks or classes using only a small amount of training data. A popular example is model-agnostic meta-learning (MAML). Where a “meta-learner” is trained on many small tasks, each with its own training and testing examples. The goal is to teach the model to start from a good initial state, so when it encounters a new task, it can quickly learn and adjust with minimal additional training. These are not simple methods to implement.
  • A more classical technique, one-class classification, is where a binary classifier (like one class SVM) is trained on data from only one class, and learns to detect outliers during testing.
  • Zero-shot approaches, such as CLIP or large language models with prompt engineering, enable classification or detection of unseen categories using textual cues (e.g., “a photo of a new product type”).
  • In zero-shot cases, combine with active learning (human in the loop) to label the most informative examples.

It’s important to maintain realistic expectations when implementing few-shot and zero-shot techniques. Often, the aim is to achieve usable or “good enough” performance. As a direct comparison of traditional deep-learning (DL) methods, the following study compares both DL and few-shot learning (FSL) for classifying 20 coral reef fish species from underwater images with applications for detecting rare species with limited available data. It should come as no surprise that the best model tested was a DL model based on ResNet. With ~3500 examples for each species the model achieved an accuracy of 78%. However, collecting this volume of data for rare species is beyond practical. Subsequently, the number of samples was reduced to 315 per species, and the accuracy dropped to 42%. In contrast, the FSL model, achieved comparable results with as few as 5 labeled images per species, and better performance beyond 10 shots. Here, the Reptile algorithm was used, which is a meta-learning-based FSL approach. This was trained by repeatedly solving small classification problems (e.g., distinguishing a few classes) drawn from the MiniImageNet dataset (a useful benchmark dataset for FSL). During fine-tuning, the model was then trained using a few labeled examples (1 to 30 shots per species).

5.5 Weakly Supervised Learning

Weakly supervised learning describes a set of techniques for building models with noisy, inaccurate or restricted sources to label large quantities of data. We can split the topic into three: incomplete, inexact, and inaccurate supervision, distinguished by the confidence in the labels. Incomplete supervision occurs when only a subset of examples has ground-truth labels. Inexact supervision involves coarsely-grained labels, like labelling an MRI image as “lung cancer” without specifying detailed attributes. Inaccurate supervision arises when labels are biased or incorrect due to human.

Why it helps:

  • Partial or inaccurate data is often simpler and cheaper to get hold of.
  • Enables models to learn from a larger pool of information without the need for extensive manual labelling.
  • Focuses on extracting meaningful patterns or features from data, that can amplify the value of any existing well labeled examples.

Tips:

  • Use a small subset of high-quality labels (or an ensemble) to correct systematic labelling errors.
  • For scenarios where coarse-grained labels are available (e.g., image-level labels but not detailed instance-level labels), Multi-instance learning can be employed. Focusing on bag-level classification since instance-level inaccuracies are less impactful.
  • Label filtering, correction, and inference techniques can mitigate label noise and minimise reliance on expensive manual labels.

The primary goal of this technique is to estimate more informative or higher dimensional data with limited information. As an example, this paper presents a weakly supervised learning approach to estimating a 3D human poses. The method relies on 2D pose annotations, avoiding the need for expensive 3D ground-truth data. Using an adversarial reprojection network (RepNet), the model predicts 3D poses and reprojects them into 2D views to compare with 2D annotations, minimising reprojection error. This approach leverages adversarial training to enforce plausibility of 3D poses and showcases the potential of weakly supervised methods for complex tasks like 3D pose estimation with limited labeled data.

5.6 Active Learning

Active learning seeks to optimise labelling efforts by identifying unlabelled samples that, once labeled, will provide the model with the most informative data. A common approach is uncertainty sampling, which selects samples where the model’s predictions are least certain. This uncertainty is often quantified using measures such as entropy or margin sampling. This is highly iterative; each round influences the model’s next set of predictions.

Why it helps:

  • Optimises expert time; you label fewer samples overall.
  • Quickly identifies edge cases that improve model robustness.

Tips:

  • Diversity sampling is an alternative selection approach that focuses on diverse area of the feature space. For instance, clustering can be used to select a few representative samples from each cluster.
  • Try to use multiple selection methods to avoid introducing bias.
  • Introducing an expert human in the loop can be logistically difficult, managing availability with a labelling workflow that can be slow/expensive.

This technique has been extensively used in chemical analysis and materials research. Where, large databases of real and simulated molecular structures and their properties have been collected over decades. These databases are particularly useful for drug discovery, where simulations like docking are used to predict how small molecules (e.g., potential drugs) interact with targets such as proteins or enzymes. However, the computational cost of performing these types of calculations over millions of molecules makes brute force studies impractical. This is where active learning comes in. One such study showed that by training a predictive model on an initial subset of docking results and iteratively selecting the most uncertain molecules for further simulations, researchers were able to drastically reduce the number of molecules tested while still identifying the best candidates.

5.7 Data Augmentation

Artificially increase your dataset by applying transformations to existing examples — such as flipping or cropping images, translation or synonym replacement for text and time shifts or random cropping for time-series. Alternatively, upsample underrepresented data with ADASYN (Adaptive Synthetic Sampling) and SMOTE (Synthetic Minority Over-sampling Technique).

Why it helps:

  • The model focuses on more general and meaningful features rather than specific details tied to the training set.
  • Instead of collecting and labelling more data, augmentation provides a cost-effective alternative.
  • Improves generalisation by increasing the diversity of training data, helping learn robust and invariant features rather than overfitting to specific patterns.

Tips:

  • Keep transformations domain-relevant (e.g., flipping images vertically might make sense for flower images, less so for medical X-rays).
  • Pay attention that any augmentations don’t distort the original data distribution, preserving the underlying patterns.
  • Explore GANs, VAEs, or diffusion models to produce synthetic data — but this often requires careful tuning, domain-aware constraints, and enough initial data.
  • Synthetic oversampling (like SMOTE) can introduce noise or spurious correlations if the classes or feature space are complex and not well understood.

Data augmentation is an incredibly broad topic, with numerous surveys exploring the current state-of-the-art across various fields, including computer vision (review paper), natural language processing (review paper), and time-series data (review paper). It has become an integral component of most machine learning pipelines due to its ability to enhance model generalisation. This is particularly critical for small datasets, where augmenting input data by introducing variations, such as transformations or noise, and removing redundant or irrelevant features can significantly improve a model’s robustness and performance.

5.8 Multitask Learning

Here we train one model to solve several tasks simultaneously. This improves how well models perform by encouraging them to find patterns or solutions that work well for multiple goals at the same time. Lower layers capture general features that benefit all tasks, even if you have limited data for some.

Why it helps:

  • Shared representations are learned across tasks, effectively increasing sample size.
  • The model is less likely to overfit, as it must account for patterns relevant to all tasks, not just one.
  • Knowledge learned from one task can provide insights that improve performance on another.

Tips:

  • Tasks need some overlap or synergy to meaningfully share representations; otherwise this method will hurt performance.
  • Adjust per-task weights carefully to avoid letting one task dominate training.

The scarcity of data for many practical applications of ML makes sharing both data and models across tasks an attractive proposition. This is enabled by Multitask learning, where tasks benefit from shared knowledge and correlations in overlapping domains. However, it requires a large, diverse dataset that integrates multiple related properties. Polymer design is one example where this has been successful. Here, a hybrid dataset of 36 properties across 13,000 polymers, covering a mix of mechanical, thermal, and chemical characteristics, was used to train a deep-learning-based MTL architecture. The multitask model outperformed single-task models for every polymer property. Particularly, for underrepresented properties.

5.9 Ensemble Learning

Ensembles aggregate predictions from several base models to improve robustness. Generally, ML algorithms can be limited in a variety of ways: high variance, high bias, and low accuracy. This manifests as different uncertainty distributions for different models across predictions. Ensemble methods limit the variance and bias errors associated with a single model; for example, bagging reduces variance without increasing the bias, while boosting reduces bias.

Why it helps:

  • Diversifies “opinions” across different model architectures.
  • Reduces variance, mitigating overfitting risk.

Tips:

  • Avoid complex base models which can easily overfit small datasets. Instead, use regularised models such as shallow trees or linear models with added constraints to control complexity.
  • Bootstrap aggregating (bagging) methods like Random Forest can be particularly useful for small datasets. By training multiple models on bootstrapped subsets of the data, you can reduce overfitting while increasing robustness. This is effective for algorithms prone to high variance, such as decision trees.
  • Combine different base models types (e.g., SVM, tree-based models, and logistic regression) with a simple meta-model like logistic regression to combine predictions.

As an example, the following paper highlights ensemble learning as a method to improve the classification of cervical cytology images. In this case, three pre-trained neural networks — Inception v3, Xception, and DenseNet-169 — were used. The diversity of these base models ensured the ensemble benefits from each models unique strengths and feature extraction capabilities. This combined with the fusion of model confidences, via a method that rewards confident, accurate predictions while penalising uncertain ones, maximised the utility of the limited data. Combined with transfer learning, the final predictions were robust to the errors of any particular model, despite the small dataset used.

5.10 Process-Aware (Hybrid) Models

Integrate domain-specific knowledge or physics-based constraints into ML models. This embeds prior knowledge, reducing the model’s reliance on large data to infer patterns. For example, using partial differential equations alongside neural networks for fluid dynamics.

Why it helps:

  • Reduces the data needed to learn patterns that are already well understood.
  • Acts as a form of regularisation, guiding the model to plausible solutions even when the data is sparse or noisy.
  • Improves interpretability and trust in domain-critical contexts.

Tips:

  • Continually verify that model outputs make physical/biological sense, not just numerical sense.
  • Keep domain constraints separate but feed them as inputs or constraints in your model’s loss function.
  • Be careful to balance domain-based constraints with your models ability to learn new phenomena.
  • In practice, bridging domain-specific knowledge with data-driven methods often involves serious collaboration, specialised code, or hardware.

Constraining a model, in this way requires a deep understanding of your problem domain, and is often applied to problems where the environment the model operates in is well understood, such as physical systems. An example of this is lithium-ion battery modelling, where domain knowledge of battery dynamics is integrated into the ML process. This allows the model to capture complex behaviours and uncertainties missed by traditional physical models, ensuring physically consistent predictions and improved performance under real-world conditions like battery aging.

6. Conclusion

For me, projects constrained by limited data are some of the most interesting projects to work on — despite the higher risk of failure, they offer an opportunity to explore the state-of-the-art and experiment. These are tough problems! However, systematically applying the strategies covered in this post can greatly improve your odds of delivering a robust, effective model. Embrace the iterative nature of these problems: refine labels, employ augmentations, and analyse errors in quick cycles. Short pilot experiments help validate each technique’s impact before you invest further.


Effective ML with Limited Data: Where to Start! was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.