Teaching AI to Say ‘I Don’t Know’: A New Dataset Mitigates Hallucinations from Reinforcement Finetuning

Reinforcement finetuning uses reward signals to guide the large language model toward desirable behavior. This method sharpens the model’s ability to produce logical and structured outputs by reinforcing correct responses. Yet, the challenge persists in ensuring that these models also know when not to respond—particularly when faced with incomplete or misleading questions that don’t have […] The post Teaching AI to Say ‘I Don’t Know’: A New Dataset Mitigates Hallucinations from Reinforcement Finetuning appeared first on MarkTechPost.

Jun 6, 2025 - 08:50

Teaching AI to Say ‘I Don’t Know’: A New Dataset Mitigates Hallucinations from Reinforcement Finetuning

Reinforcement finetuning uses reward signals to guide the large language model toward desirable behavior. This method sharpens the model’s ability to produce logical and structured outputs by reinforcing correct responses. Yet, the challenge persists in ensuring that these models also know when not to respond—particularly when faced with incomplete or misleading questions that don’t have a definite answer.

The problem arises when language models, after reinforcement finetuning, begin to lose their ability to refuse to answer unclear or ambiguous queries. Instead of signaling uncertainty, the models tend to produce confidently stated but incorrect responses. This phenomenon, identified in the paper as the “hallucination tax,” highlights a growing risk. As models are trained to perform better, they may also become more likely to hallucinate answers in situations where silence would be more appropriate. This is especially hazardous in domains that require high trust and precision.

Tools currently used in training large language models often overlook the importance of refusal behavior. Reinforcement finetuning frameworks tend to reward only correct answers while penalizing incorrect ones, ignoring cases where a valid response should be no answer at all. The reward systems in use do not sufficiently reinforce refusal, resulting in overconfident models. For instance, the paper shows that refusal rates dropped to near zero across multiple models after standard RFT, demonstrating that current training fails to address hallucination properly.

Researchers from the University of Southern California developed the Synthetic Unanswerable Math (SUM) dataset. SUM introduces implicitly unanswerable math problems by modifying existing questions through criteria such as missing key information or creating logical inconsistencies. The researchers used DeepScaleR as the base dataset and employed the o3-mini model to generate high-quality unanswerable questions. This synthetic dataset aims to teach models to recognize when a problem lacks sufficient information and respond accordingly.

SUM’s core technique is to mix answerable and unanswerable problems during training. Questions are modified to become ambiguous or unsolvable while maintaining plausibility. The training prompts instruct models to say “I don’t know” for unanswerable inputs. By introducing only 10% of the SUM data into reinforcement finetuning, models begin to leverage inference-time reasoning to evaluate uncertainty. This structure allows them to refuse answers more appropriately without impairing their performance on solvable problems.

Performance analysis shows significant improvements. After training with SUM, the Qwen2.5-7B model increased its refusal rate from 0.01 to 0.73 on the SUM benchmark and from 0.01 to 0.81 on the UMWP benchmark. On the SelfAware dataset, refusal accuracy rose dramatically from 0.01 to 0.94. Llama-3.1-8B-Instruct showed a similar trend, with refusal rates improving from 0.00 to 0.75 on SUM and from 0.01 to 0.79 on UMWP. Despite these gains in refusal behavior, accuracy on answerable datasets, such as GSM8K and MATH-500, remained stable, with most changes ranging from 0.00 to -0.05. The minimal drop indicates that refusal training can be introduced without major sacrifices in task performance.

This study outlines a clear trade-off between improved reasoning and trustworthiness. Reinforcement finetuning, while powerful, tends to suppress cautious behavior. The SUM dataset corrects this by teaching models to recognize what they cannot solve. With only a small addition to training data, language models become better at identifying the boundaries of their knowledge. This approach marks a significant step in making AI systems not just smarter but also more careful and honest.

Check out the Paper and Dataset on Hugging Face. All credit for this research goes to the researchers of this project.