Calibration Without Comprehension: Diagnosing the Limits of Fine-Tuning LLMs for Vulnerability Detection in Systems Software

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

Calibration Without Comprehension: Diagnosing the Limits of Fine-Tuning LLMs for Vulnerability Detection in Systems Software

Listen for free

View show details

Security teams are increasingly exploring whether large language models can automatically detect vulnerabilities in source code — a task with serious consequences if done poorly. This paper delivers a sobering assessment: even fine-tuned models that score well on benchmarks may be learning surface-level patterns rather than genuine security reasoning. Using carefully curated Linux kernel samples with a strict temporal split to prevent data leakage, the authors show that fine-tuning shifts output calibration without changing underlying decision logic. The implications are significant for any organization considering LLM-assisted code review, penetration testing, or automated vulnerability triage in production systems.

No reviews yet