ICSE

2025

Vulnerability Detection with Code Language Models: How Far Are We?

SOTA limitations: - dataset and benchmarks 1. noisy labels when using vulnerability-fixing commits 2. accurate human-labeled dataset are small, like SVEN only contains 1.6k samples. 3. data duplication: 18.9% test samples are leaked from the training set. - evaluation metrics 1. accuracy: predicting "not vulnerable" is accurate in score but ineffective in practice 2. F1: higher F1 score may still have high false rate

Solution: 1. PrimeVul = PrimeVul-NvdCheck + PrimeVul-OneFunc 2. VD-S: vulnerability detection score to measure the false negative rate.