🏆 Code2Bench-2505 benchmark 🏆

Code2Bench:Dynamic Benchmark Construction for Evaluating Large Language Models on Real-World Codes

Our source code is available on GitHub

github

📝 Notes

  1. We present the experimental setup employed to evaluate the performance of state-of-the-art large language models (LLMs) on the CODE2BENCH-2505 dataset.
  2. For code completion generation, we adopted greedy decoding (temperature = 0) for Pass@1 evaluation, following standard practice in deterministic code generation tasks as established in prior studies.
  3. Open-source models with fewer than 32 billion parameters were hosted and served locally using vLLM for efficient inference, while larger open-source models and proprietary models were accessed via their respective official APIs.
  4. If you want to learn specific information about the projects used by CODE2BENCH-2505, please click here.

🙏 Acknowledgements

We thank the EvalPlus team for providing the leaderboard template.