🏆 Code2Bench-2505 benchmark 🏆
Code2Bench:Dynamic Benchmark Construction for Evaluating Large Language Models on Real-World Codes
Our source code is available on GitHub
📝 Notes
- We present the experimental setup employed to evaluate the performance of state-of-the-art large language models (LLMs) on the CODE2BENCH-2505 dataset.
- For code completion generation, we adopted greedy decoding (temperature = 0) for Pass@1 evaluation, following standard practice in deterministic code generation tasks as established in prior studies.
- Open-source models with fewer than 32 billion parameters were hosted and served locally using vLLM for efficient inference, while larger open-source models and proprietary models were accessed via their respective official APIs.
- If you want to learn specific information about the projects used by CODE2BENCH-2505, please click here.