🏆 Code2Bench-2509 benchmark 🏆

Code2Bench: Scaling Source and Rigor for Dynamic Benchmark Construction

Our source code is available on GitHub

SC-Python WSC-Python SC-Java Average

⚡Ranks⚡

📝 Notes

SC-Python: Strongly Constrained Python tasks (217 problems).
WSC-Python: Weakly Structured Constrained Python tasks (194 problems).
SC-Java: Strongly Constrained Java tasks (249 problems).
For code completion generation, we adopted greedy decoding (temperature = 0) for Pass@1 evaluation.
Open-source models with fewer than 32B parameters were hosted locally using vLLM; larger/proprietary models were accessed via official APIs.

🙏 Acknowledgements

We thank the EvalPlus team for providing the leaderboard template.