πŸ† Code2Bench-2509 benchmark πŸ†

Code2Bench: Scaling Source and Rigor for Dynamic Benchmark Construction

Our source code is available on GitHub

github

πŸ“ Notes

  1. SC-Python: Strongly Constrained Python tasks (217 problems).
  2. WSC-Python: Weakly Structured Constrained Python tasks (194 problems).
  3. SC-Java: Strongly Constrained Java tasks (249 problems).
  4. For code completion generation, we adopted greedy decoding (temperature = 0) for Pass@1 evaluation.
  5. Open-source models with fewer than 32B parameters were hosted locally using vLLM; larger/proprietary models were accessed via official APIs.

πŸ™ Acknowledgements

We thank the EvalPlus team for providing the leaderboard template.