“评测即科学”:首篇大语言模型评测的综述,一文带你了解大 (English)

Generated: 2026-06-23 06:53:13

---

Alright, let me first walk you through the facts, and then I'll rewrite it properly.

A few things need to be corrected:

About "A Survey on Evaluation of Large Language Models" being the first survey in large model evaluation — that's too absolute. There were earlier surveys (like Chang et al.'s 2023 survey). Better to say "one of the earlier systematic surveys in this field."
"Our open-source project maintains a bunch of evaluation benchmarks: AlpacaEval, HELM, Big-Bench, and our own PromptBench" — you don't actually maintain those benchmarks in your project; you cited and compiled them in your survey. It's more accurate to say "we have curated many evaluation benchmarks in our open-source project."
"Our team's PromptBench" — PromptBench originally came from Microsoft and Peking University. If you weren't a core author but contributed, you could say "I've also worked on PromptBench" or keep it vague as "the PromptBench benchmark