美团大模型算法二面:Function Call三连炮! (English)

Generated: 2026-06-24 00:08:19

---

That Meituan Interview Almost Made Me Cry

I’d barely sat down, my back not even touching the chair.

The interviewer glanced at my resume, and when his eyes landed on the “tool calling” project experience, they lit right up. Then he hit me with three questions—each one more brutal than the last. I felt like I was standing in a storm, raindrops smacking me in the face, and I still had to force a smile.

This was the second round at Meituan, for the basic research platform’s agent team. The job description clearly said: “Optimize function call, multi-agent coordination.” The interviewer had obviously worked on real projects—every question cut straight to the core.

Later, when I talked to other candidates, I found out those three questions had already earned a reputation in the community: “The Function Call Triple Threat.” Few people online could explain them clearly, but Meituan interviewers loved digging into them.

Today I'm spilling all the blood and tears. You might want to buckle up.

---

Question One: How do you actually train Function Call?

Let me ask you something first—do you also think that just because you’ve called a few commercial large model APIs and used tool_calls, you already understand Function Call?

Way too optimistic.

What the interviewer really wanted to see was whether you had the full training chain in your head.

Take a look at the Llama 3 technical report and you’ll get it: tool calling ability isn’t learned during pre-training—it’s crammed in during post-training. How? Through repeated SFT and DPO iterations, polished little by little.

A few points that go against common belief:

The annotators only score the assistant’s reasoning process; they never touch the tool information. Why? Judging whether a tool call is correct is a purely technical job—ordinary annotators are just too noisy. So correctness is left to rules or automatic validation, while humans only evaluate the naturalness and helpfulness of the reply.

Also, they don't use rejection sampling. Many teams swear by it, thinking it boosts tool-use performance. But Llama 3’s internal tests showed no significant benefit, so they cut it. That’s the big-company attitude—don’t be superstitious about methods, only results.

And the data difficulty ramps up gradually. First, labeled single-turn tool-use dialogues, so you get familiar with simple scenarios. Then it’s multi-turn dialogues mixed with tool calls. The last layer is the most insane—multi-step tool calls with data analysis. Like a video game: you level up in the beginner zone before taking on the boss.

Here’s how I handled my own project at the time.

I fine-tuned a Qwen 2.5-7B model, basically following that same logic. My data mainly came from Glaive’s glaive-function-calling-v2-sharegpt—pretty good coverage, but there was a big problem: most of the tool descriptions were in English. For a domestic scenario, we needed Chinese and local APIs.

No choice but to create my own.

I hand-wrote a batch of SFT data, each sample with four core fields:

system: a list of tools in JSON Schema format. Not a single field could be wrong.
user: a real user instruction, like “Help me check today’s weather in Beijing, then send an email to Mr. Zhang telling him.”
assistant: first think, then call tools, with the tool_calls hidden inside.
tool: the result returned after the tool executed.

The pitfalls I stepped into? Let me tell you, it was all tears.

First pitfall: the tool description was too vague. I had a get_weather function with a description that just said “Get weather.” Later, the model started calling that function even for “Get air quality.” The user would say, “What’s the air like in Beijing today?” and it would blast out a weather function call—unacceptable. After that I mandated: the description must clearly state inputs, outputs, usage conditions, and even tell the model “This tool returns current temperature and weather condition for a given city, NOT air quality.” You have to specify what it shouldn’t do as well.

Second pitfall: parameter type conflicts. Qwen’s tokenizer is especially sensitive to nested JSON. A field in properties would say "type": "string", but the model would pass a number, and the whole output would go haywire. I ended up adding a hard JSON schema validation in post-processing—if it didn’t pass, it would try again.

Third pitfall: in multi-turn dialogues, the historical toolcall info was like a leaky bucket. If you didn’t stitch the toolcall and toolresponse back into the history, the model would lose memory in the next turn—it would forget what it called before and start making random calls again. I eventually stuck strictly to the ChatML format, keeping a <|toolcall|> marker in the assistant’s reply for every turn. That finally stabilized things.

---

Question Two: The text format for Function Call—more nuanced than you think

This one is interesting.

Believe it or not, most people trip up right here. They think all you have to do is stuff the tool schema into the system prompt—simple, rough, and foolproof.

Wrong.

Different models have their own “input dialects,” just like people from different regions have different accents of Mandarin.

The format I used in my project was based on ChatML and looked something like this:


<|im_start|>system
You have access to the following tools:
[{"type":"function","function":{"name":"func_add","description":"计算两个数字的和","parameters":{"type":"object","properties":{"x1":{"type":"number","description":"第一个数字"},"x2":{"type":"number","description":"第二个数字"}},"required":["x1","x2"]}}}]
<|im_end|>
<|im_start|>user
帮我算一下 125679 加上 234519 是多少？
<|im_end|>
<|im_start|>assistant
没问题，我来帮你计算。
<|tool_call|>{"name":"func_add","arguments":{"x1":125679,"x2":234519}}<|tool_end|>
<|im_end|>
<|im_start|>tool
{"ans":360198}
<|im_end|>
<|im_start|>assistant
结果是 360198。
<|im_end|>

Looks simple? There are a few details I stayed up several nights figuring out.

toolcalls must come right after the assistant’s content. If the model says something first and then calls a tool, the format has to be perfectly aligned. Many resources treat `toolc

美团大模型算法二面:Function Call三连炮! (English)

美团大模型算法二面:Function Call三连炮! (English)

That Meituan Interview Almost Made Me Cry

Question One: How do you actually train Function Call?

Question Two: The text format for Function Call—more nuanced than you think

Cael Lee

Ready to get started?