Zero-Shot vs Few-Shot vs Chain of Thought Prompts

You typed Classify the sentiment of this text: "I hated this movie." into a Gemini prompt. The model returned Negative. Perfect. Then you tried Classify: "The queue was 40 minutes but the dosa was worth it." The model returned three paragraphs about mixed emotional valence and nuance.

You wanted one word. You got an essay.

That’s the gap where prompting techniques matter. Zero-shot, few-shot, and chain of thought are the three techniques you’ll reach for most often when a direct prompt doesn’t give you what you need. They solve different problems, they cost different amounts, and they combine in useful ways.

This post walks through each technique with examples you can run in the TinkerLLM playground, a decision table for picking between them, and the token-cost tradeoffs that matter at scale.

Zero-Shot: The Default You Already Use

Zero-shot prompting is the name for what you’ve been doing since your first ChatGPT conversation. You ask the model to do something. No examples. No demonstrations. Just the task.

Classify the sentiment of this text: "I hated this movie."

The model returns Negative. That’s zero-shot. Exercise 8-1 in TinkerLLM Lesson 8 is exactly this prompt.

It works because modern models have seen sentiment classification tens of thousands of times in their training data. The task is familiar. The model knows what “classify sentiment” means without you explaining. Same for translation (exercise 8-2: Translate to Hindi: "Where is the library?"), basic summarization, and most common writing tasks.

When zero-shot works well:

Tasks the model has seen at scale: translation, summarization, sentiment, paraphrasing
Open-ended writing where you don’t care about exact format
Conversational Q&A where natural phrasing is fine
Code generation for common languages and patterns

When zero-shot fails:

You want a specific output format the model hasn’t seen (e.g., Name: X | Role: Y | Confidence: 0.8)
Your labels are non-standard (Spicy, Mild, No-spice instead of Positive, Negative)
The task involves multi-step reasoning where the answer is more than lookup
Edge cases you haven’t anticipated (the mixed-sentiment dosa review is a classic example)

The failure mode is usually one of two things. The model ignores your format request and returns prose. Or it invents a format of its own, close to what you asked for but not quite. Both are solved by showing examples.

Try it yourself: Open the TinkerLLM playground, run exercise 8-1 as written, and get the clean single-word answer. Then change the text to "The queue was 40 minutes but the dosa was worth it." and run it again. Note how the response changes. That’s zero-shot hitting an edge case. The fix is in the next section.

Few-Shot: Teaching by Example

Few-shot prompting means showing the model a few examples of input-output pairs before asking it the real question. The model picks up on the pattern and follows it.

Text: "I loved it." -> Sentiment: Positive
Text: "It was awful." -> Sentiment: Negative
Text: "Best day ever." -> Sentiment: Positive
Text: "I am extremely angry." -> Sentiment:

That’s exercise 9-1. The model completes the pattern with Negative. It also honors the exact Sentiment: format, because you just showed it three times what that format looks like.

Few-shot works because of how transformer models pattern-match. You’re not teaching it sentiment analysis (it already knew). You’re teaching it the output shape, the label vocabulary, the separator style, and any edge-case handling you demonstrated. All in three lines.

The two-example minimum: One example is ambiguous. Two or three establish a pattern. Five-plus is usually overkill. Most of the time 2-3 examples is the sweet spot, and diminishing returns set in fast after that.

Useful in practice for:

Custom output formats. JSON structures, CSV rows, labeled key-value pairs, your internal taxonomy.
Style transfer. Exercise 9-2 converts Robert → Rob, William → Will, Jonathan → ? and the model returns Johnny or John. Zero-shot wouldn’t know you wanted the short form.
Emoji or symbol maps. Exercise 9-3 converts emotions to single emojis. The model adapts to whichever emoji style your examples used.
Domain-specific labels. Classifying support tickets by your company’s own category names. The model has never seen your taxonomy; two or three examples show it what they look like.

The trap: your examples set the pattern for real. If your examples have a subtle bias or error, the model learns that too. I’ve seen teams ship few-shot prompts with typos in the example outputs and discover the model was dutifully reproducing the typo in production.

Before using few-shot in production: read your examples out loud. Run them past a teammate. The examples are the specification, and the spec ships with every call.

Try it yourself: In the TinkerLLM playground, run exercise 9-1 as written. Then add one more example before the final query, but change the style: Text: "Meh" -> Sentiment: Neutral. Submit again. Watch the model now consider Neutral as a valid output for ambiguous cases. You changed what categories exist just by adding one line.

Chain of Thought: Asking the Model to Think

Chain of thought (CoT) prompting means adding something like Think step by step or Let's reason through this to the prompt. Instead of giving you the answer directly, the model writes out its reasoning first, then the answer.

The canonical example is from Wei et al.’s 2022 paper that named the technique. With a direct prompt, many models get simple multi-step arithmetic wrong. With Think step by step appended, accuracy jumps significantly on the same model.

TinkerLLM Lesson 10 demonstrates this directly. Exercise 10-1 (The Logic Trap) uses the system instruction You do not think. You only give the answer directly. and asks I have 3 apples. I eat 2. I buy 4 more. How many do I have? Small models sometimes fail because they guess before reasoning.

Exercise 10-2 is the same question with Think step by step appended. The model writes out:

First I had 3 apples.
I ate 2, so I had 3 - 2 = 1.
I bought 4 more, so I had 1 + 4 = 5.
Answer: 5.

Reasoning visible. Answer reliable.

Why does this work? LLMs generate one token at a time without planning the full sentence in advance. When you force them to write the reasoning before the answer, each intermediate step becomes part of the context for the next step. The reasoning tokens act as a scratchpad. The final answer is conditioned on the scratchpad, not on a single-token guess.

Use CoT for:

Multi-step arithmetic and math word problems
Logic puzzles and constraint satisfaction
Ranking and comparison tasks where the reasoning matters
Any decision where you want to see why the model picked something

Don’t bother with CoT for:

Single-step tasks (translation, classification, extraction). It just makes the output longer without changing the answer.
Creative writing where you want the output, not a reasoning trace about the output.
Tasks where the model is already reliable at the direct version. The extra tokens are wasted.

Thinking Budget: CoT Built In

Modern reasoning models like Gemini 2.5 Pro and Gemini 2.0 Flash Thinking do this internally. You don’t need to write Think step by step. The model runs its reasoning in a “thinking budget” before producing the visible answer.

In the Gemini API this is exposed as thinkingConfig.thinkingBudget, a token allowance for internal reasoning. The user never sees those tokens. They don’t appear in the output. But you pay for them, and they cause the accuracy jump on hard tasks.

For anything where you’d otherwise write Think step by step, use a thinking model instead. It’s cleaner, and the reasoning doesn’t pollute your output. For models that don’t have a thinking mode, the explicit CoT prompt is still the right move.

How to Choose

The decision usually comes down to two questions: do I need a specific output format, and does the task require reasoning?

Your situation	Use
Common task, any reasonable output format is fine	Zero-shot
Common task, but I need a specific output format	Few-shot
Custom labels or taxonomy the model has never seen	Few-shot
Multi-step arithmetic or logic	CoT (or a thinking model)
Custom format + multi-step reasoning	Few-shot + CoT combined
Code generation with a specific pattern	Few-shot
Long-form writing, open-ended	Zero-shot
Summarization with a strict word count	Zero-shot, validated with output length check

Three lines, and you’ve picked the technique. In practice I run this decision mentally in maybe two seconds. It’s not a deep deliberation.

The Combinations

These techniques stack. The most common production pattern is few-shot plus CoT plus a system instruction, in that order.

Few-shot + CoT example:

System: You are a math tutor. Always show reasoning.

Q: If a train travels 60 km/h for 2.5 hours, how far does it go?
A: Reasoning: distance = speed × time. 60 × 2.5 = 150. Answer: 150 km.

Q: If a tap fills a tank in 20 minutes, how much does it fill in 7 minutes?
A: Reasoning: rate = 1/20 per minute. In 7 minutes: 7/20 = 0.35. Answer: 35% of the tank.

Q: If a book has 240 pages and you read 30 pages a day, how many days?
A:

The few-shot examples establish both the reasoning format and the answer format. The system instruction reinforces the expectation. The model completes the pattern: a reasoning line, then an answer line.

This is the standard pattern for any production LLM feature that needs structured reasoning. Write the examples once, test them on edge cases, and reuse them.

Token Cost, Honestly

Each technique has a different cost profile. At scale this matters.

Zero-shot is the cheapest. One prompt, one answer. Minimal input tokens, no padding.

Few-shot adds input tokens for every call. Three examples of 50 tokens each is 150 extra input tokens on every single request. At 10,000 calls per day, that’s 1.5 million extra input tokens daily, for the same answer you’d get from zero-shot if zero-shot worked.

Chain of thought is output-token-heavy. A direct answer is 10-20 tokens. A reasoning trace plus answer is often 200-500 tokens. Output tokens cost more than input tokens on most pricing tiers, so CoT adds up fast.

Thinking models bill separately for thinking tokens. You don’t see them, but you pay for them. The Gemini pricing page shows the thinking-mode rates for Pro and Flash.

At meaningful scale, one of the cheapest optimizations is checking whether you’ve over-engineered the prompt. Is a few-shot prompt earning its extra input tokens? Is CoT earning its longer output? Run your prompt with the technique and without, on a test set of 50 examples, and compare accuracy. If the accuracy difference is small, use the cheaper version. More on the token math in Tokens Explained.

Temperature and Prompting Techniques

Each technique has a temperature sweet spot.

Technique	Recommended temperature	Why
Zero-shot classification	0.0-0.2	Deterministic labels
Zero-shot creative writing	0.7-1.0	Variety and natural tone
Few-shot structured output	0.0-0.3	The pattern is strict
Few-shot style transfer	0.3-0.7	Some creative latitude
Chain of thought	0.0-0.3	Reasoning should be consistent

Higher temperatures are rarely useful with few-shot or CoT. The pattern you showed the model is the point. You want it followed, not reinvented. More on this in What Temperature Actually Does in LLMs.

Try It Yourself

TinkerLLM Lessons 8, 9, and 10 cover these three techniques with hands-on exercises. About 25 minutes to run all of them.

Try it yourself: In the TinkerLLM playground, do this three-exercise sequence back to back:

Run exercise 8-1 (zero-shot sentiment). Observe the direct answer.
Run exercise 9-1 (few-shot sentiment with a specific format). Observe how the format is pinned down by the examples.
Run exercise 10-1 (direct answer, no CoT). Then run 10-2 (same question plus Think step by step). Compare the two outputs on the same question.

Ten minutes. You’ll have run all three techniques, observed where each one wins, and have a working mental model you can apply to any new task.

FAQ

What’s the actual difference between few-shot and fine-tuning?

Few-shot happens in the prompt, at inference time, with no training involved. You send the examples as part of the input on every call. Fine-tuning changes the model’s weights by running additional training on example data. Fine-tuning is more expensive upfront but cheaper per call (no example tokens). Few-shot is flexible (change examples instantly) and cheap to start. Rule of thumb: if you’re using the same few-shot prompt thousands of times per day with the same examples, fine-tuning often pays off. Below that threshold, few-shot is usually the right call.

How many examples should I use in few-shot?

Usually two to five. One example is ambiguous and the model can’t tell if it’s a pattern or a one-off. Two or three establishes the pattern. Beyond five you’re mostly just paying extra input tokens for marginal accuracy gains. The Anthropic prompting guide recommends three to five as the default. Start at three and add more only if the model is still missing the pattern.

Does chain of thought work on any model?

It works on most modern models, but the size of the improvement varies. Larger models (Gemini Pro, GPT-4o, Claude 3.5) show bigger CoT gains than smaller ones. Very small models (under a billion parameters) sometimes don’t benefit from CoT at all because they lack the reasoning capacity to use the scratchpad effectively. The original Wei et al. paper found CoT was mostly useful at 60B+ parameters, and that threshold has since shifted down as models improved.

Is “Let’s think step by step” better than “Think step by step”?

They’re roughly equivalent. The original “Let’s think step by step” phrasing comes from the Kojima et al. 2022 paper that followed Wei et al.’s work. In practice, most modern models respond similarly to either phrasing, as well as variations like Reason through this or Work it out. The effect is from the pattern the model learned during training, not the specific wording. Any clear instruction to reason before answering tends to work.

Can I combine few-shot and chain of thought in the same prompt?

Yes, and it’s the strongest combination for complex tasks. Write your few-shot examples with the reasoning already included. The model will follow the same structure: reasoning, then answer. This is the pattern behind most high-accuracy prompts you’ll find in research papers and production systems. The main cost is longer input (reasoning makes examples bigger) and longer output (the model writes more per call).

When should I stop using these techniques and fine-tune instead?

Three signals. First, you’re running the same prompt thousands of times daily and the example tokens are a meaningful cost. Second, your few-shot accuracy has plateaued and adding more examples doesn’t help. Third, your task is narrow enough that you have hundreds of labeled examples to train on. If all three are true, a fine-tune on Gemini or GPT will usually outperform the best few-shot prompt and cost less at scale.

Does system instruction count as zero-shot or few-shot?

System instruction is a separate layer. A system instruction with no examples is still zero-shot. A system instruction with a few examples inside it is few-shot. A system instruction that says Think step by step before responding is CoT. The techniques describe what’s in the prompt context (including system instructions); they’re not mutually exclusive with the system layer. More on system instructions in The God Mode of LLMs.

Why do smaller models benefit more from few-shot and less from CoT?

Few-shot gives smaller models specific pattern-matching handholds, which they use well because pattern-matching is what they’re best at. CoT requires the model to reason in a structured way through multiple steps, which needs more capacity. Smaller models sometimes generate reasoning that sounds plausible but reaches the wrong conclusion anyway. If you’re stuck on a smaller model (Flash Lite, local models), lean harder on few-shot and use CoT cautiously. On larger models (Pro, Opus, GPT-4), CoT is more consistently useful and few-shot’s marginal value drops because the base zero-shot is already strong.

Zero-Shot vs Few-Shot vs Chain of Thought

TL;DR