LLM evaluation is only useful when it helps a team make decisions. This guide explains the core LLM evaluation metrics that matter in practice—accuracy, hallucination, latency, and cost—and shows how to estimate tradeoffs with simple inputs you can revisit as prompts, models, traffic, and pricing change. If you are building an internal assistant, support workflow, retrieval system, or production chat feature, this article gives you a durable framework for measuring quality without reducing the whole system to a single score.
Overview
Most teams start by asking a narrow question: “Which model is best?” In production, that is almost never the right question. A better question is: Which model, prompt, and system design gives acceptable quality at acceptable speed and acceptable cost for this use case?
That is why LLM evaluation should be treated as a multi-metric discipline. A model can be highly accurate on a benchmark and still be too slow for a customer-facing chat tool. It can be fast and cheap but produce unsupported claims. It can look impressive in demos and still fail on the long-tail cases that matter in real workflows.
The four metric families in this article form a practical baseline:
- Accuracy: Does the output correctly complete the task?
- Hallucination: Does the output introduce unsupported, fabricated, or misleading content?
- Latency: How long does the system take to respond?
- Cost: What does each request, session, or workflow run actually cost?
These are not the only useful model quality metrics. Depending on the application, you may also care about consistency, refusal quality, formatting compliance, tool-call success, retrieval relevance, safety, or user satisfaction. But accuracy, hallucination, latency, and cost are the core set because they force real product tradeoffs into view.
One important rule: do not evaluate the model in isolation if the shipped product includes prompts, retrieval, guardrails, post-processing, tools, and UI constraints. Evaluate the system the user will actually experience. That same principle appears throughout reliable prompt testing frameworks and broader prompt engineering best practices.
To keep evaluation useful, define each metric operationally:
- What exactly are you measuring?
- How will you score it?
- What threshold counts as acceptable?
- What business decision depends on the result?
If a metric cannot influence a shipping, routing, or budgeting decision, it may not be the right metric for your current stage.
How to estimate
You do not need a perfect measurement stack to begin. A simple evaluation worksheet is often enough to compare prompts, models, or architectures. The goal is repeatable estimation, not false precision.
1. Start with the task definition
Write one sentence that defines success in plain language. For example:
- “Summarize support tickets into accurate action items.”
- “Answer employee policy questions using approved documents.”
- “Extract structured fields from inbound email.”
This sounds basic, but many evaluation efforts fail because the task is vague. If success is unclear, accuracy scoring becomes arbitrary.
2. Build a representative test set
Create a dataset that reflects the inputs your system will actually receive. Include easy, typical, and difficult cases. If the application is retrieval-based, include questions where the right answer is present in context and cases where the context is missing or conflicting. If the application is extraction, include noisy formatting and edge cases.
For many teams, a small, carefully reviewed test set is more useful than a large but weakly labeled one. You can grow the dataset over time as failures appear in logs.
3. Score accuracy in task-specific terms
Accuracy should match the task. That might mean exact match, rubric-based grading, field-level extraction correctness, pass/fail against a checklist, or human judgment against reference answers. In other words, “accuracy” is not one universal formula.
Examples:
- Classification: percent correct labels
- Extraction: field-level precision and recall
- Summarization: human rubric for coverage, correctness, and usefulness
- Question answering: answer correctness relative to source material
If you need stronger prompt control, it also helps to review the boundary between system, developer, and user instructions, especially in multi-turn tools. See System Prompt vs User Prompt vs Developer Prompt for a deeper design breakdown.
4. Track hallucination separately from accuracy
Many teams bury hallucination inside general quality scores, which makes debugging harder. A useful approach is to ask a distinct question: Did the system include any claim that was not supported by the input, context, tools, or approved knowledge source?
This is especially important in RAG systems, document question answering, and enterprise assistants. A response can be mostly correct but still unsafe if it adds one invented detail.
In practice, common hallucination metrics include:
- Rate of unsupported factual claims
- Percent of answers grounded in cited context
- Human-rated factual faithfulness
- Pass/fail for “no unsupported content introduced”
If you are choosing between retrieval-heavy and long-context approaches, hallucination should be measured across both architectures rather than assumed. See RAG vs Long Context for the architectural side of that decision.
5. Measure latency as users experience it
AI latency benchmarks only matter when they reflect the delivered product. Measure end-to-end latency, not just raw model response time. Include retrieval, prompt construction, tool calls, moderation, streaming setup, and post-processing if those happen before the answer reaches the user.
Useful latency measures include:
- Time to first token: how quickly the user sees a response begin
- Time to useful answer: how long until the response becomes actionable
- P50 latency: typical experience
- P95 or P99 latency: slow-tail experience that affects trust
For chat systems, tail latency often matters more than the average. A system that is usually fast but occasionally stalls can feel worse than a slightly slower but more stable one.
6. Estimate cost at the workflow level
LLM cost evaluation should not stop at “price per million tokens.” Measure the full request path:
- Input tokens
- Output tokens
- Retries
- Fallback calls
- Retrieval or embedding operations
- Tool invocations
- Human review where required
A common mistake is to compare models on unit price alone. A cheaper model that needs longer prompts, more retries, or more verification may cost more per successful task than a stronger model with a higher nominal rate.
7. Create a weighted decision score
Once the raw metrics are available, assign weights based on business needs. For a support bot, latency and hallucination may dominate. For offline document analysis, quality may matter far more than speed. For internal productivity tools, cost may matter only after basic reliability is achieved.
A simple formula is:
Decision Score = (Accuracy weight × normalized quality) + (Hallucination weight × groundedness) + (Latency weight × speed score) + (Cost weight × efficiency score)
The exact math matters less than the discipline of making tradeoffs explicit.
Inputs and assumptions
A durable evaluation framework depends on transparent assumptions. If your inputs are hidden or unstable, your comparisons will drift.
Core inputs to document
- Use case: chat, extraction, summarization, search assistant, coding aid, workflow automation
- Traffic pattern: daily volume, peak concurrency, average session length
- Prompt design: system prompt length, examples, formatting constraints, tool instructions
- Context strategy: no retrieval, RAG, long context, hybrid routing
- Output expectations: short answer, structured JSON, report, citation-heavy response
- Error handling: retries, fallback models, human review thresholds
These assumptions materially affect all four metrics. For example, adding few-shot examples can improve accuracy but increase token cost and sometimes latency. Structured output constraints may improve downstream reliability but occasionally reduce flexibility. If you are experimenting with example-based prompting, compare designs systematically; Few-Shot Prompting vs Zero-Shot Prompting is useful background.
How to think about accuracy
Accuracy is easiest to measure when the task has a clearly right answer. It becomes harder for open-ended generation, where usefulness and correctness can diverge. To keep accuracy practical:
- Break complex tasks into sub-scores where possible
- Use rubrics instead of vague “good/bad” labels
- Separate formatting failures from factual failures
- Review a sample of automated scores manually
For example, a summarization rubric might score:
- Faithfulness to source
- Coverage of key points
- Clarity and concision
- Actionability for the intended reader
That gives you more diagnostic value than a single quality number.
How to think about hallucination
Hallucination is often discussed as if it were a single phenomenon. In practice, there are several forms:
- Fabricated facts: invented names, dates, rules, citations, or events
- Unsupported inference: plausible but unverified claims beyond the source
- Context contradiction: statements that conflict with provided information
- Tool misuse: claiming an action was completed when it was not
Your scoring method should reflect the risk profile of the application. In a creative writing tool, unsupported elaboration may be acceptable. In a compliance assistant, it is not.
How to think about latency
Latency is not just a backend metric. It is a user trust metric. Two systems with similar average response times may produce very different user experiences depending on streaming behavior, consistency, and task flow.
Consider measuring:
- Interactive latency for chat and copilots
- Batch throughput for offline jobs
- Slow-tail rates for peak traffic windows
- Latency impact of retries and fallback routing
If your system uses multiple tools or chained prompts, measure each stage and the full path. This helps identify whether the bottleneck is the model, retrieval, formatting, or orchestration layer. Articles on AI developer tools and prompt engineering techniques often become most useful when tied back to these concrete measurements.
How to think about cost
For llm cost evaluation, calculate both cost per request and cost per successful outcome. The second number is usually more decision-relevant.
A practical template:
- Base model cost: input + output usage
- Prompt overhead: long instructions, examples, schema
- Retrieval cost: search, reranking, embeddings where applicable
- Failure cost: retries, escalations, human review
- Volume multiplier: expected monthly or peak traffic
Then ask: if quality improves by a small margin, does it reduce downstream support burden, review time, or operational risk enough to justify the spend? Cost is not just an infrastructure line item. It is part of total workflow economics.
Worked examples
The best way to understand evaluation tradeoffs is to walk through a few simple scenarios. The numbers below are illustrative structures, not current market prices or benchmark claims.
Example 1: Internal policy assistant
Goal: answer employee questions using approved internal documents.
Priority order: hallucination, accuracy, latency, cost.
Evaluation setup:
- 100 representative employee questions
- Known source documents
- Pass/fail scoring for answer correctness
- Separate groundedness check for unsupported claims
- End-to-end latency measured from user submit to first useful answer
What to compare:
- Model A with shorter context window and retrieval
- Model B with larger context and fewer retrieval steps
- Prompt version 1 with strict citation rules
- Prompt version 2 with more conversational freedom
Likely insight: the strict citation prompt may reduce hallucinations but slightly increase answer length and token cost. A retrieval-based setup may lower unsupported claims if the source pipeline is strong, but it could add latency. The right choice depends on whether your environment values confidence and verifiability over speed.
Example 2: Support ticket summarization
Goal: summarize long support threads into a handoff note.
Priority order: accuracy, latency, cost, hallucination.
Evaluation setup:
- 50 real ticket threads with human-written reference notes
- Rubric for issue summary, status, next action, and tone neutrality
- Latency tracked for single-thread and peak batch runs
- Cost estimated per summary and per monthly support volume
What to compare:
- Short prompt without examples
- Few-shot prompt with high-quality examples
- Light model with lower unit cost
- Stronger model with better structured output adherence
Likely insight: few-shot prompting may improve consistency enough to reduce manual cleanup, making a slightly higher token cost worthwhile. This is a good reminder that prompt design and model choice should be tested together, not separately.
Example 3: Customer-facing chat workflow
Goal: answer common account questions and escalate edge cases gracefully.
Priority order: latency, hallucination, accuracy, cost.
Evaluation setup:
- Common intents plus adversarial or ambiguous prompts
- Measure time to first token and time to resolution
- Track escalation correctness and refusal quality
- Estimate average session cost, not just single-turn cost
What to compare:
- One-model architecture
- Router that sends simple tasks to a cheaper model and harder ones to a stronger model
- Prompt variant with explicit escalation rules
Likely insight: a routing approach can reduce average cost while preserving quality on hard cases, but only if the router itself is reliable. In support experiences, the best result is often not the answer with the highest raw model score but the workflow that fails safely. For that kind of design, operational guidance like Empathetic Automation can complement technical evaluation.
A simple scorecard template
For each model or prompt candidate, capture:
- Accuracy: percentage or rubric average
- Hallucination rate: unsupported-claim rate or groundedness pass rate
- Latency: P50 and P95 end-to-end
- Cost: per request, per session, and per successful task
- Notes: common failure modes and reviewer observations
That compact scorecard is often enough to make better decisions than a long benchmark spreadsheet detached from your real workload.
When to recalculate
Evaluation should be revisited whenever a material input changes. This is what makes the topic evergreen: the framework stays stable, but the underlying numbers and tradeoffs move.
Recalculate your baseline when:
- Model pricing changes or usage terms shift
- Prompt design changes, especially if you add examples, schemas, or tool instructions
- Traffic patterns change, including peak concurrency and session length
- Architecture changes, such as moving from direct prompting to RAG or adding fallback routing
- Quality expectations change, for example when a prototype becomes customer-facing
- New failure modes appear in production logs or human review queues
- Benchmarks improve enough that an older model choice deserves a fresh look
A practical review cadence is:
- Before launch: establish a baseline on a representative test set
- After major prompt or architecture changes: rerun targeted evaluations
- On a regular schedule: monthly or quarterly, depending on usage and risk
- After incidents: add failed cases to the evaluation set immediately
To make this sustainable, keep an evaluation changelog with the exact prompt version, model version, test set version, and scoring rubric used for each run. Without versioning, comparisons become noisy and hard to trust.
If you want a practical next step, do this:
- Pick one production or near-production LLM workflow
- Define success in one sentence
- Create 25 to 50 representative test cases
- Score accuracy and hallucination separately
- Measure P50 and P95 latency end-to-end
- Estimate cost per request and per successful task
- Compare at least two prompt or model variants
- Write down the decision rule before you run the test
That process is simple enough to repeat and strong enough to support real product decisions. For teams building more mature evaluation habits, pairing this framework with a prompt testing framework and stronger system prompt best practices will make results more reliable over time.
The main takeaway is straightforward: good LLM evaluation does not chase one perfect metric. It balances quality, groundedness, speed, and economics in a way that matches the job the system is meant to do. When those inputs change—and they will—your evaluation should change with them.