Traditional application monitoring tracks request latency, error rates, and throughput. LLM applications need all of that plus: token consumption per request (directly impacts cost), model response quality (did the LLM answer correctly?), hallucination detection (did the LLM make things up?), retrieval quality for RAG systems (did we find the right documents?), prompt performance (which prompts produce better results?), and safety monitoring (did the LLM generate harmful or off-policy content?).
Without LLM-specific observability, you're operating blind. You won't know that your AI chatbot started hallucinating after a prompt change, that your RAG system's retrieval quality degraded after a knowledge base update, or that a single customer is consuming 40% of your LLM budget. This guide covers what to monitor, how to monitor it, and the tools available in 2026.
The Five Pillars of LLM Observability
1. Cost and Token Tracking
LLM API calls are metered by tokens. Without per-request cost tracking, you cannot: identify expensive queries, optimize prompts for cost efficiency, set per-user budgets, or forecast monthly spend. Track input tokens, output tokens, and total cost for every LLM call.
// LLM call wrapper with observability
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('llm-service');
interface LLMCallMetrics {
model: string;
inputTokens: number;
outputTokens: number;
totalTokens: number;
cost: number;
latency: number;
success: boolean;
userId?: string;
feature?: string;
}
// Cost per 1K tokens (as of 2026)
const MODEL_COSTS: Record<string, { input: number; output: number }> = {
'gpt-4o': { input: 0.0025, output: 0.01 },
'gpt-4o-mini': { input: 0.00015, output: 0.0006 },
'claude-opus': { input: 0.015, output: 0.075 },
'claude-sonnet':{ input: 0.003, output: 0.015 },
'claude-haiku': { input: 0.00025, output: 0.00125 },
};
async function trackedLLMCall(
model: string,
messages: Message[],
options: { userId?: string; feature?: string } = {}
): Promise<LLMResponse> {
return tracer.startActiveSpan('llm.call', async (span) => {
const startTime = Date.now();
span.setAttribute('llm.model', model);
span.setAttribute('llm.user_id', options.userId || 'anonymous');
span.setAttribute('llm.feature', options.feature || 'unknown');
try {
const response = await openai.chat.completions.create({
model,
messages,
});
const usage = response.usage!;
const costs = MODEL_COSTS[model] || { input: 0, output: 0 };
const cost = (usage.prompt_tokens / 1000 * costs.input) +
(usage.completion_tokens / 1000 * costs.output);
const latency = Date.now() - startTime;
// Record metrics
span.setAttribute('llm.input_tokens', usage.prompt_tokens);
span.setAttribute('llm.output_tokens', usage.completion_tokens);
span.setAttribute('llm.cost_usd', cost);
span.setAttribute('llm.latency_ms', latency);
// Emit to metrics system
llmCostCounter.add(cost, {
model,
feature: options.feature || 'unknown',
user_id: options.userId || 'anonymous',
});
llmTokenHistogram.record(usage.total_tokens, { model });
llmLatencyHistogram.record(latency, { model });
return response;
} catch (error: any) {
span.setAttribute('llm.error', true);
span.setAttribute('llm.error_type', error.code || 'unknown');
llmErrorCounter.add(1, { model, error_type: error.code });
throw error;
}
});
}
2. Quality Evaluation
The hardest part of LLM observability: how do you know if the LLM's response is good? Unlike traditional software where correctness is binary (right or wrong), LLM quality is a spectrum. Approaches:
LLM-as-judge: Use a separate LLM call to evaluate the quality of the first LLM's response. "Given this question and context, rate the following answer on a scale of 1-5 for accuracy, relevance, and completeness." This is imperfect (the judge LLM has its own biases) but scalable and surprisingly correlated with human judgment.
User feedback: Add thumbs up/down buttons to AI responses. Track the feedback rate (what percentage of responses get feedback) and the positive feedback ratio. A sudden drop in positive feedback indicates a quality regression.
Automated checks: Validate responses against known patterns — does the response contain a valid URL when one was expected? Does a code generation response actually compile? Does a data extraction response match the expected schema? These are simple but catch obvious failures.
3. Hallucination Detection
Hallucination is the most dangerous LLM failure mode. Detection strategies:
Groundedness check: After generating a response from RAG, use a separate model to verify that every claim in the response is supported by the retrieved context. Claims not supported by the context are flagged as potential hallucinations.
Consistency check: Ask the same question multiple times (with different phrasings) and compare responses. If the responses are inconsistent, the LLM is likely uncertain and may be hallucinating.
Fact verification: For responses containing specific facts (dates, numbers, names), verify against a trusted data source. This is expensive but essential for high-stakes applications (legal, medical, financial).
4. Retrieval Quality (for RAG)
If your LLM application uses RAG, monitor retrieval quality separately from generation quality. Track: retrieval latency (how long does vector search take?), relevance scores of retrieved documents (are the top results actually relevant?), hit rate (what percentage of queries retrieve at least one relevant document?), and context window utilization (are you using the context window efficiently or stuffing it with irrelevant chunks?).
5. Safety and Policy Monitoring
Monitor for: prompt injection attempts (track detection rate and false positive rate), toxic or harmful content in responses, PII leakage (social security numbers, credit card numbers, personal information in responses), and off-topic responses (the chatbot for a banking app shouldn't be writing poetry).
Tools for LLM Observability
LangSmith (LangChain): The most comprehensive LLM observability platform. Provides tracing, evaluation, dataset management, and prompt versioning. Best for teams using the LangChain ecosystem.
Weights & Biases (W&B) Prompts: Extends W&B's ML experiment tracking to LLM applications. Good for teams already using W&B for ML model training.
Arize Phoenix (open-source): Open-source LLM observability with tracing, evaluation, and embedding visualization. Good for teams who want to self-host their observability data.
OpenTelemetry + custom dashboards: For teams who want full control, instrument LLM calls with OpenTelemetry spans and build custom dashboards in Grafana. This approach gives you maximum flexibility but requires more engineering effort.
Building Your LLM Monitoring Dashboard
Essential dashboard panels: total LLM cost (today, this week, this month, projected), cost per feature/endpoint, cost per user (identify heavy consumers), latency percentiles (p50, p95, p99) by model, error rate by model and error type, token usage trends, quality score trends (from LLM-as-judge or user feedback), hallucination detection rate, and top 10 most expensive queries.
Essential alerts: cost exceeds daily budget (immediate alert), error rate exceeds 5% (page on-call), latency p95 exceeds 10 seconds (Slack notification), quality score drops below threshold (Slack notification), and hallucination rate spikes (page on-call).
ZeonEdge implements LLM observability stacks for production AI applications. From cost tracking to quality evaluation and safety monitoring, we build the visibility you need to run AI in production confidently. Contact us to discuss your AI observability needs.
Daniel Park
AI/ML Engineer focused on practical applications of machine learning in DevOps and cloud operations.