The demo works. The stakeholders are impressed. Someone says "let's ship it." Three weeks later the system is quietly turned off and no one mentions it again.
This is the most common arc for AI integrations right now. Not failure at the prototype stage — failure after launch, in production, under real conditions. The causes are usually the same.
Prompts that work in development break with real inputs
In development, you test with clean, representative inputs. In production, users do things you didn't anticipate. They write in different languages, use abbreviations, submit empty fields, copy-paste from Word documents with invisible Unicode characters, or ask questions wildly outside the system's intended scope.
Prompts that look robust in a notebook turn brittle fast. A system prompt tuned for a narrow use case starts hallucinating or refusing to respond the moment it receives something unexpected.
The fix is an evaluation suite — a set of real-world inputs with known expected outputs, run against the system before every change. Without this, you're flying blind. You don't know if a prompt change improved things or quietly broke a different class of inputs.
No fallback when the API goes down
LLM APIs go down. They have rate limits. They return 500s. They time out on long completions.
Most AI integrations treat the model like a local function call — if it fails, the whole flow crashes. In a demo this doesn't matter. In production it means users hit errors, workflows stop, and data gets dropped.
Every AI integration needs a fallback posture:
- Retry with exponential backoff for transient failures
- Graceful degradation — what does the user see when the AI is unavailable?
- Queue-based architecture for non-interactive flows, so work isn't lost during outages
- Circuit breakers to stop hammering a degraded API
None of this is novel. It's standard distributed systems practice. But developers new to AI often don't apply it because the model feels like "just a function."
Context windows fill up and quality degrades silently
LLM quality degrades as context grows. A system that handles a 3-message conversation cleanly starts making mistakes at 30 messages — confusing earlier instructions with later ones, losing track of constraints, repeating itself.
Most applications don't monitor this. The model doesn't throw an error when context becomes unwieldy. It just quietly performs worse.
For conversational systems, you need a strategy:
- Summarization — periodically compress older turns into a summary before feeding them back
- Windowing — only include the N most recent turns plus the system prompt
- Persistent memory — extract and store key facts separately so the context stays lean
For document-processing pipelines, you need to chunk inputs deliberately and test with documents at the extreme end of your expected size range.
Token costs scale faster than expected
A prototype that costs $2 in API calls during testing costs $2,000 a month at 1,000 users. That math surprises people who didn't instrument usage during development.
Costs spike for a few common reasons:
- System prompts that are several thousand tokens long, prepended to every request
- Logging full conversations to the LLM for "context" when only summaries are needed
- No caching layer for identical or near-identical queries
- Using a flagship model (GPT-4, Claude Opus) for tasks a cheaper model handles fine
The right approach is to track token usage per request from day one, set alerts when cost per interaction exceeds a threshold, and route tasks to appropriately-sized models. Not every call needs the most capable model.
No mechanism to catch and correct hallucinations
LLMs hallucinate. This is a property of the architecture, not a bug that will be patched away. In low-stakes applications — summarizing notes, drafting first drafts — this is tolerable. In applications where the output drives real decisions, it isn't.
The failure mode is building an AI integration that produces confident-sounding wrong answers with no mechanism to detect or surface them.
Mitigations depend on the use case:
- Grounding: force the model to cite sources, then verify those sources exist and match the claim
- Structured output validation: if the model is supposed to return a JSON object, validate the schema and reject anything malformed
- Human-in-the-loop for high-stakes paths: automate the common case, escalate edge cases to a person
- Output classifiers: a second, cheaper model that evaluates whether the first model's output meets quality criteria
The worst AI integrations are the ones where wrong outputs are invisible to the people affected by them.
Tight coupling to one provider
Teams that build directly against a single provider's SDK — calling openai.chat.completions.create() everywhere — create fragile systems. When OpenAI has an outage, everything stops. When Anthropic releases a better model for your use case, migrating is a project.
A thin abstraction layer that normalizes provider interfaces costs one afternoon to write and pays dividends repeatedly. It also enables fallback routing: if provider A returns an error, route to provider B.
This is especially important for production systems that need reliability guarantees you can't get from any single AI provider today.
What actually ships successfully
AI integrations that hold up in production share a few traits:
- Evals from the start — measurable quality metrics, not just "does it seem right"
- Defensive error handling — every API call wrapped with retry logic and graceful fallback
- Cost tracking — usage metrics instrumented before launch, not after the first bill
- Scope discipline — the system does one thing well, with explicit handling for out-of-scope inputs
- Human review in the loop — at least initially, until confidence in the system is earned through data
The demo is easy. The demo works by design — you control the inputs, you run it in optimal conditions, you skip the edge cases. Production is where the real engineering starts.