Planning AI Applications
Before building an AI application, answer three questions: why should it exist, what role does AI play vs humans, and what milestone gets you from demo to production? It is easy to build a cool demo with foundation models. It is hard to create a profitable product.
Planning AI Applications
TL;DR
Before building an AI application, answer three questions: why should it exist, what role does AI play vs humans, and what milestone gets you from demo to production? It is easy to build a cool demo with foundation models. It is hard to create a profitable product.
The historical problem
In 2023-2024, most companies felt FOMO. "We need AI" became a directive. Teams were told to "integrate AI" without a clear use case. The result was thousands of demos, most of which never reached production.
Huyen saw this across the industry and wrote Chapter 1 to give AI engineers a framework to answer the uncomfortable question: should this app even exist?
How it works
Step 1: Use case evaluation (why build it)
Three levels of motivation, from urgent to speculative:
- Existential risk: if we do not build AI, competitors will kill us. Common in document processing (insurance, finance), creative work (advertising, design), and information-heavy industries. 2023 Gartner: 7% of AI adopters cited business continuity. Reference: OpenAI's "GPTs are GPTs" (Eloundou et al., 2023) ranks industry exposure.
- Opportunity: AI boosts profit or productivity. Customer support, sales lead generation, content creation, internal knowledge search.
- Strategic learning: not sure where AI fits yet, but do not want to be left behind. R&D budget, optional.
If motivation is 1, build in-house. If motivation is 2 or 3, buy first, build only if buy fails.
Step 2: The role of AI and humans
Three axes to classify the AI feature:
Critical vs complementary
- Critical: app cannot work without AI (Face ID, DALL-E, Cursor autocomplete).
- Complementary: app works without AI, AI enhances it (Gmail Smart Compose, Google Maps traffic prediction).
Rule: the more critical AI is, the higher the reliability bar. Users forgive a wrong suggestion in Smart Compose. They do not forgive a failed Face ID unlock.
Reactive vs proactive
- Reactive: AI responds to user action (chatbot, search, completion).
- Proactive: AI acts without being asked (traffic alerts, recommendations, scheduled summaries).
Reactive needs low latency (users wait). Proactive needs high quality (users did not ask, so mistakes feel intrusive).
Dynamic vs static
- Static: model updates rarely, one model per user segment (default ChatGPT for everyone).
- Dynamic: model adapts continuously per user (Face ID updates as your face ages, ChatGPT's memory feature, personalized fine-tunes).
Dynamic is harder. You need per-user state, drift detection, and privacy guarantees.
Step 3: AI product defensibility
Huyen asks: if your app is just a GPT wrapper, why should you exist?
Three common moats:
- Proprietary data - unique training data or retrieval corpus (medical records, legal history)
- Distribution - owning the user (Microsoft integrates Copilot into Office, Salesforce into CRM)
- Workflow and UX - deep integration into a specific job (Cursor for devs, Harvey for lawyers)
"Wrapper" apps without moat got replicated by Microsoft in two weeks in 2023. Defensibility is a must.
Step 4: Setting expectations and milestones
Demo-to-production is the AI engineering graveyard. Huyen's advice:
- Quality bar depends on criticality: medical diagnosis needs 99.99%, content generation can ship at 80%.
- Set user-facing expectations: tell users the system is AI-powered and can be wrong. Confidence indicators, escape hatches to humans.
- Iterative milestones: plan an alpha (5% works), beta (70% works), GA (95% works) path. Plan for the evaluation framework BEFORE shipping alpha, not after.
Relevance today (2026)
Huyen's framework from 2024 holds up well. Adjustments for 2026:
- Defensibility pressure is higher. In 2024, a GPT-4 wrapper could get VC funding. In 2026, wrappers get cloned in a weekend by someone with Cursor. Defensibility is not optional.
- Dynamic features easier to build. Vector DBs + per-user memory (ChatGPT Memory, Claude Projects) are now cheap. Huyen's "dynamic is hard" is less true in 2026.
- "Proactive AI" is hot. Agent-based apps (scheduled summaries, inbox triage, workflow automation) exploded in 2025-2026 with MCP. Huyen's proactive category deserves more weight today.
- Quality bar shifted up. Users experienced Claude Opus 4.x and GPT-5 in ChatGPT. They will not tolerate a 70%-works product in 2026 unless the value is extraordinary.
- Regulatory pressure. EU AI Act (2024-2025 enforcement), US state-level AI laws. Planning must now include: is this a high-risk AI system under the regulation? What compliance does it trigger?
Question: in 2026, is "should we build this?" the wrong question? Maybe it is "should we buy, integrate, or build?". Most answers are "integrate existing SaaS, do not build from scratch".
Critical questions
- If your use case is "opportunity" (not existential), and a SaaS already exists (e.g., Zendesk AI for support), what justifies building your own?
- What is your quality bar? What would cause a user to churn after a bad AI response?
- Who is responsible when the AI fails? Do you have a human-in-the-loop fallback?
- How will you measure success? Business metric (conversion, NPS, retention) or AI-specific metric (accuracy, helpfulness)?
- What is your cost per successful interaction? Can you sustain it at 10x your current scale?
Production pitfalls
- Skipping the "why" question. Teams get a leadership mandate and skip use-case evaluation. Result: a showpiece demo with no business value.
- Mistaking novelty for moat. First-mover advantage in AI is 2-6 weeks before a competitor clones you. Only durable moats (data, distribution, workflow) survive.
- Over-promising accuracy. Telling users "AI can do X" when it does X 70% of the time. Better: "AI suggests X, you confirm" (human-in-the-loop framing).
- Missing the eval. Without an eval set, you cannot tell if your prompt change helped or hurt. See
08-evaluations/. - Static when you needed dynamic. One model for all users, serving wildly different use cases, gets mediocre across the board. Consider personalization early.
Alternatives / Comparisons
Alternatives to building a custom AI app:
| Option | When to prefer | Downside |
|---|---|---|
| Use an existing SaaS with AI (Notion AI, Intercom Fin) | Off-the-shelf fits your workflow | Limited customization, shared moat |
| Embed Claude/GPT API into existing product | Your product is the moat, AI is a feature | Vendor dependency |
| Fine-tune a smaller model for your task | Privacy, cost, scale requires it | Ops burden, slower iteration |
| Build a full vertical AI product | You have proprietary data or workflow insight | 6-12 month build, risky |
Mini-lab
For your Torah Study AI project, answer Huyen's framework explicitly:
- Use case motivation: existential, opportunity, or strategic learning?
- Critical or complementary: does the app work without AI?
- Reactive or proactive: does it answer when asked or push insights?
- Static or dynamic: personalized per user or not?
- Defensibility: proprietary Torah interpretation data? Integration with existing tools (Sefaria)? Unique UX for havruta mode?
- Milestones: what is alpha / beta / GA quality?
Write a short brief (1-2 pages) in outputs/reports/torah-study-huyen-brief.md. Goal: you can pitch the project to a senior AI product manager in 2 minutes using his vocabulary.
Further reading
- Huyen, Chapter 1 of AI Engineering, "Planning AI Applications" section
- Apple, "Human Interface Guidelines for Machine Learning" - deep on human-AI interaction patterns
- Andrew Ng, "AI for Everyone" (Coursera) - non-technical framing that complements Huyen
- Eloundou et al., "GPTs are GPTs" (OpenAI, 2023) - industry exposure analysis
- a16z, Enterprise AI playbooks (annual) - defensibility case studies