Step 4: FastAPI Backend - How I Build

The Goal

Wrap the Torah chat logic in a proper REST API. Two endpoints: one synchronous (for simple clients), one streaming (for the real-time chat experience). Plus health check, CORS, and error handling.

API Design

POST /chat          - Send question, get full answer (sync)
POST /chat/stream   - Send question, get SSE stream (real-time)
GET  /health        - Health check

Why two chat endpoints? The sync endpoint is useful for testing, scripts, and mobile clients that do not support SSE. The stream endpoint is what the Next.js frontend actually uses.

The FastAPI App

from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field
import google.generativeai as genai
import json
import os

app = FastAPI(title="Torah Study AI", version="0.1.0")

# CORS for Next.js frontend
app.add_middleware(
    CORSMiddleware,
    allow_origins=["http://localhost:3000"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)


class ChatRequest(BaseModel):
    """Request body for chat endpoints."""
    question: str = Field(..., min_length=1, max_length=2000)
    language: str = Field(default="en", pattern="^(en|fr|he)$")


class ChatResponse(BaseModel):
    """Response body for sync chat endpoint."""
    answer: str
    model: str = "gemini-2.0-flash"

Pydantic Validation

Pydantic does the heavy lifting for input validation. The Field constraints replace manual if/else checks:

class ChatRequest(BaseModel):
    question: str = Field(
        ...,              # required
        min_length=1,     # no empty strings
        max_length=2000,  # prevent abuse
    )
    language: str = Field(
        default="en",
        pattern="^(en|fr|he)$",  # only supported languages
    )

If someone sends {"question": ""}, FastAPI returns a 422 with a clear error message. No custom validation code needed. This is why I chose FastAPI over Flask for this project.

The Sync Endpoint

@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    """Send a Torah question and get the full answer."""
    try:
        api_key = os.getenv("GOOGLE_API_KEY")
        if not api_key:
            raise HTTPException(status_code=500, detail="API key not configured")

        genai.configure(api_key=api_key)
        model = genai.GenerativeModel(
            model_name="gemini-2.0-flash",
            system_instruction=SYSTEM_PROMPT,
        )

        response = model.generate_content(request.question)
        return ChatResponse(answer=response.text)

    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

The Streaming Endpoint

@app.post("/chat/stream")
async def chat_stream(request: ChatRequest):
    """Stream a Torah answer via Server-Sent Events."""

    async def generate():
        api_key = os.getenv("GOOGLE_API_KEY")
        if not api_key:
            yield f"data: {json.dumps({'error': 'API key not configured'})}\n\n"
            return

        genai.configure(api_key=api_key)
        model = genai.GenerativeModel(
            model_name="gemini-2.0-flash",
            system_instruction=SYSTEM_PROMPT,
        )

        response = model.generate_content(request.question, stream=True)

        for chunk in response:
            if chunk.text:
                data = json.dumps({"content": chunk.text})
                yield f"data: {data}\n\n"

        yield "data: [DONE]\n\n"

    return StreamingResponse(
        generate(),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
        },
    )

The key pattern here is the generator function. generate() yields SSE-formatted strings one at a time. FastAPI streams them to the client without buffering the entire response in memory.

CORS Configuration

CORS is the thing that always bites you on day one. The frontend runs on localhost:3000, the backend on localhost:8000. Without CORS middleware, the browser blocks every request.

app.add_middleware(
    CORSMiddleware,
    allow_origins=["http://localhost:3000"],  # Explicit, not "*"
    allow_credentials=True,                   # Needed for JWT cookies
    allow_methods=["*"],
    allow_headers=["*"],
)

I use explicit origins instead of "*" because allow_credentials=True requires it. Wildcard origins with credentials is not allowed by the CORS spec.

Testing with TestClient

FastAPI ships with a TestClient that lets you test endpoints without starting a real server. No requests library, no port binding, no cleanup.

from fastapi.testclient import TestClient
from unittest.mock import patch, MagicMock
from main import app

client = TestClient(app)


def test_health_check():
    """Health endpoint should return 200."""
    response = client.get("/health")
    assert response.status_code == 200
    assert response.json() == {"status": "healthy"}


def test_chat_empty_question():
    """Should return 422 for empty question."""
    response = client.post("/chat", json={"question": ""})
    assert response.status_code == 422


def test_chat_missing_question():
    """Should return 422 for missing question field."""
    response = client.post("/chat", json={})
    assert response.status_code == 422


@patch("main.genai")
def test_chat_success(mock_genai):
    """Should return answer for valid question."""
    mock_response = MagicMock()
    mock_response.text = "The Shema is..."
    mock_genai.GenerativeModel.return_value.generate_content.return_value = mock_response

    with patch.dict("os.environ", {"GOOGLE_API_KEY": "test-key"}):
        response = client.post("/chat", json={"question": "What is Shema?"})
        assert response.status_code == 200
        assert "answer" in response.json()
        assert len(response.json()["answer"]) > 0

The TestClient pattern is one of the best things about FastAPI. You write tests that look like integration tests (full HTTP request/response cycle) but run like unit tests (in-process, no network).

Chaining Insight (Stanford CS230)

The Stanford CS230 course talks about chaining: breaking a complex LLM task into independent steps that can be debugged separately.

Our architecture already follows this pattern naturally:

User question -> Validate -> Search Weaviate -> Rerank (Cohere) -> Generate (Gemini)

Each step is a separate function with clear input/output. If the answer is bad, I can check: was the search wrong? Did reranking drop the right source? Or did generation hallucinate? This is why I have separate endpoints and a pipeline architecture instead of one monolithic "answer the question" call.

Lessons Learned

Pydantic replaces manual validation. Define constraints in the model, let the framework handle error responses. Less code, better error messages.
Generators make streaming trivial. The yield keyword turns a complex streaming protocol into a simple loop. No callbacks, no event listeners.
TestClient is a superpower. You can test the full HTTP layer (status codes, headers, JSON serialization) without any network overhead.
CORS must be configured day one. Do not wait until the frontend is ready. Add it immediately and test with curl -H "Origin: http://localhost:3000".

What is Next

Step 5 adds authentication and conversation persistence. SQLite database, JWT tokens, and CRUD endpoints for sessions.