Build a Serverless AI Web App on Azure – Architecture & Cost Guide - Internee

Marcel Broschk

Published Mar 2, 2026

Moritz Goeke

Manager Technology Consulting @ Protiviti | Microsoft MVP Azure & AI | M. Sc. Computer Science

Moritz Goeke’s session on the M365Con “Azure & Entra” stage walks through a practical, modern pattern: ship an AI-powered web app without running (or patching) servers, and pay mostly for what you use. The demo app is intentionally simple—a chat UI in the browser that calls a backend API, talks to a hosted GPT model, and stores conversation history—because simplicity makes the architectural trade-offs obvious. What follows is a “full article” version of that talk: the same building blocks, plus the extra context you’d typically want when you go from a demo to something you can run in a real environment.

What “serverless” really means (and why it changes the build mindset)

Serverless is less about “no servers exist” and more about not managing them yourself. Instead of provisioning virtual machines, scheduling patch windows, and guessing capacity, you deploy code (or static assets) and let the platform handle provisioning, scaling, and availability. In Azure, this is the difference between owning the operational burden versus letting the service absorb it—especially around scaling spikes and resilience. Azure Functions’ Consumption model, for example, is billed on executions and resource consumption, which is the canonical “pay for what ran” model people mean when they say serverless. (

Moritz frames serverless as a speed multiplier: you spend time building product logic, not infrastructure. That’s not just convenience—it directly affects how quickly you can iterate on prompts, swap models, and evolve a UX. If you’ve ever watched a proof-of-concept die during the “now we need to productionize hosting and scaling” phase, serverless is often the antidote. It narrows the gap between a demo and a deployable artifact.

There’s also a cost posture shift. Traditional server-based apps tend to have idle cost: you pay for capacity even when no one is using the app. Serverless services generally align cost with usage: invocations, duration, data operations, tokens. Azure Functions includes a free grant each month (requests + GB-s), which can make small workloads effectively “free-ish” until they grow.

Finally, serverless encourages an event-driven design. Your UI triggers an API call; that API triggers a model call; that response triggers a write to storage; telemetry is emitted along the way. Each step is independently scalable, and each step can be swapped out later (for example, replacing “store chat history in Cosmos DB” with “store in a different data layer” without re-architecting the whole app).

The reference architecture: Static Web App + Functions + AI Foundry + Cosmos DB

The core architecture in the talk uses four Azure-native components that fit together cleanly:

Azure Static Web Apps hosts the frontend (React in the demo) and serves it globally with managed SSL and easy CI/CD hooks. The Standard plan is commonly referenced as $9/app/month, making it easy to budget for the “website” part.
Azure Functions provides the backend API surface: send prompt → validate → call model → return response → persist conversation. This is the “serverless compute” layer that scales automatically.
Azure AI Foundry Models / Azure OpenAI provides the hosted model endpoint (Moritz uses a small GPT variant in the demo). This keeps model hosting and scaling off your plate while charging per token.
Azure Cosmos DB (serverless) stores chat history. In serverless mode it’s positioned for low/variable traffic, billing per database operation and storage without pre-provisioned throughput.

What’s elegant here is the separation of concerns. The frontend is purely static content (HTML/CSS/JS), so it’s cheap to serve and easy to deploy. The backend is thin and stateless: it doesn’t “remember” anything between calls, it simply orchestrates requests and writes state to the database. The model is an external managed dependency accessed via API. The database is the single source of truth for conversation history, which makes multi-device and multi-session experiences straightforward.

Moritz also mentions authentication as an optional add-on (for example, Entra ID), but keeps it out of the demo for focus. That’s a practical teaching choice: auth can dominate a talk, and the interesting pattern here is how the serverless pieces connect. The key takeaway is that auth is orthogonal—you can add it at the edge (Static Web Apps auth) or in the API (Functions) later.

One more important architectural note: conversation context. Chat models don’t inherently “keep memory” across calls; the “memory” is what you send in each request. That’s why the backend stores message history and replays it (or summarizes it) when calling the model. Azure’s chat completion guidance explicitly describes the use of a message list and roles such as system, user, and assistant.

Implementing the app: request flow, message roles, and the token reality

At runtime, the browser sends a user message to an API endpoint (Azure Functions). The function builds a request payload for the model: a system message (rules/behavior), plus the conversation history, plus the newest user message. Azure’s documentation calls out that after the system role, you include a series of messages between user and assistant, and you end with a user message to trigger the next assistant response.

This is where Moritz’s demo makes a subtle point visible: tokens are not linear with “how many messages you’ve sent” if you keep replaying history. Each new question includes the previous turns, so prompt tokens tend to rise as the thread grows. That’s why long chats cost more over time: you pay for the tokens you send (input) and the tokens you receive (output). Azure OpenAI pricing pages make this “per token” model explicit, and Azure AI Foundry models similarly present pay-as-you-go pricing for hosted models.

In practical terms, you have three common strategies once a chat grows:

Recommended by LinkedIn

Fall Into Serverless this September

Jason Smith 8 months ago

Microsoft Azure Containers - An Architect Perspective

Vishnu Bharath R 3 years ago

The Rise of Serverless Architecture in 2025: Why It’s…

Cybrain Software Solutions Pvt.Ltd 10 months ago

truncate old turns, 2) summarize older context into fewer tokens, or 3) move to an API pattern that manages threads and truncation more automatically. Azure’s Assistants API description, for example, explains that “threads store messages and automatically handle truncation” to fit the model context window. Moritz’s demo stays with the simpler “chat completions” style to keep the mechanics transparent.

On the frontend, the UX can be minimal: a textbox, a send button, and a message list. The interesting parts are the “production-ish” features Moritz layers in: tracking token usage, tracking function duration, and showing estimated cost per interaction. Those are not gimmicks; they’re exactly what you need in the first weeks of real usage, when you’re trying to understand whether your architecture is cheap because it’s efficient—or cheap because nobody is using it yet.

Finally, deployment is where Static Web Apps shines. For typical React/Angular/Vue setups, the service is designed to build and deploy from a repository with minimal ceremony. That “low ceremony” matters because it makes it easier to treat the whole app as code—frontend + backend + infrastructure templates—so you can recreate it, fork it, and evolve it safely.

Data and persistence: why Cosmos DB serverless is a good fit for chat history

Chat apps need storage if you want continuity: reload the page, switch devices, or show previous conversations. Cosmos DB is a natural Azure-native option because it’s globally distributable and supports multiple APIs—but Moritz’s key choice is Cosmos DB serverless. Serverless Cosmos DB is explicitly positioned for workloads with low traffic and intermittent bursts, billing per database operation and consumed storage without provisioning throughput ahead of time.

That billing model maps well to chat. Most chat workloads are “spiky”: users send a burst of messages, then go idle. If you provision throughput for peak but run at low average traffic, you can overpay. Serverless smooths that, at the cost of less predictability at high sustained traffic (at which point you might switch to provisioned throughput). The point isn’t that serverless Cosmos is always best—it’s that it’s a sensible default for an early-stage app or an internal tool with variable usage.

The data model can be simple: a conversation record keyed by user/session, with an array of message objects (role/content/timestamp), plus lightweight metadata (token counts, model name, latency). Because the model call needs history anyway, persisting history also lets you debug issues: “Did the model answer strangely because the system prompt was wrong, or because the previous turn contained something unexpected?” Storage becomes your audit trail.

In Moritz’s demo, Functions handles CRUD-style operations: create conversation, append messages, load prior messages, delete if needed. This is a very common Functions pattern: keep the function small and composable, and push durable state into the database. The operational win is big: Azure monitors and logs Functions natively, and you scale automatically with demand, without provisioning app servers.

Once you have this baseline, you can evolve: add user auth, enforce per-user quotas, encrypt sensitive fields, or introduce retention policies. The important part is that serverless doesn’t prevent “grown-up” data practices—it just means you’re applying them on managed services instead of hand-rolled infrastructure.

Cost, model choice, and “how do I not wake up to a €3,000 weekend?”

Moritz’s cost discussion lands because it connects architectural choices to euros. Static Web Apps has a predictable base price in Standard, Functions has consumption pricing, Cosmos DB serverless charges per operation + storage, and models charge per token. That combination can be extremely cost-effective for moderate usage because so much of your stack is idle-cost-free.

Model choice is the biggest variable because it scales directly with usage, and it can scale brutally. Smaller models (“mini”/“nano” style tiers) can be dramatically cheaper, but you must map them to the job. If your app is mostly retrieval-augmented (RAG) where the knowledge comes from your documents and the model is primarily rewriting/summarizing, smaller models often perform well enough. If your app needs deeper reasoning, orchestration, tool-use decisions, or agent-style planning, larger models can justify their cost because mistakes become more expensive than tokens.

This is also where Azure AI Search enters the conversation. Moritz notes that Azure AI Search isn’t “serverless” in the same way—search services provision capacity and have a meaningful baseline cost. Azure’s own pricing page lists monthly prices per Search Unit, and even the Basic tier can be a noticeable fixed cost compared to the “pay-per-request” feel of the rest of the stack. That doesn’t make AI Search bad; it means you choose it when you need its retrieval, vector indexing, filters, and scaling behavior—and you budget for it as a standing capability.

Now the scary part: runaway spend. In the Q&A, the question comes up directly: can you hard-cap spend for a model or agent? In practice, Azure gives you budgets and alerts rather than a guaranteed “hard stop.” Microsoft’s Cost Management budget scenario documentation shows how budget thresholds can trigger notifications and even automation actions (for example, a Logic App that shuts down resources) when a threshold is reached. The operational pattern Moritz recommends is sensible: always set budgets, alert early (like 50%/80%), and wire automation where feasible—especially in dev/test subscriptions where experimentation is frequent.

Finally, there’s the “Power Platform vs code” decision. Moritz’s answer is pragmatic: if you’re a low-code team with licensing and you want fast app assembly, Power Apps/Power Pages can be great. The serverless-code approach (Static Web Apps + Functions) shines when you want full control, code-first workflows, and distribution at scale without per-user licensing—because your primary costs become usage-based cloud consumption rather than user licensing.

Source URL: Review: Building a Fully Serverless AI Web App with Azure Cloud Native Services by Moritz Goeke at the M365 Con

Review: Building a Fully Serverless AI Web App with Azure Cloud Native Services by Moritz Goeke at the M365 Con

Admin Content

Marcel Broschk

Moritz Goeke

What “serverless” really means (and why it changes the build mindset)

The reference architecture: Static Web App + Functions + AI Foundry + Cosmos DB

Implementing the app: request flow, message roles, and the token reality

Recommended by LinkedIn

Data and persistence: why Cosmos DB serverless is a good fit for chat history

Cost, model choice, and “how do I not wake up to a €3,000 weekend?”

Related Blog Posts

How to Control App Registration Secret Expiry in Azure

Categories

Popular Posts

Popular Tags

Review: Building a Fully Serverless AI Web App with Azure Cloud Native Services by Moritz Goeke at the M365 Con

Admin Content

Marcel Broschk

Moritz Goeke

What “serverless” really means (and why it changes the build mindset)

The reference architecture: Static Web App + Functions + AI Foundry + Cosmos DB

Implementing the app: request flow, message roles, and the token reality

Recommended by LinkedIn

Data and persistence: why Cosmos DB serverless is a good fit for chat history

Cost, model choice, and “how do I not wake up to a €3,000 weekend?”

Related Blog Posts

How to Control App Registration Secret Expiry in Azure

Categories

Popular Posts

Popular Tags

Get New Internship Notification!