The days of stringing together a clunky, cascaded pipeline just to get an AI to speak are over.
Historically, building a real-time voice agent meant gluing together three separate systems: an Automatic Speech Recognition (ASR) module, a Large Language Model (LLM), and a Text-to-Speech (TTS) synthesizer. Every network hop added latency. Worse, converting speech to text completely stripped away the human elements of conversation – like intonation, emotion, and hesitation.
Amazon Nova Sonic models change the game. By natively processing and generating audio within a single, unified architecture, they eliminate the cascading delays of traditional pipelines. When paired with a highly optimized WebRTC media infrastructure, this end-to-end architecture achieves the sub-500-millisecond latency required for truly natural, human-like interaction.
This blog post accompanies our recent voice agent webinar, where we built a real-time voice agent using this exact stack. The complete code and webinar recording are available now.
But before we dive into the code, we need to talk about the models. Selecting a foundation model shouldn’t be a five-second decision, nor should it be a budget reflex. It’s a core engineering decision. By the end of this post you’ll know exactly how to route your workloads across the new Amazon Nova 2 tier.
💡 New to AI Agents?
We highly recommend reading that before tackling the advanced Amazon Nova models and voice architecture discussed below.
The Short Version (TL;DR)
If you are skimming, here are the immediate takeaways for building with the Nova suite:
- Use Nova 2 Lite by default: It boasts a massive 1M-token context window and offers industry-leading price-to-performance for everyday reasoning and multimodal tasks.
- Reserve Nova 2 Pro (Preview) for the heavy lifting: Save this one for highly complex agentic planning, multi-file code refactoring, or long-horizon reasoning.
- Deploy Nova 2 Sonic for real-time voice: It features native, asynchronous tool-use and robust user interruption (barge-in) handling, meaning your agent won’t freeze up with awkward “dead air” while fetching database records.
- Default to Gen-2, but keep Gen-1 as an escape hatch: Stick to Nova 2 Lite and Sonic for your core voice app. However, because advanced Gen-2 models (like Nova 2 Pro) are currently in restricted preview, Gen-1 heavyweights like Nova Pro are your immediate fallback if your background tasks hit Nova 2 Lite’s reasoning ceiling.
- The rest of the suite is for niche workloads: The broader Amazon Nova ecosystem includes image generation (Canvas), video generation (Reel), UI/browser automation (Nova Act), and custom model pre-training (Nova Forge). They are incredibly powerful, but for a conversational voice agent, you can safely tune them out.
- Ditch standard HTTP REST for WebRTC or WebSockets: Traditional request-response HTTP REST adds too much overhead and polling latency for audio. Pair a WebRTC media infrastructure or persistent WebSockets on your frontend to keep the bidirectional pipeline stutter-free.
- Mind your sample rates: Nova 2 Sonic strictly requires 16kHz audio input (Linear PCM) but returns a higher-fidelity 24kHz audio output. Fail to resample this correctly on the client side, and your agent will sound like a chipmunk.
- We did the coding: Our vox-brief demo voice agent’s open source code and webinar recording are available now.
Table of Contents
What is Amazon Nova? A deep-dive
Amazon Nova is AWS’s portfolio of foundation models available through Amazon Bedrock. The model family is designed to provide strong performance across a range of enterprise use cases while maintaining competitive pricing, low latency, and deep integration with the AWS ecosystem.
The Nova family tree
There are several Nova models (and non-model features) on offer, so let’s map the relationships to provide a solid base for the rest of this post.
Amazon Nova landed at re:Invent 2024 in December. The original announcement covered four understanding models (Micro, Lite, Pro, Premier), two creative models (Canvas for images, Reel for video), and – several months later in April 2025 – a speech model called Nova Sonic.
Then re:Invent 2025 brought Nova 2: four new models, two already GA and two in preview. All Nova 2 models share a 1M-token context window, extended thinking (togglable with low/medium/high compute budgets), built-in Code Interpreter, Web Grounding, and native remote MCP tool support. Alongside the models, Amazon also announced Nova Forge and Nova Act – covered in the section below. The full model breakdown is in the comparison table after that.
Beyond the models: Nova Forge and Nova Act
re:Invent 2025 also brought two new Nova services that don’t fit neatly into the model categories above – worth knowing about even if they’re not what we’re building with today.
Nova Forge is Amazon’s offering for teams who need to build truly custom foundation models. Rather than fine-tuning an existing model on a task dataset, Forge lets you build a model from Amazon’s pre-trained, mid-trained, and post-trained checkpoints using your own proprietary data – AWS calls the resulting models “Novellas”. It’s clearly enterprise-tier, with pricing starting at $100,000/year. For most teams this is firmly out of scope, but there’s an important side effect worth knowing: preview access to Nova 2 Pro and Nova 2 Omni is currently tied to Nova Forge customers.
Nova Act is a purpose-built agent for browser and UI workflow automation – an agent that can reliably navigate web interfaces, fill forms, chain multi-step browser tasks, and extract structured data without brittle CSS selectors or pre-recorded scripts. It spent most of 2025 in research preview and went GA with the re:Invent 2025 wave. If you’re automating anything that lives inside a browser, it’s worth a look.
Neither of these is a foundation model you’d invoke directly from Bedrock for inference – they’re tools for different problems. But they’re part of the Nova ecosystem and good to have on your radar. For the rest of this post, we’re back on the foundation models themselves.
Model comparison: the whole suite at a glance
Still confused? You’re not alone! Let’s unpack the Nova models so you can better understand the purpose of each.
Bold rows are gen-2 models, paired directly below their gen-1 equivalent.
| Gen | Model | Status | Core purpose |
|---|---|---|---|
| 1 | Nova Micro | ✅ GA | The ultra-fast sprinter. Text-only routing and classification at minimal cost. |
| 1 | Nova Lite | ✅ GA | The multimodal workhorse. Balances speed and cost for everyday tasks – RAG, chat, doc analysis. |
| 2 | Nova 2 Lite | ✅ GA | Nova Lite’s upgrade. Adds extended thinking and a 1M-token context window. |
| 1 | Nova Pro | ✅ GA | The heavy reasoner. Deep multimodal analysis across text, images, video, and documents. |
| 2 | Nova 2 Pro | ⏳ Preview | Nova Pro’s successor. Most capable reasoning model; built for complex agentic tasks. |
| 1 | Nova Premier | ✅ GA | The gen-1 powerhouse. When you need maximum capability and 1M context without waiting for gen-2. |
| 1 | Nova Sonic | ✅ GA | The original voice specialist. Real-time speech-to-speech; 6 languages at launch. |
| 2 | Nova 2 Sonic | ✅ GA | Nova Sonic’s upgrade. 7 languages, 1M context, async tool calling, <500ms latency. |
| 2 | Nova 2 Omni | ⏳ Preview | The all-in-one future. Multimodal in (text, image, video, speech) – text and image out. |
| N/A | Nova Canvas | ✅ GA | The image generator. For visual generation inside Bedrock workflows. |
| N/A | Nova Reel | ✅ GA | The video generator. 720p video generation billed per second of output. |
Pricing and context windows are covered a bit further down. Full details about each model can be found on the official Nova models landing page.
When to use each model
As the previous section highlights – each model is good at something. And some models are better at certain things even though they can also perform many tasks.
Amazon Nova Micro (gen-1 only)
Use when you need the absolute lowest latency and cost for text-only tasks. Classification, routing, short-form extraction, simple Q&A with no images or audio. At $0.035/$0.14 per 1M input/output tokens it’s one of the cheapest capable models on Bedrock. If you’re running millions of small text operations, start here.
Amazon Nova Lite (gen-1)
Use when you need everyday multimodal tasks without paying Nova Pro prices. Handles text, images, video, and documents. At $0.06/$0.24 per 1M tokens it’s nearly as cheap as Micro while adding full multimodal input. Good for document processing, image analysis, or mixed-input pipelines that don’t require heavy reasoning depth.
Amazon Nova Pro (gen-1)
Use when you need analytical depth and you want a model that’s been GA for over a year. It’s what we use in our demo architecture later in this post for digest generation – structured reasoning over a transcript, clean JSON output, predictable pricing at $0.80/$3.20 per 1M. It’s also the practical fallback for teams that need Nova 2 Pro-level tasks but can’t get preview access.
Amazon Nova Premier (gen-1 only)
Use when you’re hitting the limits of Nova Pro – very long documents, complex synthesis tasks, or workloads that genuinely benefit from the 1M-token context. At $2.50/$12.50 per 1M it’s the expensive option in gen-1, so use it only when the simpler models demonstrably fall short.
Amazon Nova 2 Lite
Use when you’re starting a new project today and want a current-generation model for everyday reasoning. It’s GA, has extended thinking available on demand, and AWS benchmarks show it beating Claude Haiku 4.5 on 13 of 15 tasks. At $0.30/$2.50 per 1M tokens it’s pricier than gen-1 Nova Lite but considerably more capable. For greenfield work, this is probably your default.
Amazon Nova 2 Pro
Use when you need maximum reasoning depth for complex agentic tasks – multi-document analysis, long software migrations, multi-step planning. Worth noting it’s currently only available to Nova Forge customers and pricing isn’t published. Factor both of those things into your evaluation timeline.
Amazon Nova 2 Sonic
Use when you’re building any kind of real-time voice interface. Customer service agents, voice assistants, live feedback interviews – if the user interaction is primarily spoken, this is your model. The unified speech-to-speech architecture beats stitched ASR/LLM/TTS pipelines on latency, naturalness, and cost. The 1M-token context window means it can hold a long conversation. If you’re building voice AI on AWS, this is the answer.
Amazon Nova 2 Omni
It’s in preview and Forge-gated, so “watch it” is the right posture for most teams. The capability is genuinely interesting – multimodal in plus image generation out in one model. When it goes broadly available, it could simplify pipelines that currently need both a reasoning model and a generation model. Keep it on your radar.
Nova Canvas / Nova Reel
Use when you need image or video generation inside a Bedrock workflow. Canvas for stills, Reel for video. These aren’t conversational models, but they integrate cleanly if your pipeline involves generating visual output.
Pricing
All Nova models are natively hosted on AWS and accessed directly through the Amazon Bedrock APIs (such as the Converse or Invoke API). Usage is billed directly to your standard AWS invoice on a pay-as-you-go, per-token basis.
| Model | Input | Output |
|---|---|---|
| Nova Micro | $0.035 / 1M tokens | $0.14 / 1M tokens |
| Nova Lite | $0.06 / 1M tokens | $0.24 / 1M tokens |
| Nova 2 Lite | $0.30 / 1M tokens | $2.50 / 1M tokens |
| Nova Pro | $0.80 / 1M tokens | $3.20 / 1M tokens |
| Nova 2 Pro | In preview | In preview |
| Nova Premier | $2.50 / 1M tokens | $12.50 / 1M tokens |
| Nova Sonic | $3.40 / 1M speech tokens + $0.06 / 1M text tokens | $13.60 / 1M speech tokens + $0.24 / 1M text tokens |
| Nova 2 Sonic | $3.00 / 1M speech tokens + $0.33 / 1M text tokens | $12.00 / 1M speech tokens + $2.75 / 1M text tokens |
| Nova 2 Omni | $0.30 / 1M text·img·vid + $1.00 / 1M tokens (audio) | $2.50 / 1M text + $40 / 1M tokens (image) |
| Nova Canvas | — | $0.04 to $0.08 per image |
| Nova Reel | $0.08 per second (720p, 24 fps) | — |
Nova 2 Sonic billing note: Unlike text models, Sonic is billed by audio duration – roughly 1 token per 100ms of audio. The speech and text tiers are billed separately: speech tokens cover the audio itself, text tokens cover transcription, tool calls, and conversation history. At ~$0.017/min for combined speech I/O, it’s roughly 80% cheaper than OpenAI’s Realtime API for equivalent workloads.
Always verify current rates at aws.amazon.com/nova/pricing – these can change without notice.
The cost levers nobody talks about
Pricing is rarely black and white with AI workflows – especially once real-time streaming audio enters the equation. To truly optimize your margins, you need to understand how AWS meters this traffic.
Speech tokens are not text tokens
Nova 2 Sonic bills on speech tokens, not word count. Production telemetry monitored by hands-on cloud architects demonstrates that Bedrock scales this at precisely 25 speech tokens per second.
At $3.00/1M speech input and $12.00/1M speech output:
| Scenario | Speech in | Speech out | Speech cost |
|---|---|---|---|
| 1-minute call (equal split) | ~600 tokens | ~600 tokens | ~$0.009 |
| 5-minute call (equal split) | ~3,000 tokens | ~3,000 tokens | ~$0.045 |
Text tokens (transcript, tool calls, conversation history) add a small fraction on top.
Flex tier for non-latency-sensitive work
Amazon Bedrock offers multiple real-time inference service tiers: Standard, Priority, and Flex. Flex is priced significantly below Standard (giving you a massive discount on Nova 2 Lite) in exchange for lower processing priority and reduced throughput guarantees during peak traffic windows. For Nova Pro digest generation in our demo architecture later in this post we use standard, but the Flex tier is absolutely worth evaluating. You are running a post-call background analysis job that can comfortably take a few extra seconds to complete synchronously. Because you aren’t in a racing hurry for the final dashboard summary, there is zero reason to pay full Standard rates for throughput guarantees you don’t actually need. As always – test and see if it works for your needs.
Extended thinking is opt-in – and costs tokens
For Nova 2 Lite and Nova 2 Pro, extended thinking is off by default. You toggle it on per-request with a budget: low, medium, or high. Higher budget = more reasoning tokens = higher cost. For simple digest generation, low might be all you need to get noticeably better structured output over a complex transcript. For straightforward classification or summarisation tasks, leave it off entirely.
💡 Fun fact: When you turn on extended thinking for Nova 2 Lite, the API response contains a reasoningContent block, but the actual step-by-step thinking prose is returned as [REDACTED]. AWS does this to protect privacy and optimize performance, but they still count and bill those hidden reasoning tokens because they actively shape the final high-quality output block.
Text caching for long system prompts in Sonic sessions
Nova 2 Sonic’s text tier (conversation history, tool results, system context) has a cached input price of $0.033/1M tokens – 10× cheaper than the uncached $0.33/1M. If your voice agent has a long, fixed system prompt (the persona, ground rules, background), explore prompt caching for that context. The speech tokens themselves don’t cache, but the text layer does.
Pin your model versions
Amazon Nova model IDs in Bedrock follow a strict, versioned scheme. Use explicit, versioned IDs in production – never rely on base aliases to “just point to the latest.” At some point, the default pointer will change, and an unexpected model update could easily disrupt your application’s prompt engineering or structured JSON formatting.
Instead of hunting blindly through the AWS Console UI – which frequently abstracts these strings behind buttons – you can always find the exact, updated programmatic strings in the official Amazon Bedrock Model Cards Guide.
Here is exactly how to structure your model configuration variables depending on your chosen routing architecture:
| Model | Production Model ID | Inference Routing Style | Supported Inference Profiles |
| Amazon Nova Micro | amazon.nova-micro-v1:0 | Supports In-Region & Geo Cross-Region | us.amazon.nova-micro-v1:0eu.amazon.nova-micro-v1:0 |
| Amazon Nova Lite | amazon.nova-lite-v1:0 | Supports In-Region & Geo Cross-Region | us.amazon.nova-lite-v1:0eu.amazon.nova-lite-v1:0 |
| Amazon Nova Pro | amazon.nova-pro-v1:0 | Supports In-Region & Geo Cross-Region | us.amazon.nova-pro-v1:0eu.amazon.nova-pro-v1:0 |
| Amazon Nova Premier | amazon.nova-premier-v1:0 | Supports In-Region & Geo Cross-Region | us.amazon.nova-premier-v1:0 |
| Amazon Nova Sonic | amazon.nova-sonic-v1:0 | Strictly In-Region Only | Not supported |
| Amazon Nova 2 Lite | amazon.nova-2-lite-v1:0 | Supports In-Region, Geo Cross-Region, & Global Cross-Region | us.amazon.nova-2-lite-v1:0eu.amazon.nova-2-lite-v1:0jp.amazon.nova-2-lite-v1:0global.amazon.nova-2-lite-v1:0 |
| Amazon Nova 2 Sonic | amazon.nova-2-sonic-v1:0 | Strictly In-Region Only | Not supported |
| Amazon Nova 2 Multimodal Embeddings | amazon.nova-2-multimodal-embeddings-v1:0 | Strictly In-Region Only | Not supported |
| Amazon Nova Canvas | amazon.nova-canvas-v1:0 | Strictly In-Region Only | Not supported |
| Amazon Nova Reel | amazon.nova-reel-v1:0 | Strictly In-Region Only | Not supported |
The table above deals with many things, including inference profiles… Let’s pause on that for a bit.
What Are Inference Profiles and How Do They Work?
Cross-Region Inference (CRIS) profiles act as AWS-managed smart-routing proxies for your application. Instead of pointing your code to a single fixed data center, you use an inference profile prefix (like us., eu., or global.). When your app experiences sudden spikes in traffic, Bedrock automatically evaluates available compute and dynamically routes the request over the encrypted AWS private backbone network to a region with open capacity. This seamlessly clears out local Rate Limit Exceeded (429) bottlenecks, can literally double your default Tokens-Per-Minute (TPM) quotas, and charges zero data-transfer premiums.
Crucially, AWS segments these routing pathways into geographic compliance tiers to respect your data residency restrictions. For example, a eu. prefix guarantees your data dynamically balances across regions like Frankfurt or Stockholm, but never leaves the European Union.
💡 Why Sonic is In-Region Only: As noted in the table above, real-time voice models like Nova 2 Sonic don’t support these profiles. Real-time full-duplex speech-to-speech demands a direct localized line; bouncing audio data frames between global regions mid-conversation would introduce unacceptable latency lag and packet jitter.
For detailed configuration architecture steps, check out the official Amazon Bedrock Cross-Region Inference Guide.
Nova 2 Sonic: why unified speech-to-speech matters
The traditional voice pipeline is a lot longer than the simple Nova Sonic flow. The legacy speech agent involves speech to text (STT), a text large language model (LLM) and then text to speech (TTS) again.
That’s three models, three round trips, and three places for latency to stack up. More importantly, the STT step strips out everything except the words – tone, pace, hesitation, emphasis all disappear. The LLM has no idea the user sounded frustrated or uncertain.
Nova 2 Sonic collapses this into a single model that processes audio directly. Audio goes in; audio comes out. It preserves prosodic signals and handles turn-taking naturally without explicit end-of-utterance markers.

What Nova Sonic unlocks practically:
- Sub-500ms End-to-End Latency: In typical production environments, the model achieves near-instantaneous response speeds. On Amazon Bedrock, your backend opens a persistent bidirectional channel via
InvokeModelWithBidirectionalStreamand keeps it open for the duration of the session. Audio chunks stream in continuously; raw text transcripts and generated audio chunks stream back in real time. - Granular Turn-Taking Controllability: Voice Activity Detection (VAD) pause sensitivity is completely configurable by the developer. You can dial it to Low (patient; perfect for complex technical explanations or educational workflows), Medium, or High (highly responsive; optimized for casual, rapid-fire conversation).
- Asynchronous Tool Use: Traditional voice bots lock up or emit awkward silences when running a database query. Nova 2 Sonic supports native async tool calling. If a user asks for data, the model initiates the background API lookup and continues to process incoming audio natively without interrupting or freezing the stream.
- Seven Native Languages (with Polyglot Shifting): Out of the box, it natively supports English, French, Italian, German, Spanish, Portuguese, and Hindi. More importantly, it supports true polyglot voices – meaning a single chosen voice avatar can fluidly shift languages mid-sentence with native, natural expressivity.
- Seamless Cross-Modal Sessions: You can seamlessly alternate between typing and talking in a single, continuous interaction. The model maintains perfect conversation history context across both modalities, allowing a user to speak a question, receive a spoken answer, and then type a complex serial number or address into the chat UI.
- A Massive 1M-Token Context Window: Unlike older voice stacks that suffer from short memory limits, Sonic inherits the core capability of the Nova 2 family tree, allowing it to maintain deep recall over exceptionally long conversational logs.
While Nova 2 Sonic originally debuted exclusively in the US and Tokyo, AWS has expanded its footprint. It is natively available in four key deployment regions: US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Stockholm). If your core user base sits in the EU, you can spin this stack up locally in eu-north-1 without suffering trans-Atlantic pipe latency.
For hands-on exploration of Nova models, AWS maintains a solid sample repo in the official Amazon Nova Samples GitHub repository
under the speech-to-speech/amazon-nova-2-sonic directory. To get up and running quickly, step-by-step onboarding is detailed directly in the Official Amazon Nova 2 Sonic Getting-Started Guide.
The demo: vox-brief, a serverless voice agent
As a proof of concept we have built vox-brief – a voice-powered customer feedback agent that runs entirely on AWS, costs nothing when idle, and can be deployed in a single command. The code is open at github.com/janobarnard/vox-brief.
What it does
You call into the agent, and a persona called Alex conducts a structured feedback interview over voice. A live transcript appears in the browser. On hang-up, Alex generates a structured digest: summary, key takeaways, action items, danger signals, and sentiment. The whole thing runs serverless – no always-on infrastructure, no idle cost.

The two Nova models
| Layer | Model | Model ID |
|---|---|---|
| Voice conversation | Amazon Nova 2 Sonic | amazon.nova-2-sonic-v1:0 |
| Post-call digest | Amazon Nova Pro | amazon.nova-pro-v1:0 |
Why Nova Pro (gen-1) for the digest rather than Nova 2 Lite or Nova 2 Pro? Nova 2 Pro is Forge-gated and in preview; Nova Pro is GA, predictably priced, and well-tested on reasoning tasks. The digest runs after the call – a few extra seconds is fine – so stability trumps novelty. When Nova 2 Pro goes broadly available, swapping it in requires changing one line.
Architecture
The serverless architecture of vox-brief is built around Bedrock AgentCore, with other AWS services assisting with components like UI and authentication (in our demo we use basic HTTP auth, AgentCore Identity is fully capable of handling this).

Everything is serverless or on-demand. The only fixed-ish cost is WAF if you enable it (~$7/month + $0.60/M requests). Everything else scales to zero between calls.
Key files
vox-brief/
├── agent/
│ ├── server.py # FastAPI WebSocket server (AgentCore runtime)
│ ├── nova_sonic.py # Bidirectional stream client for Nova 2 Sonic
│ ├── digest.py # Post-call digest generation via Nova Pro
│ ├── prompt.py # System prompts and Alex persona definition
│ └── Dockerfile
├── frontend/
│ ├── index.html
│ ├── app.js # WebSocket client, live transcript rendering
│ └── styles.css
├── infra/
│ ├── bootstrap.yaml # CloudFormation: ECR repository
│ └── template.yaml # CloudFormation: full deployment stack
└── deploy.sh # One-command deploymentDeployment is made simple with the single shell script:
./deploy.sh --profile <your-aws-profile> --password <your-demo-password> --prebuiltPythondeploy.sh creates the ECR repo via bootstrap.yaml, builds and pushes the Docker image, then deploys the full stack via template.yaml. You can drop --prebuilt to create the container from scratch.
The Nova 2 Sonic integration
The core voice logic lives in nova_sonic.py. The InvokeModelWithBidirectionalStream API opens a persistent channel at session start and holds it open. Audio chunks stream in continuously from the browser; transcribed text and audio response chunks stream back. server.py exposes a WebSocket endpoint – browser connects, Sonic stream opens, and from that point on the server is just proxying in both directions.
On hang-up, the accumulated transcript is passed to Nova Pro in digest.py via a standard synchronous Bedrock invoke. Nova Pro returns a structured JSON object (summary, takeaways, action items, danger signals, sentiment), which gets written to DynamoDB and rendered in the browser.

AgentCore handles the operational overhead – session lifecycle, container management, scaling. For a long-running voice session involving potential tool calls, that matters. You’re not managing compute; you’re managing conversation logic.
What this all means
The Nova suite covers more ground than it might first appear: cheapest-in-class text processing (Micro), solid multimodal reasoning across two generations (Lite, Pro, Premier, Nova 2 Lite, Nova 2 Pro), real-time voice (Nova 2 Sonic), image and video generation (Canvas, Reel), and – coming when Nova 2 Omni goes broadly available – a model that handles multimodal input and generates images in a single call.
For most teams building on AWS, the practical starting point is Nova 2 Lite for everyday reasoning and Nova 2 Sonic for voice. Both are GA, both have 1M-token context windows, both are competitively priced. If you need heavier reasoning, keep an eye on Nova 2 Pro going broadly available – and keep using Nova Pro (gen-1) in the meantime.
We built vox-brief to show what a production-realistic voice agent looks like with this stack – real AWS services, real infrastructure-as-code, zero idle cost, and an architecture you can fork and extend. Whether you’re building customer feedback tools, voice assistants, or something more ambitious, the pieces are there.


