Building Voice-Enabled Content Experiences: How to Add AI Speech Features to Your Headless CMS

Tony Spiro
February 23, 2026

Audio content is no longer optional. Major publishers like The Guardian have added "Listen to this article" features, and accessibility requirements are driving demand for multi-modal content delivery. With Cosmic's built-in AI audio generation, developers can now add voice capabilities to any Cosmic-powered application in minutes, not weeks.
This guide walks you through using Cosmic's native text-to-speech capabilities, covering the dashboard workflow, API integration, content modeling, implementation patterns, and cost optimization.
Why Voice-Enabled Content Matters
Voice-enabled content serves multiple audiences:
- Accessibility compliance: WCAG 2.1 guidelines (1.2.1, 1.2.3) establish requirements for audio alternatives to text content
- Multi-modal consumption: Users increasingly want to listen while commuting, exercising, or multitasking
- Content reach expansion: Audio versions open your content to visually impaired users and auditory learners
- Engagement metrics: Publications report increased time-on-site when audio options are available
Cosmic AI Audio Generation
Cosmic provides built-in text-to-speech powered by OpenAI's TTS models, available directly from the dashboard and API. No third-party API keys or external services required.
Key Features
- 9 natural-sounding voices: Choose from feminine voices (nova, shimmer, coral, sage, alloy) and masculine voices (echo, onyx, fable, ash)
- Two quality tiers: Standard (tts-1) for fast, low-latency generation or HD (tts-1-hd) for higher quality output
- Long text support: Texts over 4,096 characters are automatically chunked at paragraph boundaries and concatenated into a single file
- Instant CDN delivery: Generated audio is saved as MP3 to your Media Library and served through Cosmic's global CDN
Voice Selection Guide
| Voice | Style | Best For |
|---|---|---|
| nova | Warm and bright, friendly narration | Podcasts, explainers |
| shimmer | Soft and intimate, gentle delivery | Meditation, ASMR, bedtime stories |
| coral | Clear and polished, professional tone | Product demos, business content |
| sage | Calm and steady, thoughtful pace | Education, tutorials |
| alloy | Neutral and balanced, versatile | General purpose, articles |
| echo | Deep and authoritative, confident | Announcements, trailers, news |
| onyx | Bold and commanding, strong presence | Intros, branding, dramatic reads |
| fable | Animated and expressive, natural storyteller | Storytelling, audiobooks |
| ash | Warm and approachable, conversational | Conversational, interviews |
Generating Audio from the Dashboard
The fastest way to get started is directly from the Cosmic dashboard:
- Navigate to Media in your project
- Click Create and select Audio
- Select a voice from the dropdown (default: Nova)
- Paste or type the text you want to convert to speech
- Click Generate to create the audio file
Audio files are automatically saved to your Media Library in MP3 format, ready for use in your applications. For full details on dashboard usage, see the AI dashboard docs.
Content Modeling for Audio
Before implementing programmatic TTS, extend your content model to support audio metadata. In Cosmic, add these metafields to your blog post or article object type:
This schema captures essential audio metadata for playback controls and accessibility features. The metafield uses Cosmic's file type, which stores a reference to a media object in your Media Library, giving you access to the full media properties including the CDN URL, file name, and size.
Implementation: Cosmic SDK Audio Generation
Here's a complete implementation pattern for generating audio from Cosmic content using the built-in AI audio API:
This approach uses Cosmic AI's text generation with a lightweight model (Claude Haiku 4.5) to intelligently convert markdown content into narration-ready text before passing it to the audio generator. Unlike simple HTML stripping, the AI understands context and converts structured elements like tables and code blocks into natural, speakable prose. There is no need for a separate OpenAI client or API key. Cosmic handles the text cleanup, TTS generation, file storage, and CDN delivery all within the platform. The generated MP3 is automatically available in your Media Library. When updating the article, we set the metafield to the media object's value, which links the file metafield to the corresponding media asset in your library.
Using the API Directly with cURL
You can also generate audio directly via the REST API:
The response includes the full media object with a CDN URL you can use immediately in your application.
Batch Audio Generation
For generating audio across multiple articles at once, use a batch processing approach:
Automating Audio with Cosmic AI Agents
For a fully hands-off approach, use Cosmic AI Agents to automatically generate audio whenever new content is published. Set up an event-triggered Content Agent that listens for publish events:
- Navigate to the AI Agents page in your project
- Click Create Agent and select Content Agent
- Set the prompt: "When a blog post is published, generate an audio version using the nova voice and update the audio_file metafield with the result."
- Enable Event Triggers and select Object Published for your blog posts object type
You can also create a Workflow that chains audio generation with other tasks, such as generating social media content or translating the article into other languages, all triggered by a single publish event.
Deployment and Caching Strategies
Audio generation consumes AI tokens, so implement these optimizations:
1. Pre-generate on publish: Use Cosmic webhooks or event-triggered agents to generate audio when content is published, not on user request.
2. CDN caching: Generated audio files are stored in Cosmic's Media Library and delivered through the global CDN for edge caching. No additional CDN configuration is needed.
3. Selective generation: Not all content needs audio. Add a toggle metafield to let editors choose which articles get audio versions.
4. Choose the right quality tier: Use standard (tts-1) for most content. Reserve HD (tts-1-hd) for premium content where higher audio fidelity justifies the 2x token cost.
5. Automatic chunking: Cosmic handles long texts automatically. Articles over 4,096 characters are split at paragraph boundaries and concatenated into a single MP3, so you do not need to manage chunking logic yourself.
Accessibility Compliance
When implementing voice features, ensure WCAG 2.1 compliance:
- Provide transcripts: Audio content must have text alternatives (WCAG 1.2.1). Since your source content is already text, link to it alongside the audio player.
- Player controls: Include play, pause, volume, and playback speed controls
- Keyboard navigation: Ensure the audio player is fully keyboard accessible
- Visual indicators: Show current playback position and remaining time
Understanding Token Costs
Cosmic AI audio generation uses tokens as the unit of usage. In general, 1 token equals roughly 1 word or 4 characters. Audio generation tokens are counted as output tokens, which reflect the computational requirements of producing speech.
You can monitor your AI token usage from the Usage section of your project settings. For details on plan limits and token add-ons, visit the pricing page.
To optimize costs:
- Use the standard model (tts-1) for routine content
- Reserve HD (tts-1-hd) for high-visibility or premium content
- Skip audio generation for short-form content like product updates or changelogs
- Track usage patterns monthly and adjust your generation strategy accordingly
Putting It All Together
Voice-enabled content transforms how users interact with your applications. With Cosmic's built-in AI audio generation, you do not need to manage separate TTS API keys, handle file storage, or configure CDN delivery. Everything is handled within the platform.
Start by generating a few audio files from the dashboard to test voice options. Then integrate the SDK into your publish workflow for automated generation. For fully autonomous audio production, set up an event-triggered agent that generates audio every time new content goes live.
Your content becomes instantly more accessible, engaging, and versatile with just a few lines of code. Log in to your Cosmic account and start generating audio today.
Resources
- Introducing AI Audio Generation - Announcement blog post
- AI API Reference - Full API documentation for audio generation
- AI Dashboard Docs - Dashboard usage guide
- API Reference - Complete Cosmic API documentation
Continue Learning
Ready to get started?
Build your next project with Cosmic and start creating content faster.
No credit card required • 75,000+ developers



