Are you looking for smarter insights delivered straight to your inbox? Subscribe to our weekly newsletters for the latest updates on enterprise AI, data, and security that truly matter.
OpenAI’s New Model in the Competitive AI Voice Market
OpenAI has recently introduced its new model, gpt-realtime, which aims to enhance the increasingly competitive landscape of AI voice technology for enterprises. This model is designed to follow complex instructions and produce voices that sound more natural and expressive. As the demand for voice AI grows—particularly in applications like customer service calls and real-time translation—the need for realistic-sounding AI voices that also prioritize enterprise-grade security is becoming more critical.
Advancements in Voice Technology
OpenAI asserts that gpt-realtime offers a more human-like voice, but it faces stiff competition from companies such as ElevenLabs. The new model will be accessible through the Realtime API, which has also been made generally available. Alongside gpt-realtime, OpenAI has introduced new voices named Cedar and Marin, and it has updated its existing voices to be compatible with this latest model. During a livestream, OpenAI highlighted that it collaborated with customers developing voice applications to train gpt-realtime, aligning it with evaluations based on real-world scenarios like customer support and academic tutoring.
Challenges in AI Scaling
The landscape of enterprise AI is evolving due to power caps, increasing token costs, and delays in inference. To address these challenges, we invite you to join our exclusive salon where leading teams will share strategies on:
– Transforming energy into a strategic advantage
– Architecting efficient inference for real throughput gains
– Unlocking competitive ROI with sustainable AI systems
Secure your spot to stay ahead of the curve: https://bit.ly/4mwGngO
Features of gpt-realtime
OpenAI has emphasized the model’s capacity to generate emotive, natural-sounding voices that align well with developers’ needs. Operating within a speech-to-speech framework, gpt-realtime can comprehend spoken prompts and respond vocally. This capability makes it ideally suited for real-time interactions, such as when a customer calls a service platform to return products and engages with an AI voice assistant that responds as if it were a human.
During a livestream, T-Mobile showcased an AI voice-powered agent that assists customers in finding new phones. Similarly, Zillow presented an agent that helps users narrow down neighborhoods to discover their ideal homes.
Competitive Landscape
OpenAI claims that gpt-realtime is its “most advanced, production-ready voice model.” Like its predecessors, it can switch languages mid-sentence and can now follow more intricate instructions, such as “speak emphatically in a French accent.” However, it must compete with other models already in use by various brands. For instance, ElevenLabs launched Conversation AI 2.0 in May, and Soundhound has partnered with fast-food chains for AI voice-driven drive-thrus. Additionally, the startup Hume has introduced its EVI 3 model, enabling users to create AI versions of their own voices.
As enterprises explore diverse use cases for voice AI, other general model providers offering multimodal large language models (LLMs) are also gaining traction. Mistral has released its Voxtral model, which is touted for its effectiveness in real-time translation, while Google is enhancing its audio capabilities, notably with an audio feature on NotebookLM that converts research notes into podcasts.
Performance and Features
OpenAI has stated that gpt-realtime is smarter and better at understanding native audio, including recognizing non-verbal cues like laughter or sighs. Benchmarking with the Big Bench Audio evaluation revealed that the model achieved an accuracy score of 82.8%, a significant improvement over its predecessor’s score of 65.6%. However, OpenAI has not released comparative performance data against competitive models.
The focus for OpenAI has been on enhancing the model’s ability to follow instructions effectively. It scored 30.5% on the MultiChallenge audio benchmark. Additionally, engineers have improved function calling so that gpt-realtime can access the appropriate tools.
To further support the new model and enhance the integration of real-time AI capabilities into enterprise applications, OpenAI has introduced several new features to the Realtime API. It can now support Multi-Channel Processing (MCP) and recognize image inputs, providing users with real-time information about their visual surroundings—a feature that Google highlighted during its Project Astra presentation last year. The Realtime API is also capable of handling Session Initiation Protocol (SIP), which connects applications to telephony systems, thus expanding contact center use cases. Users can also save and reuse prompts on the API. Initial feedback on the model has been positive, although it is still early days for this recently launched technology.