OpenAI has announced a delay in the rollout of its highly anticipated advanced voice and video features for the GPT-4o model, citing the need for further technical and safety improvements. The new capabilities, which were a centerpiece of the company’s “Spring Update” event in May, were initially slated for a limited alpha release to ChatGPT Plus subscribers in the coming weeks.
In a statement, the company explained that it requires more time to refine the user experience and bolster safety measures. “We’re delaying the launch of advanced voice and video capabilities to all users,” OpenAI stated. “We need more time to get these features right, including making our models more robust and improving the user experience.” Specifically, the company is working to improve the model’s ability to detect and refuse certain content in real-time conversations and is continuing to build out the necessary infrastructure to support millions of users.
The original demonstration of the “Advanced Voice Mode” showcased an AI capable of real-time, emotionally nuanced conversation, with the ability to perceive user emotion and interrupt or be interrupted naturally. This generated significant excitement but also raised questions about the potential for misuse and the technical challenges of deploying such a system at scale.
While the advanced audio and video features are now tentatively scheduled for a fall release, the existing text and image capabilities of GPT-4o remain unaffected and are widely available to all ChatGPT users. This cautious approach signals a growing awareness within the AI industry of the complexities involved in moving from controlled demos to robust, safe, and reliable public-facing products. The delay gives competitors like Google, with its Project Astra, and Apple, with its upcoming Siri overhaul, a potential window to catch up or refine their own multimodal offerings.


