In the world of AI, speed isn’t just a feature – it’s the foundation of a seamless user experience. As we move from simple single shot queries to complex multi step “agentic workflows”, the traditional architecture of API communication is showing its age.
For an AI agent to perform a complex task, it engages in a continuous loop of thinking, acting, and observing. In a standard REST API model, each step in this loop is a separate, stateless transaction. This means setting up a new connection, re authenticating, and most critically resending the entire conversational history and context with every single call. The result is significant latency and computational overhead that can break the flow of interaction.
OpenAI is introducing WebSockets into its Responses API, establishing a persistent bidirectional channel. Once the connection is open, data can flow freely without the repetitive overhead of new connections.
With a persistent WebSocket, OpenAI’s infrastructure can cache critical context – such as the system prompt, user instructions, and recent conversation history – on the server for the duration of the session. Applications only need to send new information, dramatically reducing payload size and latency.
This shift from stateless to stateful communication unlocks new possibilities:
OpenAI’s internal benchmarks on the new Agentic Execution Engine show a staggering reduction in API overhead and a significant drop in end to end model latency, paving the way for agents that feel truly responsive.
For a deeper technical dive and performance metrics, read the original post: