← Back to News

Speeding up agentic workflows with WebSockets in the Responses API

Agent-based systems have transformed how we think about automation in the cloud. Instead of rigid workflows, agents can reason through problems, take actions, and adapt based on results. But there’s a catch: traditional REST APIs introduce latency overhead that compounds with each agent step. OpenAI’s recent work on WebSocket support in the Responses API tackles this head-on by maintaining persistent connections and caching context across agent loops. For teams building autonomous systems—whether that’s code generation pipelines, customer service agents, or data processing workflows—this optimization can mean the difference between a system that feels responsive and one that feels sluggish.

Here’s what’s actually happening under the hood. In a typical agentic workflow, the agent loops repeatedly: it receives context, calls the language model, gets a response, processes that response (often calling external tools or APIs), and then loops back with updated context. Each of these model calls previously required establishing a new HTTP connection, serializing the entire conversation history, and waiting for the response. With WebSockets, you maintain a single persistent connection that stays open across multiple requests. More importantly, the Responses API now supports connection-scoped caching, which means repeated context—like system prompts, tool definitions, or large reference documents—doesn’t need to be retransmitted on every call. This is similar to how your browser caches static assets, but applied to AI model inputs. The model can reference cached content by ID, dramatically reducing the amount of data traveling over the wire.

The practical impact becomes obvious when you do the math. Consider a code generation agent that iterates 10 times, each time sending 50KB of context. With traditional REST calls, that’s 500KB of data plus 10 connection handshakes. With WebSocket caching, you send that context once, cache it, and subsequent calls reference it by ID—maybe an extra 100 bytes each. For teams running thousands of agent workflows daily, this translates to lower API costs, reduced bandwidth consumption, and faster feedback loops for users waiting on agent results. Real-world examples include bug-fixing agents that need to reference large codebases repeatedly, customer support bots that maintain conversation context across multiple tool calls, and data analysis agents that reference the same datasets throughout their reasoning process. The latency improvements are especially noticeable in interactive applications where users are waiting for agent results in real-time.

To start experimenting with this, you’ll need to understand WebSocket basics (connection lifecycle, message framing) and how to structure your agent logic to take advantage of caching. If you’re already using OpenAI’s Python SDK or REST APIs, the transition is straightforward: you’re swapping the connection mechanism while your agent loop logic remains largely the same. The key is identifying which parts of your context are static (system prompts, tool definitions) versus dynamic (conversation history, current data), so you can leverage caching effectively. Start by profiling your current agent workflows—measure how much context you’re sending per loop and how many loops run in a typical task. If you’re sending kilobytes of repeated data, WebSockets with connection-scoped caching will likely deliver meaningful improvements. This is one of those infrastructure optimizations that doesn’t require rearchitecting your agent logic, but can noticeably improve performance and cost efficiency.

Source
↗ OpenAI News