![[../assets/cua_cartoon.webp]]
**DALL-E Prompt** - *a anthropomorphised computer pressing keys on its own keyboard , wide aspect ratio , crayon style*
When *ChatGPT* was first released, interest in AI grew significantly, even amongst those who were not technology enthusiasts. ChatGPT showed the average internet user how easy it was to tap into the vast knowledge of the internet with the help of artificial intelligence.
With the release of *Operator*, Open AI is once again looking to reinforce this idea of "democratised AI". The *Operator* is a new product that enables users to tap into the concept of *agentic AI*. An "agent" in the real world is basically a specialised professional who understands an area well enough to help users with various demands achieve their goals by clearly taking them through a step by step process of how to achieve that goal. In the AI world, an agent does the same thing, but its not human.
*Operator* is built on top of the [computer using agent (CUA)](https://openai.com/index/computer-using-agent/) which was also developed by the team at OpenAI. In the remainder of this short post, I will break down my learnings based on the official video release of the tool. Unfortunately, *Operator* is not available in the UK, so I do not have a hands on demo. If you are interested in the video, here it is.
<iframe width="700" height="350" src="https://www.youtube.com/embed/CSE77wAdDLg" title="Introduction to Operator & Agents" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
### What is CUA ?
CUA stands for *Computer Using Agent*, and it is what it sounds like. It is a [model that is able to perform tasks on your computer or web browser](https://openai.com/index/computer-using-agent/) like a human being. It marries together GPT-4o's vision capabilities with reinforcement learning and interacts with graphical user interfaces. Below is an illustration of how this works.
![[../assets/cua_working.png]]
[Source](https://openai.com/index/computer-using-agent/)
The agent is *multimodal*, which means that it can take on multiple kinds of input. Specifically here, both textual and visual input. Everything on a computer screen is basically a *bunch of pixels*. The agent processes these "raw" pixel data based of the "instructions" provided to it. It then works *iteratively using reinforcement learning*, looping several times across 3 key steps
1. Perception - Understand what is currently displayed on the monitor via screenshots
2. Reasoning - Use *chain of thought* and reason what the next steps could be to get to the main goal
3. Action - Perform the action of the screen with the help of a virtual mouse and keyboard
In terms of evaluation, CUA has been evaluated against two different areas of operation - *Browser Use* and *Computer Use*. The former tests the ability of the agent to use a browser alone, while the latter is the ability to fully operate an operating system (Windows, macOS, Ubuntu). While CUA achieves performance that is [*state of the art (SOTA)*](https://cdn.openai.com/cua/CUA_eval_extra_information.pdf), it is still a far way off from human ability.
![[../assets/cua_eval.png]]
[Source](https://openai.com/index/computer-using-agent/)
An example of browser use.
![[../assets/cua_browser_use.png]]
An example of computer use.
![[../assets/cua_computer_use.png]]
Open AI also provides high level categories based on how good CUA is / what its shortcomings are. Based of this, what it does most reliably are tasks that are done through repeated simple UI actions. If you have done a bit of web automation for either testing an application or scraping data of dynamic websites, you would realise that current web automation techniques like Selenium could do browser automation well. However, CUA goes a step further to work through UI like a human being by perceiving visual elements on the screen rather than the code that is written up in the background to create those elements in the first place.
As with all Open AI products, a comprehensive security and risk assessment accompanies the release of CUA. Risks that Open AI tested against are
1. Misuse - What to do when the model is misused
2. Model mistake - What to do when the model mistakes user intent
3. Adversarial attacks on websites - What to do if the model is "hacked"
### Thoughts on the Operator
CUA powers the *Operator*. It's an agent that perceives the interface visually, much like a human, interpreting on-screen elements to make decisions. For a given prompt, it spins up a virtual browser and walks you through what it is doing to perform the task. One of the most interesting aspects of _Operator_ is its **transparency**. As a user, you can actually observe its "thought process"—a significant improvement in making AI decision-making more understandable. If necessary, humans can step in to take control, creating a _human-in-the-loop_ interaction that ensures oversight.
Currently, this is available in "research preview" only ChatGPT Pro users (200$ per month) and only available in the USA.
A gotcha however is that the websites or apps being used need to have the right kind of APIs that OpenAI can interface with. Without this, the agent's capabilities may be restricted.
While I do see utility in the Operator, I am curious around some open questions that have emerged in my head as I write this. Beware as some of these might seem wildly speculative!
1. **How does Operator recommend websites if no preference is selected?**
- Does it suggest results based on proximity, popularity, or would OpenAI allow businesses to pay for better placement in the future?
- Would it depend on the search engine used by default? Wait a minute, what search engine would it even use? Or can you set preferences?
1. **Is this the beginning of the 'death of browsing using web UIs'?**
- If the CUA can visually interpret and interact with websites like a human, wouldn’t it be simpler to have the agent interact directly with business APIs? Over time, could this shift diminish the need for traditional web browsing, making internet navigation an AI-driven backend process rather than a user-driven activity?
2. **How do we ensure trust in an agentic system?**
- Since humans still need to oversee AI actions to build trust, a fully autonomous browsing experience might not be feasible anytime soon.
---
I am sure over time a lot more questions might arise around GUI-understanding agents and how they could bring about a seismic shift in the way the Internet is used. While their potential is vast, it also raises important questions about security, transparency, and the evolution of internet usage. Could this signal the gradual obsolescence of traditional browsing as AI-driven interactions become the norm? Or will the need for human oversight and trust act as a natural barrier to full automation?
One useful lens to consider this through is the [**Lindy Effect**](https://modelthinkers.com/mental-model/the-lindy-effect), which suggests that the longer something has existed, the longer it is likely to persist. Browsing interfaces, in one form or another, have been a fundamental part of the internet experience for decades. If the Lindy Effect holds, human-facing web interfaces may not disappear anytime soon, despite the push toward more API-based interactions. Instead, we may see a hybrid future—one where AI-driven agents like _Operator_ assist users but do not entirely replace traditional web browsing.