There is much excitement and justifiably so around OpenAI’s announcement of “Operator”, their first agent. So, what is Operator? Operator is an AI agent that is capable of autonomously performing tasks on the web. It interacts with web pages by seeing screenshots as well as typing, clicking, and scrolling like how humans interact with websites. It can handle various tasks, such as buying groceries, booking travel, ordering an Uber, booking tickets for sporting events, and many others. People are curious about how Operator works and what this may hold for the future. Here is our take on all of this:
- No Website APIs needed: Operator “sees” screenshots and uses the keyboard and mouse just like humans. It has no need to check if websites have APIs available and what functions may be in those APIs. Operator accesses websites using URLs.
- CUA Model: Operator is powered by Computer-Using Agent (CUA), a model that combines GPT-4o’s vision capabilities with advanced reasoning. CUA is trained to interact with graphical user interfaces (GUIs)—the buttons, menus, and text fields people see on a screen. This enables it to perform digital tasks without using OS or web-specific APIs. Operator provides a Chain of Thought summary to explain its planning steps.
- Remote browser: Operator uses a remote browser in the cloud.
- Human in the Loop: At any point in time, the user can take control over from Operator. If Operator needs additional information, it asks the user for input. For example, when buying tickets for an event, the user can “Take Control” to login to the website and complete payment.
- Safety/Risk Mitigation: Operator has safety mechanisms built in. For example, if a user tries to buy something illegal, it will refuse to complete the task. To prevent model mistakes, Operator asks for Confirmations from user on “impactful” actions like booking a reservation. Also, it monitors for prompt injection to guard against malicious/fake websites.
An interesting question comes to mind—will websites now be designed for AI agents to access them? What would that design look like? Would a whole lot more of the Internet be made more accessible to digital agents without the need for website APIs? As OpenAI stated, this is still a work in research and Operator may make mistakes. But, it is the beginning of an exciting era.
We at Unvired are super excited at the endless possibilities of how AI agents like Operator can allow us to re-imagine business processes and transform the business of our customers. And of course, equally interested in how agents like Operator can perform our everyday chores so that we can be more productive. Indeed, Operator is a “Smooth” Operator.