Monday, August 18, 2025
HomeTechnologiesSalesforce CoAct-1 AI Agents Beat Traditional Automation: New Code-Writing System Achieves 85%...

Salesforce CoAct-1 AI Agents Beat Traditional Automation: New Code-Writing System Achieves 85% Higher Task Completion Rates

Are you looking for smarter insights delivered straight to your inbox? Subscribe to our weekly newsletters for the latest updates on enterprise AI, data, and security.

Innovative Technique for Computer-Use Agents

Researchers from Salesforce and the University of Southern California have pioneered a groundbreaking technique that empowers computer-use agents to execute code while navigating graphical user interfaces (GUIs). This advancement allows agents to write scripts concurrently with cursor movement and button clicks, effectively merging the strengths of both methods to enhance workflow efficiency and minimize errors.

The newly developed system, known as CoAct-1, sets a new benchmark in agent performance, surpassing previous methods while requiring significantly fewer steps to complete complex computer tasks. This advancement has the potential to facilitate more robust and scalable agent automation, opening doors to numerous real-world applications.

Challenges in Enterprise AI

As enterprise AI evolves, it faces challenges such as power limitations, rising token costs, and inference delays. Join our exclusive salon to learn how leading teams are:

– Transforming energy into a strategic asset
– Designing efficient inference systems for real throughput gains
– Achieving competitive ROI with sustainable AI solutions

Secure your spot to stay ahead of the curve: https://bit.ly/4mwGngO

Limitations of Current GUI-Based Agents

Typically, computer-use agents rely on vision-language and vision-language-action models (VLMs or VLAs) to interpret screens and perform actions, mimicking human interaction with a mouse and keyboard. While these GUI-based agents can handle a variety of tasks, they often struggle with lengthy and complex workflows, particularly in applications with intricate menus and options, such as office productivity suites.

For instance, a task requiring the identification of a specific table in a spreadsheet, filtering it, and saving it as a new file can involve a lengthy sequence of precise GUI manipulations. This is where issues of reliability arise. The researchers note, “In these scenarios, existing agents frequently struggle with visual grounding ambiguity (e.g., distinguishing between visually similar icons or menu items) and the compounded probability of making any single error over the long horizon. A single mis-click or misunderstood UI element can derail the entire task.”

Addressing GUI Limitations

To tackle these challenges, many researchers have sought to enhance GUI agents with high-level planners. These systems employ powerful reasoning models like OpenAI’s o3 to break down a user’s overarching goal into smaller, more manageable subtasks. While this structured approach boosts performance, it does not resolve the inherent difficulties of navigating menus and clicking buttons, especially for operations better suited to direct coding.

To overcome these limitations, the researchers developed CoAct-1 (Computer-using Agent with Coding as Actions), a system designed to merge the intuitive, human-like strengths of GUI manipulation with the precision, reliability, and efficiency of direct system interactions through code.

The CoAct-1 Framework

CoAct-1 is structured as a collaborative team of three specialized agents: an Orchestrator, a Programmer, and a GUI Operator. The Orchestrator serves as the central planner, analyzing the user’s overall goal, breaking it down into subtasks, and assigning each subtask to the most suitable agent.

For backend operations like file management or data processing, the Orchestrator delegates tasks to the Programmer, which writes and executes Python or Bash scripts. For frontend tasks that necessitate button clicks or visual navigation, it turns to the GUI Operator, a VLM-based agent. This strategic delegation enables CoAct-1 to bypass inefficient GUI sequences in favor of robust, single-shot code execution when appropriate, while still utilizing visual interactions for tasks where they are essential.

The workflow is iterative. After the Programmer or GUI Operator completes a subtask, it sends a summary and a screenshot of the current system state back to the Orchestrator, which then determines the next step or concludes the task. The Programmer agent utilizes a large language model (LLM) to generate code and sends commands to a code interpreter for testing and refinement across multiple rounds. Similarly, the GUI Operator employs an action interpreter to execute its commands (e.g., mouse clicks, typing) and returns the resulting screenshot, enabling it to assess the outcome of its actions. Ultimately, the Orchestrator makes the final decision on whether to continue or halt the task.

Testing CoAct-1

The researchers evaluated CoAct-1 using OSWorld, a comprehensive benchmark featuring 369 real-world tasks across browsers, integrated development environments (IDEs), and office applications. The results indicate that CoAct-1 establishes a new state-of-the-art performance, achieving a success rate of 60.76%.

Top Infos

Coups de cœur