On this tutorial, we construct a complicated computer-use agent from scratch that may cause, plan, and carry out digital actions utilizing an area open-weight mannequin. We create a miniature simulated desktop, equip it with a software interface, and design an clever agent that may analyze its setting, resolve on actions like clicking or typing, and execute them step-by-step. By the tip, we see how the agent interprets targets akin to opening emails or taking notes, demonstrating how an area language mannequin can mimic interactive reasoning and activity execution. Try the FULL CODES right here.
!pip set up -q transformers speed up sentencepiece nest_asyncio
import torch, asyncio, uuid
from transformers import pipeline
import nest_asyncio
nest_asyncio.apply()We arrange the environment by putting in important libraries akin to Transformers, Speed up, and Nest Asyncio, which allow us to run native fashions and asynchronous duties seamlessly in Colab. We put together the runtime in order that the upcoming elements of our agent can work effectively with out exterior dependencies. Try the FULL CODES right here.
class LocalLLM:
   def __init__(self, model_name="google/flan-t5-small", max_new_tokens=128):
       self.pipe = pipeline("text2text-generation", mannequin=model_name, system=0 if torch.cuda.is_available() else -1)
       self.max_new_tokens = max_new_tokens
   def generate(self, immediate: str) -> str:
       out = self.pipe(immediate, max_new_tokens=self.max_new_tokens, temperature=0.0)[0]["generated_text"]
       return out.strip()
class VirtualComputer:
   def __init__(self):
       self.apps = {"browser": "https://instance.com", "notes": "", "mail": ["Welcome to CUA", "Invoice #221", "Weekly Report"]}
       self.focus = "browser"
       self.display screen = "Browser open at https://instance.comnSearch bar targeted."
       self.action_log = []
   def screenshot(self):
       return f"FOCUS:{self.focus}nSCREEN:n{self.display screen}nAPPS:{checklist(self.apps.keys())}"
   def click on(self, goal:str):
       if goal in self.apps:
           self.focus = goal
           if goal=="browser":
               self.display screen = f"Browser tab: {self.apps['browser']}nAddress bar targeted."
           elif goal=="notes":
               self.display screen = f"Notes AppnCurrent notes:n{self.apps['notes']}"
           elif goal=="mail":
               inbox = "n".be a part of(f"- {s}" for s in self.apps['mail'])
               self.display screen = f"Mail App Inbox:n{inbox}n(Learn-only preview)"
       else:
           self.display screen += f"nClicked '{goal}'."
       self.action_log.append({"kind":"click on","goal":goal})
   def kind(self, textual content:str):
       if self.focus=="browser":
           self.apps["browser"] = textual content
           self.display screen = f"Browser tab now at {textual content}nPage headline: Instance Area"
       elif self.focus=="notes":
           self.apps["notes"] += ("n"+textual content)
           self.display screen = f"Notes AppnCurrent notes:n{self.apps['notes']}"
       else:
           self.display screen += f"nTyped '{textual content}' however no editable subject."
       self.action_log.append({"kind":"kind","textual content":textual content})We outline the core elements, a light-weight native mannequin, and a digital pc. We use Flan-T5 as our reasoning engine and create a simulated desktop that may open apps, show screens, and reply to typing and clicking actions. Try the FULL CODES right here.
class ComputerTool:
   def __init__(self, pc:VirtualComputer):
       self.pc = pc
   def run(self, command:str, argument:str=""):
       if command=="click on":
           self.pc.click on(argument)
           return {"standing":"accomplished","consequence":f"clicked {argument}"}
       if command=="kind":
           self.pc.kind(argument)
           return {"standing":"accomplished","consequence":f"typed {argument}"}
       if command=="screenshot":
           snap = self.pc.screenshot()
           return {"standing":"accomplished","consequence":snap}
       return {"standing":"error","consequence":f"unknown command {command}"}We introduce the ComputerTool interface, which acts because the communication bridge between the agent’s reasoning and the digital desktop. We outline high-level operations akin to click on, kind, and screenshot, enabling the agent to work together with the setting in a structured approach. Try the FULL CODES right here.
class ComputerAgent:
   def __init__(self, llm:LocalLLM, software:ComputerTool, max_trajectory_budget:float=5.0):
       self.llm = llm
       self.software = software
       self.max_trajectory_budget = max_trajectory_budget
   async def run(self, messages):
       user_goal = messages[-1]["content"]
       steps_remaining = int(self.max_trajectory_budget)
       output_events = []
       total_prompt_tokens = 0
       total_completion_tokens = 0
       whereas steps_remaining>0:
           display screen = self.software.pc.screenshot()
           immediate = (
               "You're a computer-use agent.n"
               f"Consumer objective: {user_goal}n"
               f"Present display screen:n{display screen}nn"
               "Suppose step-by-step.n"
               "Reply with: ACTION <click on/kind/screenshot> ARG <goal or textual content> THEN <assistant message>.n"
           )
           thought = self.llm.generate(immediate)
           total_prompt_tokens += len(immediate.break up())
           total_completion_tokens += len(thought.break up())
           motion="screenshot"; arg=""; assistant_msg="Working..."
           for line in thought.splitlines():
               if line.strip().startswith("ACTION "):
                   after = line.break up("ACTION ",1)[1]
                   motion = after.break up()[0].strip()
               if "ARG " in line:
                   half = line.break up("ARG ",1)[1]
                   if " THEN " partly:
                       arg = half.break up(" THEN ")[0].strip()
                   else:
                       arg = half.strip()
               if "THEN " in line:
                   assistant_msg = line.break up("THEN ",1)[1].strip()
           output_events.append({"abstract":[{"text":assistant_msg,"type":"summary_text"}],"kind":"reasoning"})
           call_id = "call_"+uuid.uuid4().hex[:16]
           tool_res = self.software.run(motion, arg)
           output_events.append({"motion":{"kind":motion,"textual content":arg},"call_id":call_id,"standing":tool_res["status"],"kind":"computer_call"})
           snap = self.software.pc.screenshot()
           output_events.append({"kind":"computer_call_output","call_id":call_id,"output":{"kind":"input_image","image_url":snap}})
           output_events.append({"kind":"message","position":"assistant","content material":[{"type":"output_text","text":assistant_msg}]})
           if "accomplished" in assistant_msg.decrease() or "right here is" in assistant_msg.decrease():
               break
           steps_remaining -= 1
       utilization = {"prompt_tokens": total_prompt_tokens,"completion_tokens": total_completion_tokens,"total_tokens": total_prompt_tokens + total_completion_tokens,"response_cost": 0.0}
       yield {"output": output_events, "utilization": utilization}We assemble the ComputerAgent, which serves because the system’s clever controller. We program it to cause about targets, resolve which actions to take, execute these via the software interface, and report every interplay as a step in its decision-making course of. Try the FULL CODES right here.
async def main_demo():
   pc = VirtualComputer()
   software = ComputerTool(pc)
   llm = LocalLLM()
   agent = ComputerAgent(llm, software, max_trajectory_budget=4)
   messages=[{"role":"user","content":"Open mail, read inbox subjects, and summarize."}]
   async for end in agent.run(messages):
       print("==== STREAM RESULT ====")
       for occasion in consequence["output"]:
           if occasion["type"]=="computer_call":
               a = occasion.get("motion",{})
               print(f"[TOOL CALL] {a.get('kind')} -> {a.get('textual content')} [{event.get('status')}]")
           if occasion["type"]=="computer_call_output":
               snap = occasion["output"]["image_url"]
               print("SCREEN AFTER ACTION:n", snap[:400],"...n")
           if occasion["type"]=="message":
               print("ASSISTANT:", occasion["content"][0]["text"], "n")
       print("USAGE:", consequence["usage"])
loop = asyncio.get_event_loop()
loop.run_until_complete(main_demo())We convey every part collectively by working the demo, the place the agent interprets a consumer’s request and performs duties on the digital pc. We observe it producing reasoning, executing instructions, updating the digital display screen, and reaching its objective in a transparent, step-by-step method.
In conclusion, we carried out the essence of a computer-use agent able to autonomous reasoning and interplay. We witness how native language fashions like Flan-T5 can powerfully simulate desktop-level automation inside a secure, text-based sandbox. This venture helps us perceive the structure behind clever brokers akin to these in computer-use brokers, bridging pure language reasoning with digital software management. It lays a powerful basis for extending these capabilities towards real-world, multimodal, and safe automation programs.
Try the FULL CODES right here. Be happy to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be happy to comply with us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you’ll be able to be a part of us on telegram as properly.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.

 
