Metadata-Version: 2.1
Name: self-operating-computer
Version: 1.2.0
Summary: UNKNOWN
Home-page: UNKNOWN
License: UNKNOWN
Platform: UNKNOWN
Description-Content-Type: text/markdown
Requires-Dist: annotated-types ==0.6.0
Requires-Dist: anyio ==3.7.1
Requires-Dist: certifi ==2023.7.22
Requires-Dist: charset-normalizer ==3.3.2
Requires-Dist: colorama ==0.4.6
Requires-Dist: contourpy ==1.2.0
Requires-Dist: cycler ==0.12.1
Requires-Dist: distro ==1.8.0
Requires-Dist: EasyProcess ==1.1
Requires-Dist: entrypoint2 ==1.1
Requires-Dist: exceptiongroup ==1.1.3
Requires-Dist: fonttools ==4.44.0
Requires-Dist: h11 ==0.14.0
Requires-Dist: httpcore ==1.0.2
Requires-Dist: httpx ==0.25.1
Requires-Dist: idna ==3.4
Requires-Dist: importlib-resources ==6.1.1
Requires-Dist: kiwisolver ==1.4.5
Requires-Dist: matplotlib ==3.8.1
Requires-Dist: MouseInfo ==0.1.3
Requires-Dist: mss ==9.0.1
Requires-Dist: numpy ==1.26.1
Requires-Dist: openai ==1.2.3
Requires-Dist: packaging ==23.2
Requires-Dist: Pillow ==10.1.0
Requires-Dist: prompt-toolkit ==3.0.39
Requires-Dist: PyAutoGUI ==0.9.54
Requires-Dist: pydantic ==2.4.2
Requires-Dist: pydantic-core ==2.10.1
Requires-Dist: PyGetWindow ==0.0.9
Requires-Dist: PyMsgBox ==1.0.9
Requires-Dist: pyparsing ==3.1.1
Requires-Dist: pyperclip ==1.8.2
Requires-Dist: PyRect ==0.2.0
Requires-Dist: pyscreenshot ==3.1
Requires-Dist: PyScreeze ==0.1.29
Requires-Dist: python3-xlib ==0.15
Requires-Dist: python-dateutil ==2.8.2
Requires-Dist: python-dotenv ==1.0.0
Requires-Dist: pytweening ==1.0.7
Requires-Dist: requests ==2.31.0
Requires-Dist: rubicon-objc ==0.4.7
Requires-Dist: six ==1.16.0
Requires-Dist: sniffio ==1.3.0
Requires-Dist: tqdm ==4.66.1
Requires-Dist: typing-extensions ==4.8.0
Requires-Dist: urllib3 ==2.0.7
Requires-Dist: wcwidth ==0.2.9
Requires-Dist: zipp ==3.17.0
Requires-Dist: google-generativeai ==0.3.0
Requires-Dist: aiohttp ==3.9.1
Requires-Dist: ultralytics ==8.0.227

<h1 align="center">Self-Operating Computer Framework</h1>

<p align="center">
  <strong>A framework to enable multimodal models to operate a computer.</strong>
</p>
<p align="center">
  Using the same inputs and outputs as a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective. 
</p>

<div align="center">
  <img src="https://github.com/OthersideAI/self-operating-computer/blob/main/readme/self-operating-computer.png" width="750"  style="margin: 10px;"/>
</div>

<!--
:rotating_light: **OUTAGE NOTIFICATION: gpt-4-vision-preview**
**This model is currently experiencing an outage so the self-operating computer may not work as expected.**
-->


## Key Features
- **Compatibility**: Designed for various multimodal models.
- **Integration**: Currently integrated with **GPT-4v** as the default model, with extended support for Gemini Pro Vision.
- **Future Plans**: Support for additional models.

## Current Challenges
> **Note:** GPT-4V's error rate in estimating XY mouse click locations is currently quite high. This framework aims to track the progress of multimodal models over time, aspiring to achieve human-level performance in computer operation.

## Ongoing Development
At [HyperwriteAI](https://www.hyperwriteai.com/), we are developing Agent-1-Vision a multimodal model with more accurate click location predictions.

## Agent-1-Vision Model API Access
We will soon be offering API access to our Agent-1-Vision model.

If you're interested in gaining access to this API, sign up [here](https://othersideai.typeform.com/to/FszaJ1k8?typeform-source=www.hyperwriteai.com).

### Additional Thoughts
We recognize that some operating system functions may be more efficiently executed with hotkeys such as entering the Browser Address bar using `command + L` rather than by simulating a mouse click at the correct XY location. We plan to make these improvements over time. However, it's important to note that many actions require the accurate selection of visual elements on the screen, necessitating precise XY mouse click locations. A primary focus of this project is to refine the accuracy of determining these click locations. We believe this is essential for achieving a fully self-operating computer in the current technological landscape.
## Demo

https://github.com/OthersideAI/self-operating-computer/assets/42594239/9e8abc96-c76a-46fb-9b13-03678b3c67e0


## Quick Start Instructions
Below are instructions to set up the Self-Operating Computer Framework locally on your computer.

### Option 1: Traditional Installation

1. **Clone the repo** to a directory on your computer:
```
git clone https://github.com/OthersideAI/self-operating-computer.git
```
2. **Cd into directory**:

```
cd self-operating-computer
```

3. **Create a Python virtual environment**. [Learn more about Python virtual environment](https://docs.python.org/3/library/venv.html).

```
python3 -m venv venv
```
4. **Activate the virtual environment**:
```
source venv/bin/activate
```
5. **Install Project Requirements and Command-Line Interface: Instead of using `pip install .`, you can now install the project directly from PyPI with:**
```
pip install self-operating-computer
```
6. **Then rename the `.example.env` file to `.env` so that you can save your OpenAI key in it.**
```
mv .example.env .env
``` 
7. **Add your Open AI key to your new `.env` file. If you don't have one, you can obtain an OpenAI key [here](https://platform.openai.com/account/api-keys)**:
```
OPENAI_API_KEY='your-key-here'
```

8. **Run it**!
```
operate
```
9. **Final Step**: As a last step, the Terminal app will ask for permission for "Screen Recording" and "Accessibility" in the "Security & Privacy" page of Mac's "System Preferences".

<div align="center">
  <img src="https://github.com/OthersideAI/self-operating-computer/blob/main/readme/terminal-access-1.png" width="300"  style="margin: 10px;"/>
  <img src="https://github.com/OthersideAI/self-operating-computer/blob/main/readme/terminal-access-2.png" width="300"  style="margin: 10px;"/>
</div>


### Option 2: Installation using .sh script

1. **Clone the repo** to a directory on your computer:
```
git clone https://github.com/OthersideAI/self-operating-computer.git
```
2. **Cd into directory**:

```
cd self-operating-computer
```

3. **Run the installation script**: 

```
./run.sh
```


## Using `operate` Modes

### Multimodal Models  `-m`
An additional model is now compatible with the Self Operating Computer Framework. Try Google's `gemini-pro-vision` by following the instructions below. 

**Add your Google AI Studio API key to your .env file.** If you don't have one, you can obtain a key [here](https://makersuite.google.com/app/apikey) after setting up your Google AI Studio account. You may also need [authorize credentials for a desktop application](https://ai.google.dev/palm_docs/oauth_quickstart). It took me a bit of time to get it working, if anyone knows a simpler way, please make a PR:
```
GOOGLE_API_KEY='your-key-here'
```

Start `operate` with the Gemini model
```
operate -m gemini-pro-vision
```

### Set-of-Mark Prompting `-m gpt-4-with-som`
The Self-Operating Computer Framework now supports Set-of-Mark (SoM) Prompting with the `gpt-4-with-som` command. This new visual prompting method enhances the visual grounding capabilities of large multimodal models.

Learn more about SoM Prompting in the detailed arXiv paper: [here](https://arxiv.org/abs/2310.11441).

For this initial version, a simple YOLOv8 model is trained for button detection, and the `best.pt` file is included under `model/weights/`. Users are encouraged to swap in their `best.pt` file to evaluate performance improvements. If your model outperforms the existing one, please contribute by creating a pull request (PR).

Start `operate` with the SoM model

```
operate -m gpt-4-with-som
```

### Voice Mode `--voice`
The framework supports voice inputs for the objective. Try voice by following the instructions below. 

Install the additional `requirements-audio.txt`
```
pip install -r requirements-audio.txt
```
**Install device requirements**
For mac users:
```
brew install portaudio
```
For Linux users:
```
sudo apt install portaudio19-dev python3-pyaudio
```
Run with voice mode
```
operate --voice
```

## Contributions are Welcomed!:

If you want to contribute yourself, see [CONTRIBUTING.md](https://github.com/OthersideAI/self-operating-computer/blob/main/CONTRIBUTING.md).

## Feedback

For any input on improving this project, feel free to reach out to [Josh](https://twitter.com/josh_bickett) on Twitter. 

## Join Our Discord Community

For real-time discussions and community support, join our Discord server. 
- If you're already a member, join the discussion in [#self-operating-computer](https://discord.com/channels/877638638001877052/1181241785834541157).
- If you're new, first [join our Discord Server](https://discord.gg/YqaKtyBEzM) and then navigate to the [#self-operating-computer](https://discord.com/channels/877638638001877052/1181241785834541157).

## Follow HyperWriteAI for More Updates

Stay updated with the latest developments:
- Follow HyperWriteAI on [Twitter](https://twitter.com/HyperWriteAI).
- Follow HyperWriteAI on [LinkedIn](https://www.linkedin.com/company/othersideai/).

## Compatibility
- This project is compatible with Mac OS, Windows, and Linux (with X server installed).

## OpenAI Rate Limiting Note
The ```gpt-4-vision-preview``` model is required. To unlock access to this model, your account needs to spend at least \$5 in API credits. Pre-paying for these credits will unlock access if you haven't already spent the minimum \$5.   
Learn more **[here](https://platform.openai.com/docs/guides/rate-limits?context=tier-one)**


