Metadata-Version: 2.1
Name: ssh-gpu-monitor
Version: 1.0.1
Summary: A fast, asynchronous GPU monitoring tool for multiple machines through SSH
Author-email: Alex Spies <alex@afspies.com>
License: MIT License
        
        Copyright (c) 2024 Alexander F. Spies
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE. 
Project-URL: Homepage, https://github.com/afspies/gpu-monitor
Project-URL: Bug Tracker, https://github.com/afspies/gpu-monitor/issues
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Topic :: System :: Monitoring
Classifier: Topic :: System :: Systems Administration
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: rich>=10.0.0
Requires-Dist: asyncssh>=2.13.1
Requires-Dist: pyyaml>=6.0.1
Requires-Dist: pyOpenSSL==23.1.1
Requires-Dist: cryptography==40.0.2

# SSH GPU Monitor 🖥️ 
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

A fast, asynchronous GPU monitoring tool that provides real-time status of NVIDIA GPUs across multiple machines through SSH, with support for jump hosts and per-machine credentials.

![Example Output](example_running.png)

## ✨ Features

- **Real-time Monitoring**: Live updates of GPU status across multiple machines
- **Asynchronous Operation**: Fast, non-blocking checks using `asyncio` and `asyncssh`
- **Jump Host Support**: Access machines behind a bastion/jump host
- **Rich Display**: Beautiful terminal UI using the `rich` library
- **Flexible Configuration**: 
  - YAML-based configuration
  - Per-machine SSH credentials
  - Pattern-based target generation
- **Robust Error Handling**: Graceful handling of network issues and timeouts

## 🚀 Quick Start

1. Install the package or clone the repository:
```bash
pip install ssh-gpu-monitor
```

2. Create a basic configuration file (`config/config.yaml`):
```yaml
ssh:
  username: "your_username"
  key_path: "~/.ssh/id_rsa"
  jump_host: "jump.example.com"
  timeout: 10

targets:
  individual:
    - "gpu-server1"
    - "gpu-server2"

display:
  refresh_rate: 5
```

3. Run the monitor:
```bash
python main.py
```

## 📖 Configuration

### Basic Structure
```yaml
ssh:
  username: "default_user"  # Default username
  key_path: "~/.ssh/id_rsa"  # Default SSH key
  jump_host: "jump.example.com"
  timeout: 10  # seconds

targets:
  # Individual machines
  individual:
    - host: "gpu-server1"
      username: "different_user"  # Optional override
      key_path: "~/.ssh/special_key"  # Optional override
    - "gpu-server2"  # Uses default credentials
  
  # Pattern-based groups
  patterns:
    - prefix: "gpu"
      start: 1
      end: 30
      format: "{prefix}{number:02}"  # Results in gpu01, gpu02, etc.
      username: "gpu_user"  # Optional override
      key_path: "~/.ssh/gpu_key"  # Optional override

display:
  refresh_rate: 5  # seconds

debug:
  enabled: false
  log_dir: "logs"
  log_file: "gpu_checker.log"
  log_max_size: 1048576  # 1MB
  log_backup_count: 3
```

### Command Line Options
Override any configuration option via command line:
```bash
# Enable debug logging
python main.py --debug.enabled

# Override SSH settings
python main.py --ssh.username=other_user --ssh.key_path=~/.ssh/other_key

# Check specific targets
python main.py --targets gpu01 gpu02 special-server
```

## 🔧 Advanced Usage

### Custom Target Patterns
Generate targets using patterns:
```yaml
patterns:
  - prefix: "compute"
    start: 1
    end: 100
    format: "{prefix}-{number:03d}"  # compute-001, compute-002, etc.
```

### Per-Machine Credentials
Specify different credentials for specific machines:
```yaml
individual:
  - host: "special-gpu"
    username: "admin"
    key_path: "~/.ssh/admin_key"
```

### Debug Logging
Enable detailed logging for troubleshooting:
```yaml
debug:
  enabled: true
  log_dir: "logs"
  log_file: "debug.log"
```

## 🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

## 📝 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

### Original Contributors
Originally created as "some awful, brittle code to check GPU status of multiple machines at a given host address through an SSH jumpnode."

Special thanks to:
- @harrygcoppock and @minut1bc for their PRs on v1
- [gpuobserver](https://github.com/pawni/gpuobserver) for earlier code concepts
- [Stack Overflow answer](https://stackoverflow.com/a/36096801/7565759) for SSH connection handling insights

### Libraries
- [Rich](https://github.com/Textualize/rich) for the beautiful terminal interface
- [asyncssh](https://github.com/ronf/asyncssh) for async SSH support
- [PyYAML](https://pyyaml.org/) for configuration management

## 🔍 Similar Projects

- [nvidia-smi-tools](https://github.com/example/nvidia-smi-tools)
- [gpu-monitor](https://github.com/example/gpu-monitor)

## ⚠️ Known Issues

- SSH connection might timeout on very slow networks
- Some older NVIDIA drivers might return incompatible XML formats

## 📊 Roadmap

- [ ] Add support for AMD GPUs
- [ ] Implement process name filtering
- [ ] Add web interface
- [ ] Support for custom SSH config files
