Metadata-Version: 2.1
Name: hfsync
Version: 0.0.1
Summary: hfsync
Home-page: http://github.com/trisongz/hfsync
Author: Tri Songz
Author-email: ts@growthengineai.com
License: UNKNOWN
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Requires-Python: >3.6
Description-Content-Type: text/markdown
Requires-Dist: file-io[cloud]
Requires-Dist: transformers

# hfsync
 Sync Huggingface Transformer Models to Cloud Storage (GCS/S3 + more soon)



## Quickstart

```python
!pip install --upgrade git+https://github.com/trisongz/hfsync.git
!pip install --upgrade hfsync

```

## Usage
```python
from hfsync import GCSAuth, S3Auth # AZAuth (not yet supported)
from hfsync import Sync

# Note: Only use one auth client.

# To have Auth picked up from env vars / implicitly
auth_client = GCSAuth()
auth_client = S3Auth()

# To set Auth directly / explicitly
auth_client = GCSAuth(service_account='service_account', token='gcs_token')
auth_client = S3Auth(access_key='access_key', secret_key='secret_key', session_token='token')

# Local and Cloud Paths
local_path = '/content/model'
cloud_path = 'gs://bucket/model/experiment' # or 's3://bucket/model/experiment'
sync_client = Sync(local_path=local_path, cloud_path=cloud_path, auth_client=auth_client)

# Or Implicitly without an auth_client to have it figure things out based on your cloud path
sync_client = Sync(local_path=local_path, cloud_path=cloud_path)

# After training loop, sync your pretrained model to both local and cloud. 
# You don't need to explicitly call model.save_pretrained(path) as this function will do that automatically
results = sync_client.save_pretrained(model, tokenizer)
# results = {
# '/content/model/pytorch_model.bin': 'gs://bucket/model/experiment/pytorch_model.bin'
# ...
# }

# Pull Down from your bucket to local
results = sync_client.sync_to_local(overwrite=False)
# results = {
# 'gs://bucket/model/experiment/pytorch_model.bin': '/content/model/pytorch_model.bin'
# ...
# }

# Or set explicit paths if you are changing paths, for example, different dirs for each checkpoint
new_local = '/content/model2'
results = sync_client.sync_to_local(local_path=new_local, cloud_path=cloud_path, overwrite=False)
# results = {
# 'gs://bucket/model/experiment/pytorch_model.bin': '/content/model2/pytorch_model.bin'
# ...
# }

# Or set paths to use
new_cloud = 'gs://bucket/model/experiment2'
sync_client.set_paths(local_path=new_local, cloud_path=new_cloud)
results = sync_client.save_pretrained(model, tokenizer)
# results = {
# '/content/model2/pytorch_model.bin': 'gs://bucket/model/experiment2/pytorch_model.bin'
# ...
# }


# You can also use the underlying filesystem to copy a file directly

# Implicitly & Explicitly
filename = '/content/model2/pytorch_model.bin'
sync_client.copy(filename) # Copies to the set cloud_path variable -> 'gs://bucket/model/experiment2/pytorch_model.bin'
sync_client.copy(filename, dest='gs://bucket/model/experiment3') # Copies to dest variable -> 'gs://bucket/model/experiment3/pytorch_model.bin'
sync_client.copy(filename, dest='gs://bucket/model/experiment3/model.bin') # Copies to dest variable -> 'gs://bucket/model/experiment3/model.bin'

filename = 'gs://bucket/model/experiment2/pytorch_model.bin'
sync_client.copy(filename) # Copies to the set local_path variable -> '/content/model2/pytorch_model.bin'
sync_client.copy(filename, dest='/content/model3') # Copies to dest variable -> '/content/model3/pytorch_model.bin'
sync_client.copy(filename, dest='/content/model3/model.bin') # Copies to dest variable -> '/content/model3/model.bin'

# Copy Explicitly
src_file = '/content/mydataset.pb'
dest_file = 's3://bucket/data/dataset.pb'
sync_client.copy(src_file, dest_file)

```
### Environment Variables Used

Google Cloud Storage
- `GOOGLE_API_TOKEN`
- `GOOGLE_APPLICATION_CREDENTIALS`

AWS S3
- `AWS_ACCESS_KEY_ID`
- `AWS_SECRET_ACCESS_KEY`
- `AWS_SESSION_TOKEN`


### Limitations

While the library tries to respect and check prior to overwriting where `overwrite=false` is available, it's not currently supported agnostically across all cloud.
Support for other Cloud FS is WIP.

### Credits

- [file-io](https://github.com/trisongz/file_io)
- [smart-open](https://github.com/RaRe-Technologies/smart_open)

