Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we switch to using DVC + Azure from Git LFS for storing resource files? #1465

Open
matt-graham opened this issue Sep 23, 2024 · 4 comments
Labels
question Further information is requested

Comments

@matt-graham
Copy link
Collaborator

matt-graham commented Sep 23, 2024

Currently we use Git Large File Storage (LFS) to store and retrieve large files in the repository, specifically the data files in the resources directory ( see previous discussion in #150).

While this has generally worked as intended, the relatively large amount of files we have stored (~300MiB currently) and more crucially the large amount of Git LFS bandwidth we consume regularly downloading these files, has caused issues with LFS quotas being exceeded.

A possible alternative would be to use Data Version Control (DVC) with Azure Blob Storage as a remote storage backend. DVC acts a layer on top of git, and, among other things, allows for efficiently versioning and tracking large data files by keeping only proxy .dvc files under version control with Git. The files themselves can be synchronized to a variety of remote data storage platforms, including both cloud and self-hosted options.

Given we are already using Azure resources for the project and have an existing Azure storage account, Azure Blob Storage seems a natural option to use. We would need to create a blob container with accesses set accordingly - ideally to continue making the code runnable by anyone we would have anonymous read access to the blob storage enabled - this might mean we want to use a separate storage account.

While there would be a bit of additional overhead for users in using DVC rather than Git LFS it should be relatively minimal. DVC is a Python package and can so be installed using pip along with the other package dependencies. To synchronize the files from remote storage a user will just need to run dvc pull after cloning (or pulling changes in Git to the .dvc proxy files).

In the longer term there is also scope for potentially considering using other DVC features such as its support for data pipelines and experiment tracking, which it seems could fit well with our Azure batch / Scenario class system.

@matt-graham matt-graham added the question Further information is requested label Sep 23, 2024
@tamuri
Copy link
Collaborator

tamuri commented Sep 23, 2024

Yes, definitely worth considering - to discuss at next softeng meeting.

@matt-graham
Copy link
Collaborator Author

matt-graham commented Sep 24, 2024

While I remember: when just discussing this with @tamuri he asked if we went this route, how would someone who is working on a branch where the currently checked out proxy files are not up to date with locally cached version of the underlying data files know this was the case (for example after pulling in commits where the data files were updated). There is a dvc status command that would allow checking this but we would need to figure out how to ensure either people embed using this in to their workflows or if there is anyway we can automate flagging this as an issue when running simulations.

@tamuri
Copy link
Collaborator

tamuri commented Sep 24, 2024

how to ensure either people embed using this in to their workflows

Looks like the git hooks installed by dvc install help a bit.

@tamuri
Copy link
Collaborator

tamuri commented Sep 24, 2024

Or how about another whole tool to orchestrate 😆 🤦 https://github.com/dagshub/fds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants