Should we switch to using DVC + Azure from Git LFS for storing resource files? #1465

matt-graham · 2024-09-23T16:07:57Z

Currently we use Git Large File Storage (LFS) to store and retrieve large files in the repository, specifically the data files in the resources directory ( see previous discussion in #150).

While this has generally worked as intended, the relatively large amount of files we have stored (~300MiB currently) and more crucially the large amount of Git LFS bandwidth we consume regularly downloading these files, has caused issues with LFS quotas being exceeded.

A possible alternative would be to use Data Version Control (DVC) with Azure Blob Storage as a remote storage backend. DVC acts a layer on top of git, and, among other things, allows for efficiently versioning and tracking large data files by keeping only proxy .dvc files under version control with Git. The files themselves can be synchronized to a variety of remote data storage platforms, including both cloud and self-hosted options.

Given we are already using Azure resources for the project and have an existing Azure storage account, Azure Blob Storage seems a natural option to use. We would need to create a blob container with accesses set accordingly - ideally to continue making the code runnable by anyone we would have anonymous read access to the blob storage enabled - this might mean we want to use a separate storage account.

While there would be a bit of additional overhead for users in using DVC rather than Git LFS it should be relatively minimal. DVC is a Python package and can so be installed using pip along with the other package dependencies. To synchronize the files from remote storage a user will just need to run dvc pull after cloning (or pulling changes in Git to the .dvc proxy files).

In the longer term there is also scope for potentially considering using other DVC features such as its support for data pipelines and experiment tracking, which it seems could fit well with our Azure batch / Scenario class system.

The text was updated successfully, but these errors were encountered:

tamuri · 2024-09-23T16:09:16Z

Yes, definitely worth considering - to discuss at next softeng meeting.

matt-graham · 2024-09-24T10:49:59Z

While I remember: when just discussing this with @tamuri he asked if we went this route, how would someone who is working on a branch where the currently checked out proxy files are not up to date with locally cached version of the underlying data files know this was the case (for example after pulling in commits where the data files were updated). There is a dvc status command that would allow checking this but we would need to figure out how to ensure either people embed using this in to their workflows or if there is anyway we can automate flagging this as an issue when running simulations.

tamuri · 2024-09-24T11:38:21Z

how to ensure either people embed using this in to their workflows

Looks like the git hooks installed by dvc install help a bit.

tamuri · 2024-09-24T11:43:58Z

Or how about another whole tool to orchestrate 😆 🤦 https://github.com/dagshub/fds

matt-graham added the question Further information is requested label Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we switch to using DVC + Azure from Git LFS for storing resource files? #1465

Should we switch to using DVC + Azure from Git LFS for storing resource files? #1465

matt-graham commented Sep 23, 2024 •

edited

Loading

tamuri commented Sep 23, 2024

matt-graham commented Sep 24, 2024 •

edited

Loading

tamuri commented Sep 24, 2024

tamuri commented Sep 24, 2024

Should we switch to using DVC + Azure from Git LFS for storing resource files? #1465

Should we switch to using DVC + Azure from Git LFS for storing resource files? #1465

Comments

matt-graham commented Sep 23, 2024 • edited Loading

tamuri commented Sep 23, 2024

matt-graham commented Sep 24, 2024 • edited Loading

tamuri commented Sep 24, 2024

tamuri commented Sep 24, 2024

matt-graham commented Sep 23, 2024 •

edited

Loading

matt-graham commented Sep 24, 2024 •

edited

Loading