You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently we use Git Large File Storage (LFS) to store and retrieve large files in the repository, specifically the data files in the resources directory ( see previous discussion in #150).
While this has generally worked as intended, the relatively large amount of files we have stored (~300MiB currently) and more crucially the large amount of Git LFS bandwidth we consume regularly downloading these files, has caused issues with LFS quotas being exceeded.
Given we are already using Azure resources for the project and have an existing Azure storage account, Azure Blob Storage seems a natural option to use. We would need to create a blob container with accesses set accordingly - ideally to continue making the code runnable by anyone we would have anonymous read access to the blob storage enabled - this might mean we want to use a separate storage account.
While there would be a bit of additional overhead for users in using DVC rather than Git LFS it should be relatively minimal. DVC is a Python package and can so be installed using pip along with the other package dependencies. To synchronize the files from remote storage a user will just need to run dvc pull after cloning (or pulling changes in Git to the .dvc proxy files).
In the longer term there is also scope for potentially considering using other DVC features such as its support for data pipelines and experiment tracking, which it seems could fit well with our Azure batch / Scenario class system.
The text was updated successfully, but these errors were encountered:
While I remember: when just discussing this with @tamuri he asked if we went this route, how would someone who is working on a branch where the currently checked out proxy files are not up to date with locally cached version of the underlying data files know this was the case (for example after pulling in commits where the data files were updated). There is a dvc status command that would allow checking this but we would need to figure out how to ensure either people embed using this in to their workflows or if there is anyway we can automate flagging this as an issue when running simulations.
Currently we use Git Large File Storage (LFS) to store and retrieve large files in the repository, specifically the data files in the
resources
directory ( see previous discussion in #150).While this has generally worked as intended, the relatively large amount of files we have stored (~300MiB currently) and more crucially the large amount of Git LFS bandwidth we consume regularly downloading these files, has caused issues with LFS quotas being exceeded.
A possible alternative would be to use Data Version Control (DVC) with Azure Blob Storage as a remote storage backend. DVC acts a layer on top of
git
, and, among other things, allows for efficiently versioning and tracking large data files by keeping only proxy.dvc
files under version control with Git. The files themselves can be synchronized to a variety of remote data storage platforms, including both cloud and self-hosted options.Given we are already using Azure resources for the project and have an existing Azure storage account, Azure Blob Storage seems a natural option to use. We would need to create a blob container with accesses set accordingly - ideally to continue making the code runnable by anyone we would have anonymous read access to the blob storage enabled - this might mean we want to use a separate storage account.
While there would be a bit of additional overhead for users in using DVC rather than Git LFS it should be relatively minimal. DVC is a Python package and can so be installed using
pip
along with the other package dependencies. To synchronize the files from remote storage a user will just need to rundvc pull
after cloning (or pulling changes in Git to the.dvc
proxy files).In the longer term there is also scope for potentially considering using other DVC features such as its support for data pipelines and experiment tracking, which it seems could fit well with our Azure batch /
Scenario
class system.The text was updated successfully, but these errors were encountered: