- Document identifier: https://w3id.org/ro/bagit
- Author: Stian Soiland-Reyes http://orcid.org/0000-0001-9842-9718
BagIt is an Internet Draft that specifies a file system structure for transferring and archiving a collection of files, including their checksums and brief metadata.
Research Object bundles is a specification for a structured ZIP-file, based on the ePub and Adobe UCF specifications. The bundle serializes a Research Object, embedding some or all of its resources within the ZIP file, and list the RO content in a manifest, in addition to embedding and referencing annotations and provenance.
A BagIt bag can be considered a mechanism for serialization and transport consistency, while a Research Object can be considered a way to capture identity, annotations and provenance of the resources. As such, the two formats complement each-other. They are however not directly compatible.
This GitHub repository explains by example a profile for a BagIt bag to also be a Research Object. Feel free to provide comments and raise issues, or suggest changes as pull requests.
Run the build.sh
script (requires zip
, md5sum
, sha1sum
, find
) to
generate example1.bagit.zip
and the corresponding example1.bundle.zip
.
Overview of this example:
- example1/ - the bag
example1
- bagit.txt - complies with BagIt version 0.97
- manifest-md5.txt - manifest, md5-sums of all of
data/
- manifest-sha1.txt - .. and sha1
- tagmanifest-md5.txt - tag manifest, md5-sums of the remaining tag files
- tagmanifest-sha1.txt - .. and sha1
- fetch.txt - external URLs to add to
data/
- bag-info.txt - bag metadata such as size in bytes
- data/ - payload directory - what this bag is primarily transferring
- README.md - Describes the payload, e.g. how to run script
- numbers.csv - Raw data as CSV-file
- analyse.py - A script to analyze the CSV
- results.txt - Output from script
- metadata - tag directory for Research Object metadata
- manifest.json - RO manifest as JSON-LD
- annotations/ - structured annotations of RO and RO content, e.g. user-provided descriptions
- numbers.jsonld - JSON-LD annotations, describing
data/numbers.csv
- numbers.jsonld - JSON-LD annotations, describing
- provenance/ - provenance of RO content
- result.prov.jsonld - Provenance of execution of
data/analyse.py
, which createddata/results.txt
- result.prov.jsonld - Provenance of execution of
A bag in BagIt is a base folder (in this example example1/) that contains the bagit declaration in bagit.txt. A bag contains a payload, the data files that are being transferred, in addition to tag files, metadata for the bag and its content.
A BagIt serialization
is typically a tar- or zip-file which contains the base folder.
BagIt archives include at the root a subdirectory for the base folder of the
bag, e.g. the ZIP file would contain example1/bagit.txt
.
The payload
of a bag is the files within a directory that
is always called data. The data
folder may
contain arbitrary files and subdirectories. In this example we include a
simple CSV data file, an
analytical script, and
the results of running that script. In addition,
a textual README.md is included to describe this
execution.
The payload files are listed in one or more
manifest files
that provide hashes of the file content. The BagIt specification specifies the
two most common hashing mechanisms md5 and sha1 to be represented by
manifest-md5.txt and
manifest-sha1.txt. Other hash mechanisms
can also be added (e.g. sha512), but the content of any manifest-*
file
need to follow the $hash $filename
pattern.
Files that are too big to practically include in a BagIt archive
can be
referenced externally
in fetch.txt, which includes the
URLs to download, expected file size and destination filenames
within the bag base directory.
It is undefined in the BagIt specification which Accept*
headers should be
used in such a retrieval, or if any authentication might be required. This
example do not need to make any assumption for this as the
referenced external.txt
is only available in a single representation. It is undefined in the BagIt
specification if the resources in fetch.txt
should be considered when
creating manifest-*
and in Payload-Oxum
, this
example assumes they should not be included. It is undefined in the BagIt
specification what is the expected interpretation if a file in fetch.txt
already exists in the bag's data
directory.
A bag can also contain
other tag files,
which would be listed in a separate
tag manifest,
e.g. tagmanifest-md5.txt and
tagmanifest-sha1.txt. In this example, the tag manifest
lists the content of the metadata directory.
It is undefined in the BagIt specification if the remaining tag files
(e.g. bag-info.txt
or fetch.txt
) should be included in the tag manifest,
this example assumes they should not be included.
A Research Object (RO) is conceptually an aggregation of related resources, an assignment of their identities, and any relevant annotations and provenance statements. The Research Object model specifies how to declare these relations, combining existing Linked Data standard like OAI-ORE, W3C Annotation Data Model and W3C PROV.
Serialized as a Research Object Bundle, some or all of those resources are included in the encapsulating ZIP archive together with a JSON-LD manifest, metadata/manifest.json.
A Research Object BagIt archive follows the same structure as an Research Object
Bundle, except that the base directory is the bag base (e.g. example1/
),
rather than the root folder of the ZIP archive (/
). The RO Bundle's
.ro/
folder is instead called metadata/
in a Research Object BagIt.
The aggregates section of the manifest
list the payload files, both embedded (e.g. ../data/numbers.csv
) and
external resources (e.g. http://example.com/doc1
).
Note that local paths are under ../data/
, relative to the metadata/
folder.
This aggregates
listing provides hooks for additional metadata and
provenance, e.g.
mediatype,
authoredBy and
retrievedFrom.
A file can claim to conform to a standard,
minimum information checklist, requirements or
similar using conformsTo.
If more detailed provenance is available, then history can link to a separate provenance trace, e.g. a PROV-O RDF file, although any kind of embedded or external provenance resource could be appropriate (e.g. log file, word document, git repository). Provenance can also be included for the research object itself.
Annotations about any of the resources in the bag (or the RO itself)
can be linked to from the annotations
section. Here about
specifies one or more resources that are annotated,
while content
links to the annotation content, which could be any aggregated
or external resource (e.g ../data/README.md that
describes analyse.py
, numbers.csv
and results.txt
), or a
metadata file under metadata/annotations/
, typically in a Linked Data format.
In this example,
annotations/numbers.jsonld
provide semantic annotations of ../data/numbers.csv
in JSON-LD format.
It is customary in Research Object Bundles for non-payload (metadata)
files to not be listed under aggregates
and to be stored under .ro/
.
Research Object BagIt archives follow this convention (using metadata/
),
and in addition the payload files
must exclusively be within the data/
folder (or be external URLs).
The metadata/
content is listed in the
tag manifest, while the
data/
payload is listed in the payload manifest
with external URLs in the fetch file.
Research Object BagIt archives SHOULD specify the BagIt profile
for bagit-ro within bag-info.txt
as:
BagIt-Profile-Identifier: https://w3id.org/ro/bagit/profile
The combination of BagIt and Research Object adds:
- RO consistency with checksums for payload and metadata
- Structured metadata, provenance and annotations for the bag and its content
- With extensions in JSON-LD using any Linked Data vocabulary
- Graceful degradation/conversion to plain BagIt or RO Bundle
A RO Bundle is fundamentally not very different from an archived
BagIt bag, except that in the RO Bundle, the ro/
is in the root
directory together with a marker mimetype
file to help mime magic-like tools
identify the file type.
BagIt serialization
mandates that a BagIt archive contains only a single directory when unpacked,
which is the base directory of the bag. While in theory a hybrid RO Bundle and
BagIt ZIP archive could exist, it would have to use the bag name .ro
and
could not include the mimetype
file (without a binary zip file hack).
In addition the payload would then be contained in .ro/data/
,
which is not what you would expect from the RO Bundle specification
and which would hide all content from Unix/Linux users.
The approach shown here is therefore a variation of RO Bundle which contains the
Research Object within the bag of an arbitrary name, thus the RO manifest in a
Research Object BagIt archive is in this example at
example1/metadata/manifest.json/ rather than
.ro/manifest.json
.
The interpretation of manifest.json
according to the
RO Bundle specification
assumes /
is the root of the ZIP file, to also be the root of the RO.
A BagIt bag is not necessarily rooted within an
archive, and could be living standalone within a file system directory,
or be exposed on the Web at an arbitrary URL base. The name of the containing
bag is not declared outside its directory name. The RO manifest and annotations
in this approach therefore uses only relative URI paths, e.g.
../data/analyse.py
, while the RO Bundle
manifest would have used /data/analyse.py
.
Developers can struggle to generate correct relative paths. An
alternative approach to move /metadata/manifest.json
to /manifest.json
could improve on this, but would mean the manifest would no longer be
easily usable also as an RO Bundle manifest as its relative paths
would differ.
The build.sh script shows how this structure mean that a
Research Object BagIt archive can be converted to a Research Object Bundle
by adding the mimetype
file and simply archiving from within the bag directory.
A similar conversion from RO Bundle to Research Object BagIt would require
moving its embedded resources to data/
and rewrite the local paths in its
manifest and annotations. See bundle-to-bagit.sh for an example.
Having two kinds of manifests (manifest-sha1.txt
and metadata/manifest.json
)
can be confusing, and can lead to inconsistency if a tool supporting only
one of these kind is modifying an RO BagIt.
The bag-info.txt
format supports some
basic bag-level metadata, e.g.
Bagging-Date
, Contact-Phone
and Organization-Address
. While some of these
might seem archaic, "other arbitrary metadata elements may also be present.",
allowing extensions.
The BagIt specification has no requirements for such alternative elements (e.g. they are not RFC 2822 headers), and it is unclear if any whitespace (e.g. newlines and indentation) form part of the BagIt values or not.
It is recommended that only the basic metadata is provided in bag-info.txt
,
while more structured metadata and provenance should be
provided in the Research Object manifest or annotations.