A Single Cell Expression Atlas module for loading experiments to a Postgres 11 database. Release v0.4.0 was used for the October 2022 data release of Single Cell Expression Atlas.
- Rscript >= 3.6.3 with
optparse
,tidyr
anddata.table
(in Ubuntu and Debian-based distributions install packagesr-cran-optparse
,r-cran-tidyr
andr-cran-data.table
) - PostgreSQL 11
- Node v12+
Schema definitions and migrations of the database used by Single Cell Expression Atlas are managed by Flyway. They are stored at https://github.com/ebi-gene-expression-group/atlas-schemas/tree/master/flyway/scxa. An example of how to initialise a Docker container with Flyaway is available in the development environment of Single Cell Expression Atlas.
Run bin/load_db_scxa_analytics.sh
. It requires the following environment variables:
Variable name | Description |
---|---|
EXP_ID |
Experiment accession |
EXPRESSION_TYPE |
The expression type of the matrices (see next row); e.g. aggregated_filtered_counts ; default value is expression_tpm |
EXPERIMENT_MATRICES_PATH |
Path where $EXP_ID/$EXP_ID.$EXPRESSION_TYPE.mtx_cols_rows.gz files are stored |
dbConnection |
A Postgres connection string in the form postgresql://{user}:{password}@{host:port}/{databaseName} pointing to a Postgres 11 server where the expected scxa_analytics table exists |
It is recommended that bin
is prepended to the PATH
.
Set dbConnection
and EXP_ID
, then run delete_db_scxa_analytics.sh
:
export EXP_ID=...
export dbConnection=...
delete_db_scxa_analytics.sh
Run bin/load_db_scxa_dimred.sh
. It requires the following environment variables:
Variable name | Description |
---|---|
EXP_ID |
Experiment accession |
DIMRED_TYPE |
The dimension reduction type, such as "UMAP" or "t-SNE"; the value is arbitrary and supplied by the user |
DIMRED_FILE_PATH |
TSV file with the coordinates |
DIMRED_PARAM_JSON |
Optional array of parameters with the parameters used by the dimension reduction method (e.g. perplexity is typical for t-SNE, thus [{"perplexity": 20}] ) |
dbConnection |
A Postgres connection string in the form postgresql://{user}:{password}@{host:port}/{databaseName} pointing to a Postgres 11 server where the expected scxa_coords and scxa_dimension_reduction tables exist |
It is recommended that bin
is prepended to the PATH
.
Set dbConnection
and EXP_ID
, then run delete_db_scxa_dimred.sh
:
export EXP_ID=...
export dbConnection=...
delete_db_scxa_dimred.sh
Run bin/load_db_scxa_cell_clusters.sh
. It requires the following environment variables:
Variable name | Description |
---|---|
EXP_ID |
Experiment accession |
EXPERIMENT_CLUSTERS_FILE |
Path to the file <EXP-ID>.clusters.tsv containing the clusters in wide format |
CONDENSED_SDRF_TSV |
Path to the condensed SDRF file of the experiment; it will be used to derive cell groups from the metadata, in addition to the clusters |
dbConnection |
A Postgres connection string in the form postgresql://{user}:{password}@{host:port}/{databaseName} pointing to a Postgres 11 server where the expected scxa_cell_group and scxa_cell_group_membership tables exist |
It is recommended that bin
is prepended to the PATH
.
Set dbConnection
and EXP_ID
, then run delete_db_scxa_cell_clusters.sh
:
export EXP_ID=...
export dbConnection=...
delete_db_scxa_cell_clusters.sh
The script used in this step reads the table scxa_cell_group
in order to load data into
scxa_cell_group_marker_genes
and scxa_cell_group_marker_gene_stats
. Ensure you’ve run
load_db_scxa_cell_clusters.sh
as detailed above to successfully carry out this operation.
Run bin/load_db_scxa_marker_genes.sh
. It requires the following environment variables:
Variable name | Description |
---|---|
EXP_ID |
Experiment accession |
EXPERIMENT_MGENES_PATH |
Path of marker genes files for transforming and loading |
CLUSTERS_FORMAT |
ISL or SCANPY ; ISL is deprecated |
NUMBER_MGENES_FILES |
Hints at whether there are marker genes files to load (zero or positive integer); this is optional. If the variable is set to zero the script will not fail even if no marker genes files are found. Currently the script only considers whether the variable is 0 (no marker genes files) or greater (there are that number of marker gene files). We can make the script fail if the (number of) expected marker gene files were not found. |
dbConnection |
A Postgres connection string in the form postgresql://{user}:{password}@{host:port}/{databaseName} pointing to a Postgres 11 server where the expected scxa_cell_group_marker_genes and scxa_cell_group_marker_gene_stats tables exist |
It is recommended that bin
is prepended to the PATH
.
Set dbConnection
and EXP_ID
, then run delete_db_scxa_marker_genes.sh
:
export EXP_ID=...
export dbConnection=...
delete_db_scxa_marker_genes.sh
Run bin/load_exp_design.sh
. It requires the following environment variables:
Variable name | Description |
---|---|
CONDENSED_SDRF_FILE |
Path of the condensed SDRF file of the experiment being loaded |
SDRF_FILE |
Path of the SDRF file of the experiment being loaded |
dbConnection |
A Postgres connection string in the form postgresql://{user}:{password}@{host:port}/{databaseName} pointing to a Postgres 11 server where the expected scxa_cell_group_marker_genes and scxa_cell_group_marker_gene_stats tables exist |
Currently, there’s no script to delete data from these tables. You can do it manually with the following SQL statements:
DELETE FROM exp_design WHERE exp_design_column_id IN (SELECT id FROM exp_design_column WHERE experiment_accession='E-FOO-123');
DELETE FROM exp_design_column WHERE experiment_accession='E-FOO-123';
Once a number of experiments have been loaded, tables should be re-indexed:
# if not set, set the dbConnection
export dbConnection=...
reindex_tables.sh
Icons stored in the collections table can be consumed through the lo_export
function within a SELECT
statement, for instance:
SELECT lo_export(collections.icon, '/tmp/icon.png') FROM collections
WHERE coll_id = 'PHANTOM';
This is the preferred and most reproducible way of testing using containers. It requires docker to be installed:
bash run_tests_with_containers.sh
First run will be expensive due to build, subsequent runs will use your cache. Every run will leave the postgresql container running with the structure and some dummy data loaded in the database. You can connect to that database locally with the following credentials:
dbConnection='postgresql://scxa:postgresPass@localhost:5432/scxa-test'
On every run of the run_tests_with_containers.sh
the container database will be deleted and re-created.
- Start an empty Postgres 11 database through a container or any other mean:
docker run -e POSTGRES_PASSWORD=lasdjasd -e POSTGRES_USER=user -e POSTGRES_DB=scxa-test -p 5432:5432 -d postgres:10.3-alpine
- Build and export the adequate
dbConnection
env variable based on the postgres database generated.
export dbConnection=postgresql://user:lasdjasd@localhost:5432/scxa-test
- Execute
bash tests/run_tests.sh
Tests are automatically executed via GitHub Actions.
The container is available for use at quay.io/ebigxa/db-scxa-module at latest (recommended) or any of the tags after 0.2.0, so it could be used like this for example:
docker run -v /local_data:/data \
-e dbConfig=<your-database-connection-string-for-postgres> \
-e EXP_ID=<the-accession-of-experiment> \
-e EXPERIMENT_CLUSTERS_FILE=<path-inside-container-for-clusters-file> \
--entrypoint load_db_scxa_clusters.sh \
quay.io/ebigxa/db-scxa-module:latest
Please note that EXPERIMENT_CLUSTERS_FILE
needs to make sense with how you mount
data inside the container. You can change entrypoint and env variables given to use the other scripts mentioned above.