Scanpy scrubblet by default runs on all samples, best practice is to run per sample #283

pcm32 · 2023-01-27T14:03:23Z

Currently on the main EBI SC Expression Atlas tertiary pipeline we run scrubblet as a single process for all samples. When we define a batch variable, then samples are run per batch (which is more correct). Maybe the batch case is acceptable (when it happens, not usually the case for most SC Expression Atlas datasets), but certainly the base case where the whole dataset is used at once is not ideal.

At a first approach, this could be fixed by the galaxy wrapper receiving both a sample_variable (the header in the obs where the samples are defined) and a batch_variable, and when the second is given this overrides the first one. In that case, if the batch variable is not given, then scrubblet is run by default per sample. If none is given of course scrubblet should run as it is (as it has n way of knowing how to partition the dataset). In this setup, scanpy will run scrubblet serially (it would be great scanpy could do this in parallel, but that means code upstream that we don't control).

pcm32 added the persist-seq Requests from Persist-Seq label Jan 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scanpy scrubblet by default runs on all samples, best practice is to run per sample #283

Scanpy scrubblet by default runs on all samples, best practice is to run per sample #283

pcm32 commented Jan 27, 2023

Scanpy scrubblet by default runs on all samples, best practice is to run per sample #283

Scanpy scrubblet by default runs on all samples, best practice is to run per sample #283

Comments

pcm32 commented Jan 27, 2023