Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scanpy scrubblet by default runs on all samples, best practice is to run per sample #283

Open
pcm32 opened this issue Jan 27, 2023 · 0 comments
Labels
persist-seq Requests from Persist-Seq

Comments

@pcm32
Copy link
Member

pcm32 commented Jan 27, 2023

Currently on the main EBI SC Expression Atlas tertiary pipeline we run scrubblet as a single process for all samples. When we define a batch variable, then samples are run per batch (which is more correct). Maybe the batch case is acceptable (when it happens, not usually the case for most SC Expression Atlas datasets), but certainly the base case where the whole dataset is used at once is not ideal.

At a first approach, this could be fixed by the galaxy wrapper receiving both a sample_variable (the header in the obs where the samples are defined) and a batch_variable, and when the second is given this overrides the first one. In that case, if the batch variable is not given, then scrubblet is run by default per sample. If none is given of course scrubblet should run as it is (as it has n way of knowing how to partition the dataset). In this setup, scanpy will run scrubblet serially (it would be great scanpy could do this in parallel, but that means code upstream that we don't control).

@pcm32 pcm32 added the persist-seq Requests from Persist-Seq label Jan 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
persist-seq Requests from Persist-Seq
Projects
None yet
Development

No branches or pull requests

1 participant