Add Seurat first version #2047

mblue9 · 2018-08-20T01:47:20Z

This PR adds Seurat, a tool for exploring single-cell data. It takes a gene count matrix as input and outputs a PDF of plots.

FOR CONTRIBUTOR:

- I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
- License permits unrestricted use (educational + commercial)
- This PR adds a new tool or tool collection
- This PR updates an existing tool or tool collection
- This PR does something else (explain below)

FOR REVIEWER:

mblue9 · 2018-08-20T03:22:35Z

This is another one that passes fine locally but Travis is failing :( The error is

unable to load shared object '/home/travis/conda/envs/mulled-v1-5e8ede748527129d5a4d1b84ca2b5f88dcf7cde8cdf27677640f438584dc0fbe/lib/R/library/caret/libs/caret.so':
|  libgfortran.so.4: cannot open shared object file: No such file or directory

mblue9 · 2018-08-20T23:30:04Z

Thanks for the new conda package @bgruening!

Now there's the no output received error, do you know what I can do about that?

2018-08-20 22:24:26,644 DEBUG [galaxy.tools.deps.conda_util] Executing command: /home/travis/conda/bin/conda create -y --override-channels --channel iuc --channel bioconda --channel conda-forge --channel defaults --name mulled-v1-d0fcfd537b3822de7f96303838d9da1126440e4f0c08a203b711ba1ccb48bc32 r-seurat=2.3.4 bioconductor-singlecellexperiment=1.0.0 r-optparse=1.6.0
No output has been received in the last 20m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received
The build has been terminated

bgruening · 2018-08-21T06:29:01Z

Yeah, got stuck with this yesterday. The problem is again the solver :( I need more time for this.

mblue9 · 2018-08-21T06:40:51Z

Ok no worries, I have plenty of other work to keep me busy in the meantime!

until we can use a newer conda version

bgruening · 2018-08-26T19:58:39Z

Sorry, took me a while and some upstream work. There is currently a huge solver rework in progress from the conda people so things will improve in the next weeks.

Thanks @mblue9!

mblue9 · 2018-08-26T22:30:08Z

Great! Thanks a lot for working on it @bgruening !!

pcm32 · 2018-08-28T15:20:23Z

@mblue9 can you check this out https://github.com/ebi-gene-expression-group/r-seurat-scripts (Bjoern, hijacked @pcm32 notebook)

mblue9 · 2018-08-28T21:49:08Z

Nice! Thanks @pcm32 ! Do you mean I should be wrapping these wrapper scripts instead of seurat itself or using those scripts as reference? At the moment Im working on adding the aligning 2 samples method (RunCCA function) which isn't covered in those wrappers yet from what I can see.

bgruening · 2018-08-28T21:56:42Z

@mblue9 this is up for discussion I think. @pcm32 currently planned to split the tools up into the separate scripts. Let's discuss this here. ping @mtekman as well.

mblue9 · 2018-08-28T22:14:15Z

Well this more modular approach of separate scripts could be better for more flexibility and less ugliness. (The way I've added the RunCCA function into the script I wrote is currently ugly I think https://github.com/mblue9/tools-iuc/blob/seurat_add_cca/tools/seurat/seurat.R) and it'd probably be good to add RunMultiCCA for multiple datasets aswell.

pcm32 · 2018-08-29T05:36:28Z

Hi @mblue9! Great that you're also working on this! We are aiming to decompose several tertiary analysistools (Seurat, sc3, scanpy, scatter, etc) into their main separate methods so that in the long term users can cherry pick and match different analysis steps from the different packages. The work that @bgruening pointed to is the substrate for the bioconda package for the proposed seurat scripts (where we could add the step missing that you propose). So yes, we aim to work on writing the Galaxy wrappers for each individual script, but since you are already on it, it would be great to join efforts and avoid duplication. Having functionality modularized like this would also mean that we can easily divide wrappers to write as well 😀. We are open of course to shape the modularisation if you have any thoughts on that as well. I'll redirect the other people that are working on this with me to this thread.

mblue9 · 2018-08-29T06:23:55Z

@pcm32 that sounds like a great plan! Definitely would be keen to join forces and avoid duplicating effort. And working on the wrappers together would be great. Count me in! 👍

pinin4fjords · 2018-08-29T16:36:44Z

Hi all! I'm Jon, @pcm32 's colleague, and I've started to publish this work via PRs (thanks @bgruening for the review). See in particular bioconda/bioconda-recipes#10747.

We also have some draft documentation at https://tertiary-workflows-docs.readthedocs.io/en/latest/ if you're interested in our general strategy. We're trying to be very consistent in our approach, so we're making some guidelines- e.g. for the r package wrappers https://tertiary-workflows-docs.readthedocs.io/en/latest/scripts_for_r_packages.html.

Very happy to receive feedback and work with people on this.

bgruening · 2018-08-29T17:23:23Z

@pinin4fjords this is awesome! Just saying :)

mblue9 · 2018-08-30T02:59:12Z

Was wondering... if the aim for these seurat tools, and other single-cell tools, is to create one Galaxy wrapper per R function, then should we try using @blankenberg's r2g2 tool to help automate creating the wrappers and for consistency?

I've just tried it out and now have >300 Seurat tools! 😄

I've put just the ones that are currently in the r-seurat-scripts in my repo here if you want to see: https://github.com/mblue9/tools-iuc/tree/seurat_r2g2/tools/seurat_r2g2.

They need some cleanup e.g. removing options not available in the r-seurat-scripts, and every argument currently requires the input to be specified e.g integer, see below.

But what do people think, should we pursue this automated way to create the wappers? It could maybe be used for seurat, scater, SC3 etc.

mtekman · 2018-08-30T05:38:23Z

Wow I had no idea that tool existed! I was toying with the idea of scripting the process too after spending too much time manually copying and pasting help text and default parameters into the XML.

I am 100% for this.

mtekman · 2018-08-30T05:55:07Z

The tool might need to be extended to parse a script rather than an entire library though. I made a feature request with this in mind.

blankenberg/r2g2#2

I will play with this today to see his much it speeds up filling param labels, help text, and default values.

mblue9 · 2018-08-30T06:11:52Z

Great! Thanks @mtekman ! That would be really good to know how it compares to what you've already done.

I tried r2g2 as I'd been looking at how many functions I'd already included in Seurat and if I include them all, it would be a lot of wrappers. As it would be ~12 currently in r-seurat-scripts plus another ~20 here, and that's just for Seurat, not even other single cells tools, so if you've any ideas to automate or speed up the creation of wrappers for these tools that would be great!

pinin4fjords · 2018-08-30T06:54:41Z

Thanks @mtekman @mblue9, I had no idea about r2g2 either. I'm delighted with any process that makes this easier, and I was wondering if there was a way I could automate the wrapper script creation.

r-seurat-scripts and related packages we're working on are designed to be stand-alone, i.e. independent of Galaxy, so the scripts are available as components of any workflow systems. From our point of view we'd have to think about the best way of achieving that alongside the Galaxy wrappers made by r2g2, but consistent with it, but it seems there must be a way of doing that.

Obviously not married to the way we're doing this now, if there is a way of doing it in automated way I'll be very happy.

mtekman · 2018-08-30T08:19:10Z

@mblue9 The r2g2 tool at the moment does not seem to autofill labels or help text or defaults, and so currently is not so helpful in expediting the XML process, which for me is the main time consuming process.

I propose the development of a new tool which would parse help() and formals() text better. From what I have seen, most bioconductor-worthy packages are well documented, so creating something like this should be reasonably straight forward.

pcm32 · 2018-08-30T09:12:17Z

Using r2g2 sounds like a great addition! Besides stressing the fact that we want bioconda packages being Galaxy agnostic (I'm sure we can reuse results from r2g2 for the bioconda packages), so that they can be used with other workflow environments, I would add the following:

Have a reasonable editorial effort in deciding what is shown in the end in Galaxy, as there will be functionality that might be unnecessary to add (maybe internal methods). Having too many methods (which is a temptation when using something like r2g2) per tool as well will confuse users. >300 Seurat tools becomes unusable in my view. This I guess could be as well done at the level of the Galaxy instance setup, but making decisions ahead might save us time developing/refurbishing wrappers. We should probably point to the core functionalities, instead of being driven by the choice of individual methods done by the implementers of these tools (which will probably be different from tool to tool and again will confuse users).
In the second part of our project, we want to enable interaction between methods of different tools, which in many cases will mean adding the ability to output data in an "exchange" or "neutral" format, and also reading it as input. This should go inside the bioconda package as well. Galaxy wise, one could consider whether the data transformation step would happen inside the tool or as as separate box. My personal taste is that making this part of each Galaxy tool then makes the flow or narrative of the workflow much closer to the main steps of the analysis; having plenty of transformation steps in the middle of the workflow makes the workflow less clear.
Another thing that normally fails with automated tool writing is usability (UX). We need to see whether the generated wrappers are adequate in terms of usability, we might find that we are better off with 5, 10 or 15 modules for Seurat that are manually written but focus on core functionality and are truly useful for users, instead of several tenths that are difficult to understand.

So I would harness the power given by r2g2 to produce scripts that we can add to the bioconda packages, but then work on top of that to trim to relevant functions and cherry pick Galaxy wrappers for important functionality instead of just any method available inside.

pcm32 · 2018-08-30T09:23:29Z

What is your opinion on this @bgruening?

mtekman · 2018-08-30T09:33:36Z

@pcm32 If I can jump in on your comment -- I have also come across this issue of having too many methods floating around for different parts of the analyses.

Here #1841 I suggested binding all methods into one of the 4 main stages of processing (Filtering, Normalisation, Confounder Removal, and Clustering). Users would then choose a method via dropdown box.

This has the advantage of keeping all tools of the same type within the same wrapper and not cluttering the Galaxy tool search. At the time I was trying to put these 4 stages into one single tool (i.e. the user chooses the filtering method, the normalisation method, the confounder removal method, and the clustering method -- each stage being optional so that the user could skip a stage if they wanted to re-analyse the output RDS), in order to circumvent the problem of creating an exchange format between stages.

pcm32 · 2018-08-30T09:41:26Z

Agreed, those functionalities (give or take a few) are what we should be going after (and then in the same coherent way for other tools, making it easier for users; so you have F, N, CR and Clustering for Seurat, SC3, scanpy, etc where available).

I would avoid putting all functionalities together because then you cannot cherry pick the different functionalities from the different tools and used them together (ie. use filtering from seurat with normalization from sc3 and clustering from scanpy, to name an example).

pcm32 · 2018-08-30T09:51:09Z

@mtekman I think that #1841 is great and really aligns with the view that we have, mainly only difference would be to add the R scripts to the bioconda -scripts package instead of having them right next to the Galaxy wrapper, so that the work can be re-used in a similar way on other workflow environments. Would you be happy to move in that direction and collaborate both at the level of Galaxy wrappers and bioconda packages? Here at the Genome campus we are 3 people, possibly 4, working part time on this project, and we will get two more persons at the Sanger.

pcm32 · 2018-08-30T09:52:25Z

We have some advance done for scater I think besides seurat, and maybe some other package. Maybe @pinin4fjords can fill in more details.

pinin4fjords · 2018-08-30T09:58:44Z

Yep, we've got most of the way through Scater too: https://github.com/ebi-gene-expression-group/bioconductor-scater-scripts/tree/devel.

There are differences with @mtekman 's efforts, e.g. we have more content to do with argument parsing etc (since the scripts are intended to be standalone, so need more UI), but hopefully we could converge.

mtekman · 2018-08-30T12:31:44Z

@mblue9 I have started a small skeleton repo for what an Rscript2Galaxy tool should ideally look like https://github.com/mtekman/rscript2galaxy

@pcm32 it makes sense to keep the core R scripts wrapper independent. I would most certainly be happy to collaborate in that direction, and definitely agree that having things set in bioconda first before wrapping them in Galaxy makes the most sense.

@pinin4fjords I have also been in discussion with @ethering on continuing the scater modules, but I agree it would be better if we built upon your scripts before resuming any work.

blankenberg · 2018-08-30T15:33:55Z

Hi Guys, interesting things going on here :)

With R2G2 the idea was to create a generic tool that creates Galaxy tools that are fully functional. This does create a situation where for each formally declared function option, we can determine a default value and type, but due to overloading we need to allow setting other options; e.g. a distance option might have a default value set with a metric name declared as a string, but it could also accept a function, an object containing distance values, etc. Early versions of r2g2 only allowed setting Galaxy parameters based upon the declared formal default, which quickly became too limiting with the tested packages. This is why there is a dropdown for each parameter, and also a repeat for the ellipsis. Just one of the 'problems' with having a generic tool that is designed to work in all cases. It probably makes sense to make this a configurable option: 'dumb mode' where it only lets setting values based upon the type of the declared default (in many cases this could be subjectively considered smarter), and 'technically correct' (the best kind of correct) which it is doing now.

Now if you have a priori knowledge of the actual tools and packages that you want to create, it makes sense to do things in a more focus fashion. For example, I have Galaxy tool generated focused directly on anvi'o (50 or so python-based tools), and because of the focused nature, the tools it creates are more-or-less 'production quality', where, as noted, having 300 r2g2 tools for a single r package might not have the same user-friendliness factor. They can often make great starting points, however.

I am a huge fan of automated approaches for many reasons, and I think it makes lots of sense to use them when the expected number of tools exceeds a certain count. Arbitrarily, lets just say 10 tools or so: Spend an hour per tool manually creating Galaxy xml files, or spend 10 hours writing a conversion script? If you know what you want the tools to end up looking like, and are familiar with the specific use-cases, writing the script almost certainly wins out -- a huge bonus here is dealing with version updates, simply rerun the script and have new updated versions of the tools in just a few seconds.

That being said, let me know what I can do to help.

bgruening · 2018-08-30T20:14:33Z

Hi all,

sorry for being late to the game here, had an unpleasant day. But I'm glad you are all now here! Really awesome!

Another thing that normally fails with automated tool writing is usability (UX). We need to see whether the generated wrappers are adequate in terms of usability, we might find that we are better off with 5, 10 or 15 modules for Seurat that are manually written but focus on core functionality and are truly useful for users, instead of several tenths that are difficult to understand.

Totally agree. The gain what Galaxy can bring with really well-designed tools is a great UX and fewer failures in data-analysis if it's done well. We are using scripts for automatically generate Galaxy integrations for years now, for example in OpenMS or with our python-argparse converter. My experience is very mixed. For OpenMS with over 100 tools its really the only valuable option if you don't have a huge community behind it - but the wrappers are not-perfect, the UX is bad and the upgrade process really worse. Especially, if you start fixing bugs. But as @blankenberg already said They can often make great starting points, however. :)

The ideal solution for me would be, as Mehmet pointed already out, one sc-filtering tool, one sc-normalization tools .... the user should not care about the underlying method, but the user should be able to pick a method out of many. The big picture is what counts I think - and this is filtering here, not the name of the underlying package. That said, this is only possible if we find a way to store the intermediate data in a format that can be understood by all tools - and I guess we should spend a few days to find this format or define it. Some hdf5 dialect like the loom format would be nice to explore - http://linnarssonlab.org/loompy/ and https://satijalab.org/loomR/loomR_tutorial.html. It would be really cool if you can offer a scNormalisation tool in Galaxy that offers all different normalization methods that the Python and R community comes up with and we simply consume and write loom files :)

That said, I'm happy about supporting whatever we decide here - I'm just super excited about the number of people that would like to push scRNA in Galaxy - this is exciting.

For people that are interested in how we can create a matrix, @mtekman has create a pipeline for this and has described the umi-handling and much more in a Galaxy training material: galaxyproject/training-material#969. Comments welcome!

pinin4fjords · 2018-08-30T21:37:26Z

@bgruening that's quite exciting. Our project actually concerns intermediate formats for tertiary analysis tools as a key objective, but it's a complex issue and I can see it providing quite a few challenges, so a community-driven solution would be really awesome.

What you're talking about could dovetail perfectly- I can envisage a 'filtering' galaxy tool that simply calls the appropriate script from the relevant *-scripts bioconda package. Even if we ended up using different single-tool Galaxy wrappers for our own internal objectives, just getting those intermediate formats agreed and working would be a big achievement- and yes, hdf5 had crossed our minds.

Should we move this discussion somewhere away from this random PR thread :-P. Slack anyone?

nsoranzo · 2018-08-31T00:16:45Z

@pinin4fjords I think a separate issue in this repository would be the best place to keep track of the various ongoing efforts. For quick chats, we use https://gitter.im/galaxy-iuc/iuc

mblue9 · 2018-08-31T04:55:50Z

I've created an issue in the IUC repo here if you want to use that to continue the discussion. I tried to summarise the main points but not sure I got them all so feel free to edit!

mblue9 · 2018-08-31T05:31:35Z

I've created that other issue but in case you don't want to use that I'll respond here to some of the points raised.

I was not proposing to add 300 Seurat functions, a few done well is definitely better! With the automated approach I was seeing it as a starting point.

@mblue9 I have started a small skeleton repo for what an Rscript2Galaxy tool should ideally look like https://github.com/mtekman/rscript2galaxy

That tool looks cool!

I propose the development of a new tool which would parse help() and formals() text better. From what I have seen, most bioconductor-worthy packages are well documented, so creating something like this should be reasonably straight forward.

Yes I was also thinking mainly of Bioconductor in terms of automation here. As having worked on a few Bioconductor tools now, copying/pasting info that's already provided by Bioconductor just doesn't seem efficient. Parsing their packages to create a starting point for wrappers could be great for a number of reasons imo. Speed/time-saving is one thing, but consistency and standardisation are more important I think, which that could help with. It could potentially also help any interested Bioconductor tool authors create Galaxy wrappers more easily. As their focus is R, and while creating Galaxy wrappers is 'easy', auto-creating a wrapper to start from could lower the entry point for them.

suhaibMo · 2018-09-25T15:14:19Z

Hi all, I'm suhaib(@suhaibMo) working with @pcm32 and @pinin4fjords. Previously I'd written few R-scater wrappers for data processing functions (https://github.com/ebi-gene-expression-group/bioconductor-scater-scripts) that has been integrated in Bioconda recipes (https://github.com/bioconda/bioconda-recipes/tree/7d1f13c7f91fc65ed235eb4b860cfdb0287ab082/recipes/bioconductor-scater-scripts). I'm moving to write Galaxy wrappers (newbie) for Scater which I'm getting familiarise with the process and XML schema. However, I understand @mtekman is planning to write galaxy wrapper for scater ?.I aim to have following wrappers for scater

scater-read-10x-results.R
scater-normalize.R
scater-calculate-cpm.R
scater-extract-qc-metric.R
scater-calculate-qc-metrics.R
scater-is-outlier.R
Should anyone planning to write any of the above or ongoing could you please ping so don't duplicate or perhaps re-use. Thanks !

mblue9 added 3 commits August 20, 2018 11:42

Add Seurat first version

b80cd84

Fix .shed.yml

44472e2

Small fixes

5db7214

mblue9 and others added 3 commits August 20, 2018 13:49

Make test data smaller

ef0a6f8

Remove old test file

ba7eda3

use seurat 2.3.4

99f5b7e

mblue9 closed this Aug 20, 2018

mblue9 reopened this Aug 20, 2018

we need to help the solver ....

e4307ac

until we can use a newer conda version

bgruening merged commit 24c0223 into galaxyproject:master Aug 26, 2018

pcm32 mentioned this pull request Aug 29, 2018

Make a pr to bioconda soon to enable more fluid collaboration ebi-gene-expression-group/r-seurat-scripts#3

Closed

mblue9 mentioned this pull request Aug 30, 2018

Update r2g2 to Python 3 so can use rpy2 blankenberg/r2g2#1

Merged

pinin4fjords mentioned this pull request Aug 30, 2018

Add more Seurat functions? ebi-gene-expression-group/r-seurat-scripts#4

Open

mblue9 mentioned this pull request Aug 31, 2018

scRNA-Seq Workflows #2057

Open

pcm32 mentioned this pull request Dec 4, 2018

[WIP] Seurat divided in several modules #2195

Closed

20 tasks

pcm32 mentioned this pull request Jan 9, 2019

[WIP] Merges efforts of modular seurat wrappers (artbio-mblue9-ebi-sanger) #2231

Closed

20 tasks

Add Seurat first version #2047

Add Seurat first version #2047

Conversation

mblue9 commented Aug 20, 2018

mblue9 commented Aug 20, 2018

mblue9 commented Aug 20, 2018

bgruening commented Aug 21, 2018

mblue9 commented Aug 21, 2018

bgruening commented Aug 26, 2018

mblue9 commented Aug 26, 2018

pcm32 commented Aug 28, 2018

mblue9 commented Aug 28, 2018

bgruening commented Aug 28, 2018

mblue9 commented Aug 28, 2018

pcm32 commented Aug 29, 2018

mblue9 commented Aug 29, 2018

pinin4fjords commented Aug 29, 2018

bgruening commented Aug 29, 2018

mblue9 commented Aug 30, 2018

mtekman commented Aug 30, 2018

mtekman commented Aug 30, 2018 • edited Loading

mblue9 commented Aug 30, 2018

pinin4fjords commented Aug 30, 2018

mtekman commented Aug 30, 2018

pcm32 commented Aug 30, 2018

pcm32 commented Aug 30, 2018

mtekman commented Aug 30, 2018

pcm32 commented Aug 30, 2018

pcm32 commented Aug 30, 2018

pcm32 commented Aug 30, 2018

pinin4fjords commented Aug 30, 2018 • edited Loading

mtekman commented Aug 30, 2018

blankenberg commented Aug 30, 2018

bgruening commented Aug 30, 2018

pinin4fjords commented Aug 30, 2018

nsoranzo commented Aug 31, 2018

mblue9 commented Aug 31, 2018

mblue9 commented Aug 31, 2018

suhaibMo commented Sep 25, 2018

mtekman commented Aug 30, 2018 •

edited

Loading

pinin4fjords commented Aug 30, 2018 •

edited

Loading