Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JupyLab Tool Issues #6326

Open
nomadscientist opened this issue Sep 13, 2024 · 11 comments
Open

JupyLab Tool Issues #6326

nomadscientist opened this issue Sep 13, 2024 · 11 comments

Comments

@nomadscientist
Copy link

JupyLab/Jupyter tool issues collected from 2024 Bioinfo Bootcamp preparation and trying to get these tutorials to work after updates:

Please see notes here about get/put issues & installation issues & kernel switching
galaxyproject/training-material#5297

Possible ideal future = have a single cell Jupyter (that auto-updates with any main updates to the tool) but has 'slow' packages pre-installed...

@nomadscientist
Copy link
Author

@wee-snufkin
@FoxHin5431

@bgruening
Copy link
Member

Can you please test this again with a fresh jupyter notebook?

@nomadscientist
Copy link
Author

nomadscientist commented Oct 10, 2024

OK -

  1. We need to calculate how much resource each step in the notebook takes, can you point me to a screenshot or FAQ on how to do this?
  2. One key time-suck in this tutorial is they have to install salmon (wget -nv https://github.com/COMBINE-lab/salmon/releases/download/v1.10.0/salmon-1.10.0_linux_x86_64.tar.gz) and unzip it (tar -xvzf salmon-1.10.0_linux_x86_64.tar.gz) and also this (conda install -y -c conda-forge -c bioconda atlas-gene-annotation-manipulation) which takes a while.

Can these be just inherent in a Jupyter Notebook to avoid these long steps? We'll keep the code in a comments box for users to apply to other scenarios, like GoogleCollab etc., but it would speed up usage a lot if we didn't have to install these every time.

@FoxHin5431
Copy link

Hi Wendi,
Would something like this work, I only did it for the bash parts of the alevin tutorial - as each kernel needs a different way to do this. Unless there is a system behind the notebook I can't access it. I wrote some code to do it, logging the CPU and memory times and then some code to compare the start and finish times of the cells run to the collected CPU and memory times.
memory_usage_no_bars
memory_usage_with_bars

cpu_usage_with_bars
memory_usage.log

task_duration_data.xlsx

*I don't trust cell 9 times - the browser kept crashing, not jupyters fault.

Best

Mark

log files

cell_log.log
cpu_usage.log
cpu_usage_no_bars

@nomadscientist
Copy link
Author

Hmm, am I understanding this correctly that if we manage to remove the GTF annotation step (or, perhaps, run it but only on 1 chromosome or something similar) then we will substantially reduce the resources? I'm surprised, as I thought Alevin was the killer.
Regardless, the memory used plot - is this what's kicking over the 4GB allotment per notebook? Although it's expressed as a % so I am not sure if I'm interpreting it correctly. Can you express those plots as raw data Mark?

@wee-snufkin
Copy link

wee-snufkin commented Oct 10, 2024

  1. One key time-suck in this tutorial is they have to install salmon (wget -nv https://github.com/COMBINE-lab/salmon/releases/download/v1.10.0/salmon-1.10.0_linux_x86_64.tar.gz) and unzip it (tar -xvzf salmon-1.10.0_linux_x86_64.tar.gz) and also this (conda install -y -c conda-forge -c bioconda atlas-gene-annotation-manipulation) which takes a while.

Just to clarify - Salmon can be also installed using conda. In the tutorial the alternative way of downloading and unziping was shown because at the time of writing the tutorial it worked better and quicker compared to conda installation.

Can these be just inherent in a Jupyter Notebook to avoid these long steps? We'll keep the code in a comments box for users to apply to other scenarios, like GoogleCollab etc., but it would speed up usage a lot if we didn't have to install these every time.

But having salmon and dropletutils pre-installed (as well as atlas-gene-annotation-manipulation and tximeta)in Jupyter Notebook would make running this tutorial much easier.

@wee-snufkin
Copy link

Hmm, am I understanding this correctly that if we manage to remove the GTF annotation step (or, perhaps, run it but only on 1 chromosome or something similar) then we will substantially reduce the resources? I'm surprised, as I thought Alevin was the killer. Regardless, the memory used plot - is this what's kicking over the 4GB allotment per notebook? Although it's expressed as a % so I am not sure if I'm interpreting it correctly. Can you express those plots as raw data Mark?

Isn't Alevin the killer, having crashed Mark's browser multiple times? It rather doesn't happen in Google Colab

@FoxHin5431
Copy link

FoxHin5431 commented Oct 10, 2024

Something like this:
memory_usage_with_bars

Looks like cell 7 goes almost over 5gb

@FoxHin5431
Copy link

Hmm, am I understanding this correctly that if we manage to remove the GTF annotation step (or, perhaps, run it but only on 1 chromosome or something similar) then we will substantially reduce the resources? I'm surprised, as I thought Alevin was the killer. Regardless, the memory used plot - is this what's kicking over the 4GB allotment per notebook? Although it's expressed as a % so I am not sure if I'm interpreting it correctly. Can you express those plots as raw data Mark?

Isn't Alevin the killer, having crashed Mark's browser multiple times? It rather doesn't happen in Google Colab

I was running a lot of browser tabs at the same time but it could have been jupyter - will try it again in the morning - I saved my version of the pipeline with the monitoring - so should be able to just set it up again.

@bgruening
Copy link
Member

In the information site of every Galaxy dataset you will also find something like that:

grafik

The "Max memory usage recorded" would be interesting.

@FoxHin5431
Copy link

Hi,
Here are the max memory and duration attached to the tasks.

start finish max_usage_gb duration cell_task
10/10/2024 13:29 10/10/2024 13:30 4.81 5 seconds gtf2annotation
10/10/2024 13:30 10/10/2024 13:30 4.81 5 seconds gtf2annotation
10/10/2024 13:30 10/10/2024 13:30 4.81 5 seconds gtf2annotation
10/10/2024 13:30 10/10/2024 13:31 4.82 5 seconds gtf2annotation
10/10/2024 13:31 10/10/2024 13:33 5.05 2.0 minutes gtf2annotation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants