Add resource limits #106

cmelone · 2024-09-27T17:25:38Z

This is the first version of our prediction formulas for max cpu and memory.

This PR also sets SPACK_BUILD_JOBS equal to the CPU request (nearest core).

Using the included simulation script, I ran a scenario where we allocated resources for 8000 specs.

The max memory predictions includes a 20% "bump" to avoid the OOM killing of ~1100 jobs.

The ratio of actual usage/predicted usage (mem) was 0.6963, meaning that we are overallocating by 30%.
However, 437 jobs were killed, representing an OOM rate of 0.055, far higher than we would like.

@alecbcs and I discussed alternative prediction strategies that include factoring the ratio of mem/cores.

For example, if we take a look at a job that was predicted to use 3x less memory at peak than it actually used, and the data used to make this prediction:

[email protected] ~guile build_system=generic%[email protected] gitlab_id=12859608
duration: 1262 cpu_mean: 0.621, cpu_max: 0.956, mem_mean: 2590.126, mem_max: 4448.702

samples:
duration: 181 cpu_mean: 0.169, cpu_max: 0.424, mem_mean: 105.722, mem_max: 168.37
duration: 149 cpu_mean: 0.531, cpu_max: 1.064, mem_mean: 702.054, mem_max: 1033.888
duration: 107 cpu_mean: 0.283, cpu_max: 0.415, mem_mean: 95.556, mem_max: 149.381
duration: 432 cpu_mean: 0.31, cpu_max: 1.051, mem_mean: 100.313, mem_max: 1300.226
duration: 396 cpu_mean: 0.268, cpu_max: 1.023, mem_mean: 172.576, mem_max: 1364.505

This package usually takes 4-5 minutes to build, but instead took 21 minutes and peaked at nearly 4x memory.

In my opinion, there is no data available to us that would allow us to make an accurate prediction in this scenario. This is the case for most of the outliers that I've seen. In this scenario, the job in question may have been manipulated by a noisy neighbor not respecting their allocation.

My vote is to keep the formula as-is, and tweak it once we deploy gantry to the staging cluster and with limits in place.

The ratio for max cpu was 0.9546.

cmelone · 2024-10-23T21:38:33Z

will rebase this as well as #93

… system `SPACK_BUILD_JOBS=max(1, cpu_request)`

cmelone · 2024-10-25T14:03:40Z

past thread on deciding # of build jobs: spack/spack#26242

@HadrienG2 I figured you might be interested to know we're working on this for our CI, the approach is quite similar to your comment

…ting

cmelone self-assigned this Sep 27, 2024

cmelone force-pushed the add/resource-limits branch from 8777cd3 to 6a6efb6 Compare October 8, 2024 17:39

cmelone marked this pull request as ready for review October 8, 2024 18:00

cmelone changed the title ~~draft: add resource limits~~ Add resource limits Oct 8, 2024

cmelone added 7 commits October 23, 2024 18:45

Enable CPU and memory limits

12b81ff

update test data

2de1197

update simulation script to output mem/cpu limit measurements

5b249b1

don't remove ensure_higher_pred here. done in #119

b865ee4

return SPACK_BUILD_JOBS with resource allocation to control the build…

95208b0

… system `SPACK_BUILD_JOBS=max(1, cpu_request)`

prediction: use generators instead of list comprehension

a34cecd

remove reference to ensure_higher

3a7f629

cmelone force-pushed the add/resource-limits branch from 4d9e505 to 3a7f629 Compare October 23, 2024 21:48

REMOVE BEFORE MERGE: build/push container for testing

40e83ff

github-actions bot added the ci Involving Project CI & Unit Tests label Oct 29, 2024

cmelone added 3 commits October 29, 2024 19:13

REMOVE BEFORE MERGE: disable develop constraint in prediction for tes…

4c180a9

…ting

lets try with this

8287c78

again

5aade28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add resource limits #106

Add resource limits #106

cmelone commented Sep 27, 2024 •

edited

Loading

cmelone commented Oct 23, 2024

cmelone commented Oct 25, 2024 •

edited

Loading

Add resource limits #106

Are you sure you want to change the base?

Add resource limits #106

Conversation

cmelone commented Sep 27, 2024 • edited Loading

cmelone commented Oct 23, 2024

cmelone commented Oct 25, 2024 • edited Loading

cmelone commented Sep 27, 2024 •

edited

Loading

cmelone commented Oct 25, 2024 •

edited

Loading