Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

race condition with multiple numpys in a single shell #3483

Closed
jhgoebbert opened this issue Mar 9, 2022 · 19 comments
Closed

race condition with multiple numpys in a single shell #3483

jhgoebbert opened this issue Mar 9, 2022 · 19 comments
Labels

Comments

@jhgoebbert
Copy link

I came across the following error while loading xpra:

2022-03-08 20:00:47,065  cannot load enc_nvjpeg (nvjpeg encoder)
Traceback (most recent call last):
  File "/p/software/juwels/stages/2022/software/xpra/4.3.2-GCCcore-11.2.0/lib/python3.9/site-packages/xpra/codecs/loader.py", line 52, in codec_import_check
    ic =  __import__(class_module, {}, {}, classnames)
  File "xpra/codecs/nvjpeg/encoder.pyx", line 8, in init xpra.codecs.nvjpeg.encoder
  File "/p/software/juwels/stages/2022/software/SciPy-bundle/2021.10-gcccoremkl-11.2.0-2021.4.0/lib/python3.9/site-packages/numpy/__init__.py", line 150, in <module>
    from . import core
  File "/p/software/juwels/stages/2022/software/SciPy-bundle/2021.10-gcccoremkl-11.2.0-2021.4.0/lib/python3.9/site-packages/numpy/core/__init__.py", line 51, in <module>
    del os.environ[envkey]
  File "/p/software/juwels/stages/2022/software/Python/3.9.6-GCCcore-11.2.0/lib/python3.9/os.py", line 695, in __delitem__
    raise KeyError(key) from None
KeyError: 'OPENBLAS_MAIN_FREE'

After digging through the code I come to the following conclusion:
In numpy/core/__init__.py the environment variable 'OPENBLAS_MAIN_FREE' is set in line 19 if not existant and removed (if set in line 19) in line 52.
To my understanding, this can result in a race condition if two numpys are initialized at the same time (by two threads) in the same shell.
It might be an answer to the error shown above.

@totaam
Copy link
Collaborator

totaam commented Mar 9, 2022

This could be solved in xpra by wrapping all numpy imports with an indirection layer, but that's not an elegant solution.

Shouldn't this issue be solved by numpy itself? (builds without openblas don't have this issue)

@totaam totaam transferred this issue from Xpra-org/xpra-html5 Mar 9, 2022
@totaam
Copy link
Collaborator

totaam commented Mar 9, 2022

In any case, this issue has absolutely nothing to do with the html5 client so I have transferred it.

@jhgoebbert
Copy link
Author

jhgoebbert commented Mar 9, 2022

Shouldn't this issue be solved by numpy itself? (builds without openblas don't have this issue)

Yes, probably this is an issue which should be solved in numpy itself. I wonder if it is "allowed" by numpy to import numpy at the same time (by different threads) in the same shell multiple times. But I could not find any comment on that, yet.
To me this seems to be an issue independent if numpy is build with openblas or not (I do not use a numpy build with openblas).

(thanks for moving the issue)

@jhgoebbert
Copy link
Author

I created an issue here: numpy/numpy#21223

@totaam totaam added the upstream label Apr 3, 2022
@totaam
Copy link
Collaborator

totaam commented Apr 3, 2022

As per the issue above, the fix should come from upsteam Python, or a band-aid in numpy.

@totaam totaam closed this as completed Apr 3, 2022
@totaam
Copy link
Collaborator

totaam commented Jun 17, 2022

I am now also seeing this error with Fedora 36.

@totaam
Copy link
Collaborator

totaam commented Jul 17, 2022

@jhgoebbert I am unable to test this properly on my non-nvidia laptop, but the commit above should fix this issue.
It isn't as intrusive as I had feared.

@jhgoebbert
Copy link
Author

Great! Thank you for the fix.

totaam added a commit that referenced this issue Jul 20, 2022
@totaam
Copy link
Collaborator

totaam commented Jul 20, 2022

I had missed some that were loaded from pycuda. (see commit above)
One remains in the client:

from pycuda.driver import memcpy_dtod #pylint: disable=no-name-in-module
from pycuda.gl import RegisteredBuffer, graphics_map_flags

And perhaps this is related to the memory leak in #3504.
See also #3515

totaam added a commit that referenced this issue Jul 20, 2022
totaam added a commit that referenced this issue Sep 14, 2022
also add notes where we import pycuda to explain why it is safe to do so
@totaam
Copy link
Collaborator

totaam commented Dec 5, 2022

Not fixed.
I don't even have an NVidia GPU in this system!
Starting a server:

2022-12-05 19:06:59,178  cannot load nvenc (nvenc encoder)
Traceback (most recent call last):
  File "/usr/lib64/python3.11/site-packages/xpra-5.0-py3.11-linux-x86_64.egg/xpra/codecs/loader.py", line 81, in codec_import_check
    ic =  __import__(class_module, {}, {}, classnames)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xpra/codecs/nvidia/nvenc/encoder.pyx", line 18, in init xpra.codecs.nvidia.nvenc.encoder
    from xpra.codecs.nvidia.cuda_context import (
  File "/usr/lib64/python3.11/site-packages/xpra-5.0-py3.11-linux-x86_64.egg/xpra/codecs/nvidia/cuda_context.py", line 31, in <module>
    from pycuda import driver   #@UnresolvedImport
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/site-packages/pycuda/driver.py", line 7, in <module>
    import numpy as np
  File "/usr/lib64/python3.11/site-packages/numpy/__init__.py", line 144, in <module>
    from . import core
  File "/usr/lib64/python3.11/site-packages/numpy/core/__init__.py", line 52, in <module>
    del os.environ[envkey]
        ~~~~~~~~~~^^^^^^^^
  File "<frozen os>", line 694, in __delitem__
KeyError: 'OPENBLAS_MAIN_FREE'

@totaam totaam reopened this Dec 5, 2022
@totaam
Copy link
Collaborator

totaam commented Jul 11, 2023

Not seen since.

@totaam totaam closed this as completed Jul 11, 2023
@totaam
Copy link
Collaborator

totaam commented Sep 28, 2023

Seen today:

2023-09-27 23:11:53,642  cannot load enc_nvjpeg (nvjpeg encoder)
Traceback (most recent call last):
  File "xpra/codecs/loader.py", line 83, in xpra.codecs.loader.codec_import_check
    ic : ModuleType =  __import__(class_module, {}, {}, classnames)
  File "xpra/codecs/nvidia/nvjpeg/encoder.pyx", line 27, in init xpra.codecs.nvidia.nvjpeg.encoder
    from xpra.codecs.nvidia.cuda.context import get_CUDA_function, select_device, cuda_device_context
  File "/usr/lib64/python3.11/site-packages/xpra/codecs/nvidia/cuda/context.py", line 35, in <module>
    from pycuda.driver import (
  File "/usr/lib64/python3.11/site-packages/pycuda/driver.py", line 7, in <module>
    import numpy as np
  File "/usr/lib64/python3.11/site-packages/numpy/__init__.py", line 141, in <module>
    from . import core
  File "/usr/lib64/python3.11/site-packages/numpy/core/__init__.py", line 52, in <module>
    del os.environ[envkey]
        ~~~~~~~~~~^^^^^^^^
  File "<frozen os>", line 695, in __delitem__
KeyError: 'OPENBLAS_MAIN_FREE'

@totaam
Copy link
Collaborator

totaam commented Dec 3, 2023

Anyone seeing this issue, please try with the latest version 6 or later.
Add these 3 lines anywhere near the top of /usr/lib64/python3.*/site-packages/numpy/__init__.py:

import os
if not os.environ.get("XPRA_NUMPY_IMPORT"):
    import traceback
    traceback.print_stack()

(ie: just before or after import sys)
This will show us where numpy is being imported unexpectedly.

totaam added a commit that referenced this issue Jan 16, 2024
we don't want to fail to enable OpenGL acceleration just because the codec initialization thread is also initializing numpy, so wait for it to complete
@totaam
Copy link
Collaborator

totaam commented Jan 16, 2024

Got this one today:

  File "/usr/bin/xpra", line 17, in <module>
    sys.exit(main(sys.argv[0], sys.argv))
  File "/usr/lib64/python3.12/site-packages/xpra/platform/posix/gl_context.py", line 300, in __init__
    self.props.update(get_context_info())
  File "/usr/lib/python3.12/site-packages/OpenGL/plugins.py", line 18, in load
    return importByName( self.import_path )
  File "/usr/lib/python3.12/site-packages/OpenGL/plugins.py", line 45, in importByName
    module = __import__( ".".join(moduleName), {}, {}, moduleName)
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 994, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/usr/lib/python3.12/site-packages/OpenGL/arrays/numpymodule.py", line 12, in <module>
    import numpy
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 994, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/usr/lib64/python3.12/site-packages/numpy/__init__.py", line 117, in <module>
    traceback.print_stack()

So the commit above adds locking in the X11 OpenGL context path.
The other paths should already be safe as they don't import OpenGL directly, only via check_PyOpenGL_support after Gtk created the context.

The big problem is that doing any kind of locking in the UI thread is bad..
Unfortunately, by the time we want to import the OpenGL backend, we have no way of knowing how long the numpy import will take if one has already started, or even if it is actually needed (the numpy array handler is optional in PyOpenGL..)
Should we prevent PyOpenGL from loading numpy? How much performance would we lose?
(we would gain some huge memory savings..)

totaam added a commit that referenced this issue Jan 17, 2024
we don't want to fail to enable OpenGL acceleration just because the codec initialization thread is also initializing numpy, so wait for it to complete
totaam added a commit that referenced this issue Jan 17, 2024
we don't want to fail to enable OpenGL acceleration just because the codec initialization thread is also initializing numpy, so wait for it to complete
totaam added a commit that referenced this issue Jan 29, 2024
@totaam
Copy link
Collaborator

totaam commented Jan 29, 2024

Should we prevent PyOpenGL from loading numpy?

The commits above does just that via: XPRA_NUMPY=0.
It prevents numpy from being loaded - anywhere.
Or one can use the more subsystem specific variants, should this be needed: XPRA_OPENGL_NUMPY=0 or XPRA_CUDA_NUMPY=0. (the latter one is only available for completeness, as CUDA won't work without numpy)


I am closing this ticket because I am reasonably confident that the underlying bug is now fixed, even if the solution is not ideal (blocking in the main thread...).
Feel free to re-open if you can reproduce this with any version newer than v5.0.5 or v6-r35031.

@totaam totaam closed this as completed Jan 29, 2024
totaam added a commit that referenced this issue Feb 12, 2024
coming from the posix gl context calling 'get_context_info'
totaam added a commit that referenced this issue Feb 12, 2024
simpler version of 8d40c79 for v5.x
totaam added a commit that referenced this issue Feb 14, 2024
not just when importing these calls..
totaam added a commit that referenced this issue Feb 14, 2024
not just when importing these calls..
totaam added a commit that referenced this issue Feb 19, 2024
@totaam
Copy link
Collaborator

totaam commented Apr 3, 2024

Minor updates:

  • 4ec161b will log more details about the location holding the initial lock
  • nvfbc shadow servers could trigger a locking issue: 48a08ad
  • we can now also make it easier to trigger issues: 5d9f498 using XPRA_NUMPY_LOCK_SLEEP=10, CUDA initialization will take 10 seconds longer, so any other thread is likely to hit the lock

@totaam
Copy link
Collaborator

totaam commented Jun 15, 2024

As part of fixes for Fedora 40 which fails to build pyopengl-accelerate against numpy, I have made numpy optional when installing pyopengl.

And I think I found the culprit:

File "/usr/bin/xpra", line 20, in <module>
    sys.exit(main(sys.argv[0], sys.argv))
  File "/usr/lib64/python3.10/site-packages/xpra/scripts/main.py", line 139, in main
    return run_mode(script_file, cmdline, err, options, args, mode, defaults)
  File "/usr/lib64/python3.10/site-packages/xpra/scripts/main.py", line 522, in run_mode
    return do_run_mode(script_file, cmdline, error_cb, options, args, full_mode, defaults)
  File "/usr/lib64/python3.10/site-packages/xpra/scripts/main.py", line 614, in do_run_mode
    return run_client(script_file, cmdline, error_cb, options, args, mode)
  File "/usr/lib64/python3.10/site-packages/xpra/scripts/main.py", line 1450, in run_client
    app = get_client_app(cmdline, error_cb, opts, extra_args, mode)
  File "/usr/lib64/python3.10/site-packages/xpra/scripts/main.py", line 1619, in get_client_app
    app = get_client_gui_app(error_cb, opts, request_mode, extra_args, mode)
  File "/usr/lib64/python3.10/site-packages/xpra/scripts/main.py", line 1691, in get_client_gui_app
    app.init_ui(opts)
  File "/usr/lib64/python3.10/site-packages/xpra/client/gui/ui_client_base.py", line 235, in init_ui
    self.init_opengl(opts.opengl)
  File "/usr/lib64/python3.10/site-packages/xpra/client/gtk3/client_base.py", line 1269, in init_opengl
    self.opengl_props, gl_client_window_module = get_gl_client_window_module(enable_opengl)
  File "/usr/lib64/python3.10/site-packages/xpra/client/gl/window.py", line 61, in get_gl_client_window_module
    props, window_module = test_window_module(module_name, force_enable)
  File "/usr/lib64/python3.10/site-packages/xpra/client/gl/window.py", line 70, in test_window_module
    mod = import_module(f"xpra.client.gl.gtk3.{module_name}_window")
  File "/usr/lib64/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/lib64/python3.10/site-packages/xpra/client/gl/gtk3/native_window.py", line 8, in <module>
    from xpra.client.gl.gtk3.drawing_area import GLDrawingArea
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/lib64/python3.10/site-packages/xpra/client/gl/gtk3/drawing_area.py", line 11, in <module>
    from xpra.client.gl.backing import GLWindowBackingBase
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/lib64/python3.10/site-packages/xpra/client/gl/backing.py", line 16, in <module>
    from OpenGL.error import GLError
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/lib/python3.10/site-packages/OpenGL/error.py", line 164, in <module>
    from OpenGL import acceleratesupport
  File "<frozen importlib._bootstrap>", line 1078, in _handle_fromlist
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/lib/python3.10/site-packages/OpenGL/acceleratesupport.py", line 8, in <module>
    import OpenGL_accelerate

So I am now very hopeful that the two commits above will finally catch the last of the threaded loading of the numpy module.
Note: as per f5e6e3e, we don't load the numpy format handler on Fedora 40 because it doesn't build properly. So I haven't tested this fully yet.

totaam added a commit that referenced this issue Jul 6, 2024
including a better name for the subsystem and the full backtrace from both call sites
@totaam
Copy link
Collaborator

totaam commented Jul 6, 2024

Reproduced with MS Windows shadow server here: #4287 (comment) : RuntimeError: the numpy import lock is already held by CUDA in thread 91700! - which was not as helpful as I was hoping... so the commit above adds even more details.

ie:

$ python3.12
Python 3.12.3 (main, Apr 17 2024, 00:00:00) [GCC 14.0.1 20240411 (Red Hat 14.0.1-0)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from xpra.util.thread import start_thread
>>> from xpra.util.env import NumpyImportContext
>>> n=NumpyImportContext("foo", False)
>>> n.__enter__()
>>> def ni():
...  n2=NumpyImportContext("foo2", False)
...  n2.__enter__()
... 
>>> start_thread(ni, "goo")
<Thread(goo, started 139833305466560)>
>>> 2024-07-06 16:05:40,864 debug enabled for ['threading', 'util']
2024-07-06 16:05:40,864 numpy lock was already acquired from:
2024-07-06 16:05:40,864   File "/usr/lib64/python3.12/threading.py", line 1030, in _bootstrap
2024-07-06 16:05:40,865     self._bootstrap_inner()
2024-07-06 16:05:40,865   File "/usr/lib64/python3.12/threading.py", line 1073, in _bootstrap_inner
2024-07-06 16:05:40,865     self.run()
2024-07-06 16:05:40,865   File "/usr/lib64/python3.12/threading.py", line 1010, in run
2024-07-06 16:05:40,865     self._target(*self._args, **self._kwargs)
2024-07-06 16:05:40,865   File "<stdin>", line 2, in ni
2024-07-06 16:05:40,865 failed to acquire it again from:
2024-07-06 16:05:40,865   File "/usr/lib64/python3.12/threading.py", line 1030, in _bootstrap
2024-07-06 16:05:40,865     self._bootstrap_inner()
2024-07-06 16:05:40,865   File "/usr/lib64/python3.12/threading.py", line 1073, in _bootstrap_inner
2024-07-06 16:05:40,865     self.run()
2024-07-06 16:05:40,865   File "/usr/lib64/python3.12/threading.py", line 1010, in run
2024-07-06 16:05:40,865     self._target(*self._args, **self._kwargs)
2024-07-06 16:05:40,865   File "<stdin>", line 3, in ni
Exception in thread goo:
Traceback (most recent call last):
  File "/usr/lib64/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib64/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "<stdin>", line 3, in ni
  File "xpra/util/env.py", line 226, in xpra.util.env.NumpyImportContext.__enter__
RuntimeError: the numpy import lock is already held by 'foo2'!

@totaam
Copy link
Collaborator

totaam commented Nov 22, 2024

Today with xpra 5.0.11:

2024-11-22 16:54:47,175  cannot load nvenc (nvenc encoder)
Traceback (most recent call last):
  File "/usr/lib64/python3.13/site-packages/xpra/codecs/loader.py", line 82, in codec_import_check
    ic : ModuleType =  __import__(class_module, {}, {}, classnames)
                       ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xpra/codecs/nvidia/nvenc/encoder.pyx", line 19, in init xpra.codecs.nvidia.nvenc.encoder
    from xpra.util.objects import AtomicInteger, typedict
  File "/usr/lib64/python3.13/site-packages/xpra/codecs/nvidia/cuda_context.py", line 32, in <module>
    from pycuda.driver import (
    ...<4 lines>...
        )
  File "/usr/lib64/python3.13/site-packages/pycuda/driver.py", line 7, in <module>
    import numpy as np
  File "/usr/lib64/python3.13/site-packages/numpy/__init__.py", line 130, in <module>
    from numpy.__config__ import show as show_config
  File "/usr/lib64/python3.13/site-packages/numpy/__config__.py", line 4, in <module>
    from numpy.core._multiarray_umath import (
    ...<3 lines>...
    )
  File "/usr/lib64/python3.13/site-packages/numpy/core/__init__.py", line 53, in <module>
    del os.environ[envkey]
        ~~~~~~~~~~^^^^^^^^
  File "<frozen os>", line 732, in __delitem__
KeyError: 'OPENBLAS_MAIN_FREE'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants