Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downsampling pixel when min-quality set as 100 #4397

Open
Simba98 opened this issue Oct 19, 2024 · 11 comments
Open

Downsampling pixel when min-quality set as 100 #4397

Simba98 opened this issue Oct 19, 2024 · 11 comments
Labels
bug Something isn't working

Comments

@Simba98
Copy link

Simba98 commented Oct 19, 2024

Describe the bug
Threaded Lock Contention while threaded-clean thread is cleaning.
So that NVENC failed and fallback (see below)

Reason why I notice that:
See https://github.com/orgs/Xpra-org/discussions/4343#discussioncomment-10554736
I set the parameter of --min-quality=100 on client side, but even that, the YUV420 downsampling will still appears so that I can not see rectangle clearly.

To Reproduce
Steps to reproduce the behavior:

  1. server command: xpra start :A_DISPLAY -d video,cuda
  2. client command: Xpra_cmd.exe attach ssh://Server/A_DISPLAY --min-quality=100
  3. specific action to trigger the bug: No, just run software do something.

System Information (please complete the following information):

  • Server OS: Rocky Linux 8
  • Client OS: e.g. Windows 11
  • Xpra Server Version 6.2.0
  • Xpra Client Version 6.2.0

Debug Code
Change https://github.com/Xpra-org/xpra/blob/v6.2.x/xpra/codecs/nvidia/cuda/context.py#L524
To the following:
(And import threading and also change slots)

    def __enter__(self):
        if not self.lock.acquire(False):
            log(f"Current thread: {threading.current_thread().name}")
            log(f"Lock is held by: {self.holding_thread.name}")
            raise TransientCodecException("failed to acquire cuda device lock")
        self.holding_thread = threading.current_thread()
        if not self.context:
            self.make_context()
        return self.push_context()

Additional context

// My debug Code Print
2024-10-19 20:45:06,699 Current thread: encode
2024-10-19 20:45:06,699 Lock is held by: threaded-clean
// End of my debug code
2024-10-19 20:45:06,700 Warning: setup_pipeline failed for
2024-10-19 20:45:06,700  (260, (1, 1), None, 0, 0, None, 'BGRX', (1, 1), 613, 673, nvjpeg(BGRX to jpeg)):
2024-10-19 20:45:06,700  failed to acquire cuda device lock
2024-10-19 20:45:06,700 setup_pipeline: trying (241, (1, 1), None, 0, 0, None, 'BGRX', (1, 1), 613, 673, nvenc(BGRX to h264))
2024-10-19 20:45:06,700 setup_pipeline: csc=None, video encoder=nvenc(BGRX/NV12/H264 - None -  613x673 ), info: {'version': (11, 0), 'device_count': 1, 'context_count': 3, 'generation': 31, 'cards': {0: {'name': b'NVIDIA GeForce RTX 4090', 'uuid': b'GPU-UUID', 'pci': {'domain': 0, 'bus': 3, 'device': 0, 'pci-device-id': SOMEID, 'pci-subsystem-id': SOMEID, 'bus-id': '00000000:03:00.0'}, 'memory': {'total': 25757220864, 'free': 24575606784, 'used': 1181614080}, 'pcie-link': {'generation-max': 4, 'width-max': 16, 'generation': 4, 'width': 8}, 'clock-info': {'graphics': 2625, 'sm': 2625, 'mem': 10251, 'graphics-max': 3165, 'sm-max': 3165, 'mem-max': 10501}, 'fan-speed': 30, 'temperature': 29, 'power-state': 2, 'vbios-version': b'95.02.3C.00.40'}}, 'kernel_module_version': (535, 183, 1), 'kernel_version': '5.15.0-119-generic', 'width': 613, 'height': 673, 'frames': 0, 'codec': 'H264', 'encoder_width': 640, 'encoder_height': 704, 'bitrate': 1000000, 'quality': 100, 'speed': 1, 'lossless': {'': 0, 'supported': 1, 'threshold': 100}, 'yuv444': {'supported': True, 'threshold': 85}, 'cuda-device': {}, 'cuda': {}, 'pycuda': {}, 'src_format': 'BGRX', 'pixel_format': 'NV12', 'total_time_ms': 0, 'free_memory': 0, 'total_memory': 0}, setup took 0.29ms
2024-10-19 20:45:06,700 video encoder nvenc(BGRX/NV12/H264 - None -  613x673 ) is not ready yet, using temporary fallback
2024-10-19 20:45:06,713 get_CUDA_function(BGRX_to_NV12) module=<pycuda._driver.Module object at 0x7f733a35f760>
2024-10-19 20:45:06,713 loading function 'BGRX_to_NV12' from pre-compiled cubin took 0.2ms

** Extra Comment**
Well, I really want to have a way to avoid YUV420 with Xpra 6.x Windows Client.
But now the H264 with YUV444 is not supported on Xpra 6.x Windows Client. Maybe I should consider port ffmpeg from 5.x branch? Or just use 5.x client with 6.x server?

@Simba98 Simba98 added the bug Something isn't working label Oct 19, 2024
@totaam
Copy link
Collaborator

totaam commented Oct 19, 2024

Threaded Lock Contention while threaded-clean thread is cleaning.

What is the actual problem here?

Well, I really want to have a way to avoid YUV420 with Xpra 6.x Windows Client.

The only h264 decoder in v6 onwards is the openh264 one: #3734
The ffmpeg decoder was removed 1.5 years ago: 20bb5f0

Maybe I should consider port ffmpeg from 5.x branch?

Unless you can prove that you will do the work for supporting this feature in the long term, including all the build and packaging for all platforms, I cannot merge this type of change.

Or just use 5.x client with 6.x server?

That should work.


Is this a desktop or seamless session?
Setting a high enough min-quality value should let the engine decide which encoding is best and it should not use subsampling if it can be avoided. Which makes me think that the real bug is elsewhere.

@Simba98
Copy link
Author

Simba98 commented Oct 19, 2024

Threaded Lock Contention while threaded-clean thread is cleaning.

What is the actual problem here?

The actual problem is downsampling still happens. The GUI is smooth. And some region is not update correctly (until I do refresh operation in the software.)

Well, I really want to have a way to avoid YUV420 with Xpra 6.x Windows Client.

The only h264 decoder in v6 onwards is the openh264 one: #3734 The ffmpeg decoder was removed 1.5 years ago: 20bb5f0

Maybe I should consider port ffmpeg from 5.x branch?

Unless you can prove that you will do the work for supporting this feature in the long term, including all the build and packaging for all platforms, I cannot merge this type of change.

I see,

Or just use 5.x client with 6.x server?

That should work.

Is this a desktop or seamless session? Setting a high enough min-quality value should let the engine decide which encoding is best and it should not use subsampling if it can be avoided. Which makes me think that the real bug is elsewhere.

It is a seamless session. I think it is due to "fallback"

video encoder nvenc(BGRX/NV12/H264 - None - 613x673 ) is not ready yet, using temporary fallback.

Which makes me think that the real bug is elsewhere.

Could be,
GUI smooth / "downsampling" is only happens on H264 / VP8 encoding with NVENC. Forcely using JPEG encoding (should be NVJPEG) will make xpra runs very well. Forcely using any other non-NVENC encoding (includes AV1) looks good also.

*: I see this in the log so I think jpeg is using nvjpeg.

found GPU accelerated encoder nvjpeg(BGRX to jpeg)

But you suggested not to use --encoding=jpeg and use automatic (I agree if xpra is working well).

What information I can provide is that: it usually happens on partial updating (my window is 1920x1080 and in this example Xpra is trying to update 613x673. The refresh operation force my software re-draw the whole window so Xpra seems like will update whole window so, and in this case xpra usually OK.

I took two pictures from MS Windows screenshot. Not easy to say the reason why but this looks like due to downsample on YUV420.
Smooth
Smooth

I checked the server log and find quality is 100, sometime is 99 (Very few). Neither 100 or 99 should not have downsample.

found GPU accelerated encoder nvenc(BGRX to h264)
do_video_encode(h264, XShmImageWrapper(BGRX: 0, 0, 2560, 1334), typedict({'quality': 100, 'speed': 5, 'rgb_formats': ('BGRX', 'RGBX', 'BGR', 'RGB', 'r210', 'BGR565'), 'lz4': True, 'alpha': False, 'window-size': (2560, 1334), 'damage-time': 389716.302764601, 'process-damage-time': 389716.32538758, 'scroll': True, 'cuda-device-context': cuda_device_context(0)}))
video_encode('h264', XShmImageWrapper(BGRX: 0, 0, 2560, 1334), typedict({'quality': 100, 'speed': 5, 'rgb_formats': ('BGRX', 'RGBX', 'BGR', 'RGB', 'r210', 'BGR565'), 'lz4': True, 'alpha': False, 'window-size': (2560, 1334), 'damage-time': 389716.302764601, 'process-damage-time': 389716.32538758, 'scroll': True, 'cuda-device-context': cuda_device_context(0)})) image size: 2560x1334, encoder/csc size: 2560x1334
video_encode nvenc encoder: h264 2560x1334 result is 3521 bytes, 493 MPixels/s, client options={'csc': 'YUV420P', 'frame': 40, 'pts': 51679, 'full-range': True, 'csc-type': 'cuda:BGRX_to_NV12', 'quality': 99}

@totaam
Copy link
Collaborator

totaam commented Oct 20, 2024

Threaded Lock Contention while threaded-clean thread is cleaning.

What is the actual problem here?

The actual problem is downsampling still happens.

I mean here.
What does this have to do with "threaded lock contention". Why is this github issue named after it?

I think it is due to "fallback"

I don't understand what that means.

Forcely using any other non-NVENC encoding (includes AV1) looks good also.

That's because the other encodings have non-subsampled pixel formats supported at both ends.
It looks to me like the bug in this ticket is due to the fact that the engine chooses a subsampled format with quality=100, this should not happen.

I see this in the log so I think jpeg is using nvjpeg.

The debug flag to get details on picture compression is -d compress, it will show the compressor used.

@Simba98
Copy link
Author

Simba98 commented Oct 20, 2024

Threaded Lock Contention while threaded-clean thread is cleaning.

What is the actual problem here?

The actual problem is downsampling still happens.

I mean here. What does this have to do with "threaded lock contention". Why is this github issue named after it?

I think Threaded Lock Contention makes the fallback mechanism, and I observed the subsampling appears with the fallback mechanism is applied. I am writing code to change to acquire(True) for blocking mode to avoid temporary fallback to verify if it is real reason.
*: fallback mechanism, is referred to the following log.

video encoder nvenc(BGRX/NV12/H264 - None - 613x673 ) is not ready yet, using temporary fallback

I think it is due to "fallback"

I don't understand what that means.

See above.

Forcely using any other non-NVENC encoding (includes AV1) looks good also.

That's because the other encodings have non-subsampled pixel formats supported at both ends. It looks to me like the bug in this ticket is due to the fact that the engine chooses a subsampled format with quality=100, this should not happen.

I see this in the log so I think jpeg is using nvjpeg.

The debug flag to get details on picture compression is -d compress, it will show the compressor used.

Sure, let me add compress and check the log.

@totaam
Copy link
Collaborator

totaam commented Oct 20, 2024

I observed the subsampling appears with the fallback mechanism is applied

Then the easy solution is to raise the quality for the fallback path.
Much better than messing with locking.
Or perhaps the fallback chosen only supports subsampling.

@Simba98
Copy link
Author

Simba98 commented Oct 20, 2024

setup_pipeline: csc=None, video encoder=nvenc(BGRX/NV12/H264 - None - 1920x998 ), info: {'version': (11, 0), 'device_count': 1, 'context_count': 2, 'generation': 48, 'cards': {0: {'name': b'NVIDIA GeForce RTX 4090', 'uuid': b'', 'pci': {'domain': 0, 'bus': 3, 'device': 0, 'pci-device-id': x, 'pci-subsystem-id': z, 'bus-id': '00000000:03:00.0'}, 'memory': {'total': 25757220864, 'free': 24575606784, 'used': 1181614080}, 'pcie-link': {'generation-max': 4, 'width-max': 16, 'generation': 4, 'width': 8}, 'clock-info': {'graphics': 2625, 'sm': 2625, 'mem': 10251, 'graphics-max': 3165, 'sm-max': 3165, 'mem-max': 10501}, 'fan-speed': 30, 'temperature': 29, 'power-state': 2, 'vbios-version': b'95.02.3C.00.40'}}, 'kernel_module_version': (535, 183, 1), 'kernel_version': '5.15.0-119-generic', 'width': 1920, 'height': 998, 'frames': 0, 'codec': 'H264', 'encoder_width': 1920, 'encoder_height': 1024, 'bitrate': 5534023, 'quality': 100, 'speed': 28, 'lossless': {'': 0, 'supported': 1, 'threshold': 100}, 'yuv444': {'supported': True, 'threshold': 85}, 'cuda-device': {'id': 0, 'device': {'name': 'NVIDIA GeForce RTX 4090', 'pci_bus_id': '0000:03:00.0', 'memory': 24217}, 'opengl': False, 'api_version': 3020}, 'cuda': {'driver': {'version': (12, 6, 0), 'driver_version': 12020}}, 'pycuda': {'version': {'': (2024, 1, 2), 'text': '2024.1.2'}}, 'src_format': 'BGRX', 'pixel_format': 'NV12', 'total_time_ms': 0, 'free_memory': 0, 'total_memory': 0}, setup took 16.31ms
loading function 'BGRX_to_NV12' from pre-compiled cubin took 15.6ms
video encoder nvenc(BGRX/NV12/H264 - None - 1920x998 ) is not ready yet, using temporary fallback
fallback encodings(('png', 'png/P', 'png/L', 'webp', 'avif', 'rgb24', 'rgb32', 'jpeg', 'jpega'), {'rgb24': [<function encode at 0x7fc05470d080>], 'rgb32': [<function encode at 0x7fc05470d080>], 'png': [<function encode at 0x7fc04d561e40>], 'png/L': [<function encode at 0x7fc04d561e40>], 'png/P': [<function encode at 0x7fc04d561e40>], 'jpeg': [<cyfunction encode at 0x7fc05d47b100>, <cyfunction encode at 0x7fc05d4b8520>, <function encode at 0x7fc04d561e40>], 'webp': [<cyfunction encode at 0x7fc05d4b98a0>, <function encode at 0x7fc04d561e40>], 'jpega': [<cyfunction encode at 0x7fc05d47b100>, <cyfunction encode at 0x7fc05d4b8520>]})=('png', 'png/L', 'rgb32', 'png/P', 'rgb24', 'webp', 'jpega', 'jpeg')

// But when fallback, xpra is using webp. 

compress:  45.3ms for 1920x998  pixels at    0,0    for wid=3     using      webp with ratio   2.8%  ( 7485KB to   209KB), sequence    43, client_options={'rgb_format': 'BGRX', 'quality': 100, 'encoder': 'fallback: webp', 'flush': 0}, options=typedict({'quality': 100, 'speed': 28, 'rgb_formats': ('YUV420P', 'YUV422P', 'YUV444P', 'NV12', 'GBRP', 'BGRX', 'RGBX', 'RGB', 'BGR'), 'lz4': True, 'alpha': False, 'window-size': (1920, 998), 'damage-time': 430813.901918346, 'process-damage-time': 430813.904211574, 'scroll': True, 'cuda-device-context': cuda_device_context(0)})
send_delayed for wid 4, batch delay is 44ms, elapsed time is 17ms
nonvideo(0, area is too small)
must_encode_full_frame(rgb24)=False full_frames_only=False, non_video=('png', 'png/P', 'png/L', 'webp', 'avif', 'rgb24', 'rgb32', 'jpeg', 'jpega'), video=('h264', 'vp8', 'webp', 'jpeg', 'jpega', 'h265')
send_delayed_regions: queuing 1 regions
process_damage_region(430814.008523344, 2159, 803, 8, 8, 'rgb24', {'quality': 100, 'speed': 4, 'rgb_formats': ('YUV420P', 'YUV422P', 'YUV444P', 'NV12', 'GBRP', 'BGRX', 'RGBX', 'RGB', 'BGR'), 'lz4': True, 'alpha': False}, 0)
get_damage_image(2159, 803, 8, 8) took 1ms
process_damage_region: av_delay=0, must_freeze=False, size=(8, 8), encoding=rgb24
process_damage_region: wid=4, sequence=1726, adding pixel data to encode queue (   8x8    - rgb24), elapsed time: 19.7 ms, request time: 1.5 ms, frame delay=  0ms
send_delayed_regions: queued 1 regions for encoding using ['rgb24']
scroll detection took 0ms
make_data_packet: image=XShmImageWrapper(BGRX: 2159, 803, 8, 8), damage data: (4, 2159, 803, 8, 8, 'rgb24')
compress:   0.0ms for    8x8    pixels at 2159,803  for wid=4     using     rgb24 with ratio 100.0%  (    1KB to     1KB), sequence  1752, client_options={'rgb_format': 'BGRX', 'encoder': 'argb', 'flush': 0}, options=typedict({'quality': 100, 'speed': 4, 'rgb_formats': ('YUV420P', 'YUV422P', 'YUV444P', 'NV12', 'GBRP', 'BGRX', 'RGBX', 'RGB', 'BGR'), 'lz4': True, 'alpha': False, 'window-size': (2538, 1334), 'damage-time': 430814.008523344, 'process-damage-time': 430814.028227996, 'cuda-device-context': cuda_device_context(0)})

@totaam
Copy link
Collaborator

totaam commented Oct 20, 2024

webp at 100% quality is lossless.
Are you sure that this is the problematic update?
If so, then perhaps the bug is client side paint code.

@Simba98
Copy link
Author

Simba98 commented Oct 20, 2024

webp at 100% quality is lossless. Are you sure that this is the problematic update? If so, then perhaps the bug is client side paint code.

Let me double-check, too much log in one second. And I can reproduce it on Xpra 5.10.0 Windows Client with force H264 encoding with 6.x server.
Screenshot from 5.10.0
image
image

New clue:
By disabling NVENC (prohibit GPU libcuda.so access), the H264 encoding still work incorrectly.

codec module nvidia.nvjpeg.encoder cannot be loaded: libcuda.so.1: cannot open shared object file: No such file or directory
video encoder 'nvenc' could not be loaded: libcuda.so.1: cannot open shared object file: No such file or directory

@totaam
Copy link
Collaborator

totaam commented Oct 20, 2024

the H264 encoding still work incorrectly

Please try to collect -d compress log samples corresponding to the faulty screen updates.
Another way of replaying video screen updates is using XPRA_SAVE_VIDEO_STREAMS=1.
(bear in mind that non-video screen updates normally interspersed with it will not be shown in that stream)

@Simba98 Simba98 changed the title Threaded Lock Contention Downsampling pixel when min-quality set as 100 Oct 21, 2024
@Simba98
Copy link
Author

Simba98 commented Oct 21, 2024

Hi Totamm,

I have sent an email to [email protected] with title of "[Github Issue #4343] Video Recording of the Incorrect Rendering"
Three recordings with log samples are attached.

All recording uses the following config.
Server: RockyLinux 8 Xpra 6.2.0
Server command: xpra start :123 -d video,cuda,encoding,compress
Client: Windows 11 Enterprise Xpra 6.2.0 full build.
Client command: Xpra_cmd.exe attach ssh://SERVER/123 --desktop-scaling=off --min-quality=100 -d comporess,video,encoding

NVENC is disabled by blocking access libcuda.so.1 for Xpra.

codec module nvidia.nvjpeg.encoder cannot be loaded: libcuda.so.1: cannot open shared object file: No such file or directory

Please consider delete these files after using.
Thank you for your patience and selfless devotion to open source communities.

@totaam
Copy link
Collaborator

totaam commented Nov 11, 2024

So, it turns out that this quite tricky to fix.
We really want to use the GPU for encoding if we can, but the only decoder we have that support non-subsampled data is libvpx with vp9.

I thought that adding a good AV1 decoder would solve this problem, with one major downside: this would require an Ada Lovelace GPU.
dav1d looks like a decent option for that.
Unfortunately, NVENC only supports YUV420 subsampled input with AV1!

It does support YUV444 input with HEVC (and I have the hardware required), but again, we don't have a decoder for it, yet.
openHEVC has not been updated in 7 years!
It's not clear if libde265 supports YUV444:
https://github.com/strukturag/libde265/blob/336272b1245c480a4cce8231168e5adebbc1234d/libde265/de265.h#L174C3-L174C19
The documentation is very helpful, in particular: Decoder API Tutorial which even describes custom memory allocators.


I don't really want to change the heuristics too much at this point, so I guess that I will give a bigger discount to subsampled encoders when quality is at 100%.
This should fix things for now, though it will make things a lot slower..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants