Hack grammar contains invalid Oniguruma token `\xff` #159

slevithan · 2024-11-26T17:13:05Z

Problem

This line in the Hack grammar is an invalid Oniguruma pattern. While it doesn't throw, it also doesn't match what is intended due to the use of encoded byte value \xff (which doesn't work the same as code point \x{ff} in Oniguruma).

The error in this pattern is in the negated range [^...\\x7f-\\xff]. From context, it's easy to figure out that the author meant for this to exclude, among other things, the range from code point U+007F to U+00FF. And although that's what it would do in nearly every modern regex engine, that's not exactly what it's doing in Oniguruma.

In Oniguruma, \xHH matches an "encoded byte value", not a code point value like the enclosed version \x{...} does. That means for values from 0 to 7F, \xHH and \x{HH} work the same, but they diverge for values 80 to FF. With \xHH above 7F, the token must be part of a valid encoded byte sequence. E.g. the three-byte sequence \xEF\xBB\xBF in Oniguruma is equivalent to the single code point \x{FEFF} in Oniguruma (and not the same as the three code points \xEF\xBB\xBF in JavaScript and other regex flavors).

\xFF, on it's own, is not a valid encoded byte sequence. In Oniguruma, standalone invalid byte values \x80 through \xF4 throw. However, due to the peculiarities of Oniguruma's error handling, \xF5 through \xFF do not throw, but they also don't match what's intended. This seems to be because bytes F5 through FF are unused in UTF-8 encoding. So instead of throwing in Oniguruma, these byte values don't match anything if used on their own, and they cause unintended/difficult-to-predict behavior in character class ranges, intersection, etc.

Current behavior

If the pattern in the Hack grammar used e.g. \x7f-\xf4 instead of \x7f-\xff, it would be a clear error (and throw). But because it happens to use an encoded byte in the range of F5 to FF, the behavior is more complicated (but still wrong). Unlike how this pattern would behave in almost every other regex engine, it is preventing the negated character class from matching any code point above FF, which does not seem to be intended.

Fix

To fix this, the \\xff should be replaced with \\x{ff}. It will then run correctly in Oniguruma.

The text was updated successfully, but these errors were encountered:

slevithan · 2024-11-27T15:28:01Z

Re-opening after additional research and editing my comment to add more clarifying details.

slevithan · 2024-11-27T15:36:35Z

In addition to causing edge case bugs, this issue is also preventing the Hack grammar from running in Shiki when using its JS engine (which transpiles Oniguruma regexes to JS using Oniguruma-To-ES). Oniguruma-To-ES intentionally doesn't reproduce Oniguruma's bugs related to handling of unenclosed \xF5 through \xFF, and instead throws for them (as Oniguruma does for \x80 through \xF4).

slevithan mentioned this issue Nov 26, 2024

Hack grammar contains invalid Oniguruma token \xff shikijs/textmate-grammars-themes#107

Closed

slevithan closed this as completed Nov 26, 2024

slevithan reopened this Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hack grammar contains invalid Oniguruma token `\xff` #159

Hack grammar contains invalid Oniguruma token `\xff` #159

slevithan commented Nov 26, 2024 •

edited

Loading

slevithan commented Nov 27, 2024

slevithan commented Nov 27, 2024

Hack grammar contains invalid Oniguruma token \xff #159

Hack grammar contains invalid Oniguruma token \xff #159

Comments

slevithan commented Nov 26, 2024 • edited Loading

Problem

Current behavior

Fix

slevithan commented Nov 27, 2024

slevithan commented Nov 27, 2024

Hack grammar contains invalid Oniguruma token `\xff` #159

Hack grammar contains invalid Oniguruma token `\xff` #159

slevithan commented Nov 26, 2024 •

edited

Loading