Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hack grammar contains invalid Oniguruma token \xff #159

Open
slevithan opened this issue Nov 26, 2024 · 2 comments
Open

Hack grammar contains invalid Oniguruma token \xff #159

slevithan opened this issue Nov 26, 2024 · 2 comments

Comments

@slevithan
Copy link

slevithan commented Nov 26, 2024

Problem

This line in the Hack grammar is an invalid Oniguruma pattern. While it doesn't throw, it also doesn't match what is intended due to the use of encoded byte value \xff (which doesn't work the same as code point \x{ff} in Oniguruma).

The error in this pattern is in the negated range [^...\\x7f-\\xff]. From context, it's easy to figure out that the author meant for this to exclude, among other things, the range from code point U+007F to U+00FF. And although that's what it would do in nearly every modern regex engine, that's not exactly what it's doing in Oniguruma.

In Oniguruma, \xHH matches an "encoded byte value", not a code point value like the enclosed version \x{...} does. That means for values from 0 to 7F, \xHH and \x{HH} work the same, but they diverge for values 80 to FF. With \xHH above 7F, the token must be part of a valid encoded byte sequence. E.g. the three-byte sequence \xEF\xBB\xBF in Oniguruma is equivalent to the single code point \x{FEFF} in Oniguruma (and not the same as the three code points \xEF\xBB\xBF in JavaScript and other regex flavors).

\xFF, on it's own, is not a valid encoded byte sequence. In Oniguruma, standalone invalid byte values \x80 through \xF4 throw. However, due to the peculiarities of Oniguruma's error handling, \xF5 through \xFF do not throw, but they also don't match what's intended. This seems to be because bytes F5 through FF are unused in UTF-8 encoding. So instead of throwing in Oniguruma, these byte values don't match anything if used on their own, and they cause unintended/difficult-to-predict behavior in character class ranges, intersection, etc.

Current behavior

If the pattern in the Hack grammar used e.g. \x7f-\xf4 instead of \x7f-\xff, it would be a clear error (and throw). But because it happens to use an encoded byte in the range of F5 to FF, the behavior is more complicated (but still wrong). Unlike how this pattern would behave in almost every other regex engine, it is preventing the negated character class from matching any code point above FF, which does not seem to be intended.

Fix

To fix this, the \\xff should be replaced with \\x{ff}. It will then run correctly in Oniguruma.

@slevithan
Copy link
Author

Re-opening after additional research and editing my comment to add more clarifying details.

@slevithan slevithan reopened this Nov 27, 2024
@slevithan
Copy link
Author

In addition to causing edge case bugs, this issue is also preventing the Hack grammar from running in Shiki when using its JS engine (which transpiles Oniguruma regexes to JS using Oniguruma-To-ES). Oniguruma-To-ES intentionally doesn't reproduce Oniguruma's bugs related to handling of unenclosed \xF5 through \xFF, and instead throws for them (as Oniguruma does for \x80 through \xF4).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant