You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This line in the Hack grammar is an invalid Oniguruma pattern. While it doesn't throw, it also doesn't match what is intended due to the use of encoded byte value\xff (which doesn't work the same as code point\x{ff} in Oniguruma).
The error in this pattern is in the negated range [^...\\x7f-\\xff]. From context, it's easy to figure out that the author meant for this to exclude, among other things, the range from code point U+007F to U+00FF. And although that's what it would do in nearly every modern regex engine, that's not exactly what it's doing in Oniguruma.
In Oniguruma, \xHH matches an "encoded byte value", not a code point value like the enclosed version \x{...} does. That means for values from 0 to 7F, \xHH and \x{HH} work the same, but they diverge for values 80 to FF. With \xHH above 7F, the token must be part of a valid encoded byte sequence. E.g. the three-byte sequence \xEF\xBB\xBF in Oniguruma is equivalent to the single code point \x{FEFF} in Oniguruma (and not the same as the three code points \xEF\xBB\xBF in JavaScript and other regex flavors).
\xFF, on it's own, is not a valid encoded byte sequence. In Oniguruma, standalone invalid byte values \x80 through \xF4 throw. However, due to the peculiarities of Oniguruma's error handling, \xF5 through \xFF do not throw, but they also don't match what's intended. This seems to be because bytes F5 through FF are unused in UTF-8 encoding. So instead of throwing in Oniguruma, these byte values don't match anything if used on their own, and they cause unintended/difficult-to-predict behavior in character class ranges, intersection, etc.
Current behavior
If the pattern in the Hack grammar used e.g. \x7f-\xf4 instead of \x7f-\xff, it would be a clear error (and throw). But because it happens to use an encoded byte in the range of F5 to FF, the behavior is more complicated (but still wrong). Unlike how this pattern would behave in almost every other regex engine, it is preventing the negated character class from matching any code point above FF, which does not seem to be intended.
Fix
To fix this, the \\xff should be replaced with \\x{ff}. It will then run correctly in Oniguruma.
The text was updated successfully, but these errors were encountered:
In addition to causing edge case bugs, this issue is also preventing the Hack grammar from running in Shiki when using its JS engine (which transpiles Oniguruma regexes to JS using Oniguruma-To-ES). Oniguruma-To-ES intentionally doesn't reproduce Oniguruma's bugs related to handling of unenclosed \xF5 through \xFF, and instead throws for them (as Oniguruma does for \x80 through \xF4).
Problem
This line in the Hack grammar is an invalid Oniguruma pattern. While it doesn't throw, it also doesn't match what is intended due to the use of encoded byte value
\xff
(which doesn't work the same as code point\x{ff}
in Oniguruma).The error in this pattern is in the negated range
[^...\\x7f-\\xff]
. From context, it's easy to figure out that the author meant for this to exclude, among other things, the range from code point U+007F to U+00FF. And although that's what it would do in nearly every modern regex engine, that's not exactly what it's doing in Oniguruma.In Oniguruma,
\xHH
matches an "encoded byte value", not a code point value like the enclosed version\x{...}
does. That means for values from0
to7F
,\xHH
and\x{HH}
work the same, but they diverge for values80
toFF
. With\xHH
above7F
, the token must be part of a valid encoded byte sequence. E.g. the three-byte sequence\xEF\xBB\xBF
in Oniguruma is equivalent to the single code point\x{FEFF}
in Oniguruma (and not the same as the three code points\xEF\xBB\xBF
in JavaScript and other regex flavors).\xFF
, on it's own, is not a valid encoded byte sequence. In Oniguruma, standalone invalid byte values\x80
through\xF4
throw. However, due to the peculiarities of Oniguruma's error handling,\xF5
through\xFF
do not throw, but they also don't match what's intended. This seems to be because bytesF5
throughFF
are unused in UTF-8 encoding. So instead of throwing in Oniguruma, these byte values don't match anything if used on their own, and they cause unintended/difficult-to-predict behavior in character class ranges, intersection, etc.Current behavior
If the pattern in the Hack grammar used e.g.
\x7f-\xf4
instead of\x7f-\xff
, it would be a clear error (and throw). But because it happens to use an encoded byte in the range ofF5
toFF
, the behavior is more complicated (but still wrong). Unlike how this pattern would behave in almost every other regex engine, it is preventing the negated character class from matching any code point aboveFF
, which does not seem to be intended.Fix
To fix this, the
\\xff
should be replaced with\\x{ff}
. It will then run correctly in Oniguruma.The text was updated successfully, but these errors were encountered: