You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As more and more products use the JSON-P reference implementation in production, it is critical that parsing performance is good.
I think there is an opportunity to improve the performance of JsonParserImpl. Right now, the underlying tokenizer operates on a Java character string and is completely unaware of the underlying byte representation. In many cases, JSON is persisted as UTF8 - from rfc8259:
JSON text exchanged between systems that are not part of a closed
ecosystem MUST be encoded using UTF-8 [RFC3629].
Java characters are represented in UTF-16 and conversion from UTF-8 to UTF-16 is often expensive.
I suggest making a special purpose tokenizer that operates directly on the UTF8 byte stream. Other encodings can continue to use the current code path as they will be less common. A special-case UTF8 tokenizer would provide the following benefits:
(1) Markup characters in the ascii range (curly braces, brackets, string delimiters, white space, etc) can be scanned with byte comparisons and never converted to UTF-16.
(2) JSON numbers, true, false, and null don't need to be converted to UTF-16
(3) Strings (keys and values) can be converted to UTF-16 lazily so that if they are never consumed by an application, they need not be converted.
(4) Skip methods (like skipArray() and skipObject()) could avoid any character set conversion of the skipped item.
The text was updated successfully, but these errors were encountered:
@jjspiegel Thanks for raising those issues. Unlike the JCP model with a dedicated Reference Implementation Jakarta EE does not mandate this or rather it can and should (in theory also at Eclipse or different communities like Apache, JBoss etc.) have more than one implementation.
If the Glassfish "Spec Implementation" under the Jakarta EE umbrella is widely used, then I am pretty sure, the team and community will try to address many of those issues with upcoming releases, but it does not prevent others to create and maintain their own independent implementations that may have advantages over the SI.
As more and more products use the JSON-P reference implementation in production, it is critical that parsing performance is good.
I think there is an opportunity to improve the performance of JsonParserImpl. Right now, the underlying tokenizer operates on a Java character string and is completely unaware of the underlying byte representation. In many cases, JSON is persisted as UTF8 - from rfc8259:
Java characters are represented in UTF-16 and conversion from UTF-8 to UTF-16 is often expensive.
I suggest making a special purpose tokenizer that operates directly on the UTF8 byte stream. Other encodings can continue to use the current code path as they will be less common. A special-case UTF8 tokenizer would provide the following benefits:
(1) Markup characters in the ascii range (curly braces, brackets, string delimiters, white space, etc) can be scanned with byte comparisons and never converted to UTF-16.
(2) JSON numbers, true, false, and null don't need to be converted to UTF-16
(3) Strings (keys and values) can be converted to UTF-16 lazily so that if they are never consumed by an application, they need not be converted.
(4) Skip methods (like skipArray() and skipObject()) could avoid any character set conversion of the skipped item.
The text was updated successfully, but these errors were encountered: