Unicode Case Folding #228

isomarcte · 2022-02-05T20:03:13Z

While working on cats-uri, I ran into an issue with how CIString was handling certain unicode values which led me to notice it wasn't respecting Caseless matching from the Unicode standard. As it turns out, neither does String.equalsIgnoreCase.

I'd just about completed a branch to implement full case folding as defined by the Unicode standard when I ran across this test.

  test("character based equality") {
    assert(CIString("ß") != CIString("SS"))
  }

Since under the Unicode standard's caseless matching these two strings would compare equal, I'm beginning to think we are intentionally not following the standard here. Is that the case? If so, why? Is it to maintain parity with what the Java standard library is doing with methods like equalsIgnoresCase?

The text was updated successfully, but these errors were encountered:

armanbilge · 2022-02-05T20:38:33Z

I was curious and it seems that JS may support this out-of-the-box. So at least we wouldn't have to copy those tables in 😅

"ß".localeCompare("SS", 'en', {sensitivity: 'base'}) // 0

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/localeCompare

armanbilge · 2022-02-05T20:43:45Z

Ah, here's the equivalent with JDK.

import java.text.Collator
val coll = Collator.getInstance
coll.setStrength(Collator.PRIMARY)
coll.compare("ß", "SS") // 0

isomarcte · 2022-02-05T22:46:43Z

@armanbilge I actually already have the lookup tables: https://github.com/isomarcte/case-insensitive/blob/full-unicode-case-folding/core/src/main/scala/org/typelevel/ci/CaseFolds.scala#L23

It only took a few minutes to get emacs to transform the text file into that.

That said, I'm happy to use existing JS/JRE based implementations, but I think we need to actually have a case folded string: https://github.com/isomarcte/case-insensitive/blob/full-unicode-case-folding/core/src/main/scala/org/typelevel/ci/CIString.scala#L43

I didn't see anyway to get that from the Collator. We need the folded string to get a safe/correct implementation of .hashCode for CIString.

armanbilge · 2022-02-05T23:01:04Z

Thanks, yes, I actually saw your lookup tables which is what prompted me to look for any stdlib solutions for this ... at least for JS, this is just one more thing to bloat the generated code 🙃

You're right though, if there's no way to get a folded string then your lookup tables are the way to go 👍

isomarcte · 2022-02-05T23:45:05Z

I've only poked around in the JRE stdlib. I can check around in JS land to see if we can get a case folded string there and then maybe lighten the generated code and use the lookup tables on the JRE.

armanbilge · 2022-02-05T23:49:47Z

No worries, you can stay focused on doing awesome work in cats-uri and I'll worry about the JS stuff. Correctness/functionality first, optimizations second :)

armanbilge · 2022-02-06T00:34:23Z

~~@isomarcte what about:~~

scala> "ß".toLowerCase.toUpperCase == "SS".toLowerCase.toUpperCase
val res9: Boolean = true

~~Does this do what we want?~~

Edit: ignore me :)

rossabaker · 2022-02-06T03:56:07Z

It looks like you've implemented "full" case folding in #229. Is what we have currently consistent with "simple" case folding? (I could probably answer this myself...)

This would be a semantic breaking change, even if it's binary compatible. If we proceed, should this be 2.0?

I wonder whether our current implementation varies by Java version. JDK8 supported Unicode 6.2, and Java 17 supports Unicode 13.0. It looks like your proposed algorithm guarantees stability, but only if we also normalize to NKFC?

For each string S containing characters only from a given Unicode version, toCasefold(toNFKC(S)) under that version is identical to toCasefold(toNFKC(S)) under any later version of Unicode.

rossabaker · 2022-02-06T04:17:57Z

Ugh, there are four definitions of this.

Default caseless matching
Canonical caseless matching
Compatibility caseless matching
Identifier caseless matching

rossabaker · 2022-02-06T04:51:24Z

It seems you're targeting the Default definition, which seems as good a starting point as any.

isomarcte · 2022-02-06T14:22:37Z

This would be a semantic breaking change, even if it's binary compatible. If we proceed, should this be 2.0?

I'm not sure. It is a semantic change, but most users will be unaffected. What we currently implement appears to be semantically equivalent to the simple case folding rules + rules for some Turkic languages with no normalization.

The Turkic rules are supposed to off by default. That is, this should yield false.

scala> val a: Char = 0x0049.toChar
val a: Char = I

scala> val b: Char = 0x0131.toChar                                                                                      
val b: Char = ı

scala> CIString(a.toString) == CIString(b.toString)
val res0: Boolean = true

I can probably adapt my PR to provide both explicit simple caseless matching and full caseless matching, and then maybe we can give users the option. Let me take a look. I also want to look into the normalization as well.

isomarcte · 2022-02-06T18:10:37Z

@rossabaker @armanbilge take a look here for an ~~insane~~ alternative approach which handles the different normalization and folding rules more explicitly and doesn't break the semantics of CIString.

isomarcte linked a pull request Feb 6, 2022 that will close this issue

Implement Unicode Case Folding #229

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode Case Folding #228

Unicode Case Folding #228

isomarcte commented Feb 5, 2022

armanbilge commented Feb 5, 2022

armanbilge commented Feb 5, 2022

isomarcte commented Feb 5, 2022

armanbilge commented Feb 5, 2022

isomarcte commented Feb 5, 2022

armanbilge commented Feb 5, 2022

armanbilge commented Feb 6, 2022 •

edited

Loading

rossabaker commented Feb 6, 2022

rossabaker commented Feb 6, 2022

rossabaker commented Feb 6, 2022

isomarcte commented Feb 6, 2022

isomarcte commented Feb 6, 2022

Unicode Case Folding #228

Unicode Case Folding #228

Comments

isomarcte commented Feb 5, 2022

armanbilge commented Feb 5, 2022

armanbilge commented Feb 5, 2022

isomarcte commented Feb 5, 2022

armanbilge commented Feb 5, 2022

isomarcte commented Feb 5, 2022

armanbilge commented Feb 5, 2022

armanbilge commented Feb 6, 2022 • edited Loading

rossabaker commented Feb 6, 2022

rossabaker commented Feb 6, 2022

rossabaker commented Feb 6, 2022

isomarcte commented Feb 6, 2022

isomarcte commented Feb 6, 2022

armanbilge commented Feb 6, 2022 •

edited

Loading