Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Fix Alphabetizing for the ICANN section #2287

Merged
merged 4 commits into from
Nov 29, 2024

Conversation

groundcat
Copy link
Contributor

This PR is to address #2255

No entry has been modified, which can be verified with

diff <(sort public_suffix_list.dat) <(sort public_suffix_list_sorted.dat)

@wdhdev
Copy link
Contributor

wdhdev commented Nov 23, 2024

Are you able to share the script you used to sort it?

@groundcat
Copy link
Contributor Author

Script used:
https://gist.github.com/groundcat/6d81a42e009468c24a80d1c57a7211a8

@simon-friedberger
Copy link
Contributor

We should probably make sure that psltool also checks this now so it doesn't regress again.

@simon-friedberger
Copy link
Contributor

We should probably make sure that psltool also checks this now so it doesn't regress again.

@groundcat Would you be interested in trying?

@groundcat
Copy link
Contributor Author

We should probably make sure that psltool also checks this now so it doesn't regress again.

@groundcat Would you be interested in trying?

Thank you for the suggestion regarding psltool checks. I should note that I'm not competent in Go, unfortunately.

Additionally, there might be a structural consideration - the ICANN section is formatted quite differently from the private section that the parser was originally designed for. The ICANN entries don't include email addresses or require TXT records that are present in the private section. Simply appending the same cleaning for ICANN entries would likely trigger failures in the PR checker due to these fundamental format differences. (@danderson correct me if I'm wrong though!)

switch v.Name {
case "PRIVATE DOMAINS":
for _, child := range v.Blocks {
ret = append(ret, cleanBlock(child)...)
}
// The private domains section must be sorted according to
// the names of the maintainers of suffix blocks.
ret = append(ret, sortSection(v)...)
case "ICANN DOMAINS":
// We don't currently clean the ICANN section, since it
// has a lot of historical and non-normalized data. We
// want to focus on the private domains section first.
//
// TODO: when we're ready, apply the same cleaning as the
// private section.
}

If someone more experienced with Go would like to take on this task, they might need to account for these structural differences in the parser implementation.

@danderson
Copy link
Contributor

Happy tp send a patch for this if you'd like. I didn't do sorting in the ICANN section at the request of maintainers, when we were just getting started.

I believe the formatter should be able to reformat the ICANN section without causing validation errors, the validation and the reformatting are separate passes. It might, however, cause some additional unexpected reflowing of the suffixes because the ICANN section is full of fairly old conventions for suffix layout, and I'm not sure the formatter takes care of all of them.

@danderson
Copy link
Contributor

Just tried it. The change is mostly trivial, but formatting the ICANN section triggers a bunch of lint errors due to inline comments that prevent full sorting. The easiest fix for that, for now, is to just not produce those warnings when sorting the ICANN section, unless we want to try and remove them... But in the ICANN section a lot of those comments are load-bearing subsection separators, see for example the comments in the .aero section.

@danderson
Copy link
Contributor

psltool's sorting also adds a few more changes compared to this PR, all related to accented letters: https://gist.github.com/danderson/cb0fb30acbec38651cf8c0a1f4275125.

This isn't too surprising, accented letter sorting is one of the big things that change between collations, there is no "correct" answer. But, this is the collation we picked for psltool so 🤷

Assuming we're okay with the additional diff, I suggest merging this change (with or without the accents diff applied) and I'll send a separate PR to make psltool sort the ICANN section to maintain the ordering?

@danderson
Copy link
Contributor

#2295 updates psltool to enforce ICANN section ordering to match this PR. The accent sorting changes above need to be merged one way or another, either in this PR or I can add them to the psltool PR, your call :)

@groundcat
Copy link
Contributor Author

#2295 updates psltool to enforce ICANN section ordering to match this PR. The accent sorting changes above need to be merged one way or another, either in this PR or I can add them to the psltool PR, your call :)

Just applied the diff patch to this PR :)

@simon-friedberger simon-friedberger merged commit f34b2f1 into publicsuffix:main Nov 29, 2024
2 checks passed
@groundcat groundcat deleted the fix-icann-sorting branch November 29, 2024 07:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants