Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDN Support suggestion #15

Open
wadih opened this issue Sep 9, 2022 · 2 comments
Open

IDN Support suggestion #15

wadih opened this issue Sep 9, 2022 · 2 comments

Comments

@wadih
Copy link

wadih commented Sep 9, 2022

It seems any IDN can't get past the regexp in regDomain.class.php:

if (!preg_match("/^([a-z0-9])(([a-z0-9-])*([a-z0-9]))*$/", $domPart)) return FALSE;

If anybody's interested, the library could be augmented to support UTF8 letter characters by adding the \p{L} character set along with a-z0-9 and adding /u at the end:

if (!preg_match("/^([a-z0-9\p{L}])(([a-z0-9-\p{L}])*([a-z0-9\p{L}]))*$/u", $domPart)) return FALSE;

I tested it as such and it worked:

echo (new regDomain())->getRegisteredDomain("example.мон", false);

And it works for that IDN (.xn--l1acc):

example.мон

I could submit a pull request if nobody sees an issue

@wadih
Copy link
Author

wadih commented Sep 9, 2022

I found an issue on above regexp, that validation regexp doesn't appear to match for Indian tld's like, although 99% of the rest has worked.

ಭಾರತ     xn--2scrj9c
ଭାରତ     xn--3hcrj9c
ভাৰত     xn--45br5cyl
भारतम्    xn--h2breg3eve

I tried varying the regexp but not finding one that works, using \p{Devanagari} without success.

Maybe will have to go through the ascii domain instead to avoid these vocabulary challenges.

@usrflo
Copy link
Owner

usrflo commented Feb 5, 2023

@wadih, thanks for your feedback and research: in 9fccafa I simply disabled the domain label length validation and the regexp to prevent false analysis.

I tend to change the processing so there is a second suffix tree being encoded to ACE and all checks including length validation and character validation are done in ASCII. Downward compatibility should be kept.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants