-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Charset: Track detection of non-characters when scanning strings. #9827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN: To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
2c0a22e to
74a6c52
Compare
220bf5e to
258508b
Compare
7a9b9a7 to
2cc5894
Compare
809d9c9 to
2a169ab
Compare
74c6a72 to
4975f20
Compare
f401838 to
9e2f6f2
Compare
Noncharacters are code points that are permantently reserved in the Unicode Standard for internal use. They are not recommended for use in open interchange of Unicode text data. However, they are valid code points and will not cause a string to return as invalid. Still, HTML and XML both impose semantic rules on their use and it may be important for code to know whether they are present in a string. This patch introduces a new function, `wp_has_noncharacters()`, which answers this question. This is accomplished through an inline check with the fallback UTF-8 scanner. There are 66 noncharacters, making it difficult to find them properly with common string search functionality. While the inline check adds overhead to the scanning process, the rare occurrance of noncharacters should lead to minimal actual overhead due to strong branch prediction. See https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-23/#G12612
9e2f6f2 to
8e0c7eb
Compare
Noncharacters are code points that are permanently reserved in the Unicode Standard for internal use. They are not recommended for use in open interchange of Unicode text data. However, they are valid code points and will not cause a string to return as invalid. Still, HTML and XML both impose semantic rules on their use and it may be important for code to know whether they are present in a string. This patch introduces a new function, `wp_has_noncharacters()`, which answers this question. See https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-23/#G12612 Developed in #9827 Discussed in https://core.trac.wordpress.org/ticket/63863 See #63863. git-svn-id: https://develop.svn.wordpress.org/trunk@61000 602fd350-edb4-49c9-b593-d223f7449a82
Noncharacters are code points that are permanently reserved in the Unicode Standard for internal use. They are not recommended for use in open interchange of Unicode text data. However, they are valid code points and will not cause a string to return as invalid. Still, HTML and XML both impose semantic rules on their use and it may be important for code to know whether they are present in a string. This patch introduces a new function, `wp_has_noncharacters()`, which answers this question. See https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-23/#G12612 Developed in WordPress/wordpress-develop#9827 Discussed in https://core.trac.wordpress.org/ticket/63863 See #63863. Built from https://develop.svn.wordpress.org/trunk@61000 git-svn-id: http://core.svn.wordpress.org/trunk@60336 1a063a9b-81f0-0310-95a4-ce76da25c4cd
Noncharacters are code points that are permanently reserved in the Unicode Standard for internal use. They are not recommended for use in open interchange of Unicode text data. However, they are valid code points and will not cause a string to return as invalid. Still, HTML and XML both impose semantic rules on their use and it may be important for code to know whether they are present in a string. This patch introduces a new function, `wp_has_noncharacters()`, which answers this question. See https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-23/#G12612 Developed in WordPress/wordpress-develop#9827 Discussed in https://core.trac.wordpress.org/ticket/63863 See #63863. Built from https://develop.svn.wordpress.org/trunk@61000 git-svn-id: https://core.svn.wordpress.org/trunk@60336 1a063a9b-81f0-0310-95a4-ce76da25c4cd
Trac ticket: Core-63863
See:
#9825,#9830,#9498,#9826, (#9827), #9798,#9828,#9829Provides a method to determine if a given input string contains Unicode noncharacters, something relevant to XML and HTML semantics.
Where supported, a PCRE-based approach runs at 10x–35x faster speeds on worse-case documents (very long, minimal US-ASCII). Where unavailable, computes using the UTF-8 pipeline with minimal added overhead to that function.
Noncharacters are valid UTF-8 code points but strongly discouraged from use in transmission. They will not invalidate a string, but some functions need to know if they are present.