Skip to content

Conversation

@dmsnell
Copy link
Member

@dmsnell dmsnell commented Sep 10, 2025

Trac ticket: Core-63863
See: #9825, #9830, #9498, #9826, (#9827), #9798, #9828, #9829

Provides a method to determine if a given input string contains Unicode noncharacters, something relevant to XML and HTML semantics.

Where supported, a PCRE-based approach runs at 10x–35x faster speeds on worse-case documents (very long, minimal US-ASCII). Where unavailable, computes using the UTF-8 pipeline with minimal added overhead to that function.

Noncharacters are valid UTF-8 code points but strongly discouraged from use in transmission. They will not invalidate a string, but some functions need to know if they are present.

@github-actions
Copy link

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@github-actions
Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@dmsnell dmsnell force-pushed the utf8/track-noncharacters branch 9 times, most recently from 2c0a22e to 74a6c52 Compare September 16, 2025 13:27
@dmsnell dmsnell force-pushed the utf8/track-noncharacters branch 5 times, most recently from 220bf5e to 258508b Compare September 23, 2025 04:06
@WordPress WordPress deleted a comment from github-actions bot Sep 23, 2025
@dmsnell dmsnell force-pushed the utf8/track-noncharacters branch 4 times, most recently from 7a9b9a7 to 2cc5894 Compare September 25, 2025 18:47
@dmsnell dmsnell force-pushed the utf8/track-noncharacters branch 6 times, most recently from 809d9c9 to 2a169ab Compare October 9, 2025 23:39
@dmsnell dmsnell force-pushed the utf8/track-noncharacters branch 7 times, most recently from 74c6a72 to 4975f20 Compare October 18, 2025 21:40
@dmsnell dmsnell marked this pull request as ready for review October 18, 2025 21:41
@dmsnell dmsnell force-pushed the utf8/track-noncharacters branch 8 times, most recently from f401838 to 9e2f6f2 Compare October 20, 2025 23:24
Noncharacters are code points that are permantently reserved in the
Unicode Standard for internal use. They are not recommended for use in
open interchange of Unicode text data. However, they are valid code
points and will not cause a string to return as invalid.

Still, HTML and XML both impose semantic rules on their use and it may
be important for code to know whether they are present in a string. This
patch introduces a new function, `wp_has_noncharacters()`, which answers
this question.

This is accomplished through an inline check with the fallback UTF-8 scanner.
There are 66 noncharacters, making it difficult to find them properly
with common string search functionality. While the inline check adds
overhead to the scanning process, the rare occurrance of noncharacters
should lead to minimal actual overhead due to strong branch prediction.

See https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-23/#G12612
@dmsnell dmsnell force-pushed the utf8/track-noncharacters branch from 9e2f6f2 to 8e0c7eb Compare October 21, 2025 00:34
pento pushed a commit that referenced this pull request Oct 21, 2025
Noncharacters are code points that are permanently reserved in the Unicode Standard for internal use. They are not recommended for use in open interchange of Unicode text data. However, they are valid code points and will not cause a string to return as invalid.

Still, HTML and XML both impose semantic rules on their use and it may be important for code to know whether they are present in a string. This patch introduces a new function, `wp_has_noncharacters()`, which answers this question.

See https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-23/#G12612

Developed in #9827
Discussed in https://core.trac.wordpress.org/ticket/63863

See #63863.


git-svn-id: https://develop.svn.wordpress.org/trunk@61000 602fd350-edb4-49c9-b593-d223f7449a82
markjaquith pushed a commit to markjaquith/WordPress that referenced this pull request Oct 21, 2025
Noncharacters are code points that are permanently reserved in the Unicode Standard for internal use. They are not recommended for use in open interchange of Unicode text data. However, they are valid code points and will not cause a string to return as invalid.

Still, HTML and XML both impose semantic rules on their use and it may be important for code to know whether they are present in a string. This patch introduces a new function, `wp_has_noncharacters()`, which answers this question.

See https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-23/#G12612

Developed in WordPress/wordpress-develop#9827
Discussed in https://core.trac.wordpress.org/ticket/63863

See #63863.

Built from https://develop.svn.wordpress.org/trunk@61000


git-svn-id: http://core.svn.wordpress.org/trunk@60336 1a063a9b-81f0-0310-95a4-ce76da25c4cd
github-actions bot pushed a commit to platformsh/wordpress-performance that referenced this pull request Oct 21, 2025
Noncharacters are code points that are permanently reserved in the Unicode Standard for internal use. They are not recommended for use in open interchange of Unicode text data. However, they are valid code points and will not cause a string to return as invalid.

Still, HTML and XML both impose semantic rules on their use and it may be important for code to know whether they are present in a string. This patch introduces a new function, `wp_has_noncharacters()`, which answers this question.

See https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-23/#G12612

Developed in WordPress/wordpress-develop#9827
Discussed in https://core.trac.wordpress.org/ticket/63863

See #63863.

Built from https://develop.svn.wordpress.org/trunk@61000


git-svn-id: https://core.svn.wordpress.org/trunk@60336 1a063a9b-81f0-0310-95a4-ce76da25c4cd
@dmsnell
Copy link
Member Author

dmsnell commented Oct 21, 2025

Merged in cfab276
[61000]

@dmsnell dmsnell closed this Oct 21, 2025
@dmsnell dmsnell deleted the utf8/track-noncharacters branch October 21, 2025 03:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant