Skip to content

Commit b6b9cf8

Browse files
committed
Charset: Rely on new UTF-8 pipeline for mb_strlen() fallback.
The existing polyfill for `mb_strlen()` contains a number of issues leaving plenty of opportunity for improvement. Specifically, the following are all deficiencies: it relies on Unicode PCRE support, assumes input strings are valid UTF-8, splits input strings into an array of character to count them (1,000 at a time, iterating until complete), and entirely gives up when the Unicode support is missing. This patch provides an updated polyfill which will reliably count code points in a UTF-8 string, even in the presence of sequences of invalid bytes. It scans through the input with zero allocations. Additionally, the underlying fallback extends the behavior of `mb_strlen()` to provide character counts for substrings within a larger input without extracting the substring (it can counts characters within a byte offset and length of a larger string). This change improves the reliability of UTF-8 string length calculations and removes behavioral variability based on the runtime system. Developed in WordPress/wordpress-develop#9828 Discussed in https://core.trac.wordpress.org/ticket/63863 See #63863. Built from https://develop.svn.wordpress.org/trunk@60949 git-svn-id: https://core.svn.wordpress.org/trunk@60285 1a063a9b-81f0-0310-95a4-ce76da25c4cd
1 parent edf32b4 commit b6b9cf8

File tree

3 files changed

+57
-57
lines changed

3 files changed

+57
-57
lines changed

wp-includes/compat-utf8.php

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -291,3 +291,49 @@ function _wp_scrub_utf8_fallback( string $bytes ): string {
291291

292292
return $scrubbed;
293293
}
294+
295+
/**
296+
* Returns how many code points are found in the given UTF-8 string.
297+
*
298+
* Invalid spans of bytes count as a single code point according
299+
* to the maximal subpart rule. This function is a fallback method
300+
* for calling `mb_strlen( $text, 'UTF-8' )`.
301+
*
302+
* When negative values are provided for the byte offsets or length,
303+
* this will always report zero code points.
304+
*
305+
* Example:
306+
*
307+
* 4 === _wp_utf8_codepoint_count( 'text' );
308+
*
309+
* // Groups are 'test', "\x90" as '�', 'wp', "\xE2\x80" as '�', "\xC0" as '�', and 'test'.
310+
* 13 === _wp_utf8_codepoint_count( "test\x90wp\xE2\x80\xC0test" );
311+
*
312+
* @since 6.9.0
313+
* @access private
314+
*
315+
* @param string $text Count code points in this string.
316+
* @param ?int $byte_offset Start counting after this many bytes in `$text`. Must be positive.
317+
* @param ?int $max_byte_length Optional. Stop counting after having scanned past this many bytes.
318+
* Default is to scan until the end of the string. Must be positive.
319+
* @return int How many code points were found.
320+
*/
321+
function _wp_utf8_codepoint_count( string $text, ?int $byte_offset = 0, ?int $max_byte_length = PHP_INT_MAX ): int {
322+
if ( $byte_offset < 0 ) {
323+
return 0;
324+
}
325+
326+
$count = 0;
327+
$at = $byte_offset;
328+
$end = strlen( $text );
329+
$invalid_length = 0;
330+
$max_byte_length = min( $end - $at, $max_byte_length );
331+
332+
while ( $at < $end && ( $at - $byte_offset ) < $max_byte_length ) {
333+
$count += _wp_scan_utf8( $text, $at, $invalid_length, $max_byte_length - ( $at - $byte_offset ) );
334+
$count += $invalid_length > 0 ? 1 : 0;
335+
$at += $invalid_length;
336+
}
337+
338+
return $count;
339+
}

wp-includes/compat.php

Lines changed: 10 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -228,69 +228,23 @@ function mb_strlen( $string, $encoding = null ) { // phpcs:ignore Universal.Nami
228228
/**
229229
* Internal compat function to mimic mb_strlen().
230230
*
231-
* Only understands UTF-8 and 8bit. All other character sets will be treated as 8bit.
232-
* For `$encoding === UTF-8`, the `$str` input is expected to be a valid UTF-8 byte
233-
* sequence. The behavior of this function for invalid inputs is undefined.
231+
* Only supports UTF-8 and non-shifting single-byte encodings. For all other
232+
* encodings expect the counts to be wrong. When the given encoding (or the
233+
* `blog_charset` if none is provided) isn’t UTF-8 then the function returns
234+
* the byte-count of the provided string.
234235
*
235236
* @ignore
236237
* @since 4.2.0
237238
*
238239
* @param string $str The string to retrieve the character length from.
239-
* @param string|null $encoding Optional. Character encoding to use. Default null.
240-
* @return int String length of `$str`.
240+
* @param string|null $encoding Optional. Count characters according to this encoding.
241+
* Default is to consult `blog_charset`.
242+
* @return int Count of code points if UTF-8, byte length otherwise.
241243
*/
242244
function _mb_strlen( $str, $encoding = null ) {
243-
if ( null === $encoding ) {
244-
$encoding = get_option( 'blog_charset' );
245-
}
246-
247-
/*
248-
* The solution below works only for UTF-8, so in case of a different charset
249-
* just use built-in strlen().
250-
*/
251-
if ( ! _is_utf8_charset( $encoding ) ) {
252-
return strlen( $str );
253-
}
254-
255-
if ( _wp_can_use_pcre_u() ) {
256-
// Use the regex unicode support to separate the UTF-8 characters into an array.
257-
preg_match_all( '/./us', $str, $match );
258-
return count( $match[0] );
259-
}
260-
261-
$regex = '/(?:
262-
[\x00-\x7F] # single-byte sequences 0xxxxxxx
263-
| [\xC2-\xDF][\x80-\xBF] # double-byte sequences 110xxxxx 10xxxxxx
264-
| \xE0[\xA0-\xBF][\x80-\xBF] # triple-byte sequences 1110xxxx 10xxxxxx * 2
265-
| [\xE1-\xEC][\x80-\xBF]{2}
266-
| \xED[\x80-\x9F][\x80-\xBF]
267-
| [\xEE-\xEF][\x80-\xBF]{2}
268-
| \xF0[\x90-\xBF][\x80-\xBF]{2} # four-byte sequences 11110xxx 10xxxxxx * 3
269-
| [\xF1-\xF3][\x80-\xBF]{3}
270-
| \xF4[\x80-\x8F][\x80-\xBF]{2}
271-
)/x';
272-
273-
// Start at 1 instead of 0 since the first thing we do is decrement.
274-
$count = 1;
275-
276-
do {
277-
// We had some string left over from the last round, but we counted it in that last round.
278-
--$count;
279-
280-
/*
281-
* Split by UTF-8 character, limit to 1000 characters (last array element will contain
282-
* the rest of the string).
283-
*/
284-
$pieces = preg_split( $regex, $str, 1000 );
285-
286-
// Increment.
287-
$count += count( $pieces );
288-
289-
// If there's anything left over, repeat the loop.
290-
} while ( $str = array_pop( $pieces ) );
291-
292-
// Fencepost: preg_split() always returns one extra item in the array.
293-
return --$count;
245+
return _is_utf8_charset( $encoding ?? get_option( 'blog_charset' ) )
246+
? _wp_utf8_codepoint_count( $str )
247+
: strlen( $str );
294248
}
295249

296250
// sodium_crypto_box() was introduced in PHP 7.2.

wp-includes/version.php

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
*
1717
* @global string $wp_version
1818
*/
19-
$wp_version = '6.9-alpha-60948';
19+
$wp_version = '6.9-alpha-60949';
2020

2121
/**
2222
* Holds the WordPress DB revision, increments when changes are made to the WordPress DB schema.

0 commit comments

Comments
 (0)