Edit File by line

<?php

[0] Fix | Delete

[1] Fix | Delete

if ( extension_loaded( 'mbstring' ) ) :

[2] Fix | Delete

/**

[3] Fix | Delete

* Determines if a given byte string represents a valid UTF-8 encoding.

[4] Fix | Delete

[5] Fix | Delete

* Note that it’s unlikely for non-UTF-8 data to validate as UTF-8, but

[6] Fix | Delete

* it is still possible. Many texts are simultaneously valid UTF-8,

[7] Fix | Delete

* valid US-ASCII, and valid ISO-8859-1 (`latin1`).

[8] Fix | Delete

[9] Fix | Delete

* Example:

[10] Fix | Delete

[11] Fix | Delete

* true === wp_is_valid_utf8( '' );

[12] Fix | Delete

* true === wp_is_valid_utf8( 'just a test' );

[13] Fix | Delete

* true === wp_is_valid_utf8( "\xE2\x9C\x8F" ); // Pencil, U+270F.

[14] Fix | Delete

* true === wp_is_valid_utf8( "\u{270F}" ); // Pencil, U+270F.

[15] Fix | Delete

* true === wp_is_valid_utf8( '✏' ); // Pencil, U+270F.

[16] Fix | Delete

[17] Fix | Delete

* false === wp_is_valid_utf8( "just \xC0 test" ); // Invalid bytes.

[18] Fix | Delete

* false === wp_is_valid_utf8( "\xE2\x9C" ); // Invalid/incomplete sequences.

[19] Fix | Delete

* false === wp_is_valid_utf8( "\xC1\xBF" ); // Overlong sequences.

[20] Fix | Delete

* false === wp_is_valid_utf8( "\xED\xB0\x80" ); // Surrogate halves.

[21] Fix | Delete

* false === wp_is_valid_utf8( "B\xFCch" ); // ISO-8859-1 high-bytes.

[22] Fix | Delete

* // E.g. The “ü” in ISO-8859-1 is a single byte 0xFC,

[23] Fix | Delete

* // but in UTF-8 is the two-byte sequence 0xC3 0xBC.

[24] Fix | Delete

[25] Fix | Delete

* A “valid” string consists of “well-formed UTF-8 code unit sequence[s],” meaning

[26] Fix | Delete

* that the bytes conform to the UTF-8 encoding scheme, all characters use the minimal

[27] Fix | Delete

* byte sequence required by UTF-8, and that no sequence encodes a UTF-16 surrogate

[28] Fix | Delete

* code point or any character above the representable range.

[29] Fix | Delete

[30] Fix | Delete

* @see https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G32860

[31] Fix | Delete

[32] Fix | Delete

* @since 6.9.0

[33] Fix | Delete

[34] Fix | Delete

* @param string $bytes String which might contain text encoded as UTF-8.

[35] Fix | Delete

* @return bool Whether the provided bytes can decode as valid UTF-8.

[36] Fix | Delete

[37] Fix | Delete

function wp_is_valid_utf8( string $bytes ): bool {

[38] Fix | Delete

return mb_check_encoding( $bytes, 'UTF-8' );

[39] Fix | Delete

}

[40] Fix | Delete

else :

[41] Fix | Delete

/**

[42] Fix | Delete

* Fallback function for validating UTF-8.

[43] Fix | Delete

[44] Fix | Delete

* @ignore

[45] Fix | Delete

* @private

[46] Fix | Delete

[47] Fix | Delete

* @since 6.9.0

[48] Fix | Delete

[49] Fix | Delete

function wp_is_valid_utf8( string $string ): bool {

[50] Fix | Delete

return _wp_is_valid_utf8_fallback( $string );

[51] Fix | Delete

}

[52] Fix | Delete

endif;

[53] Fix | Delete

[54] Fix | Delete

if (

[55] Fix | Delete

extension_loaded( 'mbstring' ) &&

[56] Fix | Delete

// Maximal subpart substitution introduced by php/php-src@04e59c916f12b322ac55f22314e31bd0176d01cb.

[57] Fix | Delete

version_compare( PHP_VERSION, '8.1.6', '>=' )

[58] Fix | Delete

) :

[59] Fix | Delete

/**

[60] Fix | Delete

* Replaces ill-formed UTF-8 byte sequences with the Unicode Replacement Character.

[61] Fix | Delete

[62] Fix | Delete

* Knowing what to do in the presence of text encoding issues can be complicated.

[63] Fix | Delete

* This function replaces invalid spans of bytes to neutralize any corruption that

[64] Fix | Delete

* may be there and prevent it from causing further problems downstream.

[65] Fix | Delete

[66] Fix | Delete

* However, it’s not always ideal to replace those bytes. In some settings it may

[67] Fix | Delete

* be best to leave the invalid bytes in the string so that downstream code can handle

[68] Fix | Delete

* them in a specific way. Replacing the bytes too early, like escaping for HTML too

[69] Fix | Delete

* early, can introduce other forms of corruption and data loss.

[70] Fix | Delete

[71] Fix | Delete

* When in doubt, use this function to replace spans of invalid bytes.

[72] Fix | Delete

[73] Fix | Delete

* Replacement follows the “maximal subpart” algorithm for secure and interoperable

[74] Fix | Delete

* strings. This can lead to sequences of multiple replacement characters in a row.

[75] Fix | Delete

[76] Fix | Delete

* Example:

[77] Fix | Delete

[78] Fix | Delete

* // Valid strings come through unchanged.

[79] Fix | Delete

* 'test' === wp_scrub_utf8( 'test' );

[80] Fix | Delete

[81] Fix | Delete

* // Invalid sequences of bytes are replaced.

[82] Fix | Delete

* $invalid = "the byte \xC0 is never allowed in a UTF-8 string.";

[83] Fix | Delete

* "the byte \u{FFFD} is never allowed in a UTF-8 string." === wp_scrub_utf8( $invalid, true );

[84] Fix | Delete

* 'the byte � is never allowed in a UTF-8 string.' === wp_scrub_utf8( $invalid, true );

[85] Fix | Delete

[86] Fix | Delete

* // Maximal subparts are replaced individually.

[87] Fix | Delete

* '.�.' === wp_scrub_utf8( ".\xC0." ); // C0 is never valid.

[88] Fix | Delete

* '.�.' === wp_scrub_utf8( ".\xE2\x8C." ); // Missing A3 at end.

[89] Fix | Delete

* '.��.' === wp_scrub_utf8( ".\xE2\x8C\xE2\x8C." ); // Maximal subparts replaced separately.

[90] Fix | Delete

* '.��.' === wp_scrub_utf8( ".\xC1\xBF." ); // Overlong sequence.

[91] Fix | Delete

* '.��.' === wp_scrub_utf8( ".\xED\xA0\x80." ); // Surrogate half.

[92] Fix | Delete

[93] Fix | Delete

* Note! The Unicode Replacement Character is itself a Unicode character (U+FFFD).

[94] Fix | Delete

* Once a span of invalid bytes has been replaced by one, it will not be possible

[95] Fix | Delete

* to know whether the replacement character was originally intended to be there

[96] Fix | Delete

* or if it is the result of scrubbing bytes. It is ideal to leave replacement for

[97] Fix | Delete

* display only, but some contexts (e.g. generating XML or passing data into a

[98] Fix | Delete

* large language model) require valid input strings.

[99] Fix | Delete

[100] Fix | Delete

* @since 6.9.0

[101] Fix | Delete

[102] Fix | Delete

* @see https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-5/#G40630

[103] Fix | Delete

[104] Fix | Delete

* @param string $text String which is assumed to be UTF-8 but may contain invalid sequences of bytes.

[105] Fix | Delete

* @return string Input text with invalid sequences of bytes replaced with the Unicode replacement character.

[106] Fix | Delete

[107] Fix | Delete

function wp_scrub_utf8( $text ) {

[108] Fix | Delete

[109] Fix | Delete

* While it looks like setting the substitute character could fail,

[110] Fix | Delete

* the internal PHP code will never fail when provided a valid

[111] Fix | Delete

* code point as a number. In this case, there’s no need to check

[112] Fix | Delete

* its return value to see if it succeeded.

[113] Fix | Delete

[114] Fix | Delete

$prev_replacement_character = mb_substitute_character();

[115] Fix | Delete

mb_substitute_character( 0xFFFD );

[116] Fix | Delete

$scrubbed = mb_scrub( $text, 'UTF-8' );

[117] Fix | Delete

mb_substitute_character( $prev_replacement_character );

[118] Fix | Delete

[119] Fix | Delete

return $scrubbed;

[120] Fix | Delete

}

[121] Fix | Delete

else :

[122] Fix | Delete

/**

[123] Fix | Delete

* Fallback function for scrubbing UTF-8.

[124] Fix | Delete

[125] Fix | Delete

* @ignore

[126] Fix | Delete

* @private

[127] Fix | Delete

[128] Fix | Delete

* @since 6.9.0

[129] Fix | Delete

[130] Fix | Delete

function wp_scrub_utf8( $text ) {

[131] Fix | Delete

return _wp_scrub_utf8_fallback( $text );

[132] Fix | Delete

}

[133] Fix | Delete

endif;

[134] Fix | Delete

[135] Fix | Delete

if ( _wp_can_use_pcre_u() ) :

[136] Fix | Delete

/**

[137] Fix | Delete

* Returns whether the given string contains Unicode noncharacters.

[138] Fix | Delete

[139] Fix | Delete

* XML recommends against using noncharacters and HTML forbids their

[140] Fix | Delete

* use in attribute names. Unicode recommends that they not be used

[141] Fix | Delete

* in open exchange of data.

[142] Fix | Delete

[143] Fix | Delete

* Noncharacters are code points within the following ranges:

[144] Fix | Delete

* - U+FDD0–U+FDEF

[145] Fix | Delete

* - U+FFFE–U+FFFF

[146] Fix | Delete

* - U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, …, U+10FFFE, U+10FFFF

[147] Fix | Delete

[148] Fix | Delete

* @see https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-23/#G12612

[149] Fix | Delete

* @see https://www.w3.org/TR/xml/#charsets

[150] Fix | Delete

* @see https://html.spec.whatwg.org/#attributes-2

[151] Fix | Delete

[152] Fix | Delete

* @since 6.9.0

[153] Fix | Delete

[154] Fix | Delete

* @param string $text Are there noncharacters in this string?

[155] Fix | Delete

* @return bool Whether noncharacters were found in the string.

[156] Fix | Delete

[157] Fix | Delete

function wp_has_noncharacters( string $text ): bool {

[158] Fix | Delete

return 1 === preg_match(

[159] Fix | Delete

'/[\x{FDD0}-\x{FDEF}\x{FFFE}\x{FFFF}\x{1FFFE}\x{1FFFF}\x{2FFFE}\x{2FFFF}\x{3FFFE}\x{3FFFF}\x{4FFFE}\x{4FFFF}\x{5FFFE}\x{5FFFF}\x{6FFFE}\x{6FFFF}\x{7FFFE}\x{7FFFF}\x{8FFFE}\x{8FFFF}\x{9FFFE}\x{9FFFF}\x{AFFFE}\x{AFFFF}\x{BFFFE}\x{BFFFF}\x{CFFFE}\x{CFFFF}\x{DFFFE}\x{DFFFF}\x{EFFFE}\x{EFFFF}\x{FFFFE}\x{FFFFF}\x{10FFFE}\x{10FFFF}]/u',

[160] Fix | Delete

$text

[161] Fix | Delete

);

[162] Fix | Delete

}

[163] Fix | Delete

else :

[164] Fix | Delete

/**

[165] Fix | Delete

* Fallback function for detecting noncharacters in a text.

[166] Fix | Delete

[167] Fix | Delete

* @ignore

[168] Fix | Delete

* @private

[169] Fix | Delete

[170] Fix | Delete

* @since 6.9.0

[171] Fix | Delete

[172] Fix | Delete

function wp_has_noncharacters( string $text ): bool {

[173] Fix | Delete

return _wp_has_noncharacters_fallback( $text );

[174] Fix | Delete

}

[175] Fix | Delete

endif;

[176] Fix | Delete

[177] Fix | Delete