* Efficiently scan through block structure in document without parsing
* the entire block tree and all of its JSON attributes into memory.
* Class for efficiently scanning through block structure in a document
* without parsing the entire block tree and JSON attributes into memory.
* This class is designed to help analyze and modify block structure in a
* streaming fashion and to bridge the gap between parsed block trees and
* the text representing them.
* Use-cases for this class include but are not limited to:
* - Counting block types in a document.
* - Queuing stylesheets based on the presence of various block types.
* - Modifying blocks of a given type, i.e. migrations, updates, and styling.
* - Searching for content of specific kinds, e.g. checking for blocks
* with certain theme support attributes, or block bindings.
* - Adding CSS class names to the element wrapping a block’s inner blocks.
* > *Note!* If a fully-parsed block tree of a document is necessary, including
* > all the parsed JSON attributes, nested blocks, and HTML, consider
* > using {@see \parse_blocks()} instead which will parse the document
* For typical usage, jump first to the methods {@see self::next_block()},
* {@see self::next_delimiter()}, or {@see self::next_token()}.
* As a lower-level interface than {@see parse_blocks()} this class follows
* different performance-focused values:
* - Minimize allocations so that documents of any size may be processed
* on a fixed or marginal amount of memory.
* - Make hidden costs explicit so that calling code only has to pay the
* performance penalty for features it needs.
* - Operate with a streaming and re-entrant design to make it possible
* to operate on chunks of a document and to resume after pausing.
* This means that some operations might appear more cumbersome than one
* might expect. This design tradeoff opens up opportunity to wrap this in
* a convenience class to add higher-level functionality.
* All text documents can be considered a block document containing a combination
* of “freeform HTML” and explicit block structure. Block structure forms through
* special HTML comments called _delimiters_ which include a block type and,
* optionally, block attributes encoded as a JSON object payload.
* This processor is designed to scan through a block document from delimiter to
* delimiter, tracking how the delimiters impact the structure of the document.
* Spans of HTML appear between delimiters. If these spans exist at the top level
* of the document, meaning there is no containing block around them, they are
* considered freeform HTML content. If, however, they appear _inside_ block
* structure they are interpreted as `innerHTML` for the containing block.
* ### Tokens and scanning
* As the processor scans through a document is reports information about the token
* on which is pauses. Tokens represent spans of text in the input comprising block
* delimiters and spans of HTML.
* - {@see self::next_token()} visits every contiguous subspan of text in the
* input document. This includes all explicit block comment delimiters and spans
* of HTML content (whether freeform or inner HTML).
* - {@see self::next_delimiter()} visits every explicit block comment delimiter
* unless passed a block type which covers freeform HTML content. In these cases
* it will stop at top-level spans of HTML and report a `null` block type.
* - {@see self::next_block()} visits every block delimiter which _opens_ a block.
* This includes opening block delimiters as well as void block delimiters. With
* the same exception as above for freeform HTML block types, this will visit
* top-level spans of HTML content.
* When matched on a particular token, the following methods provide structural
* and textual information about it:
* - {@see self::get_delimiter_type()} reports whether the delimiter is an opener,
* a closer, or if it represents a whole void block.
* - {@see self::get_block_type()} reports the fully-qualified block type which
* the delimiter represents.
* - {@see self::get_printable_block_type()} reports the fully-qualified block type,
* but returns `core/freeform` instead of `null` for top-level freeform HTML content.
* - {@see self::is_block_type()} indicates if the delimiter represents a block of
* the given block type, or wildcard or pseudo-block type described below.
* - {@see self::opens_block()} indicates if the delimiter opens a block of one
* of the provided block types. Opening, void, and top-level freeform HTML content
* - {@see static::get_attributes()} is currently reserved for a future streaming
* - {@see self::allocate_and_return_parsed_attributes()} extracts the JSON attributes
* for delimiters which open blocks and return the fully-parsed attributes as an
* associative array. {@see static::get_last_json_error()} for when this fails.
* - {@see self::is_html()} indicates if the token is a span of HTML which might
* be top-level freeform content or a block’s inner HTML.
* - {@see self::get_html_content()} returns the span of HTML.
* - {@see self::get_span()} for the byte offset and length into the input document
* representing the token.
* It’s possible for the processor to fail to scan forward if the input document ends
* in a proper prefix of an explicit block comment delimiter. For example, if the input
* ends in `<!-- wp:` then it _might_ be the start of another delimiter. The parser
* cannot know, however, and therefore refuses to proceed. {@see static::get_last_error()}
* to distinguish between a failure to find the next token and an incomplete input.
* A block’s “type” comprises an optional _namespace_ and _name_. If the namespace
* isn’t provided it will be interpreted as the implicit `core` namespace. For example,
* the type `gallery` is the name of the block in the `core` namespace, but the type
* `abc/gallery` is the _fully-qualified_ block type for the block whose name is still
* `gallery`, but in the `abc` namespace.
* Methods on this class are aware of this block naming semantic and anywhere a block
* type is an argument to a method it will be normalized to account for implicit namespaces.
* Passing `paragraph` is the same as passing `core/paragraph`. On the contrary, anywhere
* this class returns a block type, it will return the fully-qualified and normalized form.
* For example, for the `<!-- wp:group -->` delimiter it will return `core/group` as the
* There are two special block types that change the behavior of the processor:
* - The wildcard `*` represents _any block_. In addition to matching all block types,
* it also represents top-level freeform HTML whose block type is reported as `null`.
* - The `core/freeform` block type is a pseudo-block type which explicitly matches
* top-level freeform HTML.
* These special block types can be passed into any method which searches for blocks.
* There is one additional special block type which may be returned from
* {@see self::get_printable_block_type()}. This is the `#innerHTML` type, which
* indicates that the HTML span on which the processor is paused is inner HTML for
* Non-block content plays a complicated role in processing block documents. This
* processor exposes tools to help work with these spans of HTML.
* - {@see self::is_html()} indicates if the processor is paused at a span of
* HTML but does not differentiate between top-level freeform content and inner HTML.
* - {@see self::is_non_whitespace_html()} indicates not only if the processor
* is paused at a span of HTML, but also whether that span incorporates more than
* whitespace characters. Because block serialization often inserts newlines between
* block comment delimiters, this is useful for distinguishing “real” freeform
* content from purely aesthetic syntax.
* - {@see self::is_block_type()} matches top-level freeform HTML content when
* provided one of the special block types described above.
* As the processor traverses block delimiters it maintains a stack of which blocks are
* open at the given place in the document where it’s paused. This stack represents the
* block structure of a document and is used to determine where blocks end, which blocks
* represent inner blocks, whether a span of HTML is top-level freeform content, and
* more. Investigate the stack with {@see self::get_breadcrumbs()}, which returns an
* array of block types starting at the outermost-open block and descending to the
* currently-visited block.
* Unlike {@parse_blocks()}, spans of HTML appear in this structure as the special
* reported block type `#html`. Such a span represents inner HTML for a block if the
* depth reported by {@see self::get_depth()} is greater than one.
* It will generally not be necessary to inspect the stack of open blocks, though
* depth may be important for finding where blocks end. When visiting a block opener,
* the depth will have been increased before pausing; in contrast the depth is
* decremented before visiting a closer. This makes the following an easy way to
* determine if a block is still open.
* $depth = $processor->get_depth();
* while ( $processor->next_token() && $processor->get_depth() > $depth ) {
* // Processor is now paused at the token immediately following the closed block.
* A unique feature of this processor is the ability to return the same output as
* {@see \parse_blocks()} would produce, but for a subset of the input document.
* For example, it’s possible to extract an image block, manipulate that parsed
* block, and re-serialize it into the original document. It’s possible to do so
* while skipping over the parse of the rest of the document.
* {@see self::extract_full_block_and_advance()} will scan forward from the current block opener
* and build the parsed block structure until the current block is closed. It will
* include all inner HTML and inner blocks, and parse all of the inner blocks. It
* can be used to extract a block at any depth in the document, helpful for operating
* on blocks within nested structure.
* if ( ! $processor->next_block( 'gallery' ) ) {
* $gallery_at = $processor->get_span()->start;
* $gallery_block = $processor->extract_full_block_and_advance();
* $after_gallery = $processor->get_span()->start;
* substr( $post_content, 0, $gallery_at ) .
* serialize_block( modify_gallery( $gallery_block ) .
* substr( $post_content, $after_gallery )
* #### Handling of malformed structure
* There are situations where closing block delimiters appear for which no open block
* exists, or where a document ends before a block is closed, or where a closing block
* delimiter appears but references a different block type than the most-recently
* opened block does. In all of these cases, the stack of open blocks should mirror
* the behavior in {@see \parse_blocks()}.
* Unlike {@see \parse_blocks()}, however, this processor can still operate on the
* invalid block delimiters. It provides a few functions which can be used for building
* custom and non-spec-compliant error handling.
* - {@see self::has_closing_flag()} indicates if the block delimiter contains the
* closing flag at the end. Some invalid block delimiters might contain both the
* void and closing flag, in which case {@see self::get_delimiter_type()} will
* report that it’s a void block.
* - {@see static::get_last_error()} indicates if the processor reached an invalid
* block closing. Depending on the context, {@see \parse_blocks()} might instead
* ignore the token or treat it as freeform HTML content.
* This class provides helpers for performing semantic block-related operations.
* - {@see self::normalize_block_type()} takes a block type with or without the
* implicit `core` namespace and returns a fully-qualified block type.
* - {@see self::are_equal_block_types()} indicates if two spans across one or
* more input texts represent the same fully-qualified block type.
* This processor is designed to accurately parse a block document. Therefore, many
* of its methods are not meant for subclassing. However, overall this class supports
* building higher-level convenience classes which may choose to subclass it. For those
* classes, avoid re-implementing methods except for the list below. Instead, create
* new names representing the higher-level concepts being introduced. For example, instead
* of creating a new method named `next_block()` which only advances to blocks of a given
* kind, consider creating a new method named something like `next_layout_block()` which
* won’t interfere with the base class method.
* - {@see static::get_last_error()} may be reimplemented to report new errors in the subclass
* which aren’t intrinsic to block parsing.
* - {@see static::get_attributes()} may be reimplemented to provide a streaming interface
* to reading and modifying a block’s JSON attributes. It should be fast and memory efficient.
* - {@see static::get_last_json_error()} may be reimplemented to report new errors introduced
* with a reimplementation of {@see static::get_attributes()}.
class WP_Block_Processor {
* Indicates if the last operation failed, otherwise
* will be `null` for success.
private $last_error = null;
* Indicates failures from decoding JSON attributes.
* @see \json_last_error()
private $last_json_error = JSON_ERROR_NONE;
* Source text provided to processor.
* Byte offset into source text where a matched delimiter starts.
* 5 10 15 20 25 30 35 40 45 50
* <!-- wp:group --><!-- wp:void /--><!-- /wp:group -->
* ╰─ Starts at byte offset 17.
private $matched_delimiter_at = 0;
* Byte length of full span of a matched delimiter.
* 5 10 15 20 25 30 35 40 45 50
* <!-- wp:group --><!-- wp:void /--><!-- /wp:group -->
private $matched_delimiter_length = 0;
* First byte offset into source text following any previously-matched delimiter.
* Used to indicate where an HTML span starts.
* 5 10 15 20 25 30 35 40 45 50 55
* <!-- wp:paragraph --><p>Content</p><⃨!⃨-⃨-⃨ ⃨/⃨w⃨p⃨:⃨p⃨a⃨r⃨a⃨g⃨r⃨a⃨p⃨h⃨ ⃨-⃨-⃨>⃨
* │ ╰─ This delimiter was matched, and after matching,
* │ revealed the preceding HTML span.
* ╰─ The first byte offset after the previous matched delimiter
* is 21. Because the matched delimiter starts at 55, which is after
* this, a span of HTML must exist between these boundaries.
private $after_previous_delimiter = 0;
* Byte offset where namespace span begins.
* When no namespace is present, this will be the same as the starting
* byte offset for the block name.
* <!-- wp:core/gallery -->
* ╰─ Namespace starts here.
* ├─ The namespace would start here but is implied as “core.”
* ╰─ The name starts here.
private $namespace_at = 0;
* Byte offset where block name span begins.
* When no namespace is present, this will be the same as the starting
* byte offset for the block namespace.
* <!-- wp:core/gallery -->
* ╰─ Namespace starts here.
* ├─ The namespace would start here but is implied as “core.”
* ╰─ The name starts here.
* Byte length of block name span.
* <!-- wp:core/gallery -->
private $name_length = 0;
* Whether the delimiter contains the block-closing flag.
* This may be erroneous if present within a void block,
* therefore the {@see self::has_closing_flag()} can be used by
* calling code to perform custom error-handling.
private $has_closing_flag = false;
* Byte offset where JSON attributes span begins.
* <!-- wp:paragraph {"dropCaps":true} -->
* ╰─ Starts at byte offset 18.
* Byte length of JSON attributes span, or 0 if none are present.
* <!-- wp:paragraph {"dropCaps":true} -->
private $json_length = 0;
* Internal parser state, differentiating whether the instance is currently matched,
* on an implicit freeform node, in error, or ready to begin parsing.
* @see self::INCOMPLETE_INPUT
protected $state = self::READY;
* Indicates what kind of block comment delimiter was matched.
* - {@see self::OPENER} If the delimiter is opening a block.
* - {@see self::CLOSER} If the delimiter is closing an open block.
* - {@see self::VOID} If the delimiter represents a void block with no inner content.
* If a parsed comment delimiter contains both the closing and the void
* flags then it will be interpreted as a void block to match the behavior
* of the official block parser, however, this is a syntax error and probably
* the block ought to close an open block of the same name, if one is open.
* Whether the last-matched delimiter acts like a void block and should be
* popped from the stack of open blocks as soon as the parser advances.
* This applies to void block delimiters and to HTML spans.
private $was_void = false;