WordPress Importer can now migrate URLs in your content

Moving a WordPress site has always meant fixing countless broken URLs. The links still pointed to the old domain, the images didn’t load, and the cover blocks lost their background. Not anymore! WordPress Importer now migrates the URLs in your imported content.

A Real Example

Imagine you’re editor-in-chief of https://yummy-🍲-recipes.org/vegan — an imaginary cooking site with vegan recipes. Your reader base is growing, things are going well, but when you meet people in person, they find it difficult to type in that emoji. You decide to move to an all-ASCII domain: https://yummy-cooking-recipes.org/

Your first step is exporting the site content. You go to wp-adminadmin (and super admin), click the right button, export the xml file, and… what is it? Some posts have a really weird-looking markup. Is it because of that pluginPlugin A plugin is a piece of software containing a group of functions that can be added to a WordPress website. They can extend functionality or add new features to your WordPress websites. WordPress plugins are written in the PHP programming language and integrate seamlessly with WordPress. These can be free in the WordPress.org Plugin Directory https://wordpress.org/plugins/ or can be cost-based plugin from a third-party you installed last week? Or the blockBlock Block is the abstract term used to describe units of markup that, composed together, form the content or layout of a webpage using the WordPress editor. The idea combines concepts of what in the past may have achieved with shortcodes, custom HTML, and embed discovery into a single consistent API and user experience. editor work you contracted last month? You’re not sure. The markup is not wrong. It is valid HTMLHTML HyperText Markup Language. The semantic scripting language primarily used for outputting content in web browsers., and it renders well in every web browser. It’s just not what you’ve expected:

<!-- wp:cover {"url":"https://yummy-\uD83C\uDF7E-recipes.org/vegan/wp-content/uploads/photo.jpg","align":"left","id":761} -->
<div
	class="wp-block-cover"
	style="
		background-image: url(&#104;ttps:&#x2f;&#x2f;yummy-\u1f372-recipes.org&#x2f;vegan&#x2f;wp-content&#x25;2Fuploads%2Fphoto.jpg);
	"
>
	<div class="wp-block-cover__inner-container yummy-🍲-recipes.org/vegan/-cover">
		<img
			src="&#104;ttps://xn--yummy--recipes-vb87&#x6d;.org/vegan/wp-content/uploads/cover.jpg"
		/>

		<h1>Yummy Vegan Recipes!</h1>

		<p>You are on the official yummy-🍲-recipes.org/vegan site!</p>

		<p>
			Be careful – there is a phishing site you may mistake us for:
			extra-yummy-🍲-recipes.org/vegan/.

			Oh! And our email is: hello@yummy-🍲-recipes.org
		</p>
	</div>
</div>
<!-- /wp:cover -->

You sigh and think Well, that will take some work to adjust. The existing URLURL A specific web address of a website or web page on the Internet, such as a website’s URL www.wordpress.org rewriting tools, such as wp search-replace, will catch some URLs but miss most of them. They may alter the warning about the phishing site, which mentions a similar but distinct domain. And then you notice the last WordPress importer release.

WordPress importer now solves this exact problem!

You import your content on the new site with a bit of disbelief. Can it really get it right? You think. But you try it, and, after a brief moment, the import is finished with all the URLs correctly updated:

<!-- wp:cover {"url":"https://yummy-cooking-recipes.org/wp-content/uploads/photo.jpg","align":"left","id":761} -->
<div
	class="wp-block-cover"
	style="
		background-image: url(&quot;https://yummy-cooking-recipes.org/wp-content%2Fuploads%2Fphoto.jpg&quot;);
	"
>
	<div class="wp-block-cover__inner-container yummy-🍲-recipes.org/vegan/-cover">
		<img
			src="https://yummy-cooking-recipes.org/wp-content/uploads/cover.jpg"
		/>

		<h1>WordPress news!</h1>

		<p>You are on the official yummy-cooking-recipes.org/ site!</p>

		<p>
			Be careful – there is a phishing site you may mistake us for:
			extra-yummy-🍲-recipes.org/vegan/.

			Oh! And our email is: hello@yummy-🍲-recipes.org
		</p>
	</div>
</div>
<!-- /wp:cover -->

Isn’t that great?

Breaking down what the importer did

The WordPress importer knows the difference between a URL that needs migrating and an unrelated text that just happens to contain similar characters. Let’s take a closer look at the data migrationMigration Moving the code, database and media files for a website site from one server to another. Most typically done when changing hosting companies. we’ve just done.

These parts were migrated:

  • Domain encoded using punycode (xn--yummy--recipes-vb87m.org)
  • JSONJSON JSON, or JavaScript Object Notation, is a minimal, readable format for structuring data. It is used primarily to transmit data between a server and web application, as an alternative to XML. with Unicode escapes (yummy-\uD83C\uDF7E-recipes.org)
  • HTML attributes with entities (src="https://xn--yummy--recipes-vb87&#x6d;.org/vegan/wp-content/uploads/cover.jpg")
  • CSSCSS Cascading Style Sheets. with Unicode escapes encoded as an HTML attribute (style="background-image: url(https://yummy-\u1f372-recipes.org/vegan/wp-content%2Fuploads%2Fphoto.jpg);")
  • URLs using %-encoding mixed with HTML entities (&#x25;2Fuploads%2Fphoto.jpg)

These parts stayed exactly as they were:

  • The CSS class yummy-🍲-recipes.org/vegan/-cover. The class name coincides with the domain, but it’s still a unique identifier defined in a stylesheet. Changing it would affect how the site is displayed.
  • The email address hello@yummy-🍲-recipes.org. In this migration, only the website domain changes. Old emails continue to work.
  • The reference to extra-yummy-🍲-recipes.org. It’s a different domain. It would be modified by a simple string replacement, but the WordPress importer recognizes the difference and preserves the original domain.

The WordPress importer parses each data format and encoding, respecting the syntactical nuances, and finds the raw URLs beneath all the layers. All of that happens during the import. The old URLs never make it to the database.

Under the hood, URL rewriting is powered by the new structured data parsers shipped in WordPress CoreCore Core is the set of software required to run WordPress. The Core Development Team builds WordPress. and in the WordPress/php-toolkit repository. BlockMarkupUrlProcessor is the orchestra director coordinating the effort of multiple format-specific parsers such as WP_HTML_Processor, CSSProcessor, CSSURLProcessor, URLInTextProcessor, _wp_scan_utf8, and others.

Because the imported data doesn’t require post-processing, there’s no need to run the traditional UPDATE wp_post SET post_content=REPLACE(old_url, new_url, post_content) queries after the import. This is a big deal. Those queries might be just a minor inconvenience on a small site, but on larger sites, they could take days and lock the most important tables.

If you are interested in even more technical context, see the original Pull Request and the various resources linked in the description.

Try It Out

URL rewriting is available in WordPress Importer 0.9.5 out of the box. You only need to check the checkbox before starting the import:

WP-CLI has an open Pull Request to support this feature.

If you’d like to try it out now, here’s a WordPress Playground demo that imports a content page similar to the example used in this post. You can inspect the imported markup and also go to wp-admin and try importing your own file.

Please share your feedback – it matters a lot! You can share your experience with WordPress Importer in the comments under this post. For any issues and feature requests, feel free to open an issue in the WordPress/wordpress-importer repository.

What’s next?

The WordPress importer improvement roadmap lists several more upcoming features to improve site migrations, such as support for importing large files, concurrent media downloads, or a direct WordPress-to-WordPress site synchronization. You can follow along and share your thoughts in the roadmap issue.

Props to @dmsnell for the major effort he put into the structured data parsers and all his guidance and feedback. Props to @zaerl for his help with reviewing WordPress-importer PRs. Props to @bph for the feedback that helped greatly improve the URL rewriting experience and also for reviewing this post.