Exporting content from Drupal 7 to Markdown

Table of contents

Recently, we decided to rebuild the 7Sabores website in GatsbyJS, a front end framework and static site generator based on React. Once that decision was made, the next step was to define how to move the content stored in the database of the original Drupal 7 site. While it is possible to use Drupal as a content source in Gatsby, we choose against using a CMS and export all Drupal content to Markdown.

With the help of two plugins, Gatsby can read Markdown files from the file system and then generate pages and listings from them. For this reason Markdown was the best option to export Drupal content and reduce the complexity of the new site by removing the need for a CMS or database altogether.

Exporting the content

The solution used to export the Drupal content consisted of two PHP scripts, one for node content entities and another one for user entities. These PHP scripts were executed from the command line and created markdown files for each node and user. Files were stored in different folders depending on the source content type to help with the organization of content in Gatsby.

In order to have access to the data stored on the Drupal site’s database and to be able to take advantage of all Drupal functions and APIs, we made a full bootstrap of Drupal calling drupal_bootstrap()
at the beginning of the export scripts.

define('DRUPAL_ROOT', getcwd());

require_once DRUPAL_ROOT . '/includes/bootstrap.inc';

drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

This allows us to easily access all the data we need from Drupal through dynamic queries using the db_select() function and then loading all nodes with node_load_multiple().

As an example, for Drupal nodes we did the following:

$nids = db_select('node', 'n')
  ->fields('n', array('nid'))
  ->fields('n', array('type'))
  ->condition('n.status', 1)
  ->condition('n.type', $content_type)
  ->orderBy('n.created', 'ASC')
  ->execute()
  ->fetchCol();

$nodes = node_load_multiple($nids);

The PHP script to export content requires the user to specify the type of content to export as a parameter. This allows us to export only the nodes of a particular type, for example “blog” or “page”. This parameter is assigned to the variable $content_type and is used on the condition of the dynamic query. In the previous example we use the db_select() function to select all published nodes of a particular type and return their respective IDs. Then with the function node_load_multiple() we loaded all entities from the database that match the selected node IDs.

Once we have the entities that represent the content, the next step was to convert the body of each of them to Markdown and save it to a file. In this particular case we choose to use the HTML To Markdown for PHP library to do this conversion.

The library is available as a composer package so to use it we must first require the package with the following command.

composer require league/html-to-markdown

And then add the following require statement to the top of the PHP script to autoload the dependencies.

require 'vendor/autoload.php';

Finally, we create a new converter and use it to transform the HTML content of the nodes.

$converter = new LeagueHTMLToMarkdownHtmlConverter();

$converter->getConfig()->setOption('strip_tags', true);
$converter->convert($clean_html)

Problems

Unfortunately things don’t always go as planned and there were some complications when we converted the content of our nodes to markdown.

In our particular case, after doing the conversion and reviewing the exported content we found problems in the Markdown files for many of the nodes that had code blocks with <pre> tags. We discovered that the root cause was not the tag itself, but that the tag had a class as an attribute.

The solution to avoid these problems was to remove the attributes before converting from HTML to Markdown with the help of PHP’s DOMDocument and DOMXPath classes.

// Create a DOMDocument object.
$dom = new  DOMDocument();

// Load body content and fix encoding issues.
$dom->loadHTML(mb_convert_encoding($node->body['und'][0]['value'], 'HTML-ENTITIES', 'UTF-8'));

Once we have the HTML content of our node’s body parsed and rendered by DOMDocument, we can start manipulating it with the help of DOMXPath.

// Create a DOMXPath object.
$xpath = new DOMXPath($dom);
// Search all <pre> elements and remove the class attribute.
foreach ($xpath->evaluate("//pre") as $code_block) {
  $code_block->removeAttribute('class');
}

The last of our problems was related to the image style tokens that Drupal adds to the images embedded within the content. Since version 7.20 Drupal started to add tokens in the form of an “itok” parameter in the query string of the URLs of all derived images to avoid DDoS attacks. Since this “itok” parameter should not be in the Markdown files, we used XPath again to manipulate the HTML and remove the parameter before converting to Markdown.

// Find all <img> elements.
foreach ($xpath->evaluate("//img") as $image) {

  // Get the src attribute.
  $src = $image->getAttribute('src');

  // Split the url into its different components.
  $parsed = parse_url($src);

  // Get the parameters of the query string.
  $query = $parsed['query'];
  parse_str($query, $params);

  // Remove the "itok" parameter.
  unset($params['itok']);

  // Rebuild the query string of the URL.
  $parsed['query'] = http_build_query($params);

  // Rebuild the URL.
  $new_scr = unparse_url($parsed);

  // Replace the url of the image.
  $src = $image->setAttribute('src', $new_scr);

}

In the example we use a parse_url() function to reverse and regenerate the url as a string. Unfortunately in PHP there is no a function to reverse the effect of parse_url, but there are many examples of unparse_url() functions that you can add to your scripts.

Finally, we save all the modifications made to the HTML in a string, and this is the string that we will use for the conversion from HTML to Markdown.

$clean_html = $dom->saveHtml();

Frontmatter

Up to this point, we got the body of our entities in Markup format, but the content that we would like to migrate from a node is not just its body, there is more data that we want to have. For example the author, the date of creation, the path, additional fields that have been created in that content type, etc.

Luckily Gatsby provides us with a way to represent all this additional information in a Markdown file. With the help of a Frontmatter we can provide additional data relevant to our content to be used in the GraphQL data layer.

In a book, the Front Matter is everything that goes at the beginning of the book before its main content. For example the title page, the copyright and the index. In the case of a Markdown file, the Frontmatter is a block of key-value pairs in YAML format at the beginning of the document, before the Markdown content.

For Gatsby to be able to recognize the Frontmatter it must be the first thing found in the file, it must be in valid YAML format, and it must also start and end with a line with three — hyphens.

Below you can see an example of the Frontmatter of one of the blog post.

---
title: 'Como utilizar tablas personalizadas en Views con Drupal 7'
date: 2014-05-09
author: enzo
path: blog/utilizar-tablas-personalizadas-views-drupal-7
nid: 489
topics:
  - Drupal
  - 'Desarrollo de Modulos'
  - Vistas
cover: ../assets/custom_field_formatter_1.png
summary: "Como ya hemos visto en las entrada de blog Como crear tablas para nuestros módulos personalizados en Drupal 7 y Como agregar una nueva tabla a modulo ya activado en Drupal 7 es posible la creación de un modelo de datos personalizado en la base de datos que guarde cierta relación con el modelo de datos de Drupal."
---

# Markdown Content ...

As you can see in the previous example, the Frontmatter allows us to have a large amount of additional information to the Markdown content. The values of each of the pairs can be of different data types, in this example you can see strings, dates, numbers, arrays/lists and even paths to images. All the metadata contained in the Frontmatter of our Markdown files will be available on Gatsby in the GraphQL data layer.

I hope you found this interesting and that it would help you if you ever have to export data from Drupal to Markdown.