Structured Data with schema.org
Monday, May 13, 2013
Good website content is important so people can learn and interact, but robots are the ones interpreting content to
figure out if the content is actually useful to people. With the new website I wanted to be sure I was using
standards and metadata so the content could be programmatically useful. I chose to use the markup from schema.org
due to its fairly comprehensive data types and broad adoption by search engines.
I think the importance of structured data is growing. Not only does it make things easier for search engines to
consistently interpret content, it can also help encourage properly designed website architecture. For example, if I
want search engines to know what the brand of a product is, it probably means I should ensure the product is linked to
the main brand page. A byproduct of this means a regular user can then click back to the main brand listings as well.
One of the most difficult things about embedding structured data is verifying that the markup looks how I expect. There
are tools on both Google and Bing for testing structured data, but they really work best for
publicly accessible pages (not development-local content). I found a fewothertools, but either they
were limited in their features or had some inconvenient bugs in how they represented data.
Ultimately, I wanted to see the website from a robots perspective and make sure I could traverse it as one. To help
myself out with that, I created a tool which would parse arbitrary local pages into JSON data based on my understanding
of how robots would interpret data. For example, I could view the home page in raw JSON, or I could pretend I
was a robot and browse it in a formatted HTML page where links are rewritten for followup.
Even basic pages can provide some useful structured data. For example, the page describing the Loopy Groupies
doesn’t have complicated content, but it still uses the basic WebPage type to identify breadcrumbs, titles, main
content, and a significant image on the page. By integrating the main site template, it also identifies the header and
footer as WPHeader and WPFooter.
Of course it’s not limited to schema.org data types. The robot data also includes detailed breadcrumb data in the
raw JSON structure.
One of the most useful types in an e-commerce environment is SomeProducts. It lets robots see things like
pricing, inventory, availability, company, model, and various product attributes. For example, here’s what our
Slate Blue product currently looks like to robots:
With the markup on the page, it’s now possible for search engines to quickly show information such as pricing and
availability alongside results for the product. Not only that, but given sufficient parsing it can also infer the
relationships that specific page (marked as a product) has with other product concepts to create a more intelligent data
For the main product types, pages also support listings that reference the individual products. The main
Solid Series listing has the following data:
Nearly all pages on the new website have at least some structured data present, if only the breadcrumb data. All
this markup isn’t simply an academic exercise though. For example, Ravelry supports checking the pricing and
inventory of our product ads and displaying them to users. Instead of complex, fragile regular expressions or DOM
traversal, we can just say to use an XPath query like *[@itemscope and @itemtype = "http://schema.org/SomeProducts"].
One of the original motivations behind focusing on structured data was the goal of having an internal search for the
site. Instead of writing web page scrapers that know what the DOM looks like and how to find significant content, it has
been much easier to rely on simple schema.org types which are consistent across all pages. The structured data on
pages is still a work in progress as I learn more about what robots are interested in and figure out the best way to