The Basics of a Custom Search Engine

June 1, 2013

One of the most useful features of a website is the ability to search. The Loopy Ewe has had some form of faceted product search for a long time, but it has never had the ability to quickly find regular pages, categories, brands, blog posts and the like. Google seems to lead in offering custom search products with both Custom Search Engine and Site Search, but they're either branded or cost a bit of money. Instead of investing in their proprietary products, I wanted to try to create a simple search engine for our needs which took advantage of my previous work in implementing existing open standards.

Introduction

In my mind, there are four basic processes when creating a search engine:

Discovery - finding the documents that are worthy of indexing. This step was fairly easy since I had already setup a sitemap for the site. Internally, the feature bundles of the site are responsible for generating their own sitemap (e.g. blog posts, regular content pages, photo galleries, products, product groups) and sitemap.xml just advertises them. So, for our purposes, the discovery step just involves reviewing those sitemaps to find the links.

Parsing - understanding the documents to know what content is significant. Given my previous work of implementing structured data on the site and creating internal tools for reviewing the results, parsing becomes a very simple task.

The next two processes are more what I want to focus on here:

Indexing - ensuring the documents are accessible via search queries.
Maintenance - keeping the documents updated when they are updated or removed.

Indexing

We were already using elasticsearch, so I was hoping to use it for full-text searching as well. I decided to maintain two types in the search index.

Discovered Documents (`resource`)

The resource type has all our indexed URLs and a cache of their contents. Since we're not going to be searching it directly, it's more of a basic key-based storage based on the URL. The mapping looks something like:

{ "_id" : {
      "type" : "string" },
  "url" : {
      "type" : "string",
      "index" : "no" },
  "response_status" : {
      "type" : "string",
      "index" : "no" },
  "response_headers" : {
      "properties" : {
          "key" : {
              "type" : "string",
              "index" : "no" },
          "value" : {
              "type" : "string",
              "index" : "no" } } },
  "response_content" : {
      "type" : "string",
      "index" : "no" },
  "date_retrieved" : {
      "type" : "date",
      "format" : "yyyy-MM-dd HH:mm:ss" },
  "date_expires" : {
      "type" : "date",
      "format" : "yyyy-MM-dd HH:mm:ss" } }

The _id is simply a hash of the actual URL and used elsewhere. Whenever the discovery process finds a new URL, it creates a new record and queues a task to download the document. The initial record looks like:

{ "_id" : "b48d426138096d66bfaa4ac9dcbc4cb6",
  "url" : "/local/fling/spring-fling-2013/",
  "date_expires" : "2001-01-01 00:00:00" }

Then the download task is responsible for:

Receiving a URL to download;
Finding the current resource record;
Validating it against robots.txt;
Sending a new request for the URL (respecting ETag and Last-Modified headers);
Updating the resource record with the response and new date_* values;
And, if the document has changed, queueing a task to parse the resource.

By default, if an Expires response header isn't provided, I set the date_expires field to several days in the future. The field is used to find stale documents later on.

Parsed Documents (`result`)

The result type has all our indexed URLs which were parsed and found to be useful. The documents contain some structured fields which are generated by the parsing step. The mapping looks like:

{ "_id": {
      "type": "string" },
  "url": {
      "type": "string",
      "index": "no" },
  "itemtype": {
      "type": "string",
      "analyzer": "keyword" },
  "image": {
      "type": "string",
      "index": "no" },
  "title": {
      "boost": 5.0,
      "type": "string",
      "include_in_all": true,
      "position_offset_gap": 64,
      "index_analyzer": "snowballed",
      "search_analyzer": "snowballed_searcher" },
  "keywords": {
      "_boost": 6.0,
      "type": "string",
      "include_in_all": true,
      "index_analyzer": "snowballed",
      "search_analyzer": "snowballed_searcher" },
  "description": {
      "_boost": 3.0,
      "type": "string",
      "analyzer": "standard" },
  "crumbs": {
      "boost": 0.5,
      "properties": {
          "url": {
              "type": "string",
              "index": "no" },
          "title": {
              "type": "string",
              "include_in_all": true,
              "analyzer": "standard" } } },
  "content": {
      "type": "string",
      "include_in_all": true,
      "position_offset_gap": 128,
      "analyzer": "standard" },
  "facts": {
      "type": "object",
      "enabled": false,
      "index": "no" },
  "date_parsed" : {
      "type" : "date",
      "format" : "yyyy-MM-dd HH:mm:ss" },
  "date_published" : {
      "type" : "date",
      "format" : "yyyy-MM-dd HH:mm:ss" } }

A few notes on the specific fields:

itemtype - the generic result type in schema.org terms (e.g. Product, WebPage, Organization)
image - a primary image from the page; it becomes a thumbnail on search results to make them more inviting
title - usually based on the title tag or more-concise og:title data
keywords - usually based on the keywords meta tag (the field is boosted because they're specifically targeted phrases)
description - usually the description meta tag
content - any remaining useful, searchable content somebody might try to find something in
facts - arbitrary data used for rendering more helpful search results; some common keys:
- collection - indicates there are multiple of something (e.g. product quantities, styles of a product)
- product_model - indicate a product model name for the result
- brand - indicate the brand name for the result
- price, priceMin, priceMax - indicate the price(s) of a result
- availability - for a product this is usually "in stock" or "out of stock"
date_published - for content such as blog posts or announcements

The result type is updated by the parse task which is responsible for:

Receiving a URL to parse;
Finding the current resource record;
Run the response_content through the appropriate structured data parser;
Extract generic data (e.g. title, keywords);
Extract itemtype-specific metadata, usually for facts;
Update the result record.

For example, this parsed product model looks like:

{ "url" : "/shop/g/yarn/madelinetosh/tosh-dk/",
  "itemtype" : "ProductModel",
  "title" : "Madelinetosh Tosh DK",
  "keywords" : [ "tosh dk", "tosh dk yarn", "madelinetosh", "madelinetosh yarn", "madelinetosh tosh dk", "madelinetosh" ],
  "image" : "/asset/catalog-entry-photo/17c1dc50-37ab-dac6-ca3c-9fd055a5b07f~v2-96x96.jpg",
  "crumbs": [
      {
          "url" : "/shop/",
          "title" : "Shop" },
      {
          "url" : "/shop/g/yarn/",
          "title" : "Yarn" },
      {
          "url" : "/shop/g/yarn/madelinetosh/",
          "title" : "Madelinetosh" } ],
  "content" : "Hand-dyed by the gals at Madelinetosh in Texas, you'll find these colors vibrant and multi-layered. Perfect for thick socks, scarves, shawls, hats, gloves, mitts and sweaters.",
  "facts" : {
      "collection": [
          {
              "value" : 93,
              "label" : "products" } ],
      "brand" : "Madelinetosh",
      "price" : "22.00" },
  "_boost" : 4 }

Searching

Once some documents are indexed, I can create simple searches with the ruflin/Elastica library:

$bool = (new \Elastica\Query\Bool())
    ->addMust(
        (new \Elastica\Query\Bool())
            ->setParam('minimum_number_should_match', 1)
            ->addShould(
                (new \Elastica\Query\QueryString())
                    ->setParam('default_field', 'keywords')
                    /* ...snip... */ )
            ->addShould(
                (new \Elastica\Query\QueryString())
                    ->setParam('default_field', 'title')
                    /* ...snip... */ )
            ->addShould(
                (new \Elastica\Query\QueryString())
                    ->setParam('default_field', 'content')
                    /* ...snip... */ ) );

/* ...snip... */

$query = new \Elastica\Query($bool);

To easily focus specific matches in the title and content fields I can enable highlighting:

$query->setHighlight(
    array(
        'pre_tags' => array('<strong>'),
        'post_tags' => array('</strong>'),
        'fields' => array(
            'title' => array(
                'fragment_size' => 256,
                'number_of_fragments' => 1 ),
            'content' => array(
                'fragment_size' => 64,
                'number_of_fragments' => 3 ) ) ) );

Maintenance

A search engine is no good if it's using outdated or no-longer-existant information. To help keep content up to date, I take two approaches:

Time-based updates - one of the reasons for the indexed date_expires field of the resource type is so an process can go through and identify documents which have not been updated recently. If it sees something is stale, it goes ahead and queues it for update.

Real-time updates - sometimes things (like product availability) change frequently, impacting the quality of search results. Instead of waiting for time-based updates, I use event listeners to trigger re-indexing when it sees things inventory changes or product changes in an order.

In either case, when a URL is discovered to be gone, the records from both resource and result are removed for the URL.

Utilities

Sometimes there are deploys where specific pages are definitely changing, or when a whole new sitemap is getting registered with new URLs. Instead of waiting for the time-based updates or cron jobs to run, I have these commands available for scripting:

search:index-rebuild - re-read the sitemaps and assert the links in the resource index
search:index-update - find all the expired resources and queue them for update
search:result-rerun - force the download and parsing of a URL
search:sitemap-generate - regenerate all registered sitemaps

Conclusion

Starting with structured data and elasticsearch makes building a search engine significantly easier. Data and indexing makes it faster to show smarter search results. Existing standards like OpenSearch make it easy to extend the search from a web page into the browser and even third-party applications via Atom and RSS feeds. Local, real-time updates ensures search results are timely and useful. Even with the basic parsing and ranking algorithms shown here, results are quite accurate. It has been a beneficial experience to approach the website from the perspective of a bot, giving me a better appreciation of how to efficiently markup and market content.