[Danny Berger](https://dpb587.me/ "Home")

# The Basics of a Custom Search Engine

June 1, 2013

One of the most useful features of a website is the ability to search. [The Loopy Ewe](http://www.theloopyewe.com/) has had some form of faceted product search for a long time, but it has never had the ability to quickly find regular pages, categories, brands, blog posts and the like. [Google](http://www.google.com/) seems to lead in offering custom search products with both [Custom Search Engine](http://www.google.com/cse/all) and [Site Search](http://www.google.com/enterprise/search/products_gss_pricing.html), but they're either branded or cost a bit of money. Instead of investing in their proprietary products, I wanted to try to create a simple search engine for our needs which took advantage of my previous work in implementing existing open standards.

# Introduction

In my mind, there are four basic processes when creating a search engine:

**Discovery** - finding the documents that are worthy of indexing. This step was fairly easy since I had already setup a [sitemap](http://www.sitemaps.org/) for the site. Internally, the feature bundles of the site are responsible for generating their own sitemap (e.g. blog posts, regular content pages, photo galleries, products, product groups) and [`sitemap.xml`](http://www.theloopyewe.com/sitemap.xml) just advertises them. So, for our purposes, the discovery step just involves reviewing those sitemaps to find the links.

**Parsing** - understanding the documents to know what content is significant. Given my previous work of [implementing structured data](https://dpb587.me/blog/2013/05/13/structured-data-with-schema-org.html) on the site and creating internal tools for reviewing the results, parsing becomes a very simple task.

The next two processes are more what I want to focus on here:

- **Indexing** - ensuring the documents are accessible via search queries.
- **Maintenance** - keeping the documents updated when they are updated or removed.

# Indexing

We were already using [elasticsearch](http://www.elasticsearch.org/), so I was hoping to use it for full-text searching as well. I decided to maintain two types in the search index.

## Discovered Documents (`resource`)

The `resource` type has all our indexed URLs and a cache of their contents. Since we're not going to be searching it directly, it's more of a basic key-based storage based on the URL. The mapping looks something like:

```
{ "_id" : {
      "type" : "string" },
  "url" : {
      "type" : "string",
      "index" : "no" },
  "response_status" : {
      "type" : "string",
      "index" : "no" },
  "response_headers" : {
      "properties" : {
          "key" : {
              "type" : "string",
              "index" : "no" },
          "value" : {
              "type" : "string",
              "index" : "no" } } },
  "response_content" : {
      "type" : "string",
      "index" : "no" },
  "date_retrieved" : {
      "type" : "date",
      "format" : "yyyy-MM-dd HH:mm:ss" },
  "date_expires" : {
      "type" : "date",
      "format" : "yyyy-MM-dd HH:mm:ss" } }
```

The `_id` is simply a hash of the actual URL and used elsewhere. Whenever the discovery process finds a new URL, it creates a new record and queues a task to download the document. The initial record looks like:

```
{ "_id" : "b48d426138096d66bfaa4ac9dcbc4cb6",
  "url" : "/local/fling/spring-fling-2013/",
  "date_expires" : "2001-01-01 00:00:00" }
```

Then the download task is responsible for:

1. Receiving a URL to download;
1. Finding the current `resource` record;
1. Validating it against `robots.txt`;
1. Sending a new request for the URL (respecting `ETag` and `Last-Modified` headers);
1. Updating the `resource` record with the response and new `date_*` values;
1. And, if the document has changed, queueing a task to parse the `resource`.

By default, if an `Expires` response header isn't provided, I set the `date_expires` field to several days in the future. The field is used to find stale documents later on.

## Parsed Documents (`result`)

The `result` type has all our indexed URLs which were parsed and found to be useful. The documents contain some structured fields which are generated by the parsing step. The mapping looks like:

```
{ "_id": {
      "type": "string" },
  "url": {
      "type": "string",
      "index": "no" },
  "itemtype": {
      "type": "string",
      "analyzer": "keyword" },
  "image": {
      "type": "string",
      "index": "no" },
  "title": {
      "boost": 5.0,
      "type": "string",
      "include_in_all": true,
      "position_offset_gap": 64,
      "index_analyzer": "snowballed",
      "search_analyzer": "snowballed_searcher" },
  "keywords": {
      "_boost": 6.0,
      "type": "string",
      "include_in_all": true,
      "index_analyzer": "snowballed",
      "search_analyzer": "snowballed_searcher" },
  "description": {
      "_boost": 3.0,
      "type": "string",
      "analyzer": "standard" },
  "crumbs": {
      "boost": 0.5,
      "properties": {
          "url": {
              "type": "string",
              "index": "no" },
          "title": {
              "type": "string",
              "include_in_all": true,
              "analyzer": "standard" } } },
  "content": {
      "type": "string",
      "include_in_all": true,
      "position_offset_gap": 128,
      "analyzer": "standard" },
  "facts": {
      "type": "object",
      "enabled": false,
      "index": "no" },
  "date_parsed" : {
      "type" : "date",
      "format" : "yyyy-MM-dd HH:mm:ss" },
  "date_published" : {
      "type" : "date",
      "format" : "yyyy-MM-dd HH:mm:ss" } }
```

A few notes on the specific fields:

- `itemtype` - the generic result type in schema.org terms (e.g. Product, WebPage, Organization)
- `image` - a primary image from the page; it becomes a thumbnail on search results to make them more inviting
- `title` - usually based on the `title` tag or more-concise `og:title` data
- `keywords` - usually based on the keywords `meta` tag (the field is boosted because they're specifically targeted phrases)
- `description` - usually the description `meta` tag
- `content` - any remaining useful, searchable content somebody might try to find something in
- `facts` - arbitrary data used for rendering more helpful search results; some common keys:
  
  - `collection` - indicates there are multiple of something (e.g. product quantities, styles of a product)
  - `product_model` - indicate a product model name for the result
  - `brand` - indicate the brand name for the result
  - `price`, `priceMin`, `priceMax` - indicate the price(s) of a result
  - `availability` - for a product this is usually "in stock" or "out of stock"
  
- `date_published` - for content such as blog posts or announcements

The `result` type is updated by the parse task which is responsible for:

1. Receiving a URL to parse;
1. Finding the current `resource` record;
1. Run the `response_content` through the appropriate structured data parser;
1. Extract generic data (e.g. title, keywords);
1. Extract `itemtype`-specific metadata, usually for `facts`;
1. Update the `result` record.

For example, this parsed [product model](https://www.theloopyewe.com/shop/g/yarn/madelinetosh/tosh-dk/) looks like:

```
{ "url" : "/shop/g/yarn/madelinetosh/tosh-dk/",
  "itemtype" : "ProductModel",
  "title" : "Madelinetosh Tosh DK",
  "keywords" : [ "tosh dk", "tosh dk yarn", "madelinetosh", "madelinetosh yarn", "madelinetosh tosh dk", "madelinetosh" ],
  "image" : "/asset/catalog-entry-photo/17c1dc50-37ab-dac6-ca3c-9fd055a5b07f~v2-96x96.jpg",
  "crumbs": [
      {
          "url" : "/shop/",
          "title" : "Shop" },
      {
          "url" : "/shop/g/yarn/",
          "title" : "Yarn" },
      {
          "url" : "/shop/g/yarn/madelinetosh/",
          "title" : "Madelinetosh" } ],
  "content" : "Hand-dyed by the gals at Madelinetosh in Texas, you'll find these colors vibrant and multi-layered. Perfect for thick socks, scarves, shawls, hats, gloves, mitts and sweaters.",
  "facts" : {
      "collection": [
          {
              "value" : 93,
              "label" : "products" } ],
      "brand" : "Madelinetosh",
      "price" : "22.00" },
  "_boost" : 4 }
```

## Searching

Once some documents are indexed, I can create simple searches with the [`ruflin/Elastica`](https://github.com/ruflin/Elastica/) library:

```
$bool = (new \Elastica\Query\Bool())
    ->addMust(
        (new \Elastica\Query\Bool())
            ->setParam('minimum_number_should_match', 1)
            ->addShould(
                (new \Elastica\Query\QueryString())
                    ->setParam('default_field', 'keywords')
                    /* ...snip... */ )
            ->addShould(
                (new \Elastica\Query\QueryString())
                    ->setParam('default_field', 'title')
                    /* ...snip... */ )
            ->addShould(
                (new \Elastica\Query\QueryString())
                    ->setParam('default_field', 'content')
                    /* ...snip... */ ) );

/* ...snip... */

$query = new \Elastica\Query($bool);
```

To easily focus specific matches in the `title` and `content` fields I can enable highlighting:

```
$query->setHighlight(
    array(
        'pre_tags' => array('<strong>'),
        'post_tags' => array('</strong>'),
        'fields' => array(
            'title' => array(
                'fragment_size' => 256,
                'number_of_fragments' => 1 ),
            'content' => array(
                'fragment_size' => 64,
                'number_of_fragments' => 3 ) ) ) );
```

# Maintenance

A search engine is no good if it's using outdated or no-longer-existant information. To help keep content up to date, I take two approaches:

**Time-based updates** - one of the reasons for the indexed `date_expires` field of the `resource` type is so an process can go through and identify documents which have not been updated recently. If it sees something is stale, it goes ahead and queues it for update.

**Real-time updates** - sometimes things (like product availability) change frequently, impacting the quality of search results. Instead of waiting for time-based updates, I use event listeners to trigger re-indexing when it sees things inventory changes or product changes in an order.

In either case, when a URL is discovered to be gone, the records from both `resource` and `result` are removed for the URL.

## Utilities

Sometimes there are deploys where specific pages are definitely changing, or when a whole new sitemap is getting registered with new URLs. Instead of waiting for the time-based updates or cron jobs to run, I have these commands available for scripting:

- `search:index-rebuild` - re-read the sitemaps and assert the links in the `resource` index
- `search:index-update` - find all the expired resources and queue them for update
- `search:result-rerun` - force the download and parsing of a URL
- `search:sitemap-generate` - regenerate all registered sitemaps

# Conclusion

Starting with structured data and elasticsearch makes building a search engine significantly easier. Data and indexing makes it faster to show smarter [search results](https://www.theloopyewe.com/search/?q=madelinetosh). Existing standards like [OpenSearch](http://www.opensearch.org/Home) make it easy to extend the search from a web page into the [browser](https://www.theloopyewe.com/search/opensearch.xml) and even third-party applications via [Atom](https://www.theloopyewe.com/search/results.atom?q=spring+fling) and [RSS](https://www.theloopyewe.com/search/results.rss?q=spring+fling) feeds. Local, real-time updates ensures search results are timely and useful. Even with the basic parsing and ranking algorithms shown here, results are quite accurate. It has been a beneficial experience to approach the website from the perspective of a bot, giving me a better appreciation of how to efficiently markup and market content.

## Reader Comments

Copyright © 2026 // [dpb587.me](https://dpb587.me/) is a [personal](https://dpb587.me/projects/website), [open source](https://github.com/dpb587/dpb587.me/blob/main/content/post/2013/search-engine-based-on-structured-data-20130601.md) site.