One of the most useful features of a website is the ability to search. The Loopy Ewe has had some form of faceted product search for a long time, but it has never had the ability to quickly find regular pages, categories, brands, blog posts and the like. Google seems to lead in offering custom search products with both Custom Search Engine and Site Search, but they’re either branded or cost a bit of money. Instead of investing in their proprietary products, I wanted to try to create a simple search engine for our needs which took advantage of my previous work in implementing existing open standards.
In my mind, there are four basic processes when creating a search engine:
Discovery - finding the documents that are worthy of indexing. This step was fairly easy since I had already setup
a sitemap for the site. Internally, the feature bundles of the site are responsible for generating their own
sitemap (e.g. blog posts, regular content pages, photo galleries, products, product groups) and
advertises them. So, for our purposes, the discovery step just involves reviewing those sitemaps to find the links.
Parsing - understanding the documents to know what content is significant. Given my previous work of implementing structured data on the site and creating internal tools for reviewing the results, parsing becomes a very simple task.
The next two processes are more what I want to focus on here:
- Indexing - ensuring the documents are accessible via search queries.
- Maintenance - keeping the documents updated when they are updated or removed.
We were already using elasticsearch, so I was hoping to use it for full-text searching as well. I decided to maintain two types in the search index.
Discovered Documents (
resource type has all our indexed URLs and a cache of their contents. Since we’re not going to be searching it
directly, it’s more of a basic key-based storage based on the URL. The mapping looks something like:
_id is simply a hash of the actual URL and used elsewhere. Whenever the discovery process finds a new URL, it
creates a new record and queues a task to download the document. The initial record looks like:
Then the download task is responsible for:
- Receiving a URL to download;
- Finding the current
- Validating it against
- Sending a new request for the URL (respecting
- Updating the
resourcerecord with the response and new
- And, if the document has changed, queueing a task to parse the
By default, if an
Expires response header isn’t provided, I set the
date_expires field to several days in the
future. The field is used to find stale documents later on.
Parsed Documents (
result type has all our indexed URLs which were parsed and found to be useful. The documents contain some
structured fields which are generated by the parsing step. The mapping looks like:
A few notes on the specific fields:
itemtype- the generic result type in schema.org terms (e.g. Product, WebPage, Organization)
image- a primary image from the page; it becomes a thumbnail on search results to make them more inviting
title- usually based on the
titletag or more-concise
keywords- usually based on the keywords
metatag (the field is boosted because they’re specifically targeted phrases)
description- usually the description
content- any remaining useful, searchable content somebody might try to find something in
facts- arbitrary data used for rendering more helpful search results; some common keys:
collection- indicates there are multiple of something (e.g. product quantities, styles of a product)
product_model- indicate a product model name for the result
brand- indicate the brand name for the result
priceMax- indicate the price(s) of a result
availability- for a product this is usually “in stock” or “out of stock”
date_published- for content such as blog posts or announcements
result type is updated by the parse task which is responsible for:
- Receiving a URL to parse;
- Finding the current
- Run the
response_contentthrough the appropriate structured data parser;
- Extract generic data (e.g. title, keywords);
itemtype-specific metadata, usually for
- Update the
For example, this parsed product model looks like:
Once some documents are indexed, I can create simple searches with the
To easily focus specific matches in the
content fields I can enable highlighting:
A search engine is no good if it’s using outdated or no-longer-existant information. To help keep content up to date, I take two approaches:
Time-based updates - one of the reasons for the indexed
date_expires field of the
resource type is so an
process can go through and identify documents which have not been updated recently. If it sees something is stale, it
goes ahead and queues it for update.
Real-time updates - sometimes things (like product availability) change frequently, impacting the quality of search results. Instead of waiting for time-based updates, I use event listeners to trigger re-indexing when it sees things inventory changes or product changes in an order.
In either case, when a URL is discovered to be gone, the records from both
result are removed for the
Sometimes there are deploys where specific pages are definitely changing, or when a whole new sitemap is getting registered with new URLs. Instead of waiting for the time-based updates or cron jobs to run, I have these commands available for scripting:
search:index-rebuild- re-read the sitemaps and assert the links in the
search:index-update- find all the expired resources and queue them for update
search:result-rerun- force the download and parsing of a URL
search:sitemap-generate- regenerate all registered sitemaps
Starting with structured data and elasticsearch makes building a search engine significantly easier. Data and indexing makes it faster to show smarter search results. Existing standards like OpenSearch make it easy to extend the search from a web page into the browser and even third-party applications via Atom and RSS feeds. Local, real-time updates ensures search results are timely and useful. Even with the basic parsing and ranking algorithms shown here, results are quite accurate. It has been a beneficial experience to approach the website from the perspective of a bot, giving me a better appreciation of how to efficiently markup and market content.