The Basics of a Custom Search Engine
One of the most useful features of a website is the ability to search. The Loopy Ewe has had some form of faceted product search for a long time, but it has never had the ability to quickly find regular pages, categories, brands, blog posts and the like. Google seems to lead in offering custom search products with both Custom Search Engine and Site Search, but they're either branded or cost a bit of money. Instead of investing in their proprietary products, I wanted to try to create a simple search engine for our needs which took advantage of my previous work in implementing existing open standards.
Introduction
In my mind, there are four basic processes when creating a search engine:
Discovery – finding the documents that are worthy of indexing. This step was fairly easy since I had already setup a sitemap for the site. Internally, the feature bundles of the site are responsible for generating their own sitemap (e.g. blog posts, regular content pages, photo galleries, products, product groups) and sitemap.xml
just advertises them. So, for our purposes, the discovery step just involves reviewing those sitemaps to find the links.
Parsing – understanding the documents to know what content is significant. Given my previous work of implementing structured data on the site and creating internal tools for reviewing the results, parsing becomes a very simple task.
The next two processes are more what I want to focus on here:
- Indexing – ensuring the documents are accessible via search queries.
- Maintenance – keeping the documents updated when they are updated or removed.
Indexing
We were already using elasticsearch, so I was hoping to use it for full-text searching as well. I decided to maintain two types in the search index.
Discovered Documents (resource
)
The resource
type has all our indexed URLs and a cache of their contents. Since we're not going to be searching it directly, it's more of a basic key-based storage based on the URL. The mapping looks something like:
{ "_id" : {
"type" : "string" },
"url" : {
"type" : "string",
"index" : "no" },
"response_status" : {
"type" : "string",
"index" : "no" },
"response_headers" : {
"properties" : {
"key" : {
"type" : "string",
"index" : "no" },
"value" : {
"type" : "string",
"index" : "no" } } },
"response_content" : {
"type" : "string",
"index" : "no" },
"date_retrieved" : {
"type" : "date",
"format" : "yyyy-MM-dd HH:mm:ss" },
"date_expires" : {
"type" : "date",
"format" : "yyyy-MM-dd HH:mm:ss" } }
The _id
is simply a hash of the actual URL and used elsewhere. Whenever the discovery process finds a new URL, it creates a new record and queues a task to download the document. The initial record looks like:
{ "_id" : "b48d426138096d66bfaa4ac9dcbc4cb6",
"url" : "/local/fling/spring-fling-2013/",
"date_expires" : "2001-01-01 00:00:00" }
Then the download task is responsible for:
- Receiving a URL to download;
- Finding the current
resource
record; - Validating it against
robots.txt
; - Sending a new request for the URL (respecting
ETag
andLast-Modified
headers); - Updating the
resource
record with the response and newdate_*
values; - And, if the document has changed, queueing a task to parse the
resource
.
By default, if an Expires
response header isn't provided, I set the date_expires
field to several days in the future. The field is used to find stale documents later on.
Parsed Documents (result
)
The result
type has all our indexed URLs which were parsed and found to be useful. The documents contain some structured fields which are generated by the parsing step. The mapping looks like:
{ "_id": {
"type": "string" },
"url": {
"type": "string",
"index": "no" },
"itemtype": {
"type": "string",
"analyzer": "keyword" },
"image": {
"type": "string",
"index": "no" },
"title": {
"boost": 5.0,
"type": "string",
"include_in_all": true,
"position_offset_gap": 64,
"index_analyzer": "snowballed",
"search_analyzer": "snowballed_searcher" },
"keywords": {
"_boost": 6.0,
"type": "string",
"include_in_all": true,
"index_analyzer": "snowballed",
"search_analyzer": "snowballed_searcher" },
"description": {
"_boost": 3.0,
"type": "string",
"analyzer": "standard" },
"crumbs": {
"boost": 0.5,
"properties": {
"url": {
"type": "string",
"index": "no" },
"title": {
"type": "string",
"include_in_all": true,
"analyzer": "standard" } } },
"content": {
"type": "string",
"include_in_all": true,
"position_offset_gap": 128,
"analyzer": "standard" },
"facts": {
"type": "object",
"enabled": false,
"index": "no" },
"date_parsed" : {
"type" : "date",
"format" : "yyyy-MM-dd HH:mm:ss" },
"date_published" : {
"type" : "date",
"format" : "yyyy-MM-dd HH:mm:ss" } }
A few notes on the specific fields:
itemtype
– the generic result type in schema.org terms (e.g. Product, WebPage, Organization)image
– a primary image from the page; it becomes a thumbnail on search results to make them more invitingtitle
– usually based on thetitle
tag or more-conciseog:title
datakeywords
– usually based on the keywordsmeta
tag (the field is boosted because they're specifically targeted phrases)description
– usually the descriptionmeta
tagcontent
– any remaining useful, searchable content somebody might try to find something infacts
– arbitrary data used for rendering more helpful search results; some common keys:collection
– indicates there are multiple of something (e.g. product quantities, styles of a product)product_model
– indicate a product model name for the resultbrand
– indicate the brand name for the resultprice
,priceMin
,priceMax
– indicate the price(s) of a resultavailability
– for a product this is usually "in stock" or "out of stock"
date_published
– for content such as blog posts or announcements
The result
type is updated by the parse task which is responsible for:
- Receiving a URL to parse;
- Finding the current
resource
record; - Run the
response_content
through the appropriate structured data parser; - Extract generic data (e.g. title, keywords);
- Extract
itemtype
-specific metadata, usually forfacts
; - Update the
result
record.
For example, this parsed product model looks like:
{ "url" : "/shop/g/yarn/madelinetosh/tosh-dk/",
"itemtype" : "ProductModel",
"title" : "Madelinetosh Tosh DK",
"keywords" : [ "tosh dk", "tosh dk yarn", "madelinetosh", "madelinetosh yarn", "madelinetosh tosh dk", "madelinetosh" ],
"image" : "/asset/catalog-entry-photo/17c1dc50-37ab-dac6-ca3c-9fd055a5b07f~v2-96x96.jpg",
"crumbs": [
{
"url" : "/shop/",
"title" : "Shop" },
{
"url" : "/shop/g/yarn/",
"title" : "Yarn" },
{
"url" : "/shop/g/yarn/madelinetosh/",
"title" : "Madelinetosh" } ],
"content" : "Hand-dyed by the gals at Madelinetosh in Texas, you'll find these colors vibrant and multi-layered. Perfect for thick socks, scarves, shawls, hats, gloves, mitts and sweaters.",
"facts" : {
"collection": [
{
"value" : 93,
"label" : "products" } ],
"brand" : "Madelinetosh",
"price" : "22.00" },
"_boost" : 4 }
Searching
Once some documents are indexed, I can create simple searches with the ruflin/Elastica
library:
$bool = (new \Elastica\Query\Bool())
->addMust(
(new \Elastica\Query\Bool())
->setParam('minimum_number_should_match', 1)
->addShould(
(new \Elastica\Query\QueryString())
->setParam('default_field', 'keywords')
/* ...snip... */ )
->addShould(
(new \Elastica\Query\QueryString())
->setParam('default_field', 'title')
/* ...snip... */ )
->addShould(
(new \Elastica\Query\QueryString())
->setParam('default_field', 'content')
/* ...snip... */ ) );
/* ...snip... */
$query = new \Elastica\Query($bool);
To easily focus specific matches in the title
and content
fields I can enable highlighting:
$query->setHighlight(
array(
'pre_tags' => array('<strong>'),
'post_tags' => array('</strong>'),
'fields' => array(
'title' => array(
'fragment_size' => 256,
'number_of_fragments' => 1 ),
'content' => array(
'fragment_size' => 64,
'number_of_fragments' => 3 ) ) ) );
Maintenance
A search engine is no good if it's using outdated or no-longer-existant information. To help keep content up to date, I take two approaches:
Time-based updates – one of the reasons for the indexed date_expires
field of the resource
type is so an process can go through and identify documents which have not been updated recently. If it sees something is stale, it goes ahead and queues it for update.
Real-time updates – sometimes things (like product availability) change frequently, impacting the quality of search results. Instead of waiting for time-based updates, I use event listeners to trigger re-indexing when it sees things inventory changes or product changes in an order.
In either case, when a URL is discovered to be gone, the records from both resource
and result
are removed for the URL.
Utilities
Sometimes there are deploys where specific pages are definitely changing, or when a whole new sitemap is getting registered with new URLs. Instead of waiting for the time-based updates or cron jobs to run, I have these commands available for scripting:
search:index-rebuild
– re-read the sitemaps and assert the links in theresource
indexsearch:index-update
– find all the expired resources and queue them for updatesearch:result-rerun
– force the download and parsing of a URLsearch:sitemap-generate
– regenerate all registered sitemaps
Conclusion
Starting with structured data and elasticsearch makes building a search engine significantly easier. Data and indexing makes it faster to show smarter search results. Existing standards like OpenSearch make it easy to extend the search from a web page into the browser and even third-party applications via Atom and RSS feeds. Local, real-time updates ensures search results are timely and useful. Even with the basic parsing and ranking algorithms shown here, results are quite accurate. It has been a beneficial experience to approach the website from the perspective of a bot, giving me a better appreciation of how to efficiently markup and market content.