Xapian Search for Drupal

Error message

The spam filter installed on this site is currently unavailable. Per site policy, we are unable to accept new submissions until that problem is resolved. Please try resubmitting the form in a couple of minutes.

Here at Trellon, clients come to us all the time to looking for solutions for making knowledge more accessible through their web sites. Given that search features are a primary tool for exposing data and that the performance of Drupal's search engine is less than optimal in certain situations, we developed a module that replaces Drupal's native search features with the Xapian search engine. And here's why we did it.

Reason

A common challenge for Drupal sites is working with documents in different formats and getting them into the search engine. Drupal does not natively index PDFs and Word documents, despite the fact they are the most commonly exchanged text formats on the Internet (outside of HTML). This presents problems for sites where content is driven by document uploads, and has lead to some sub-optimal solutions and messy UI workflow patterns in the past.

Another challenge is performance. While Drupal scales well in terms of serving pages through the use of memcache, CDNs and op code caching, search functionality is another matter entirely. The organic nature of searches often defies attempts to bring this important subsystem to scale in high traffic environments. Alternative search engines, like Google's Site Search and Yahoo's web services, often need to be leveraged in order to handle the demands that come along with high utilization.

To attempt to solve these problems, Trellon spent some time working with at alternative search engine solutions for Drupal, and one product we came across that can get the job done is Xapain. We have just released a module that replaces Drupal's native search with Xapian, and it can be downloaded from it's Drupal project page

Xapian is an open source source engine that has some cool features. The most exciting feature is the ability to index just about any document type commonly used on the Internet - HTML, PHP, PDF, PostScript, OpenOffice/StarOffice, OpenDocument, Microsoft Word/Excel/Powerpoint/Works, Word Perfect, AbiWord, RTF, DVI, Perl POD documentation, and plain text documents. The search engine supports a full range of boolean operators for searches, meaning users can punch in AND, OR and NOT operators into their searches to get really fine grained with the results returned. Xapian supports unicode and UTF-8, meaning a content can be supported in a many different languages. It can also output to xml or csv formats, which means there are ways to get information from the search without necessarily having to bootstrap Drupal (making things like an AJAX search component possible).

While we are still adding new indexing features, the Drupal module is stable and ready for testing. We performed some benchmarks to understand the performance implications, and they are good. Keep in mind these are early stage benchmarks, and they have not been vetted by the community and we will be collecting more as time goes on.

Environment

These benchmarks were performed on Trellon's development server, which has a 2.0MHz Xeon processor and 512 MB of RAM. It runs Apache, postfix, mysql, memcache and serves the Drupal pages used for these benchmarks.

Your desktop machine, and maybe your laptop, is more powerful than our development server. Don't consider these results representative of what will happen on a more powerful server, they are meant to demonstrate some low level concepts in performance.

We ran these tests against a default installation of Drupal with few modules enabled, the most important ones are probably path and taxonomy. There are 94,567 nodes loaded into the database, each of which is pretty long (minimum of 3 full paragraphs).

Natural Language Search

We added a timer within the search module and a watchdog alert to tell us how long it takes to return results on several different keyword searches.

It is important to understand we are benchmarking the time taken within PHP to access search records, not the time for generating a page.

Some takeaways:

  • Xapian's search engine is about 14 times faster for a natural language search, on average amongst these terms on this dataset on a first pass.
  • Xapian's search engine is about 71 times faster for a natural language search, on average amongst these terms on this dataset on a second pass.
  • Both search engines are faster the second time a term has been searched for. Drupal's search performance improves by a factor of 8.6, Xapian's improves by a factor of 43.7.

Performance Under Load

We tested the performance of the two search engines under load using Apache Benchmark under several scenarios: first, with no caching enabled, and with normal caching enabled. The results are a little different, as described below.

100 Hits, 10 At a Time, No Caching

The top chart is Xapian and the bottom chart is Drupal search. No caching is enabled in Drupal.


This demonstrates that, without caching enabled, Xapian makes searching a Drupal site faster.

It is important to remember that benchmarks need to model actual user behavior, and these numbers may be misleading. First and foremost, it is a bad idea to run Drupal without caching enabled if your site is going to have any amount of traffic. These charts are concerned with out of the box Drupal behavior.

Secondly, users don't all search for terms at the same time, and search indexing occurs organically. To be considered a model of actual user behavior, think about the way users come to search engines. There are typically some terms which are more popular than others, and a power law distribution occurs as most of the search traffic goes towards the terms that are searched for most often. Unlike benchmarks for standard web pages, the numbers in the 100% level may be more important than the numbers at the 50% level, depending on which part of the curve you are looking at.

Some takeaways:

  • These numbers simulate traffic under load for a single search term. The search performance increases on second pass searches are evident in these results.
  • For Drupal, the average time to generate pages is around 25 seconds.
  • For Xapian, the average time to generate pages is around 4 seconds.
  • For some terms, Drupal was unable to return pages in under 60 seconds. ab timed out. We performed multiple tests to confirm this.

100 Hits, 10 At a Time, Normal Caching

Next, we took a look at the way these numbers work with the cache on. The top chart is Xapian and the bottom chart is Drupal search. Normal caching is enabled in Drupal.


Under normal caching conditions, the case for performance changes considerably. We observe that the performance of Xapian and Drupal are roughly similar at the 50% mark, with a slight edge going to Drupal's native search. We account for this by saying that the pages themselves are already stored in cache at this point. At the upper end of the spectrum, we see the difference in performance between Xapian and Drupal. Time to return results on most terms is reduced dramatically under Xapian. Going back to the power law analogy, results in the 100th percentile represent the performance times for organic searches on the tail end of the graph.

Some final takeaways:

  • These numbers simulate traffic under load for a single search term with caching enabled. The performance benefits of caching are more dramatic than any second pass results for either Xapian or Drupal search.
  • For popular search terms that will be searched by more than one user, Drupal has a slight edge in performance.
  • For organic searches that are not going to be searched too often, Xapian offers dramatically better performance.

Now that we have an understanding of performance implications, our next post is going to focus on indexing different media types. While the Xapian module can handle most of the work, module developers will need to be aware of implications for handling uploads to get stuff into the search engine. Until next time!

NOTE: This module was originally developed for http://www.taniwhasolutions.com/