If your SQL search is just getting to slow, or is not returning the results you expect from it anymore, its probably time to move to a dedicated search-engine.

There are a lot of full text search-engines available like Xapian or Apache Lucese, but in this blogpost we will be showing you how to set up Sphinx in a Ruby on Rails environment. The easiest way to get Sphinx going with Rails is to use excellent Thinking Sphinx gem by Pat Allan. For an introduction to Thinking Sphinx, I recommend viewing the Thinking Sphinx Railscast.

The Thinking Sphinx integrates directly with your active record or merb models and sets up communication with Sphinx through your database. This is the biggest advantage of this gem, but unfortunately also the biggest disadvantage of it. Because of this direct integration it is impossible to have different setups for Sphinx and your application. Also it is not possible to index data which is not stored in your databae. Because I wanted to index all my the data and include data inside my Reddis store, I was unable to use Thinking Sphinx for this.

Luckily Pat separated out the connection layer between Thinking Sphinx and Sphinx into a separate gem called Riddle. In this blogpost I will explain how I set everything up, so that I could use Sphinx to index a media database running on Reddis and Ruby on Rails.

Setting up Sphinx

Installing Sphinx is a piece of cake, just head down to the Sphinx download page and download it for the OS you want. Most distributions provide Sphinx in their default packages and on OSX you can install it through brew.

As I said, the default way to setup Sphinx is to tether it directly to your database. However, it is also possible to use xml as an input for Sphinx. There are two formats Sphinx supports: xmlpipe and xmlpipe2. For this example we will be using the xmlpipe2 standard. Here is an example on you you have to config Sphinx to do this config/sphinx.conf:

indexer {
  mem_limit = 128000000
}

searchd {
  listen = 127.0.0.1:9306:mysql41
  listen = 127.0.0.1:9312
  log = searchd.log
  query_log = searchd.query.log
  pid_file = searchd.pid
}

source media_xml_sphinx {
    type = xmlpipe2
    xmlpipe_command = rake sphinx:generate_xml RAILS_ENV=production
}

index media_stemmer
{
  source = media_xml_sphinx
  path = ./db/sphinx/media_stemmer
  morphology = stem_en
  charset_type = utf-8
}

index media_soundex
{
  source = media_xml_sphinx
  path = ./db/sphinx/media_soundex
  morphology = soundex
  charset_type = utf-8
}

Indexer

First up is the indexer config. We need to set a maximum amount of memory the indexer may use. 128MB should be safe and good enough if you are not wrestling huge amounts of data.

Searchd

The searchd block specifies the options searchd will be using. I have it set up so that it listens on the default port 9312, using Sphinx protocol, but also on port 9306 using Mysql protocol. This will help you a lot when debugging as it allows you to query your SphinxDB directly using any Mysql client and SphinxQL. This also allows you to connect to it with the Mysql2 gem, which I will show you later on.

Source

This is where we define the source of our data. Xmlpipe2 gets the structure of our data from the XML itself, so we just need to specify the source. You can either just cat data.xml here, or call a rake task to stream data directly from your app into the Sphinx indexer.

Indexes

For this example we will be using two types of indexes: a soundex and a stemmer. All the documents from our source will be fed into the morphology function of the index one by one. For each document, the morphology function will simplify the data considerably and then store the id of the document at the correct location in its index.

For example if you feed the words John and Jen through the soundex morphology, you get J500 for the both of them. Sphinx will create in entry in its index named J500 and place the document ids of John and Jen at that entry. If somebody searches for Jehn, Sphinx will return the document ids for John and Jen, as those are stored at the index that sounds like Jehn.

This is repeated for all the documents. When this is done the index will be a searchable index of your data, but not actually contain your data. This is how Sphinx can so efficiently search a lot of data. By setting up multiple indexes, we will be able to tweak the order of returned results exactly to our liking later on.

Generating the XML file for Sphinx

The XML file you generate must contain the attributes you want Sphinx to index and also your documents.

<?xml version="1.0" encoding="utf-8"?>
<sphinx:docset>
  <sphinx:schema>
    <sphinx:field name="artist"/>
    <sphinx:field name="title"/>
    <sphinx:field name="description"/>
    <sphinx:field name="keywords"/>
    <sphinx:attr type="int" name="content_id"/>
    <sphinx:attr type="int" name="class_id"/>
    <sphinx:attr type="int" name="rating"/>
  </sphinx:schema>

  <sphinx:document id="100312">
    <artist>Golden Earring</artist>
    <title>Radar love</title>
    <description>Favourite song of every bassplayer.</description>
    <keywords>night, driving, wheel</keywords>
    <content_id>321</content_id>
    <class_id>1</class_id>
    <rating>10</rating>
  </sphinx:document>

  <sphinx:document id="200456">
    <artist>The Prodigy</artist>
    <title>Voodoo People</title>
    <description>Video of the Live performance at my house.</description>
    <keywords>magic, voo, doo</keywords>
    <content_id>456</content_id>
    <class_id>2</class_id>
    <rating>8</rating>
  </sphinx:document>
</sphinx:docset>

In the first few fields of the schema definition, we are defining the fields we want to have indexed: the artist, title, description and some keywords. Please note again that these fields will be indexed, which means that these fields will NOT be available in your search results. The fields document_id, content_id, class_id and rating however will be available in a search result.

You might be wondering what I am doing with the class, content and document ids here. As I said in the beginning, I have media content from mixed sources I want searchable as a whole. So I am indexing an Audio class (class_id=1) and also a Video class (class_id=2), which I both want available in my Sphinx index. For this I use the following trick: in my Rails app, I have a simple function mapping Classes to a class_id and vice versa. The content_id field here is actually the id field of my objects in ActiveRecord. This way, when I get a search result from Sphinx, I can just look op the class and do a item = get_class(class_id).find(content_id) to retrieve the corresponding object. Because Sphinx requires unique document ids, I use the original class and content_id to generate that.

At this point we need to write an xml generator for our data. I used an extremely simple string-xml generator, as it allowed me to gather my Reddis data per object and stream the generated xml directly into Sphinx. Just implement a to_sphinx_xml function on all the objects that that gather up all the data they need from Reddis or ActiveRecord and write out xml. If you are just setting things up, it is smarter to first generate your XML to file and then read it into Sphinx. Sphinx will provide line numbers to the XML file where you messed up, so that you can easily correct it.

After you’ve done that, you can get it all working, by running the following commands:

indexer --config /config/sphinx.conf
searchd --config /config/sphinx.conf --pidfile

Querying the Sphinx database

There are two easy ways to query the Sphinx database: using the Riddle gem or using the current Master of the Mysql2 gem. You need the current Mysql2 master because has multi statement support, which allows you to query Sphinx using SphinxQL.

Riddle

Add Riddle to your Gemfile and run bundle install. After that you can use the following syntax to query the database:

sphinx = Riddle::Client.new
sphinx.add_query('john', 'media_stemmer, media_soundex')
sphinx.set_filter('class_id', 1)
sphinx.sort_mode = :extended
sphinx.field_weights = {'artist' => 10, 'title' => 5, 'description' => 2, 'keywords'=> 2}
sphinx.index_weights = {'media_stemmer' => 10, 'media_soundex' => 8}
sphinx.rank_mode = :sp04
sphinx.group_clause = '@weight DESC, rating DESC'
results = sphinx.run_queries.first

The set_filter allows us to still search specifically for audio or video in our search database. If you remove this line, we search through the entire media library. As you might have guessed from the run_queries statement, it is possible to perform different queries on multiple indexes in parallel. Take a look at the Riddle documentation on how to do this.

Mysql2

Add the Mysql2 gem to your Gemfile and run bundle install. Quering using the Mysql2 gem is quite similar to Riddle:

sphinx = Mysql2::Client.new(:host => '127.0.0.1', :port => 9306,
          :database => 'nothing', :user => 'nothing',
          :flags => Mysql2::Client::MULTI_STATEMENTS)

query = 'john'

sphinx_ql_query = <<-ENDOFQUERY
  SELECT *
  FROM media_stemmer, media_soundex
  WHERE MATCH('#{john}')
  AND class_id IN(1)
  ORDER BY "@weight DESC, rating DESC"
  OPTION
    field_weights=(artist=10,title=5,description=2,keywords=2),
    index_weights=(media_stemmer=10,media_soundex=8),
    ranker=sph04; SHOW META;
ENDOFQUERY              

results = sphinx.query(sphinx_ql_query)
meta_information = nil

if sphinx.respond_to?(:next_result) && sphinx.next_result
  meta_information = sphinx.store_result
end

The database name and username can be anything really, but are needed for the Mysql2 gem, otherwise it will refuse to connect. As you can see, the second query we are passing here is SHOW META;. This contains vital information like how many more results there are, which you need for things like pagination.

Tweaking

As you can see in the Riddle and Mysql2 example, we are setting field and index weights. These can be used to finely tune the query results you get back from Sphinx. They work in the following order:

  • Sphinx selects all documents matching your query and any filters you might have set.
  • Each matching document gets a total value called weight, calculated by matches in specific fields. So if you search for John and you get a match in the Artist field, that documents gets 10 points. If the keywords also include John, you the line gets an additional 2 points.
  • The points gathered by a document get multiplied by the index weights and added together. So our example document gets 12x10 = 120 points in the media_stemmer index and 6x8 = 48 points in the media_soundex index. This brings it to a total score to 120+48 = 168, which is the weight.
  • All documents get sorted by their calculated weight, returning the documents with the highest weight first.
  • If two documents have the exact same weight, they will be sorted by our manually added rating field in descending order.

As you’ll understand, changing of any of the weight values will alter the calculated weight of each document and therefore the sorting you get back. Please note that you can tweak this process in far more detail by selecting different rankers, sorters, morphologies and groupers. The real challenge here is to get your Sphinx to return relevant results for each query your users throw at it. In our case the soundex preforms great when people are misspelling, but can sometimes give strange results when people aren’t.



blog comments powered by Disqus

Published

15 May 2012

Tags