Sei sulla pagina 1di 4

Comparing Thinking Sphinx And Acts

As Ferret For Full-text Indexing In


Rails
Filed in Ruby . Delivered Thursday March 05 2009

This will by no means be an exhaustive list of differences between using Sphinx and using Ferret,
but we’ll look at a few major differences between the way these two search engines are
implemented via acts_as_ferret (AAF) and Thinking Sphinx (TS).

Active Development
First thing is first, TS is in much more active development. At the time of writing, TS last had a
commit to the offical repo 13 days ago in comparison with three months ago for AAF. And it
shows. acts_as_ferret discussion these days is minimal online and most of the tutorials are
rather old. Meanwhile, Thinking Sphinx has a very active google group and several more recent
tutorials including a fairly recent Railscast .

So Thinking Sphinx wins in terms of active development.

Speed, Reliability, and Resources


Another place TS blows AAF out of the water is in speed and resource usage. Sphinx uses
kilobytes of memory where a ferret daemon will sit on megabytes, having to load your entire
Rails app into memory. For example, on my machine, the Sphinx daemon sat at 376 KB while
my ferret process ate 57.69 MB. Not kidding.

Ezra Zygmuntowicz said

Ferret is unstable in production. Segfaults, corrupted indexes


galore. We’ve switched around 40 clients form ferret to sphinx and
solved their problems this way. I will never use ferret again after
all the problems I have seen it cause peoples production apps.

Plus sphinx can reindex many many times faster then ferret and uses less cpu and
memory as well.

Anecdotally, that’s my experience as well. Thinking Sphinx can index my database in less than a
minute, while acts_as_ferret can take up to 30 minutes or more.
The Level of Interaction with Rails
Simply put, acts_as_ferret obeys ActiveRecord, while Thinking Sphinx goes low-level.

In his Thinking Sphinx Peepcode PDF , Pat Allen writes “For those familiar with Ferret, Sphinx is
quite similar, except that Sphinx talks directly to database servers – both MySQL and PostgreSQL
– to obtain the data to index.”

This is largely what gives Sphinx its speed advantage, but it also makes Thinking Sphinx dumb
as far as your ActiveRecord models are concerned.

For instance, this means that TS isn’t aware of your acts_as_paranoid models until you add the
deleted_at conditional to your define_index block.

define_index d o
indexes [:body, :title]
where ['deleted_at is NULL']
end

This also means that TS can’t index computed values as easily. In AAF you can index methods
on your object, so you could index a method like the following

d e f ordinalized_names_of_children
ordinalized_children = []
self.children.sort_by(&:birth_date).each_with_index d o |child, i|
ordinalized_children << [child.first_name, i.to_ordinal]
end
ordinalized_children
end

This is a silly example, but to accomplish the same with TS you need to use db-specific string
transformations and add all your conditional logic to the query as well. And you can easily
imagine more complicated examples. Where with AAF you have the entire landscape of Ruby to
use and abuse and you’re instantly inheriting the constraints of ActiveRecord, with TS you’re
limited to what can be done solely on the database level. Luckily this is usually enough.

Indexing Changes to Your Models


As far as ease of handling updates goes, acts_as_ferret has a big initial advantage. From Gregg
Pollack

the index gets modified every time you add/edit/remove the ActiveRecord model it’s
associated with. You never have to worry about doing this yourself, it happens
automatically, so your search index is always 100% accurate. No rebuilding needed.
With Thinking Sphinx, you need to specify something called delta indexes on the models you want
to keep up-to-date for searches between index rebuilds. This is a little more intrusive than AAF’s
approach since you also have to add a field to your table called “delta” to track what has
updated… but a single boolean field doesn’t incur much overhead. You’ll still need to periodically
rebuild your indexes regularly as the delta indexes can slow things down over time.

In both AAF and TS, deleted models are immediately removed from the index.

To sum up the differences:

AAF requires less work on your part to keep an accurate running index, but the overhead
involved in updating the index on each record update can be painfully slow over time.
TS requires more work to specify delta indexes and requires you to add a column to your
database, but provides you with options on how often incorporate the changes so that
you’re not waiting on a change to be indexed on every model save.

From my experience, very few model updates need to be instantly available for search, and both
approaches have their pros and cons. Though it requires slightly more work on your part, I feel
TS puts more control in your hands.

The Overall Winner


The winner here is obviously Thinking Sphinx. You use less resources and get better speed,
reliability, and the future looks a lot more sure for support. Sure, you may have to get your
hands a little dirtier with some SQL, but the benefits more than make up for it.

Also (and I get nothing out of this), you should buy the Peepcode PDF as it will give you a huge
head start on Thinking Sphinx.

An aside on Sphinx’s Treating of Primary Keys


There’s another thing Ferret can do that Sphinx cannot. As Section 3.5 of the Sphinx
documentation states, “ALL DOCUMENT IDS MUST BE UNIQUE UNSIGNED NON-ZERO INTEGER
NUMBERS (32-BIT OR 64-BIT, DEPENDING ON BUILD TIME SETTINGS).” And Thinking Sphinx
enforces this in its config file by performing math on your table’s id column to help create unique
sphinx index ids.

99 times out of 100 this is fine since most tables just have auto-incremented integer ids anyway,
but what if you have tables with ids of significant value? That’s the situation I found myself in
when adopting Thinking Sphinx on my current project. We have a ton of external data coming in
and much of that data already has a GUID so we decided early on to use the GUID as a primary
key and foreign key as that would allow us to later recreate any table without having to worry
key and foreign key as that would allow us to later recreate any table without having to worry
about the foreign key integrity issues that can sometimes be a taxing side-effect of using auto-
incremented ids.

My first approach to overcoming this limitation was to add an auto-incremented column named
“id” to the table and then make use of set_primary_key in Rails. Unfortunately, once you do
that, Thinking Sphinx tries to call that primary key you specified. So Thinking Sphinx had to be
patched. Essentially, I added a method set_sphinx_primary_key to allow you to specify a
primary key that TS should use regardless of what the ActiveRecord model specifies as its
primary key.

So in the example:

class Robot < ActiveRecord d::B


Ba s e
# The key ActiveRecord will use on joins, map to id, etc.
# Setting the primary key isn't necessary for set_sphinx_primary_key to work
set_primary_key :internal_id

# The key sphinx will use for indexing, must be a unique integer
set_sphinx_primary_key :alternate_primary_key

define_index d o
indexes :name
end
end

ActiveRecord will use the field internal_id on the “robots” table (the set_primary_key could just
as well be left out and ActiveRecord would use the default “id”). But while ActiveRecord uses
internal_id, Thinking Sphinx will instead use alternate_primary_key. So our robots can internally
use a GUID string while Sphinx is still provided with the integer column it needs to index the
robots.

You can find these updates in my github branch of Thinking Sphinx. I have no idea if Pat will ever
merge these into the main repo, as it is admittedly a niche need. But if you find yourself in the
situation I found myself in, it can really help you overcome this limitation of Sphinx.

Potrebbero piacerti anche