Search Implementation Using Zend_Search_Lucene

Zend_Search_Lucene is a comprehensive Information Retrieval component written entirely in PHP (the index is completely file system based). It can be employed in most medium to mid-range websites to provide full-text search capabilities. In this blog, we will look at a practical implementation of the same. From the outside, the API makes Lucene seem obscenely simplistic. But it is in fact quite a complex data analysis, storage and retrieval engine. It masterfully encompasses many major aspects of computer theory like Data structures, File systems,  Indexes and Parsing. For the technically inclined who wish to learn more about Lucene internals, I would strongly recommend Lucene In Action (The entire PHP source code is also available under the Zend/Search folder. You can easily insert a break point and explore it line by line using a good IDE like Netbeans!)

Credit where it is due: The code has been inspired by a chapter from the book “Zend Framework in Action” by Rob Allen. Although coding a Lucene search solution that “works” is quite trivial to accomplish with the Zend_Search_Lucene API, this book/chapter looks at a “well-engineered” solution using the Observer design pattern that scales beautifully.

Note:

  1. The PHP implementation of Lucene is slower than its pure Java counterpart. This is especially true when the size of data to be indexed increases. (Understandably so – PHP, an interpreted language, comes no where close to java in runtime efficiency)
  2. The max size of a Lucene index on a 32 bit OS is 2GB. On a 64 bit server OS, there is no limit.
  3. The binary index structure created by PHP and Java are identical and can be used interchangeably
  4. The next step in scaling your search solution would be migrating to SOLR. SOLR is a pure java server-side indexing engine (built on Lucene) which does not have any of the bottlenecks of Lucene. It is truly enterprise-ready with the ability to scale to multiple servers, and index millions of documents. It runs in a Tomcat container and provides a web service which can be easily queried by PHP. So, once you migrate to SOLR, your search query will be submitted to SOLR via a web service call which in turn returns formatted HTML. The SOLR API provides similar mechanisms to add/modify/delete from your index.

The following framework is built on the Zend Framework and helps us to ‘instantaneously’ index fresh content so that it is available for search. This is a core requirement for dynamic websites (like facebook and twitter).

We will be building a search solution for a web app for the infamous ‘bugs’ database – although the example is quite contrived, you will see that the solution can easily be extended to any model type.

Our simple model has the following (self explanatory) fields: bug_id, bug_description, reported_by. In order to search, we must first have an “index” of terms in place. The index comprises one or more “documents” that are in turn made up of “field” objects. We will add to the index when new books are “inserted” into the system (to get reacquainted with the basics of Zend_Search_Lucene, please refer to my earlier blog).

The following describes the lucene field types that will be used in our index:

Lucene Field Name Lucene Field Type Description
classname Unindexed For now, we are indexing “bugs”.. As our application grows, we may need to store data from various other models Storing the classname will help us construct links to pages that contain the full data
key Unindexed The id that uniquely identifies the record in the table. Again used for creating the URL to link to the full data
docRef Keyword Unique id in the index – We will concatenate the “class” and “internal id” fields and store it so we can accurately locate the document. This is especially useful for updates (Lucene does not permit updates.. so document will need to be deleted and a fresh document inserted). We would like this field to be indexed -we can use Zend_Search_Lucene_Search_Query_Term to search for this exact term and delete the documents returned!
description text bug description. Needs to be indexed, tokenized and stored.
reported_by keyword name (we don’t want this to be tokenized.. A name is a name is a name!)

The code for creating an index is the same irrespective of the data model:

//open the index directory
$index = Zend_Search_Lucene::open(APPLICATION_PATH . "\searchindex");
//declare the search document structure
$doc = new Zend_Search_Lucene_Document(field1, field2, field3…);
//insert the document into the index
$index->addDocument($doc);

The naive solution is to repeat the above 3 lines whenever/wherever we have data to be indexed. However, as our application (and the number of data models) grows, we would run into code maintenance issues. We would also like to make all the models searchable (if required) – automatically.

To a mind attuned to design patterns, hints of the “observer” pattern begin to emerge. The data/model rows in our case are the “subjects” and the SearchIndexer class is the “observer”. The SearchIndexer class registers with models to be notified of any updates – conveniently for us, the Zend_Db_Table_Row_Abstract class triggers  _postInsert(), _postUpdate() and _postDelete() function/events AFTER insert, update and delete respectively on the data model. This design pattern can be effectively used to decouple the model from the search.

Advantages of a “lightly coupled” search solution:

  1. It is scalable. It can grow easily with our application and not require major work to make additions searchable
  2. We could swap out our search solution tomorrow with another without impacting any other business logic code (you know this is coming!)

We create the interfaces required to implement the observer design pattern in our code (an in-depth treatment of the actual design pattern is beyond the scope of this article. If you are new to design patterns, Head First Design Patterns is an exceptional read):

We start off by creating the cornerstones of the Observer design pattern namely the Observer and Subject interfaces as shown below. The “Subject” raises the event and the “Observer” responds to it. In our case, the subject is the model row (that is inserted, updated or deleted) and the observer is the SearchIndexer that modifies the Lucene index files accordingly.

<?php
interface ZF_ISubject {
public static function  Register(ZF_IObserver $o);
public function Notify($flag);
}
?>
<?php
interface ZF_IObserver {
public function update($flag, $row);
}
?>

Looking at the target Use Cases will give you a high-level view of what we are aiming for:

1. Registering the Observer

class Bootstrap extends Zend_Application_Bootstrap_Bootstrap
{
public function _initSearchListeners()
{
	$search = new ZF_SearchIndexer(APPLICATION_PATH . '/index');
	ZF_SearchableRow::Register($search);
	Zend_Registry::set('search', $search);
}
}

The SearchIndexer object registers itself with SearchableRow using the static “Register” function. This is typically done in the Bootstrap process.

2. Trigger an insert into the index by updating a row and then search for the newly inserted term

//retrieve search indexer from Zend Registry
$search = Zend_Registry::get('search');
//Use a plain old object to create a bug
$data = new Model_Bugs();
$data->bug_id=1;
$data->bug_description = "Button click does not work!";
$obj->update($data);    //this automatically indexes the data!
//search for the data in the index
$index = Zend_Search_Lucene::open($search->getIndexDirectory());
$hits = $index->find('Button');
Zend_Debug::dump($hits);

3. Rebuild the whole index
//retrieve search indexer from registry
$search = Zend_Registry::get('search');
//loop through all our model rows and index!
$obj = new Model_BugsDAL();
$rows = $obj->fetchAll();
foreach ($rows as $row)
{
$search->update('insert', $obj->getRow($row));
}

See how easy it is to extend the indexing to the whole site (and possibly all models) once we have the search infrastructure setup ?

The ZF_SearchableRow class is defined below. As you may notice from its definition, it extends the Zend_Db_Table_Row_Abstract class and implements the ISubject interface. Additionally, this class is defined as abstract (no instances can be created) further, it defines two abstract methods – ModelType() and getSearchFields() that its concrete class must implement.

The $_observers[] array, and the Register function are marked as static (so that they can be accessed from the bootstrapper without the need to instantiate an object). The _postInsert(), _postUpdate() and _postDelete() events are caught and the notify() function is triggered with the type of event and the entire row ($this) as parameter.

<?php
abstract class ZF_SearchableRow extends Zend_Db_Table_Row_Abstract implements ZF_ISubject
{
protected static $_observers = array();
//all classes that inherit this must provide implementation for
//the following:
//Return the name of the model.
abstract public function ModelType();
//Return the index fields. This is best known to the individual models.
abstract public function getSearchFields();

public static function Register(ZF_IObserver $o)
{
self::$_observers[] = $o;
}
public function Notify($flag)
{
foreach (self::$_observers as $observer)
{
$observer->update($flag, $this);
}
}
protected function  _postInsert()
{
$this->Notify("insert");
parent::_postInsert();
}
protected function  _postUpdate()
{
$this->Notify("update");
parent::_postUpdate();
}
protected function  _postDelete()
{
$this->Notify('delete');
parent::_postDelete();
}
}
?>

We next implement the model classes. I wont go into too much detail – I follow the Zend framework recommended documentation for creating models (If you are new to this, please also take time to view this informative presentation on data models by Matthew Weier O’Phinney: http://mtadata.s3.amazonaws.com/webcasts/20090724-playdoh.wmv)

We create three classes for the Bugs model layer

  • Bugs: business logic
  • BugsDAL: Data abstraction layer
  • BugsDB: data source model; in our example we extend Zend_Db_Table_Abstract. Additionally, we create BugsRow.php that extends ZF_SearchableRow above (and set the _rowClass variable in BugsDB to refer to this class)
<?php
/*
* override the __get and __set magic methods to provide base functionality for models.
*/
class ZF_BaseGetSet
{
protected $_data=array();
/**
*Automatically invoked when a non existant property is read
* @param <type> $name
*/
public function __get($name)
{
if (array_key_exists($name, $this->_data))
return $this->_data[$name];
}
/**
*Automatically invoked when a non existant property is written
* @param <type> $name
* @param <type> $value
*/
public function  __set($name, $value)
{
$this->_data[$name]=$value;
}
}
?>

<?php
class Model_Bugs extends ZF_BaseGetSet
{
//this model now has default getter and setter
//inherited from ZF_BaseGetSet
}
?>

<?php
class Model_BugsDB extends Zend_Db_Table_Abstract
{
protected $_name="zfbugs";
protected $_rowClass = "Model_BugsRow";
}
?>
<?php
class Model_BugsRow extends ZF_SearchableRow
{
public function getSearchFields()
{
$fields=array();
$fields['class']=$this->ModelType();
$fields['key']=$this->bug_id;
$fields['description']=$this->bug_description;
$fields['reportedBy']=$this->reported_by;
return $fields;
}
/**
*Each model row exposes what type it is.. This helps make our search
* more generic.
* @return <type>
*/
public function ModelType()
{
return "Bugs";
}
}
?>

 

So far, we have handled the models and “Subjects”. Let us now move on to implementing the SearchIndexer class (the observer)

<?php
class ZF_SearchIndexer implements ZF_IObserver
{
protected $_indexDirectory;
public function __construct($indexDirectory)
{
$this->_indexDirectory = $indexDirectory;
try
{
$index = Zend_Search_Lucene::open($this->_indexDirectory);
} catch (Exception $e)
{
$index = Zend_Search_Lucene::create($this->_indexDirectory);
}
}
public function setIndexDirectory($directory)
{
$this->_indexDirectory = $directory;
}
public function getIndexDirectory()
{
return $this->_indexDirectory;
}
protected function getDocument($row)
{
//use factory design pattern to figure out the
//appropriate zend lucene document type and field structure
$doc = ZF_SearchIndexFactory::getDocument($row);
return $doc;
}
//this is the function invoked by the subject (Observer pattern)
public function update($flag, $row)
{
$doc = $this->getDocument($row);
$this->_modifyIndex($flag, $doc);
}
protected function _modifyIndex($flag, Zend_Search_Lucene_Document $doc)
{
$docRef = $doc->docRef;
$index = Zend_Search_Lucene::open($this->_indexDirectory);
if ($flag != 'insert')
{
$term = new Zend_Search_Lucene_Index_Term($docRef, 'docRef');
$query = new Zend_Search_Lucene_Search_Query_Term($term);
$hits = $index->find($query);
if (count($hits) > 0)
{
foreach ($hits as $hit)
{
$index->delete($hit->id);
}
}
}
if ($flag != "delete")
{
$index->addDocument($doc);
$index->optimize();
}
}
}
?>

The constructor checks if the index is present in the directory.. if not, it goes ahead and creates a new index.

Notice how the the above class handles “updates” (_modifyIndex() method) – The index is first searched for the ‘docRef’ –this is guaranteed to be unique in the index : If found, the document is first deleted (Unfortunately, this is the only way to handle updates in Lucene).

The “Factory” design pattern is used to create Zend_Search_Lucene_Document instances confirming to particular models:

<?php
class ZF_SearchIndexFactory
{
//returns a lucene document for indexing
public static function getDocument($row)
{
//return a Zend_Lucene_Document corresponding to the
//model passed in as parameter.
if ($row->modelType() == 'Bugs')
{
$fields = $row->getSearchFields();
$doc = new ZF_BugsLuceneDocument($fields['class'], $fields['key'],
$fields['description'], $fields['reportedBy']);
}
//if you have more types, add them here...
return $doc;
}
}
?>

And, finally here is the ZF_BugsLuceneDocument that simply extends Zend_Search_Lucene_Document and adds the fields that are passed to its constructor. The ‘docRef’ field is created as a concatenation of the class and key fields.

<?php
class ZF_BugsLuceneDocument extends Zend_Search_Lucene_Document
{

public function __construct($class, $key, $description, $reportedBy)
{
$this->addField(Zend_Search_Lucene_Field::Keyword(
'docRef', "$class:$key"));
$this->addField(Zend_Search_Lucene_Field::UnIndexed(
'class', $class));
$this->addField(Zend_Search_Lucene_Field::UnIndexed(
'key', $key));
$this->addField(Zend_Search_Lucene_Field::text(
'description', $description));
$this->addField(Zend_Search_Lucene_Field::Keyword(
'reportedBy', $reportedBy));
}
}
?>

Down the line, if we need ‘Blogs’ to be indexed, we create the appropriate models (inheriting and implementing the required classes), make sure that its ModelType returns “Blogs” , and its getSearchFields() returns an array containing appropriate fields to be indexed from the Blogs model, and finally tweak the ZF_SearchIndexFactory to return instances of ZF_BlogsLuceneDocument – See how elegant the code is?

Advertisements

One thought on “Search Implementation Using Zend_Search_Lucene

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s