Searching with Zend (Zend_Lucene)

The Zend Framework provides Zend_Search_Lucene a PHP port of the Apache Lucene search engine to accomplish full text searches with ease. The process of indexing an application can be divided into the following phases:

Part 1 – Create Index

1. Create an index file – Zend_Search_Lucene::create($indexPath);

2. Add documents to index – $index->addDocument(new MyDocument($document));

Part 2 – Search the Index

1. Open the index – $index = Zend_Search_Lucene::open($indexPath);

2. Search against the index – $hits = $index->find(‘search phrase’);

Part 3 – Format and return hits

1. Get ‘Hits’ array –foreach ($hits as $hit) { … }

2. Iterate and display relevant data stored in index – $this->paginationControl($hits, ‘Sliding’,’p.phtml’);

Resource considerations:

The search engine returns an array of “Zend_Search_Lucene_Search_QueryHit” objects. This array contains id’s (number assigned to document in the index) and scores (float value) along with reference information about how data that is stored in the index is to be retrieved.

So, if your search returns 100,000 results, then an array of 100,000 elements will be allocated. Bear in mind that Lucene works with “disk-based data structures’ – So, this array is still lightweight as the actual payload is retrieved on the fly (i.e. when you iterate over the hits array and access $hit->title, $hit->teaser etc). Lucene’s lazy loading mechanism optimizes resource use.

1. Restrict the number of search results using setResultSetLimit if your index spans thousands of pages. Because results are ordered by score (descending from highest score), you will still get the most relevant hits.

2. Lucene is a mature, well established information retrieval engine that internally handles all of its operations in an efficient manner. You do not need to worry about the resource requirements once properly setup. Searching in general uses extra computing resources (memory and processor). As mentioned in the official Lucene documentation, for indexing upto 10M documents on a single server, Zend Lucene is your best bet. Yes.. there are more scalable IR engines out there but none offer the ease and simplicity of Lucene.

3. As described earlier, the array returned by Lucene DOES NOT contain the actual data but pointers to data on disk. So do not iterate over the entire array returned by the search to display the components of the search (author, url, teaser etc). Use the Zend_Paginator component to restrict disk access to about 10-20 results per page – Most users will revise their search criteria if required hits are not obtained in a couple of pages.

Here is an example that illustrates pagination of the search result set:

IndexController.php

<?php
class IndexController extends Zend_Controller_Action
{
public function indexAction()
{
$indexPath=”C:\\web\\search\\docindex”;
$index = Zend_Search_Lucene::open($indexPath);
$hits = $index->find(‘search string’);

$paginator = Zend_Paginator::factory($hits);
$paginator->setCurrentPageNumber($this->_getParam(‘page’));
$paginator->setItemCountPerPage(10);
$this->view->hits=$paginator;
}
}
?>

Index.phtml (View script)

<?php
$hits=$this->hits;
?>
<?php foreach ($hits as $hit) { ?>
<h3><?php echo $hit->title ?> (score: <?php echo $hit->score ?>)</h3>
<p>
By <?php echo $hit->author ?>
</p>
<p>
<?php echo $hit->teaser ?><br />
<a href=”<?php echo $hit->url ?>”>Read more…</a>
</p>
<?php }

echo $this->paginationControl($hits, ‘Sliding’,’p.phtml’);

?>

and finally, the script that handles rendering of the page number links (p.phtml)

<?php if ($this->pageCount): ?>
<div class=”paginationControl”>
<!– Previous page link –>
<?php if (isset($this->previous)): ?>
<a href=”<?php echo $this->url(array(‘page’ => $this->previous)); ?>”>&lt;Previous </a> |
<?php else: ?> <span class=”disabled”>&lt; Previous</span>
|<?php endif; ?>
<!– Numbered page links –>
<?php foreach ($this->pagesInRange as $page): ?>
<?php if ($page != $this->current): ?> <a
href=”<?php echo $this->url(array(‘page’ => $page)); ?>”> <?php echo $page; ?></a>
| <?php else: ?> <?php echo $page; ?> |
<?php endif; ?>
<?php endforeach; ?>
<!– Next page link –>
<?php if (isset($this->next)): ?>
<a href=”<?php echo $this->url(array(‘page’ => $this->next)); ?>”> Next
&gt; </a><?php else: ?> <span class=”disabled”>Next &gt;</span>
<?php endif; ?>
</div>
<?php endif; ?>

One important thing to remember is that the “hits” array CANNOT be cached or stored in a session variable directly. The error you will get if you attempt this is :

Fatal error: Call to a member function getDocument() on a non-object in C:\Program Files\php\includes\Zend\Search\Lucene\Proxy.php on line 368

That is by design. The Lucene developers recommended approach for paging is to re-execute the search, and that is exactly what we do in the above code. The Zend_Paginate component handles slicing the array and displaying relevant data to the user.

Click here to view another blog I wrote detailing a search solution implementation using Zend_Lucene.

Advertisements

7 thoughts on “Searching with Zend (Zend_Lucene)

  1. Setting a result set limit with setResultSetLimit() is usually not a good idea. This is applied before the results are scored, so you could potentially lose good search results in the process.

    The Zend documentation says: “It doesn’t give the ‘best N’ results, but only the ‘first N’.”

    Also, did you ever work out the session issue?

    1. Thanks for your comment.
      The key to using setResultSetLimit() is to set the limit high enough that most searches return ‘well ranked relevant results’ AND low enough so that the search does not squander memory. It is a dicey issue I admit!
      This is where google is absolutely fascinating… Even a search for a generic word like ‘but’ yields results (yeah.. 3 billion hits in .29 seconds! Bing yields about one third of that.. still very impressive).

      And no.. I have not revisited the session problem with zend_lucene hits.

  2. Hi, sorry for digging up an old topic, but I have the same problem and zend_lucene “Fatal error: Call to a member function getDocument()..” Can you write back if you found a solution?

  3. Thanks for your post

    if your search returns 100,000 results. You want only 10 result on a page, i see you use Zend_Paginator component, but it only limit number of results on a page. First times, you call page 1, it has to search in file index and return 100,000 results. Second times you call page 2, it has to search also in file index and return 100,000 results.

    Zend_Paginator dont help you improve the performance of searching.

    I think you should save results in cache. You dont have to call file index many times.

    Do you agree ?

    1. You are right in that the entire resultset is returned on every single page. Unfortunately, I don’t see a way around it given that the hit object is not amenable to either storing in session or cache. Please refer to the link in the last paragraph of my post above. It is indeed the recommended way.
      The most time consuming process is “iterating” over resultsets (as it involves disk/file access).. and “paging” minimizes that.

  4. Are you sure: “re-execute the search and ignore the hits you don’t want to show” is better than reading from cache file ?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s