r4 - 17 Sep 2006 - 08:03:52 - ReidEllisYou are here: OSAF >  Projects Web  >  DevelopmentHome > ApplicationProject > GeekTalks > GeekTalk032706

How to use Py Lucene efficiently for search in Chandler?

  • Host: Andi
  • Notes: Jeffrey
  • Date: Monday March 27th, 2006 @ 2pm

History

Chandler needed full text indexing. Lucene is a Java product, had the write license, wrong language. PyLucene wraps it.

Details

  • All of the Lucene API is implemented, use Java Lucene javadocs, there's a good book
  • Lucene is very fast at search, not necessarily fast at indexing
  • Indexing only happens at commit, won't find items that have changed but haven't yet been committed
  • Complicated sorting can done by Lucene
  • There's an API to return where matches occurred
  • Analyzers are language specific, features like stemming, accepting plurals and conjugated forms of words, depends on language

Open questions

  • How to expose query syntax? It's not that bad, but it's not beginner friendly
  • How to sort and render results?
  • Do we need to worry about UI blocking when a query is run?
    • Search is very fast, so probably shouldn't be a problem
  • Chandler needs to use different Lucene analyzers depending on language. Locale isn't enough. Improve guesses about language? Provide UI for setting language?

Use cases

  • Search only within a collection
  • Display the number of matches associated with a search, different match numbers for each collection in the sidebar
    • Fast as long as testing membership doesn't load an item, currently membership tests load items for collections like All
    • Lucene knows nothing about collections, but can filter on only UUIDs in the current collection
    • Could perhaps add collection information to common collections
  • Virtual collection, based on search term
    • Lucene collections could receive notifications about attribute changes
    • Could get expensive when committing thousands of items
      • Can cache Lucene query object
  • Find all items with any attribute with value "dog"
    • Can be done
  • Tagging
    • Might tagging use Lucene?
      • Lucene collection could be implemented to do most of what a filtered collection does today, except filtered collections have monitors to get instant response from an uncommitted change

Issues

  • Indexing will be expensive
    • Can fine tune indexing
      • cache more or less
      • index only the first 10 lines
    • Could indexing run in a different thread from commit?
      • If willing to accept false positives, can decouple
        • Could indexing start only after a successful commit? Seperate out, improve appeared performance of commit
        • Searching in the same thread as indexing? Queue searches, blocking on indexing completion
      • Currently notifications dwarf indexing costs
    • Lets not over-engineer indexing when notification time >> indexing > search
  • Transactions are a layer higher than Lucene, Lucene on top of Chandler supports transactions because it's built on top of Chandler's Berkeley DB
Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r4 < r3 < r2 < r1 | More topic actions
 
Open Source Applications Foundation
Except where otherwise noted, this site and its content are licensed by OSAF under an Creative Commons License, Attribution Only 3.0.
See list of page contributors for attributions.