How to use Py Lucene efficiently for search in Chandler?
- Host: Andi
- Notes: Jeffrey
- Date: Monday March 27th, 2006 @ 2pm
History
Chandler needed full text indexing. Lucene is a Java product, had the write license, wrong language. PyLucene wraps it.
Details
- All of the Lucene API is implemented, use Java Lucene javadocs, there's a good book
- Lucene is very fast at search, not necessarily fast at indexing
- Indexing only happens at commit, won't find items that have changed but haven't yet been committed
- Complicated sorting can done by Lucene
- There's an API to return where matches occurred
- Analyzers are language specific, features like stemming, accepting plurals and conjugated forms of words, depends on language
Open questions
- How to expose query syntax? It's not that bad, but it's not beginner friendly
- How to sort and render results?
- Do we need to worry about UI blocking when a query is run?
- Search is very fast, so probably shouldn't be a problem
- Chandler needs to use different Lucene analyzers depending on language. Locale isn't enough. Improve guesses about language? Provide UI for setting language?
Use cases
- Search only within a collection
- Display the number of matches associated with a search, different match numbers for each collection in the sidebar
- Fast as long as testing membership doesn't load an item, currently membership tests load items for collections like All
- Lucene knows nothing about collections, but can filter on only UUIDs in the current collection
- Could perhaps add collection information to common collections
- Virtual collection, based on search term
- Lucene collections could receive notifications about attribute changes
- Could get expensive when committing thousands of items
- Can cache Lucene query object
- Find all items with any attribute with value "dog"
- Tagging
- Might tagging use Lucene?
- Lucene collection could be implemented to do most of what a filtered collection does today, except filtered collections have monitors to get instant response from an uncommitted change
Issues
- Indexing will be expensive
- Can fine tune indexing
- cache more or less
- index only the first 10 lines
- Could indexing run in a different thread from commit?
- If willing to accept false positives, can decouple
- Could indexing start only after a successful commit? Seperate out, improve appeared performance of commit
- Searching in the same thread as indexing? Queue searches, blocking on indexing completion
- Currently notifications dwarf indexing costs
- Lets not over-engineer indexing when notification time >> indexing > search
- Transactions are a layer higher than Lucene, Lucene on top of Chandler supports transactions because it's built on top of Chandler's Berkeley DB