Note: This paper was given at PyCON 2005 and EuroPython 2005. It contains a lot of information that is not yet duplicated elsewhere on the wiki, but is not kept up to date.
Pulling Java Lucene into Python: PyLucene
Author: Andi Vajda
Abstract
As OSAF needed an open source text search engine library for its Python
based project, Chandler, we made the following bet: what if we pulled
together Java Lucene, GNU's gcj Java compiler and SWIG to build a
Python extension?
This paper examines the issues of pulling an open source Java library
into Python by matching the differences in memory management, thread
support and type systems. It also describes a technique for extending a
Java class from Python by providing a python implementation to a Java
extension proxy.
Introduction
OSAF's flagship project, Chandler (
http://www.osafoundation.org), is a
personal information manager. As such, it needs the ability to run
unstructured full text queries over arbitrarily large repositories of
text.
There are not that many open source text search engines
available. Lucene is considered among the better ones and it is
licensed under the Apache license, both of which make it a very
attractive solution.
But Lucene is written in Java.
Java presents quite a challenge since it is a very insular
environment. It doesn't play well, both culturally and technically,
with other non-Java environments. The Java Native Interface (JNI) is
clunky, the Java Runtime Environment is something we didn't want to be
encumbered with either.
Really, for Lucene to be an acceptable solution for Chandler, it had to
be able to be compiled down to a simple Python extension that is loaded
and managed just like any other Python extension written in C or C++.
Enter GNU's Java Compiler, gcj.
GCJ compiles a set of Java classes together into a shared library and
exports these classes as if they were implemented in C++ via the Common
Native Interface (CNI) which makes them very easily accessible to other
C/C++ programs.
GCJ does not depend on the Apple or Sun Java Runtime Environments. It
comes with its own, 100% open source, clean room implementation of the
Java runtime classes.
Enter the interface compiler, SWIG.
SWIG is an interface compiler that connects programs written in C and
C++ with scripting languages such as Perl, Python, Ruby, and Tcl. It
works by taking the declarations found in C/C++ header files and using
them to generate the wrapper code that scripting languages need to
access the underlying C/C++ code. (quoted from
http://www.swig.org/exec.html)
Using SWIG and gcj allows to keep the resulting Python extension very
close to the actual Java Lucene library which is under active
development. This approach, while involving many thousands of lines of
boilerplate code, almost all generated by SWIG, ultimately yields a
much closer and more up to date library than a handcrafted port.
Alternatives to the gcj/SWIG approach include a manual port of Java
Lucene to Python such as Lupy, which is ten times slower than Java
Lucene and way behind the Java Lucene version; a manual port to C++
integratable into Python such as CLucene, which is four times faster
than Java Lucene but comes with its own set of C++ bugs and is also way
behind the original; or finally, a JPipe-based solution, which is very
clever but relies on the JNI and the JRE.
PyLucene Architecture
PyLucene is compiled as a shared library - a python extension - loaded
into the Python process by an 'import' statement.
On Mac OS X and Linux, gcj's libgcj.so and libstdc++.so need to be
shipped along with _PyLucene.pyd. On Windows however, all these
libraries are linked together into a single shared library of a few
megabytes in size.
----------- -------------- -------------------
| Java | --- gcj ---> | lucene.o |->------<-| PyLucene_wrap.o |
| Lucene | --- gcjh -------------- | -------------------
----------- | v ^
v ----------------- |
--------------- | _PyLucene.pyd | g++
| C++ headers | -->| ----------------- |
--------------- | |
|--- SWIG ---> ---------------------
-------------- | | PyLucene_wrap.cxx |
| PyLucene.i | ----------------->| ---------------------
-------------- | PyLucene.py |
---------------------
Preparing the Java Lucene library for compilation by gcj
The GNU java compiler, gcj, has been under very active development for
a number of years now and is quite usable. Still, it comes with a
number of bugs and kinks that need to be worked around. For example,
gcj can compile .java source files and .class bytecode files to .o
files alike. There are, however, more bugs in the .java to .o
compilation process, notably with handling anonymous inner
classes.
The infamous Miranda bug,
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15411, needs to be
carefully worked around by adding abstract method declarations to
abstract java classes for the methods of interfaces they claim to
implement.
Differences in reserved words between Java, C++ and Python can also
cause problems. For example, 'delete' is a perfectly acceptable name
for a method in Java but a reserved keyword in C++. Similarly, 'or' is
acceptable in Java but a reserved keyword in Python. Some of these
incompatibilities can be dealt with in the SWIG layer, others need to
be patched in the original Java sources.
Matching the memory models
PyLucene involves three languages, Java, C++ and Python, each with
their own conflicting memory management policies. Java's memory is
garbage collected, C++ memory is more or less managed by hand and
Python's memory objects are reference counted.
The GNU Java class header file generator, gcjh, makes Java classes
available to C++/CNI programs as C++ classes via the C++ headers it
generates.
SWIG, which generates the glue code between Python and C++, is not
aware that the C++ pointers it wraps for Python's use are not really
C++ but pointers to Java objects that cannot be explicitely 'deleted'
when Python releases them.
The gcj runtime library, libgcj, includes a garbage collector which
keep track of Java objects in the Java heap, on the stack and in static
variables but not in the Python heap.
To solve these problems all Java objects returned to Python are kept in
a Java IdentityHashMap where they are reference counted. That way, they
are not garbage collected until they are removed from the table when
their refcount reaches zero.
Matching thread support and concurrency models
Both Python and Java support threads. On the platforms PyLucene is
currently supported, Mac OS X, Linux and Windows, Python and Java use
the same operating system thread implementations. One would hope that
threads created in Python could be used to call into Java and
vice-versa, but it is not that simple.
Before any Java memory can be allocated by a thread using gcj's Java
runtime library, libgcj, the garbage collector has to be made aware of
the thread and its stack. The current implementation of the garbage
collector, boehm-gc, does not support the registration of a thread
after the fact. Hans Boehm, the author of the garbage collector module,
intends to eventually support this but making it work reliably on the
three supported PyLucene operating systems is a lost cause at the
moment.
Luckily, Python is a lot more nimble about threading support. It
happily runs in threads it did not create and can even be coaxed into
treating such a thread as one of its own.
PyLucene exposes a new class, called PythonThread, which is a subclass
of Python's threading.Thread class whose start() method delegates to
Java the creation, initialization and starting of the actual operating
system thread. There are then in fact two distinct thread objects, one
Python, one Java, that use the same operating system thread.
Any thread in python wishing to use the PyLucene APIs needs to be an
instance of this PythonThread class.
Python's threading support is not truly concurrent. While a Python
thread is running, it is holding a Global Interpreter Lock (GIL) that
prevents any other Python thread in the same process from
running. Java, on the other hand, is fully concurrent. Several Java
threads can run truly concurrently on hardware that supports it.
Because of this concurrency mismatch, deadlocks may occur if the GIL is
not released whenever a call is made from Python to Java since Java
code may call back into Python when invoking Python 'extensions' to
Java Lucene classes.
Releasing the GIL also helps the Python process in not appearing hung
when a longer running Java Lucene API such as IndexWriter.optimize() is
in progress.
'Extending' Java classes from Python
Many areas of the Lucene API expect the programmer to provide their own
implementation or specialization of a feature where the default is
inappropriate. For example, text analyzers and tokenizers are an area
where many parameters and environmental or cultural factors are calling
for customization.
PyLucene enables this by providing Java extensions that serve as
proxies for Java to call back into the Python implementations of these
customizations.
Technically, the PyLucene programmer is not providing an 'extension'
but a Python implementation of a set of methods encapsulated by a
Python class whose instances are wrapped by the Java proxies provided
by PyLucene.
For example, the code below, extracted from a PyLucene unit test,
defines a custom analyzer using a custom token stream that returns the
tokens '1', '2', '3', '4', '5' for any document it is called on.
All that is needed in order to provide a custom analyzer in Python is
defining a class that implements a method called 'tokenStream'. The
presence of the 'tokenStream' method is detected by the corresponding
SWIG type handler and the python instance passed in is wrapped by a new
Java PythonAnalyzer instance that extends Lucene's abstract Analyzer
class.
In other words, SWIG in reverse.
class _analyzer(object):
def tokenStream(self, fieldName, reader):
class _tokenStream(object):
def __init__(self):
self.tokens = ['1', '2', '3', '4', '5']
self.increments = [1, 2, 1, 0, 1]
self.i = 0
def next(self):
if self.i == len(self.tokens):
return None
t = Token(self.tokens[self.i], self.i, self.i)
t.setPositionIncrement(self.increments[self.i])
self.i += 1
return t
return _tokenStream()
analyzer = _analyzer()
store = RAMDirectory()
writer = IndexWriter(store, analyzer, True)
d = Document()
d.add(Field.Text("field", "bogus"))
writer.addDocument(d)
writer.optimize()
writer.close()
Supporting downcasting
Python's type system does not require type casting. On the other hand,
downcasting is a very common operation in Java. SWIG will wrap a C++
object with a Python object matching the object's declared protocol.
For example, if a Lucene API is declared to return Query, the resulting
Python wrapper implements the Query methods, exactly. If the wrapped
object is actually an instance of a subclass of Query, such as
BooleanQuery, the subclass's methods are not available on the Python
proxy.
Where this is a problem, PyLucene extends the types in question to have
is
() type checkers and to() type casters to work this
around.
For example:
analyzer = StandardAnalyzer()
query = QueryParser("data", analyzer).parse("a AND b").toBooleanQuery()
Pythonic API flavors
Java is a rather verbose programming language. In places where it made
sense, PyLucene added some pythonic extensions to the Lucene APIs such
as iterators or property accessors. For example, one of the most
commonly used Lucene classes, Hits, which returns the documents found
in a search, is not iterable in Java Lucene, a hit counter is used
instead. PyLucene wraps this nicely with a Python iterator. This can be
illustrated as follows:
The Java loop:
for (int i = 0; i < hits.length(); i++) {
Document doc = hits.doc(i);
System.out.println(hits.score(i) + " : " + doc.get("title"));
}
with PyLucene becomes:
for i, doc in hits:
print hits.score(i), ':', doc['title']
or:
for hit in hits:
print hit.getScore(), ':', hit.get('title')
Error reporting
Java exceptions are caught by the Python - Java call boundary and
wrapped into a Python JavaError exception. Errors that occur in Python
code called from Java, are also caught there and reported as usual.
In effect, for every call made from Python to Java, the glue code is
the following:
try {
PythonThreadState state; // release/restore GIL
$action // call Java API
} catch (org::osafoundation::util::PythonException *e) {
return NULL; // report Python error
} catch (java::lang::Throwable *e) {
PyErr_SetObject(PyExc_JavaError, // wrap Java exception
jo2pr_type(e, "java::lang::Throwable *"));
return NULL; // report Java error
}
The PythonThreadState type is a simple C++ class that ensures that the
Python thread state is saved and the GIL released when it enters into
scope and conversely that the Python thread state is restored and the
GIL reacquired when it goes out of scope.
class PythonThreadState {
private:
PyThreadState *state;
public:
PythonThreadState()
{
state = PyEval_SaveThread();
}
~PythonThreadState()
{
PyEval_RestoreThread(state);
}
};
Samples
A large number of samples are shipped with PyLucene. Most notably, all
the samples published in the "Lucene in Action" book that did not
depend on a third party Java library for which there was no obvious
Python equivalent were ported to Python and PyLucene.
"Lucene in Action" is a great companion to learning Lucene. Having all
the samples available in Python should make it even easier for Python
developers.
"Lucene in Action" was written by Erik Hatcher and Otis Gospodnetic,
both part of the Java Lucene development team, and is available from
Manning Publications at http://www.manning.com/hatcher2.
Future work
Most of PyLucene's SWIG code is boilerplate code that could also be
generated before being fed to SWIG. Such a Java classes to SWIG
generation tool remains to be written.
The same gcj/SWIG-based techniques could be used for other languages
supported by SWIG such as Perl, Ruby, Lisp, etc... Various people have
shown interest in a Ruby version for a while now. It would be exciting
to see PyLucene morph into SWIGLucene as support for some of these
languages is added.
PyLucene could be ported to other operating systems. That effort is
effectively bounded by the state of the implementation of libgcj on
these platforms. In particular, threading and garbage collection
support can be problematic. GNU gcj is under very active development
and progress is made on a regular basis.
Acknowledgements
PyLucene wouldn't be possible without the tireless efforts of the
people contributing to the open source projects below:
Thank you all !
License
This paper is licensed under a Creative Commons License:
http://creativecommons.org/licenses/by/2.0