r2 - 23 Nov 2004 - 01:34:14 - TedLeungYou are here: OSAF >  Journal Web  >  TWikiUsers > TedLeung > TedLeungNotes > TedLeung20041122

Links

Notes

Erik Hatcher at ApacheCon

  • I spoke with Erik Hatcher at ApacheCon about the use of Lucene as an implementation vehicle for our query system. There are two major issues. The first is that it's not clear how you'd support our observable query concept using Lucene -- there'd be quite a bit of work to hook into the guts of Lucene. The second issue was using Lucene to implement object/item path traversal queries. Erik said that someone has looked at doing this but that it wasn't very efficient. Following item references inside of querires is important, so this makes Lucene less appealing for this purpose. Erik was very happy with the work that Andi has done on Pylucene and was excited about some ways to get broader visibility for that work.

IM conversation with Katie on 11/19

Katie I discussed a pair of issues:

  • Brian and I were having a discussion on the use of queries/notification in the system, and I wanted a sanity check. From our discussion, it seems reasonable to limit notification to items that are in collections -- we want to keep the number of notifications under control, and we don't want to have to add lots of attributes/queries in order to send the right notifications We also discussed the possiblity of getting a notification if any item that is in the result set of a query changes state -- not just when it enters or exits the result set of the query. This sounds like a perfect job for monitors.

  • We also did a little noodling on possible server implementation ideas. One thing that we kicked around was the following set of ideas:
    1. Use queries to specify the bulk data transfer
    2. The client just provides an object cache which is loaded via bulk transfers. We also need to handle notification for observable queries. A writeback object cache on the client can be used to detect changes for notification.
    3. We might have to poll the server in order to pickup changes that were made to the server via another client

Performance Benchmarking

In theory, repository performance should be bounded by the performance of the disk subsystem in the machine. I broke out a copy of Tim Bray's Bonnie filesystem benchmark in order to geta rough idea of the performance of my machine, just to get an idea. The disk in this machine is a Hitach ICN25N080ATMR04, which is an 80G, 4200 RPM drive. The Bonnie output is below.

File './Bonnie.13028', size: 0
Writing with putc()...done
Rewriting...done
Writing intelligently...done
Reading with getc()...done
Reading intelligently...done
Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...
              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
            0  6702 58.7 23450  9.8 18408  6.5  9851 75.4 204884 30.0 1871.4  8.0

Picking the lowest numbers, you get a sequential write of 6702K/s and sequential read of 9851/s.

So take the lower number, 6702 and make it a round number, say 6000K/s.

So if items are 1K each, then we should be able to write 6000 items/s MAX.

So call that the theoretical upper limit

The next limit which needs to be investigated is the write bandwidth that we can obtain through Berkeley DB from Python.

-- TedLeung - 22 Nov 2004

Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r2 < r1 | More topic actions
 
Open Source Applications Foundation
Except where otherwise noted, this site and its content are licensed by OSAF under an Creative Commons License, Attribution Only 3.0.
See list of page contributors for attributions.