Observations on Cosmo 0.2 resource usage
Cosmo stores calendar objects such as events and timezones in a JCR content repository as nodes and properties. The node type definitions can be viewed at
http://svn.osafoundation.org/server/cosmo/branches/0.2/src/docs/jcr/icalendar.txt. iCalendar components, properties and some value types are modeled as nodes, while property values, parameters and most value types are modeled as JCR properties.
(Cosmo also stores the original iCalendar stream itself, which it uses to service GET requests for the calendar object. The calendar nodes and properties are only there to support calendar reports via CalDAV).
The problem with this approach, other than the complex code required to convert back and forth between the JCR node structure and a calendar object, is that storing even the simplest calendar object calls for an absurdly large number of nodes and properties and a correspondingly absurd amount of disk resources.
Jackrabbit's "object" persistence manager stores a node as follows:
- All files for the node (but not its child nodes) in a single directory
- A .node file describing the structure of the node (its properties, primary type, mixin types, etc)
- A separate file for each property containing its value
(For comparison's sake, the "XML" persistence manager does the same thing, but writes the contents of the files as XML rather than using a binary format. The "simple db" persistence manager uses the binary format but stores the data in "node" and "prop" tables in an RDBMS.)
Testing storage of a calendar with a single, simple (non-recurring) event with few calendar properties, I found that 34 nodes were created, using 129 files and a whopping 648k of disk.
As a more extreme example, a 535-event calendar required over 800m of disk and took tens of minutes to rm -rf. Furthermore, the JVM (containing Tomcat, Cosmo, and Jackrabbit) required 100m of heap in order to process the PUT requests to publish this calendar.
Our rough goal for the server is to support (on a single commodity linux host) 10k users, each with a calendar of 1k events, at a 10% activity rate.
At the current rate, 10k users * 1k events would require over 6 terabytes of disk. And that's just for calendar metadata - it doesn't account for the actual iCalendar streams or other blobs that are stored with WebDAV. While we don't have a storage goal yet, my gut feeling is that it will wind up being a couple orders of magnitude lower.
It's still unclear with the memory requirements would be for 1k concurrent users. We need to observe the heap with more than one person publishing/syncing at the same time.
Since we only store the calendar metadata in order to service CalDAV reports, we need to analyze specifically what data the reports need to query on. My assumption is that the reports need to be able to perform case-insensitive "contains" queries on the string values of arbitrary calendar component properties.
If this is true, then:
- we could store component property values as strings; we wouldn't have to break down complex property value types and store the constituent parts separately
- we could fully ignore parameters and calendar properties altogether
This would go a long way toward decreasing the storage requirement of calendar metadata as well as the memory consumption and processing time of converting from the calendar object to the JCR representation and vice versa.