More on Cosmo 0.2 Resource Usage
It turns out that CalDAV reports allow querying based on iCalendar component type, property and parameter, so we'll need to store the names and string values for all of these things. We can get away without storing the "broken down" form of property and parameter values, which is a small win, but it's not enough. We're now looking into alternate ways to cram lots of info into a single JCR property in a way that remains queryable. Things look hopeful there. That work proceeds on the trunk, for the 0.3 release.
I spent the morning changing how we generate calendar collection indexes, by extracting the info we need from the raw iCalendar streams rather than from JCR. Now 0.2, which doesn't implement CalDAV reports, has no need to store calendar metadata in JCR, so i've removed all of that code from the 0.2 branch. If the work I mentioned above pans out, we'll likely remove it all from the trunk (0.3) as well.
All that said, we still need to analyze how many resources even the basic WebDAV/CalDAV metadata requires. Here's a simple breakdown of this metadata:
| node type | properties |
| dav:collection (extends nt:folder) | dav:displayname, * (arbitrary client-generated dead properties) |
| dav:resource (extends nt:file) | dav:displayname, dav:contentlanguage, * |
| caldav:home (mixin) | caldav:calendar-description, xml:lang |
| caldav:collection (mixin) | caldav:calendar-description, xml:lang |
| caldav:resource (mixin) | caldav:uid |
So even without storing huge amounts of iCalendar metadata, there's still what looks like a relatively small amount of metadata requirement. Yeah, it looks small now...
CalDAV?/webcal Disk Usage"> Vanilla CalDAV/webcal Disk Usage
I ran some simple CalDAV and webcal operations to observe the number of files and amount of disk used for each operation:
| After this operation... | Num files (cum) | Disk usage (cum) |
| sign up for account | 15 | 60kb |
| MKCALENDAR | 22 | 88kb |
| PUT event1.ics into calendar | 34 | 136kb |
| PUT event2.ics into calendar | 46 | 184kb |
| PUT event3.ics into calendar | 58 | 232kb |
| PUT event1.ics into homedir | 69 | 276kb |
The second through fifth ops model creating a CalDAV calendar and uploading individual events into it, while the final op models publishing a single-event webcal calendar (like say from Apple iCal).
This shows a very regular 12 files and 48kb per CalDAV event, and 11 files and 44kb for the webcal event (we don't have to store the uid for that one like we do for the CalDAV events). The calendar itself costs 7 files and 28kb.
Compare the 34 files and 136k for the single-event CalDAV calendar with the 129 files and 648kb that was being used yesterday. Definitely an improvement. Let's use these numbers to calculate sizing for our target population of 10k users each with a calendar of 1k events:
48kb/event * 1000 events/calendar = 48000kb/calendar * 1 calendar/user * 10k users = 480,000,000kb = ~450gb
Chandler Disk Usage
Now, that's the general CalDAV case. But I think we want to assume that all of our users will be using Chandler, which requires
two files per event (one icalendar, one with Chandler internal data) plus a couple other Chandler-internal files. So let's look at what happens when we share the 535-event calendar from Chandler.
| After this operation... | Num files (cum) | Disk usage (cum) |
| sign up for account | 15 | 60kb |
| share test calendar | 12,265 | 49,116kb |
This averages out to 23 files and 92kb per event (in this size calendar, there is some variability in the amount of information in each event, but it averages out over a large-ish calendar), or just under twice what's needed for a CalDAV event, which is pretty much what we expected.
So again let's calculate sizing:
92kb/event * 1000 events/calendar = 92000kb/calendar * 1 calendar/user * 10k users = 920,000,000kb = ~875gb
So this is almost an order of magnitude improvement (<1tb vs ~6tb), and it's probably close to being acceptable, though I'd really like to see it sub-100gb.
That Was Just the Metadata
Something very important to remember: this does not take the actual content of the shared data into account. That stuff is stored in a different place. So far we've just been looking at metadata storage.
For the 535-event calendar, we're using 4,368kb of disk to store the shared data (the individual iCalendar events and the private Chandler data files). That is what it is, and there ain't much we can do about it. One implication of the above number is that it'll probably cost ~8mb to store a 1000-event calendar. So what do we set as the default disk quota for the average user? How much free space (above the 8mb) should we give them?
Let's say we decide on a 15mb user quota. That means our total disk requirement becomes:
15mb/user * 10k users = 150,000mb = ~150gb quota + 875gb metadata = ~1tb total disk
How much does a terabyte of disk cost these days?
What's Next?
We can potentially eke out even more gains by using an RDBMS instead of the filesystem. Jackrabbit's
SimpleDbPersistenceManager stores nodes in one table and properties in another. Depending on how specific database vendors store table data, we might gain a lot of savings. However, we'd almost definitely take a performance hit - 12,250 INSERTs to share the 535-event calendar above, even over loopback or against an embedded db, could be painful. We won't know without testing, of course, and we may well have many more significant performance issues to worry about first.
Memory Usage
A note on memory usage: while running the above tests, I kept an eye on the JVM heap size (started with
-Xms256m -Xmx256m) and saw the following:
| After this operation... | Heap size |
| JVM & webapp startup | 8.5m |
| PUTing 3 CalDAV events | 9.9m |
| sharing Chandler test calendar | 78.0m |
| half hour of no further activity | 51.0m |
This shows that we have improved quite a bit in this area as well since yesterday, and that i really need to start using
-Xincgc so that I don't have to wait a half hour for garbage collection to show me the real heap size.
Anyway, we aren't going to be able to draw many conclusions about memory usage until we can observe Cosmo during concurrent usage.