r10 - 10 Jun 2005 - 15:29:07 - HeikkiToivonenYou are here: OSAF >  Projects Web  >  DevelopmentHome > ServicesWorkingGroup > SharingArchitecture

Kibble Sharing architecture

This page describes the sharing architecture for Kibble (with notes on pre-Kibble release targets). We are working on something that allows calendar sharing to get implemented quickly as a simple feature, without too many barriers to sharing arbitrary data, and then make incremental improvements once it's usable.

Note: For brevity of examples in this document, Pablo is the publisher or sharer of data, and Sherry is the "sharee", the one who accepts the sharing invitation and synchs Pablo's share.

WebDAV Servers

Chandler sharing is to be done by synchronizing Chandler items via a server repository. The server repository is ideally highly available (always on), giving the sharer the ability to upload Chandler items even when the sharees are not online, and the sharees get the ability to download even when the sharer is not online.

The server is a WebDAV server for a number of reasons.

  • First, using an implemented standard gives us the ability to use an out-of-the-box product (both commercial and open source versions exist) to run the server at least in the Kibble timeframe. This reduces development overhead for OSAF.
  • Second, using a standard with a flexible data model saves us from the hassle and long-term costs of designing a protocol from scratch. Custom-designed protocols often bear long-term costs in maintainance, documentation, and limited flexibility.
  • Third, a server with a standard data model opens up data on the server (should the sharer choose to open it up) to access by third-party software. For example, we envision that Blackberry users could write a Blackberry module to allow that client to access and modify the same calendar data that a Chandler client can access and modify. This happens for free with our sharing architecture -- when Pablo uploads the data to share with himself, synchronization works in the same way as it does when Sherry synchs up, as we'll see later.
  • Fourth, a proven technology that has been known to scale and perform decently is a good thing. A protocol's model and design has a large effect on the scalability of the server. Contrast IMAP and FTP with HTTP/WebDAV. Because IMAP and FTP both involve long-term connections (where the connections are mostly idle waiting for user input) and the server must maintain state, both IMAP and FTP servers can handle fewer users than a HTTP/WebDAV server which is more stateless. A protocol with a too-high granularity such as SMB performs worse than HTTP/WebDAV. Smart protocol designers could invent a better, more scalable protocol from scratch but this would require significant attention, expertise, time and testing.

User Accounts

One could imagine storing the Chandler data on a server in a kind of a soup or database -- one pool of data, either structured or unstructured. Some calendaring servers do work this way, such as Oracle's calendaring server. In a single pool of data, an item can be shared by several people in a single instance, and the item doesn't go away until its last user goes away. This is a fine model for enterprises, however it doesn't work too well for federated systems (multiple servers) or use by consumers who need to rent or borrow repository space on some server run by an ISP or ASP.

Instead, if we model the repository space as broken into accounts, and accounts are spread across several servers, we'll make it significantly easier for consumers to be able to obtain server access. This is generally true for server access, but specifically true for WebDAV. Already, Apple's .Mac servers provide rented WebDAV account space and Sharemation offers free limited space. ISPs are beginning to offer WebDAV account space because there's a long history of providing Web space, and the WebDAV functionality offers a more usable way of editing Web content than do alternatives.

The account model imposes some design limitations and issues.

  • When Pablo uploads a set of shared items, these items may be under his quota. If Sherry adds items to Pablo's account, then these add to Pablo's quota.
  • If Pablo's account is erased or moved, the server repository location for Pablo's share disappears. The copies synched on Pablo's and Sherry's machines still exist but there is no longer an obvious place to synchronize the copies. In the future we can mitigate this by designating other server synch locations (hot backups) but that will not come freely.
  • One logical share -- such as a calendar for a given person or a calendar for a room -- logically belongs on one account. Consider the opposite case where a schedule for a room could consist of a number of items in various accounts and even on various servers. Collating all that information from multiple sources is costly and imperfect. It may be better for the entire share to be unavailable than for parts of the share to be available but items missing.

Since 0.4 release, sharing requires each user to enter their account location, login information and password. The only way to locate somebody else's account or shares is to receive an invitation. In the future we will probably add some way of browsing around for shares that are publicly readable.

WebDAV? Collections"> Using WebDAV Collections

A WebDAV collection is a convenient container for a group of items that ought to be synchronized together. The most useful request is probably the depth-infinity (or depth 1) PROPFIND request which asks for the 'getetag' property for each item inside the collection. If the ETag has changed since the last time the client synchronized, then either the body or properties of the resource must be downloaded. It also allows the client to deduce when a resource has been added or deleted (though there is probably no way to distinguish a move/rename).

Because this PROPFIND is a single short request, it's a very high-performing way of synchronizing the members of that collection. In the (expected common) case where there are no changes to the items in the collection in the span of the poll, this is a very reasonable way to poll for changes.

So, to synchronize a share such as Pablo's calendar, there's a strong benefit to having Pablo's calendar items stored in one WebDAV collection or a small number of WebDAV collections. There's also a strong benefit to having that set of collections specialize in the data in that share -- so that there aren't a lot of items that the client must ignore, but only items that the client is supposed to synchronize.

What if Sherry wants to synchronize Pablo's entire calendar, while Suzie wants to synchronize only the work events that Pablo shares? If this is a common case we can divide the shares up into collections as follows: /users/pablo/work-events /users/pablo/general-events

It should be easy for Sherry's client to synchronize both WebDAV collections while Suzie synchronizes only one.

Note that the collection can itself have properties, or can contain some items other than the events which are the main content of the share. For example, if we want Pablo and Sherry to each know each other's synch status, we could create editable properties on the collection, or resources within the collection that represent each user.

In 0.4 release we put all the resources in one collection even if the owner created multiple shares. In 0.5 we're planning to use separate collections per share.

WebDAV resources: granularity

The natural granularity offered by WebDAV resources is not really a clear fit to Chandler items. The Chandler repository data model encourages large numbers of small items. Currently, an email with 10 attendees is modelled as 11 items. The 10 items that model each attendee are very small. Instead, we are working on modeling a Chandler cloud of items as a WebDAV resource.

Why does granularity matter? A large number of small independent resources are more laborious to synchronize. One must handle more ETags, make more PROPFIND requests, more PROPPATCH requests. Each resource must be updated in a separate PROPPATCH request because WebDAV PROPPATCH handles only one resource per request.

Note that the granularity problem isn't just a WebDAV problem, it would equally be a problem if we were to use FTP and model one FTP file per item. It's simply more work to synchronize more objects beyond a certain point. Because latency costs dwarf bandwidth restrictions in their effect on performance at a small granularity, performance is better if the round-trips are minimized, even if that means wasted bytes on the wire. In the case of WebDAV, what that means is that it would be more efficient to synchronize an XML-valued "attendees" property on an event resource than it would be to synchronize 10 attendee resources.

At some point in the middle of the granularity scale (assuming a fixed protocol) the efficiency is maximized. For example, it would be less efficient to put the entire calendar in one resource and download the whole calendar every time one event changed. In other words, there's an ideal granularity from the point of view of WebDAV synchronization performance, that lies somewhere between representing the entire calendar as a single resource on one extreme, and representing every attendee on every event and many other things as separate resources at the other extreme.

Synchronization efficiency is not the only concern when choosing granularity: another is minimizing conflicts and allowing multiple authors to coordinate smoothly. Consistency may be more work to achieve too. What would one do with an orphaned attendee resource when the event has been deleted?

To minimize conflicts, it seems obvious that an event and all its metadata is a good granularity choice to model as a single resource. If two clients are changing the same event at the same time, there's a fair argument that they need to know about the conflict. Clients may still be able to resolve the conflict, for example if Pablo changes the location and Sherry adds a guest to an event, then whichever client tries to modify the event last will detect the other's change, download the change and compare to see if the change can be merged. Note that this decision involves knowledge of the content model. If Pablo changed the start time and Sherry changed the end time, the client might choose to flag that as unmergeable changes, and rather than make the second change, flag to the user.

Will we see cases where two events have interdependencies and modeling them as separate resources causes difficulties in detecting conflicts? Possibly, but I don't think this is too bad given the use cases examined so far. Take, for example, the case of Pablo swapping the time slots of two events. Pablo's client could well succeed at updating the first event but then conflict with Sherry's time change to the second event. Is this a problem? Conceivably yes, but in practice probably not so bad. Think about Pablo's likely user interface. He may drag one meeting over the other meeting before dragging the second meeting back to the first spot, creating a time period where the events overlap. He may decide in the middle of this process to leave them overlapping -- temporarily or permanently. Our content model clearly does not see this as an inherent conflict. If Pablo is really intending to swap the event time slots but one of them conflicts, then Pablo will get that information and be able to rectify that problem, in our desired model of conflict resolution.

Part of the reason why this works well to reduce conflicts is because although the Chandler data model has the event as many items, the user views it as one object that should be copied, moved, deleted and shared as a whole. We can use the Chandler Content Model, which says that a view of events shows each event as a Content Item (and each attendee as part of the event), as a guide to what granularity is most appropriate.

In the Chandler data model, we currently resolve the need to associate an event and all its attendees together by using a cloud. A cloud is the set of things that are shared together, or deleted together, or copied together. The clouds for those three purposes may be slightly different but for sharing, clearly the sharing cloud can be considered as a natural thing to map to a WebDAV resource.

So far this has been a lot of special-case thinking, to convince ourselves that an event's sharing cloud is best modelled as one WebDAV resource. Do we have to do the same thinking for other application domains? Is an email always one resource? A contact? A task? These all seem reasonable, but will external developers make less appropriate choices? There are some risks here, but probably not showstoppers, and we'll probably learn more about this as we go on.

In 0.4 we did each repository 'item' as one WebDAV resource, but in 0.5 we're shifting to representing a cloud (a whole Content Item, i.e. an event including its location and attendees) as a WebDAV resource.

WebDAV resource bodies

The email attachment case, however, probably should be considered immediately. It's a use case we know a lot about and there's a problem fitting a single email and multiple attachments into one WebDAV resource, because a WebDAV resource has only one body. I see three basic approaches.

1. Could we put attachments into properties instead? We could but this feels wrong. Bodies are intended for content data such as an image or a PDF file, and WebDAV has a bunch of tools to make that work well.

  • Content-Type: when a chunk of data has a variable MIME type, HTTP/WebDAV already has a way to represent that. We'd have to reinvent ways to indicate that a big property value had a certain MIME type
  • Byte-range requests: Clients can download a Web resource body a chunk at a time, to compensate for poor connections. The connection can be resumed without throwing away the byte-ranges that have been fully received so far.
  • Deltas, diffs or patches: RFC3229 and the PATCH proposal should allow for great improvements in synchronizing large bodies like images or PDF files.

2. Could we put multiple attachments into a single body as a MIME multipart document? This would work a lot better and is a reasonable approach. However, then the client must download the entire multi-part body and unpack it. This works fine for the synchronization case but when we do support browsing (and recall that 3rd party software could do browsing even before it gets on our schedule) it doesn't work as well.

3. Could we create separate resources for each attachment? Yes, only now we have the consistency problem (make sure the attachments are deleted when the parent thing is deleted, same thing with copy and move). The granularity is probably neither too small nor too big in this case, or at least not obviously so (there may be extreme cases if I send an email containing hundreds of tiny attachments but because email systems aren't optimized for that they are fairly edge cases and do no worse than slow down the process).

I prefer choice 3 although don't strongly object to choice 2.

[Mimi pointed out that there are two use cases; one where the multiple bodies are all seen as part of the same Content Item (e.g. the text and the html body versions for an email) and one where there are really separable attachments (e.g. photos attached to the email). I should clarify that I prefer choice 3 for both use cases so that we don't have to deal with two architectures. The GUI can describe and handle the two use cases without requiring the sharing code to be aware of the distinctions.)

Our sharing code does not yet deal with attachments and probably won't in 0.5.

Data formats: follow standards

Recall two of the benefits from using WebDAV in the first place:

  • Extensible protocol and extensible data model allows Chandler to evolve
  • Open/standard data model and formats allow 3rd party software to access shared data on WebDAV server

To make the most of these two benefits, we should use standard metadata names and value formatting when possible. For example, our internal data model might not have separate start, end time and location attributes on an event, but our sharing data model ought to because the standard for event data (iCalendar) does. Non-Chandler clients will easily be able to see how to handle our sharing data model when it is obviously derived from iCalendar, For email, the model should be 822 header values, and for contacts, vCards. Tasks are again iCalendar.

Note that if we follow existing standards, this implies

  • Not using properties (iCalendar, vCard are most definitely document-oriented)
  • Not using XML

We have considered converting some standard to a format usable as WebDAV properties. The iCalendar and vCard properties have names with a string profile that is compliant with WebDAV rules for property names, so those are easy to transform. All iCalendar properties can have attributes, and these may need to be transformed to XML syntax. E.g. for the LOCATION property which has a string value and an "ALTREP" (alternate representation) attribute, there are a number of ways we could represent that in WebDAV property syntax with XML sub-structure (shown here: (1) simple string value (client must parse), (2) model iCalendar attribute as XML attribute, and (3) model iCalendar attribute as XML element with value):

    <LOCATION>ALTREP="http://xyzcorp.com/conf-rooms/f123.vcf":
      Conference Room - F123, Bldg. 002</LOCATION>

Or:

  <LOCATION ALTREP="http://xyzcorp.com/conf-rooms/f123.vcf">Conference
   Room  - F123, Bldg. 002</LOCATION>

Or:
  <LOCATION>
    <ALTREP>http://xyzcorp.com/conf-rooms/f123.vcf"</ALTREP>
    Conference Room  - F123, Bldg. 002</LOCATION>
  </LOCATION>

After considering this same path for CalDAV, we eventually abandoned it for the simplicity of storing each event as an iCalendar document and only using properties for non-calendar, WebDAV-related metadata (getetag, resourcetype).

When we decide that Chandler needs more information than the standard has structure for, we have a number of options. We could add our WebDAV properties, in a Chandler namespace. So we could have the LOCATION property in some iCalendar-derived namespace plus some "extra-location-information" property in the Chandler namespace. Or we could put X-OSAF-FOO properties into the iCalendar/vCard document body.

For 0.4 we exported Chandler attributes as WebDAV properties without considering the standard. For 0.5 we'll actually export attribute values as an XML document which will be the body of the WebDAV resource, not properties. After 0.5 we haven't decided yet.

WebDAV Namespaces

While we're on the topic of namespaces... clearly, namespaces help avoid property collision. Two software projects can each define a "synch-ID" property and as long as they use their own namespaces (ideally containing an org domain name or a UUID) they can even create their properties on the same resource without conflict.

However, beyond this need, there's not much purpose to defining a large set of namespaces run by the same people. For instance, we don't need to define a Chandler mail namespace, a Chandler calendaring namespace and a Chandler synch namespace unless we think the groups working together on WebDAV can't reasonably coordinate to choose non-overlapping names. It may seem easier at the outset to have a number of Chandler namespaces but it's just another decision to make and argue over (which namespace to use and what to call it), and it does impose a namespace management and storage cost on WebDAV servers, a slight overhead in request and response size, etc.

For 0.4 we used only one namespace and we'll continue this for 0.5.

WebDAV Properties: multi-valued properties not ideal

A lot of property information can come in the form of a list. Assuming we end up modeling attendees as property data, then we will encounter this.

WebDAV properties, strictly speaking, have one value per property name. So a property name like 'dtstart' and a value like 20041117T14:10 works fine, there's no need for additional values.

However, WebDAV properties can have XML-formatted values. So if we define a property name like 'attendees' it can contain multiple 'attendee' values. The values themselves can even be complex.

Lisa Dusseault lisa@osafoundation.org tentative joe@example.com

So far, so good. But this property is atomic -- it can only be set all together. To add an attendee, the client must get the whole property value, add one and set the whole property value back to the server.

Another option might be to define multiple property names for multi-valued properties. For instance, we could define "attendee_8a4d" and "attendee_b128" as two property names, with values for two attendees. The problem with that is that it becomes very hard to just ask for attendee information. The client must ask the server for a list of all properties which contains a lot of extra data. This is fine for synchronization scenarios but not so great for browsing.

I prefer the XML-formatted multiple values, because it's a common feature request for WebDAV to be able to deal with property values formatted this way. There's a good chance the standard will be extended.

We can also look at use cases. It's rare for more than 20 attendees to be listed each explicitly, so that use case probably works in a number of implementation alternatives. Emails can have long lists of to addresses, but on the other hand those lists are rarely changed, so a single XML-valued property doesn't hold much of a penalty.

CalDAV Interoperability

If we choose to model each Event or Task as a WebDAV resource, with all the item data stored in the resource body, then we have a nearly perfect match to CalDAV as it's now specified. We can probably use a CalDAV server almost exactly as we use a general WebDAV server for sharing -- just be sure to use iCalendar format and check to see if the collection is a Calendar collection before putting events in it.

Smart servers, special purpose servers and layering

We've talked about whether Chandler needs to talk to a "smart" server -- a server that has Chandler-osity, that is programmed to support the same content model that Chandler supports. This sounds rather compelling but it's a lot of work, could limit flexibility and make future functionality more difficult. That's because a server that intimately understands what Chandler wants to do today, will need to change to work with the Chandler of tomorrow. Making that change is costly, particularly if it requires a change to the protocol (with difficult backwards-compatibility problems) as well as a server upgrade, just in order to allow for a client upgrade.

Describing such a server as "smart" is misleading -- it would be more accurate to call it special-purpose and high-level. The alternative is to use a more general-purpose server and do more work, where possible, on the client. WebDAV is a general-purpose protocol and WebDAV servers are general-purpose repositories so that fits the model. This approach is well understood in software architecture, where we constantly look for generic modules to build applications out of (e.g. using XML libraries, databases, crypto libraries) and only write custom code where we have to. Not only does that make application building faster in the first place, it also makes it easier to change the application later because the application has imported generic modules that have a lot of flexible functionality. The same principles hold for network architecture. WebDAV is a generic repository access layer

WebDAV, by itself, isn't an email access protocol, nor is it a calendar sharing protocol. To make that work, we have to layer on top of WebDAV, using properties to describe what we're doing. You could even say that we're defining a Chandler sharing protocol on top of WebDAV. Now we can see that this is effectively a layering decision. If we were to write a connection-oriented protocol from scratch we'd be layering on top of TCP. If we were to write a connection-oriented protocol that used channels and authentication, we could layer on top of BEEP. If we found the envelope semantics of SOAP useful we could layer on top of SOAP. However, WebDAV clearly provides more services than the other protocols for what we're doing.

We could also extend WebDAV to do "smart" Chandler-esque things, such as filter for Kinds. But if we invented a WebDAV method such as KINDFIND, that would be special purpose -- why not use SEARCH? And if we're going to use SEARCH, why not make Kind just another property, since as a general-purpose feature, WebDAV search already supports arbitrary properties.

Although this approach is more difficult, in the long run it should pay off -- and luckily we don't have to do that much of the work. Most of the features we need, to make our Chandler use cases work, can be modelled through existing WebDAV features (provided we're very careful about how to use its data model). This allows us more flexibility to change use cases without locking them into server features.

In Kibble (0.4, 0.5...) we're using WebDAV as a dumb server.

Permissions

Permissions are notoriously difficult for users to control appropriately. Engineers have the temptation to provide deep flexibility in assigning permissions, but then users find it difficult to set up everything in the way they intend.

Our design team has approached this by simplifying permissions or access control. An earlier idea was that each "share" would have a "sharing circle" and that circle would have full read/write and even administration privileges on the share. While this idea may still be under review, it's an illustrative of an approach which trusts other people to do the right thing, and focuses on social control rather than technological.

  • I do not view sharing circle like this. I understood that sharing circle was the people you could potentially share with (you would not be able to share with anyone unless they entered your circle). Setting permissions is totally different in my view. Of course you could set permissions so that everyone in your circle would have admin rights to everything, but that makes no sense. I always thought that you would give read, maybe write, and in extremely rare cases admin, rights to the people you specifically shared some pieces of information with (some individuals from the circle, not the whole circle). -- HeikkiToivonen - 10 Jun 2005

Still, some technological control is still considered appropriate: a share will not always be world-readable, and a share will not be world-writable or world-administrable in most cases. So our approach should be to assign read/write and administer privileges to the share as a whole.

The easiest way to achieve that with a WebDAV server is through an architecture that puts one share in one WebDAV collection. After that, splitting the share into several collections (e.g., by Kind primarily) would be almost as easy, the client would simply have to keep the permissions in synch across a small set of collections. The use of collections makes ACLs easier to maintain because the item inherits permissions from the parent. To add an item to the share, the client simply uploads the item and it inherits the correct permissions.

The cost of doing by-item ACL administration could be high. First, in performance -- each item would require an ACL method to set its ACLs.

Bindings add a bit of a wrinkle. If an Item is bound into multiple collections, and the server does support bindings, we could choose to use bindings but we'd still need to check what the server does (the standard is silent). If we don't require the server to support bindings or we choose not to use them then we have two choices.

  • Uploading the Item once only and put markers into the other collections: this requires us to manually manage ACLs on the Item so that it meets the read/write requirements of all the shares it appears in.

  • Upload the Item into every share: this requires the originator of the Item to synchronize all these multiple copies and increases the chances for conflict. E.g. Pablo uploads an event into his work and home calendar shares. Sherry modifies the work event copy, and Pablo's girlfriend modifies the home event copy. How does Pablo or his client merge these?

Our first work on permissions will be to begin working on a Python client library that does WebDAV ACL in 0.5.

Principals

There are issues in how to map somebody's email address (where we send the sharing invitation) to their WebDAV account id, in order to grant them permissions. This is a general problem similar to finding out whether two email addresses both refer to the same person, or an email address and IM address.

Our current plan for Kibble is two-fold.

First, we will try a really simple hack to learn what sharing account is used by a person with a given email address by advertising share location in emails. So when Pablo sends an email it will have a Chandler header, and the Chandler client used by recipient Sherry will know that Pablo is also using Chandler. Then when Sherry sends Pablo an email, her Chandler client can add an advertisement to that email letting Pablo's client know that Sherry's email address can be mapped to a given WebDAV account location and userid. Pablo's client can cache that information (long-term persisted cache) for future use. Now when Pablo wants to share his calendar with Sherry, his client will already know what account to try to grant access permissions to.

Second, we need a backup plan for the case where we simply don't have the account information, or when it doesn't help -- when Pablo and Sherry have accounts on different servers, we can't grant permissions from one to another. Each user cannot log into the others server -- they only have identities/accounts on their own server. So our backup plan is to use tickets. A ticket doesn't rely on the identity of the recipient, it simply grants a capability to the user of the ticket by assigning a single-recipient single-resource access token.

We don't use principals/tickets/account-advertising yet (0.4, 0.5) because we don't do anything with permissions yet. Our plans in this area for 0.6+ are as you see.

Modeling bi-directional references

Why I don't think it's necessary (or helpful) to synchronize most bi-di references to the server or between clients...

[unfinished topic]

WebDAV Bindings

how bindings might help with items that appear in multiple shares... and what to do in the meantime...

[unfinished topic]

-- LisaDusseault - 18 Nov 2004

Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r10 < r9 < r8 < r7 < r6 | More topic actions
 
Open Source Applications Foundation
Except where otherwise noted, this site and its content are licensed by OSAF under an Creative Commons License, Attribution Only 3.0.
See list of page contributors for attributions.