From an email sent to me (Katie Parlante) by Brian, who was in early conversations about the data model and the repository. (I added minor wiki formatting for readability).
link to spike overview
Hi Katie,
Thanks for the pointer. Good document. I think it raises a bunch of
interesting questions.
Here are my some reactions off the top of my head to the things that caught
my eye. Feel free to share any of this with other people if that would be
useful to the discussion.
- brian
Independently testable layers
This all sounds great to me... "commands down, events up", "events only
travel between adjacent layers", This part is probably uncontroversial.
Basic undergraduate doctrine. But it's worth having someone remind you
about it from time to time. It might be worth the effort to set up the unit
tests in a layer-by-layer way, if that isn't the case already.
Presentation Layer separate from Interaction Model Layer
Seems like a noble goal, but in practice it might not be worth the effort.
I'm not a good person to offer an opinion on this: I don't know enough about
the current CPIA architecture, and I don't have enough experience writing
presentation and interaction code.
Storage Layer above Model Layers
For me, this was the single most interesting idea in the paper. This week
I've been writing code for my "blue-sky" project, and one of the issues I've
been wrestling with is what relationship I want between the "Modelling
Layer" and the persistence code. Right now I have them munged together
because I couldn't figure out a good separation.
I've never seen a code base where the Storage code was layered above the
Model code, but if that's a workable solution it would certainly be more
elegant. And it would have the big pragmatic payoff: it would be easier to
swap in and out different Storage modules, to test different performance
numbers, or to add different features. It would also make it easier to
define an ideal Model first, without being influenced by pragmatic storage
layer constraints. I think the Chandler project may have suffered some from
doing too much storage work too early on, before experimenting enough with
the model layer.
Platform extensibility, plugins, and start-up time
This all sounds good to me. I don't know much about this kind of stuff, but
it seems like the Eclipse model would work for Chandler.
Two Worlds: static access API and dynamic access API
This is a big deal. All along, Chandler has been struggling with the
tension between static and dynamic APIs. RAP and ZODB and all that old
stuff. I think it's a hard problem. I don't have any experience with any
system that has tried to solve the problem and somehow combine static and
dynamic APIs.
If I were starting the Chandler project from scratch today, I would argue
for
not trying to have both static and dynamic APIs, and instead having
only dynamic access APIs. That would make all the e-mail and calendar
code way uglier, but I think in the end it would lead to a much more
wonderful product. I worry about losing the soul of agenda with all the
compromises required to support even a limited static API. So, on this "two
worlds" question, my position is at one extreme, and I'm opposed to the
solution the paper is suggesting.
Chandler-defined attributes and user-defined attributes
The paper talks about two types of attributes: "Chandler-defined" and
"user-defined". I don't think those are good terms. Presumably
"Chandler-defined" means all the attributes that exist in the Chandler that
OSAF ships, when I first install it. And "user-defined" means the
attributes that get added later. But those "added-later" attributes won't
just be "user-defined", they'll also be added when the user installs
third-party parcels. Or they'll be added when user Foo looks at a shared
item that was created by user Bar using a third-party parcel, even though
user Foo never installed the third-party parcel.
In my mental model, there isn't a single distinction between
"Chandler-defined" and "user-defined" attributes. Instead, there are two
distinctions. One distinction is between "attributes which have code
written against them" and "attributes which don't", without regard to
who
wrote the code, OSAF or a third-party. And the other distinction is between
"attributes shipped by OSAF" and "attributes added later".
If you assume my mental model, then some of what paper says is problematic.
For example, this paragraph concerns me:
"Such code
cannot reference user-defined attributes in such a way, but
not only because that code doesn't know what user-defined attributes will
exist ahead of time. Spike's architecture must actually make a stronger
guarantee: it must
never be possible to access a user-defined attribute
(UDA) via normal Python attribute access directly from a content item,
because it would otherwise be more difficult to evolve Chandler's schema
safely."
Attribute names, naming conflicts, and schema evolution
The paper assumes that kinds and attributes have both display names and
"names", where the name is some "unique" symbol that can be used in python
code or xml. For example, an attribute might have a display name of "Start
Date" and a name "startDate". The startDate name is meant to be unique
within some context, where that context might be an xml namespace, or a
might be a python class.
If I were starting the Chandler project from scratch today, I would argue
for having only display names. I don't think unique symbol names should be
stored in the repository alongside display names. Python code could still
be written using unique symbol names ("sd = item.startDate"), it's just that
the mapping between the symbols and their corresponding items should be
stored with the python code base, not in the schema. The symbol "startDate"
would uniquely specify a certain attribute because the python code that it's
used in would explicitly associate it with that one attribute. That
association could live in the python code itself, or in a mapping file along
with the python code, or in parcel of items that describes the python code,
but
not in a parcel that describes the kinds and attributes themselves.
I believe that if you set things up without the repository storing unique
names, then a lot of other headaches just go away, including a lot of the
questions raised in the section of the paper on schema evolution. Another
headache that would go away is the problem described in this paragraph in
the paper -- and this problem would go away not only for "user defined
attributes", but also for attributes defined by third-party parcels, where
the parcel has code written against the attribute:
"Right now, if a user were to add an attribute to a Chandler-defined
'kind', and a future version of Chandler added an attribute to that kind
with the same name as the one the user added, Chandler would be forced to
rename the user's existing attribute in order to avoid conflict. However,
if UDA's always exist only in a dynamic namespace with no direct mapping to
Python object attribute names, then there is never any possibility of
conflict."
Information modelling
I totally agree with everything the paper says about relationship
cardinality.
I also like the material the paper presents about bi-directional
relationships. Good material about weak-typing of relationships, and
defining a relationship as its own thing, apart from the kinds that it
relates. And the stuff about iterating over relationships. Also good stuff
about problems with having to modify existing kinds in order to create new
kinds where items of the new kind can have references to items of the old
kind. (Although some of the problems go away if you don't have to worry
about unique names.) And good stuff about having API that allows
bi-directional traversal, regardless of how the actual storage is done.
Terminology
The paper suggests using the term "Entity" instead of "Kind". I think that
would be a mistake.
Ideally, the software designer and the programmer should have the same
mental model of the app as the end-user will have. It doesn't work well for
the developers to build one set of abstractions but then present the user
with a different set of abstractions. You get the problem with leaky
abstractions. Better to just have one mental model to start with, and that
mental model should be the end-user model, not the programmer model. So,
when it comes to naming things, I would argue that you should pick the names
that are going to show up in the UI and the help text. I can explain to my
grandfather that "there's a kind of item called book, and these items here
are book items", but that's harder if I have to use the word "Entity".
Also, words like "Entity" and "Class" are problematic precisely because they
already have established meanings in programming languages and database
theory and metamodelling standards. Better to start with a clean slate and
introduce terms that are free of preconceptions. That way there's less
confusion when you start explaining that an item can be assigned to more
than one kind, or whatever else might be innovative.