r6 - 24 Mar 2004 - 19:17:25 - BrianDouglasSkinnerYou are here: OSAF >  Projects Web  >  DevelopmentHome > RepositoryFramework > QueryFramework > ChandlerQuerySystem

Chandler Query Proposal 0.1

Goal

Produce a subsystem that allows developers to request a set of items that match a criteria. This request is called a query. The query is made against a set of items.

Requirements

  1. Queries must operate on all items in the repository and on subsets of items in the repository (including other query results)
  2. Queries must operate on Kinds and other metadata as well as "regular" items
  3. Queries must be Items and stored in the repository
  4. Queries produce sets of items satisfying a boolean condition
  5. Queries specify what the result is, not how to produce it
  6. Query language must be closed - results of queries must be queryable
  7. Queries support set theoretic operations like Union, Intersection, Difference
  8. It must be possible to determine the result set size of the previous exeuction of a query (this really means query over a particular collection(s)
  9. Queries must allow attribute traversal
  10. There should be a way to express an extensional (explicitly defined - without a rule) set as a query
  11. Queries must allow execution of Python code as part of content/itemSpec
  12. Query results should be produced incrementally - see Python itertools
  13. App can be notified when items enter/exit a result set
  14. There should be string based and python based API methods for constructing queries
  15. It should be easy to compose queries programmatically to facilitate implementation of gui or natural language query builders
  16. Query results can be either transient or persistent
  17. Query capability should not be tied to User Level Collections/ItemCollection/Playlists
  18. Query Results should know which input and query generated the result

Context

The Chandler repository / data model supplies the items that a query operates on. The obvious source of items is reference collections. This means that the input to a query is supplied by one or more reference collections, and that the output of the collection is also a reference collection. (More specifically, this "interface" could be narrowed to an item iterator. The only difficulty with this approach is that reference collections only exist as attributes of items. This means that all access to reference collections is via an item. As we'll see in the examples, this leads to code that is more verbose than you might expect.

The Chandler design team also has a notion of a set of items which will be visible to the user. This notion has been called ItemCollection, or more recently Playlists.

Requirements for this notion include:

  1. can have a bunch of items in it
  2. can have a name
  3. examples are "my e-mail inbox", and "my san jose sharks calendar"
  4. i can share them with other users
  5. i can give different people different levels of access permissions to different playlists
  6. i can see who has subscribed to different playlists

ItemCollection/Playlists's searching querying functionality will be implemented using the functionality of the query system.

One open question is whether or not Reference Collections should be stand alone items with their own kind. Doing this would shorten some of the code involving queries and present a data model more consistent with the philosophy that everything is an Item.

This could also be accomplished by creating a new Kind, Collection, which exists solely to "wrap" a reference collection. It is likely that something like this will need to be created anyway in order to represent things such as mailboxes, calenders, etc. Whether this is actually Playlists/ItemCollection or an abstraction used to construct Playlist/ItemCollection is unclear.

If we chose to do this, the new kinds we would add to the system are

New Kinds

Collection
    name
    values (refCollection)

Query
    name - name of the query
    input - one collection or a list of collections
    itemSpec - the query expression 
    lastResultSize - size of the last execution (persistent?)

UserCollection/Playlist/ItemCollection(Collection)
    strictness
    + application specific attributes

There would be subkinds of either Collection or UserCollection/Playlist/ItemCollection for representing mailboxes, calendars, contacts, and so forth.

(Shouldn't "name" be an attribte of "Collection" instead of being an attribute of "UserCollection/Playlist/ItemCollection"? -- BrianDouglasSkinner - 03 Feb 2004)

Issues

Here is a short list of open issues regarding queries
  1. Order preservation -- esp since all our "sets" are really lists
  2. Do we need a Join like operation?
  3. Do queries operate on literal collections?
  4. How do queries fit in as attributes -- do we want "query valued attributes"?

Queries

This section gives examples of how a Chandler developer would interact with the query system. I assume that the variable rep is an instance of the repository. As an example, we use the request "all e-mail messages that have been marked up as being in the Foo project". I've also assumed that we are not introducing a Collection Kind, which implies that we need a Kind for persistent query results, which I'm calling ResultSetItem?. This is so that I can show the amount of code required under this approach.

First we need to set up a query item:

inbox = rep.find("//userdata/roots/email/inbox")
query = QueryItem?()
query.input = inbox.contents
query.itemSpec = 'FOR i IN $1 WHERE "Foo" in i.Projects'

To execute the query we need to do something like this, to get a transient query result:

result = query.execute(QueryItem.TRANSIENT)

To get persistent query result it would look like this:

result = ResultSetItem?()
result.contents = query.execute(QueryItem.PERSISTENT)

We can shorten this by using the QueryItem? constructor and defaults on the execute method. The shortened version is:

query = QueryItem?('FOR i IN $1 WHERE "Foo" in i.Projects', inbox.contents)
result = query.execute() # transient default

The shortened persistent version is:

query = QueryItem?('FOR i IN $1 WHERE "Foo" in i.Projects', inbox.contents)
result = ResultSetItem?()
result.contents = query.execute(QueryItem.PERSISTENT)

What happens when we want to do a query over the result of a query? The code looks like this for the transient case:

query1 = QueryItem?()
query1.input = result
query1.itemSpec 'FOR i in $1 WHERE "urgent" in i.subject'
result1 = query.execute(QueryItem.TRANSIENT)

or this for the persistent case:

query1 = QueryItem?()
query1.setInput(result.contents)
query1.setItemSpec('FOR i in $1 WHERE "urgent" in i.subject)
result1 = ResultSetItem?()
result1.content = query1.execute(QueryItem.PERSISTENT)

Again, we can shorten the transient case to

query1 = QueryItem?('FOR i in $1 WHERE "urgent" in i.subject', result)</br /> result1 = query1.execute()

and the persistent case to:

query1 = QueryItem?('FOR i in $1 WHERE "urgent" in i.subject', result.contents)
result1 = ResultSetItem?()
result1.contents = query1.execute(QueryItem.PERSISTENT)

I think that we need to do something to make the persistent and transient cases more alike.

Item specs for Example Queries

This section shows how to express these queries using the system being proposed.

Here is the key for reading this section, the example query, questions about the query, and issues raised by the query

These have the path exprs worked out, except for . vs / as step delimiter.

Instead of showing the full code for each query, which contains a lot of boilerplate Python code, only the query's Input and ItemSpec? are shown. The Input has been plugged directly into the ItemSpec? for readability. The Python code to retrieve a particular input is only shown once per group of queries.

All inbox email items that have been marked up as being in the Foo project
inbox.contents = rep.find("//userdata/roots/email/inbox")
FOR i IN inbox WHERE "Foo" in i.Projects

(This example has the project name "hard-coded". That's the right thing to do if the query represents some kind of a one-off search where the user just typed in "Foo". But if the user built the query by selecting the project "Foo" from a pop-up list and afterwards the user saved the query, then the saved query should not contain the text "Foo", but should instead have a bi-directional item-reference to the item representing the Foo project. -- BrianDouglasSkinner - 03 Feb 2004)

All inbox email items that were sent in the last month/day/week
FOR i in inbox.contents WHERE i.dateSent < 1 month/1 day/1 week
note easy notation for date literals - may be wishful thinking

All inbox email items that were sent by a particular person
FOR i in inbox.contents WHERE i.fromAddress == <Joe'sContactItem>???
May need fuzzy match or alias based equality to do a good job here.
mail schema is buggy because reply-address = From:/Sender

All inbox email items with the word "Bar" in the body
FOR i in inbox.contents WHERE "Bar" in i.messageBody
Note inference of full text indexing!

All inbox email items with "Bar" in either the subject, the body, or any of the to or from fields (or in the names of the Contacts associated with the email addresses in any of the to or from fields)
FOR i in inbox.contents WHERE "Bar" in i.subject OR "Bar" in i.messageBody" OR "Bar" in i.toAddress or "Bar" in i.replyAdddress OR "Bar" in i.toAddress.emailAddress
not clear enough how to specify in schema.
Not clear when to use full text (or if able due to schema)

All "recent" mail to or from a particular person
Is this across all mailboxes? -- (Yes, across all mailboxes -- BrianDouglasSkinner - 03 Feb 2004)
mail = rep.find('//userdata/roots/email')
FOR i in mail.contents WHERE i.returnAddress == "John" AND i.dateSent < 1 week

is this the correct nested semantics?
mail = rep.find('//userdata/roots/email') or a query on kinds
(Instead of checking for "i.returnAddress == 'John'", it would be better to check to see if the returnAddress is a member of the set of pastEmailAddresses belonging to the Contact that represents "John". -- BrianDouglasSkinner - 03 Feb 2004)

A user's home calendar: all calendar events, tasks, notes, etc. in the "home" calendar
home = rep.find('//userdata/roots/calendars/home')
FOR i in home.contents WHERE i.kind in (calendarKind, taskKind, notesKind, ...)
either a big union, or any item in the home calendar -- attached notes may make it harder

All calendar items with durations overlapping a given day/week/month (startTime and endTime denote the duration)
cals = rep.find('//userdata/roots/calendars')
FOR i in cals.contents WHERE interval(i.startTime,i.endTime).contains("4/28/2004")

Assumes we create interval convenience classes for duration overlapping -- may be part of calender or egenix-MX - need to research

All calendar items with the word "Bar" in the headline or body
FOR i in cals.contents WHERE "Bar" in i.heading OR "Bar" in i.body

All calendar items that are in the Foo project
FOR i IN cal.contents WHERE "Foo" in i.Projects

All content items that have some relationship to "today"
What does relationship to "today mean"? -- any date in any attribute?

Any Contact item where the related ContactName? has the value "Lee" for any of the attributes (firstName, middleName, lastName)
contacts = rep.find('//userdata/roots/contacts')
FOR i IN contacts.contents WHERE i.firstName = "Lee" OR i.middleName = "Lee" OR i.lastName == "Lee"

Any content item with "Foo" in any attribute-value
items = rep.find('//userdata/contentitems')
FOR i in items.contents WHERE ????

Okay, this is a tough one

All content items with a "project" attribute
FOR i in items.contents WHERE i.hasAttribute("project")
Currently there is no hasAttribute() on items An alternate formulation is a query over kind.items and kind.subKinds.items

A set of content items that the user has selected and dragged to the "MyStuff" view

Couldn't find a view attribute -- this means a UserCollection?, which should be queryable -- but the "query" probably just returns.

Some/any combination of these use cases, unions and intersections
UNION(refCollection1, refCollection2)
INTERSECTION(refCollection1, refCollection2)
DIFFERENCE(refCollection1, refCollection2)

Implementation issues

Here are a few issues that we'll need to take into account in the implementation

Streaming results (implementation via threads / generators) - also see python itertools

Indices - values and path indices

Clustering - may be difficult given current implementation

Order Statistics

Notification / constraint enforcement

Nested Reference Collections

Related Projects

-- TedLeung - 02 Feb 2004

Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r6 < r5 < r4 < r3 < r2 | More topic actions
 
Open Source Applications Foundation
Except where otherwise noted, this site and its content are licensed by OSAF under an Creative Commons License, Attribution Only 3.0.
See list of page contributors for attributions.