points out that even a tiny amount of the right sample data would be extremely useful -- important that the sample data capture examples of rich interconnections (e.g. stamped item)
stresses that it's important that the data be realistic -- it should be based on realistic user profiles and realistic use cases
it would be good to have a couple user profiles -- maybe one set of data for a home user and one set of data for a business user
would find the data most useful if it were made available in a spreadsheet format
needs to have enough data so that different views can be fully populated, including views that show filtered or collapsed views of the sample data -- maybe 50 to 100 mail messages, and comparable amounts of other kinds of items?
Next Actions:
Mitch to harvest real data from Mitch's PIM and create a small starter set of quality sample data. Needs to be scubbed of any personal info so that it can be used publically in examples.
Mimi may add additional realistic sample data so that we have a good amount of sample data.
Brian to take combined data sets and hand them off to Jeffrey.
Jeffrey to map sample data to 0.4 content model, and convert data into parcel.xml formet. Jeffrey to write transform to make parcel.xml data also available in some simple spreadsheet format (e.g. CSV format).
Open Questions (and answers from Ted via email)
Can we have a single, unified set of sample data, which meets our needs for design work, content model validation, and repository testing (repository unit tests, not performance tests)? Or would it be better to have different sample data for different purposes?
"I think that having a reference set of sample data that works for design, content model, and repository purposes is a good idea. I also think that we will have additional data sets for other purposes, (e.g. my 17,000 RSS feeds of data for stress testing the repository)." -- TedLeung - 9 Feb 2004
How much sample data do we want? A few dozen items? A few hundred? A few thousand?
"I think that we probably want a few dozen items of each kind for this purpose. If you want a standard set of test data, we probably need more than this, a few hundred items each." -- TedLeung - 9 Feb 2004
Can this be made-up data, or would it be better to collect real data from a PIM program that one of us uses? Should we put some effort into making sure that the sample data reflects the use cases we care most about?
"I think that things like people's names and addresses can be made up, but the relationships between the items should be as close to realistic as possible." -- TedLeung - 9 Feb 2004
What format do we want the sample data in? Parcel.xml files? Python unit test code? Excel spreadsheets? Interconnected wiki pages?
"If you want live data that can be shown in Chandler/prototypes, then we need at least parcel.xml files. Unit test code should access the data via the repository API anyway. Encoding lots of data in test files leads to problems. It would be great to generate the various formats (xml, XLS, wiki pages, etc) from a single source, in order to reduce the number of typing bugs." -- TedLeung - 9 Feb 2004
If we parcel.xml files as the standard source format for sample data, does that format meet everybody's needs, or should we have transformas or export tools that can convert it into another format?
Should we just dive in, and learn as we go, or should we do a little planning first to identify "requirements" and figure out who wants to be involved?
"I think that generating a small subset of the total data set accompanied by feedback about what's easy/hard/whatever would be a good start. I've already noted some issues with modeling in the query proposal" ChandlerQuerySystem -- TedLeung - 9 Feb 2004