r7 - 29 Aug 2006 - 14:45:08 - PhilippeBossutYou are here: OSAF >  Journal Web  >  TWikiUsers > TedLeung > TedLeungNotes > TedLeung20060127 > InternProjectMVA

Multivariate analysis for search and data exploration in PIM data

1. People

2. Original Proposal by PhilippeBossut

Skills Required: good OO skills, serious math skills

This is more a research project than a software engineering project. The idea is to use multivariate analysis tools (Principal Component Analysis, segmentation) to create a model of the data in a PIM and map newly acquired data. The objective is to allow some level of sorting (tagging) to be automatically done by the software. There will be few UI involved here. There's however quite a bit of Python code to write including coding of statistical methods (no known Python package offering this) or SWIG wrapping some existing C statistical package.

3. Summer of Code Proposal by XunLuo

http://code.google.com/soc/osaf/appinfo.html?csaid=AE8284BFF8C48FEA

4. Ongoing Notes

(with no further statements, 'Chandler' in the text means Chandler 0.6).

4.1 Overview of Chandler data models

The best document to read about Chandler's data model is the architecture document. Here some key points are reviewed briefly:
  • ContentItem is the basic data unit Chandler maintains for PIM items.
  • ListCollection groups ContentItems.
  • Block is responsible for data representation, UI and event handling. A specific parcel needs to subclass Block, such as the FeedController in feeds parcel.
  • RepositoryView implements a cache for repository items. It performs the refresh/commit functionality as well. Through a RepositoryView, Lucene indexing and search operations on underlying Repository could be performed.
  • DBRepository is the Berkeley DB style repository Chandler uses for physical data persistence. It maintains all the metadata as well, especially the Lucene index.

In the MVA project, the powerful tools we use are Lucene's full-text indexing, transforming and representation capabilities. All the computation and scoring are based on quantification and deep analysis of the text content of a ContentItem.

4.2 Supervised Tagging: choose the most appropriate tag for an item

4.2.1 Proposed Technique

Supervised Tagging is defined as this: given a set of tagged ContentItems, we establish a mapping model T=f(C.content) between the contents of a ContentItem and the tag(s) it bears (T stands for tag(s), C stands for a ContentItem); then this mapping model is used to caculate the proper tag for a new, untagged ContentItem c, by deriving t from f(c.content). We build the mapping model based on Principle Component Analysis, or, (although not strictly identical), Latent Semantics Analysis (see [1], [2] in Reference, these methods are abbreviated for PCA and LSA in the following).

What PCA or LSA does is basically map high dimension data points to a low dimension space. While there will be information loss to some extent, difference among the data points are better mapped along the new axes. This page is not intended to give elaborate descriptions about the details of PCA and LSA. Readers could check the references if interested. In simple words, each term of a text document forms a dimension in term vector space, and as not every term plays a role of same importance in distinguishing one document from another, PCA and LSA are very useful in finding reprensentative term combinations; and, in turn, help us establish accurate mapping model between tags and the content of a ContentItem.

Here is the tagging process in pseudo-code:

{
     set up an empty mapping model M;
     do {
        collect tagged ContentItems; 
        adjust parameters of M;
     } until M is stable/reliable;
     
     foreach new ContentItem C
     {
        caculate the tag(s) using M(C); 
     }
}

It seems straightforward. However to implement this idea in Chandler, two questions need to be answered: 1)How to map the control-flow based p-code to Chandler's dataflow-based architecture? 2) How is M caculated, persisted, and used? We did experiments on implemeting the proposed technique on Feeds parcel, due to its relative simplicity as well as being a tutorial itself for parcel developpers.

4.2.2 Experiment on Feeds parcel: a concrete example

The feed parcel retrieves feed items from a RSS channel, and then store all items retrieved as Chandler ContentItems. Of course, the ContentItem class is sub-classed into a FeedItem class. FeedItem has a field category, which is mapped to the field in feed xml. We can use the category field as the tag on a feed item, supposing that the category properly reflects the content of the feed item.

Enlighted by this, we choose the feeds from slashdot.org as experimental datasets. Each slashdot.org feed item has a field in xml, which is mapped to one of the 15 major categories: Mainpage, Apple, Askslashdot, Backslash, Books, Developers, Games, Hardware, Interviews, IT, Linux, Politics, Science, YRO. With enough feed items, we will be able to build the mapping model between the feed item's content with its category, i.e. the tag. We firstly modify the feedparser.py file under util to let it recognize field and map to category field in FeedItem class.

Then, we added a new member function modelIsSetUp() in class FeedChannel (in channels.py). This boolean function returns a value which indicates if the tag-content mapping model has been setup. If it returns true, then the system is still in training stage and every new feed item's category field is used to tune up the model's parameters. Otherwise, the system is in tagging stage and every new feeditem's category field is assigned to the best fitting tag. In this experiment, we implemented a very simple modelIsSetUp(): just returns true if the feed items' count has already exceeded 300, otherwise returns false.

After these two steps, it comes to the part of model generation. As mentioned earlier, the model contains two pieces of information: 1) general represention of the 15 categories, in PCA or LSA term, these are called 'eigenvectors'. 2) The parameters of the PCA or LSA. For 1), one observation we could take advantage of is that all the 'eigenvectors' are also FeedItems themselves, the only difference is that the content of these 'eigenFeedItems' are artificial and do not have explicit meaning. Based on this observation, we expand the FeedItem class to add one more member variable isEigen, items with this variable set to true are calculated 'eigenVectors'; while items with this variable set to false are retrieved from a RSS feed. For 2), we played a small trick to make parameter persistence easier.

Normally when using PCA for classification, eigenVectors and unlabeled data points are projected to eigen space first. This is relatively easy when we have control of data representation. However in the case of Chandler, the projection is a bit difficult. TermVectors are the atomic unit we have control of in Chandler archtecture, which is handled by the PyLucene library. As Lucene tokenized all terms to ids, it is not easy to store and represent the projected coordinates. The solution to this problem is to always keep all eigenVectors and unlabeled data points in the original space, but modify the vector operation function to apply the projection.

For example, a cosine function which computes the angle between two vectos is defined:

cos(v1, v2) = (v1 x v2)/(|v1| *|v2|)

In eigen space, v1 and v2 are mapped to v1' and v2' respectively by applying projection: v1' = P x v1, v2' = P x v2. Where P = [p1 p2 .... pm] is the projection matrix to the first principle component axes.

Philippe's notes: I replaced "vector" here above by "matrix" and "axis" (singular), by "axes" (plural). Projecting to one eigenvector will give you only a scalar so the cosine between 2 projections won't be of much help... I suppose it was a math typo in your paragraph.

So, a cosine function which computes angle between v1' and v2' could be computed using v1 and v2 only, by:

cos(v1', v2') = [((inv(P) x v1) x (inv(P) x v2))]/[|inv(P| * |v1| *|v2|]

Philippe's notes: this seems wrong, cos(v1',v2')=(v1' x v2')/(|v1'|*|v2'|)=((Pxv1)x(Pxv2))/(|Pxv1|*|Pxv2|), there's no need to inverse P.

By this formula, the only parameter we need to store is P. And we do not need to project all the FeedItems and eigen vectors, but just keep them intact.

Philippe's notes: well, you do project since you have to compute P x v... and P does contain the coordinates of the eigen vectors (or it's at least equivalent in term of storage).

The eigen vectors are managed by FeedChannel class, which is a subclass of ListCollection, and all cosine computations are performed in FeedChannel class as well. Because through FeedChannel.itsView, we will be able to access the indexReader in the PyLucene library, and in turn get the TermVectors (see chapter 5 in Reference [4] for details about Lucene TermVector).

Experiments on 433 slashdot feed items (300 for model training, 133 for automatic tagging using the model) revealed that the accuracy is as high as 73%.

Philippe's notes: that would be nice to have the data and tagging result to appreciate this result.

4.3 Unsupervised Tagging: automatic retrieve term(s) to tag an item [TODO]

Unsupervised tagging is actually not MVA. The definition is: given a untagged ContentItem and no training sets, tag the ContentItem purely based on its content, using predefined tag or key term extracted from the ContentItem. One possible effective way is to use natural language processing (NLP) techniquest to extract key words, such as date, time, event type or named entities, and then apply these keywords as the tag learned in unsupervised fashion. This approach is still being investigated and research is in progress.

Philippe's notes: Sure, unsupervised tagging is not MVA but MVA (CPA or LSA) can be used to reduce the dimensionality (which is computationally interesting) and have dimensions that are the most relevant to the data set (the ones concentrating the most spread). We should discuss this in more details on the list I think.

5. Reference Links

  1. Code Sandbox
  2. a PCA tutorial
  3. Latent Semantic Analysis
  4. Lucene in Action book
  5. Yet another documentation about the Chandler repository
Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r7 < r6 < r5 < r4 < r3 | More topic actions
 
Open Source Applications Foundation
Except where otherwise noted, this site and its content are licensed by OSAF under an Creative Commons License, Attribution Only 3.0.
See list of page contributors for attributions.