r3 - 07 Jul 2005 - 14:23:00 - LisaDusseaultYou are here: OSAF >  Journal Web  >  SimpleDynamicCategorizer20031223 > CatySectionTwo20031223
As mentioned in the summary, I see it as an important issue to include the possibility to change the behavior of the “composition” of high-level functions through the user. For this purpose the tools will be embedded in a generic environment with a common interface to allow easy rearrangement. You might have guessed it by now the single interface to get this done is the access to the taxonomy.

The Taxonomy

It makes sense to start with the description, of what the taxonomy should be on a technical level. Generally speaking, a taxonomy is nothing else, than a repository for relational information and attributes, which can be viewed in hierarchical fashions.

It can be any type of Data-Storage, like a database, a stream or - most desirable – structured files, in a common format like XML/RDF. I will stay with this implementation proposal, because it seems to be already accepted by OSAF, but remember RDF is just bricks & stones. Amazing results or just crap – it depends on the architect.

The taxonomies basic structure is “Nodes” with non-directional references and attributes. A category Node will point to multiple other nodes (elements, categories or documents). The non-directional reference is a simple pointer to another element or node within the taxonomy. “Non-directional” means, there is no requirement to be hierarchically correct (like parents point to their siblings or vice versa). In fact a node might not even have one. Jungle.Attributes (Multiple Property Elements and Property Jungle.Attributes in RDF) might contain information like: nodename, nodetype, name and location of the related document, version information, flags, comparison data and possibly more. The taxonomy file will be read, interpreted along with a definition and loaded as object-tree into persistent memory to form the environment for our toolbox at runtime. Taxonomies and part of taxonomies can and will be exported, imported, exchanged through Email, synchronized or browsed through the network.

(@@@Need to put some RDF and schema-examples here later…)

The Taxonomy Viewer

Although I have only a rough idea of a graphical user interface, the capability to view the taxonomy from different “angels” is the most important function to work effectively with a taxonomy. It has to be very simple in the first place, because it will give the user a impression, of how to “imagine” his information relating to each other. The user can pick any point in the taxonomies “spaghetti” of relations to be the “top-level” point of his view. All other references form a tree “below” this point. Of course, there will be a “default” view, which will also allow viewing of unrelated objects and taxonomies. If the chosen top-level is a category, the Taxonomy-viewer will behave like explorer-application, allowing navigation through the taxonomies relations in a hierarchical fashion. If the chosen “top-level” node is a document, the Taxonomy-viewer will form the document-concentric working environment with all appropriate tools, views and functionality. Predominantly, this means to manage the references all the related documents. This could be done for comfort-reasons trough “dragging” and “dropping”, if applicable. That means also, the document is opened from within the viewer and then edited and saved with the appropriate helper-application. There need to be some kind of communication from the helper-application to notify the viewer if the document has been changed, to update the versioning-information, which we will discuss in the next topic.

(@@@Need to put some sketches here later…)

Versioning

The versioning I imagine works on document-level only. From my point of view, there is no urgent necessity to version categories or the taxonomy itself. The taxonomy will be far to dynamic to track all its changes and it represents the world of information in “current state”. And yes – you know it already - we will use the corresponding document-node in the taxonomy to store the version information. The version information will contain flags and attributes. Our “You stopped working here”-flag will be there along with other, possibly user-defined, custom flags. Jungle.Attributes contain version-count (simple number), author and date of last modification and a reference to the previous version of the document along with other, possibly user-defined, custom attributes. For my work there will be a status-attribute, which allows me to switch it from “draft”, “release candidate” and “final”. Although versioning is just another aspect of the information, stored and represented in the taxonomy, it is a essential function, which has to be implemented into the viewer.

(@@@Need to put some sketches here later…)

The Toolbox

The Toolbox represents the functions of the taxonomy-viewer, available in document-concentric working mode (Top-Level = document-Node) and in browsing mode (Top-Level = Category-Node). In my toolbox, I want to have the following functions that can be manually activated or used by automation:

Count

I want to count words, characters and paragraphs and so on in a given Document.

Diff

I want to take two or more documents and “diff” then along with the corresponding meta-information from the taxonomy. As a result, I want to see an indication of the differences of the documents and their corresponding meta-information.

Match

I want to take two or more documents and “match” then along with the corresponding meta-information from the taxonomy. As a result, I want to see an indication of the similarities of the documents and their corresponding meta-information.

Summarize

I want to extract a configurable number of strongest semantic concepts from one or more documents. This can be done trough available algorithms, based on the paradigms of Bayesian Believe-Networks. Copernic or MS-summarizer products will provide some results in the form of text-paragraph summary, which is a degradation of the original idea, but will work fine for a start. There is a gazillion tools for semantic or static text analysis out there, I might choose a couple of them to work for me.

Expand

I want to expand one or more semantic concepts, found in my documents with references from other data-sources, like a dictionary, glossary or encyclopedia. Millions of school kids (and possibly strategic consultants and “researchers”) dream of having some kind of a “fluff-generator”, which can make a complex abstract on specific topic just by the seed of some keywords.

Search

Of course I want to take a document and drop it onto search, which will then find similar documents. All kind of sources should be accessible (DMS, Network, Email, Internet… etc.). This can be achieved in two ways: In searching classic full-text or in searching taxonomic categories.

Compare

Here is my favorite and I have been working on this for quite a while. This is quite a complex digital animal, which will tell you how “near” certain information relates to another. Its output is n-dimensional vectors, which have to be weighted again to represent a human-readable result. It would blow the format of this documentation to go into more detail, but I will tell you what it does. You take one ore more documents (or references on documents, like search-results) and run them through it. As a result you will get an indication (in Percent) how “near” these documents are relating to each other by semantic “meaning” of the content.

Other tools will come with time, when I will stumble over them or just exchange some pieces in the search of something better. Now for the final thing:

Categorize

Categorize will group the simple functions discussed above together, to provide a more complex (automated) functionality. Categorize will be creating taxonomic trees from any Information, relying on specific arrangements the tools outlined above, driven by working context. Categorize will extract semantic concepts, sort them, search and expand on them and finally delivering a taxonomic proposal.

I put some Examples for Categorize-compositions here:

I would build an “Intelligence” function or Agent, if you like. If I drop something onto it (Maybe a Document about RDF) , it is trying to find people (Jungle.Contacts in this case) with corresponding attributes in their taxonomies (RDF-related Documentation). That is done by:

  • Analyze the Item
  • If its not categorized propose an association with my taxonomy through Summarize and Compare
  • If it has one or more categories search for these taxonomic categories in all available sources
  • If the results are less or equal than seven items, display them
  • If the results are more than seven items, get the corresponding document-summaries from the located peer-systems.
  • Run the summaries through compare
  • Give me the best five results and prompt for automatic copies in my contacts and further action.

I would like a “Related” hint box in the context-block. While I type an Email, references (links) to documents and their summaries with related content would scroll through the box. If I am interested in an Item, I click on the link and get the appropriate document. That is done by:

  • Analyze the Semantic concepts of my writing through summarize
  • Look for Matches in my Taxonomy with compare
  • Look for Matches in all other (configurable) sources
  • Get back the results and display them continuously by link and summary

An “Explain” function would be nice I drag or right-click a word out of a document (e.g. RDF) and use the “Explain” function. As a result it would then show the Explanation together wit a link to the w3c Web page. That is done by:

  • Look up the explanation for RDF in the connected Glossary though Expand
  • Get the link from google and open it in a browser
  • Categorize the URL of the opened Web Pages in my Taxonomy

I would have a “Smell” button I drop a file on the “Smell” button and I get a result of “similar” files together with a list of similarities and differences. That is done by:

  • Conventional search for other copies of the file
  • Taxonomic search of available sources
  • Running Diff and Match on the discovered Files
  • Display the results and prompt for further Action

I would also have an Analyze, Count and Jungle.Search-Button, which would exactly implement one Function from the toolbox.

ToolBoxImplementationReference20031223
GeneralDocumentationReference20031223

-- BernhardGroehl - 28 Dec 2003

Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r3 < r2 < r1 | More topic actions
 
Open Source Applications Foundation
Except where otherwise noted, this site and its content are licensed by OSAF under an Creative Commons License, Attribution Only 3.0.
See list of page contributors for attributions.