Just an Idea

I have this idea for a Content Management System.

I know, who doesn’t right?

Although it would not be just a content management system. It would also be a way to take notes, and keep track of various pieces of data. The main idea is that people intuitively think in sets, even if they don’t want to talk about things that way. Venn diagrams make sense to people and that’s the way this system would work. It would also strive to be efficient in terms of the time it takes to find something. I think this will be offset with the trade-off that a lot of precomputation would be necessary to put the data into the system’s internal format. The second main idea is that everything is a document, that way any piece of the system can be included into another piece, seamlessly. It would also be good to incorporate a web browser, that way people could visually separate pieces of pages into new document units. I’m going to try to explain each piece now.

The Sets

I feel that a set is a good way to set up a CMS. In particular, a set cannot have circular references. While this would be novel, I don’t think it’s a necessity for most people. With the document structure I have rolling around in my head, one could come up with an easy hack and fake it. In particular document structure is just that, everything is a document. So every document is the leaf of a tree (more about that later). Further, every document can be tagged with as many tags as are necessary. The tags themselves are the names of sets. Therefore if a document is tagged with Tag A, it is an element of the set Tag A. Now sometimes it makes sense to give a value to a tag, for instance the Date tag. So a document could have a tag Date=20090212. Further, a document could have any number of Date tags, depending on the purpose of the document. In that case, the document would be in the Date set, but also the Date=20090212 set, and any other date value. This also gives a way for documents to be ‘near’ other documents. This is important to the search methodology. Part of the precomputation would be to place every document tagged like Date=XXXX02XX into the February (or similarly named) set. So in a search for “February” a disambiguation document (everything is a document, including search results) would be served displaying all tags with the matching display name of February. Tags must have option of sharing the same display name due to the existence of homophones in the English language. Take a minute to consider this minute point. The user can click through to whichever definition they were interested in, or to the results displayed in the preview modules (a la a Google search for “dakota” circa 20090212). Similarly the URLs for these documents will be configurable to have a display version as well as an internal one. The server (or an extra layer) would accept the display URLs and map to the internal representation. The idea here is that if the semantics of a document change, the internal representation will not have to.

The Trees

The trees will be spanning trees of the graphs resulting from the precomputation. The documents will be nodes and documents they contain or are linked to will share edges. The tags will make node sets which will, in a future step, be considered macro nodes. Spanning trees of these graphs will be determined and these will define the refinements one may make. That is that certain searches will make use of a specific spanning tree to aid in the search. A search for “rhombus” for instance will likely be highly connected to other documents about geometry and along with a tag search for Rhombus, nearest neighbors tagged Geometry may also appear. This will provide a natural “nearness” measure which can be used when other measurements are unavailable. Further if the tags are semantic then this will provide a high level of relevance to searches. These spanning trees will be then be transformed into binary trees to aid future searching. There are a few cases which must be addressed in order to keep the trees balanced, and this will often add depth and numbers of not-so-semantic divisions, but I do think it will be worth it to do this type of precomputation at the same time as more traditional computations are taking place. These will be particularly useful for internal searches based on user queries.

The Browser

Basically the browser would be a dynamic template design interface. One could edit a document and select a certain portion to be a subdocument. These subdocuments could be included as modules on other pages. The advantage of doing this in a browser like interface is that the browser will be able to tell what is inside any closed shape that is drawn. This will allow the browser to render the subdocument independently with only the data necessary. Instead of masking the entire document to reveal only the subdocument, the browser would try to save as little information as possible to faithfully recreate the chosen piece in another document or on its own. This necessitates an internal format that would make it possible to include entire documents into other ones. For example, in the case where one wants to include multiple full documents in one document (a slideshow for instance), each of these included documents should still retain their meta information. Let’s say that we have two documents, the Gettysburg Address and the preamble to the Constitution that we want to include in a single document. On their own they would be viewed (by default) as fully qualified XHTML documents. In this other document they must be encapsulated in some other structure. So instead of <body /> and <meta /> elements storing the content and meta data, we could use some <div /> elements with a “body” class and <span /> elements with a “meta” class. That way the metadata for each document would still be available in the larger document, but it would not be that document’s meta data. This means that the internal format of these documents should be easily transformed into XHTML and into this subdocument format. It is likely that XHTML will be sufficient for the internal format with a special xslt stylesheet to translate it into the subdocument format.

Uncategorized ideas

Several tags like Date or Length or Geographical Location will allow for “nearness” searches.
Most documents will be meta documents like templates that include other, more basic documents.
A distinguished Link tag will be created based on any actual links or “recognizable” references to other documents.
Preview modules should stick to four or fewer results with links to longer result listing pages. On these pages the user should be able to specify how many results to see at a time. Automatic pagination should be used. If a user specifies 10 results per page, the page will show 10 results and the next and back buttons will load the previous and next 10 results. If a user decides to show all results they will be loaded in exponentially increasing increments. That is if there are thousands of results, display the first 128 and notify the user that not all results are shown. The user should then be given the option of displaying all the results (with a warning that this could take a considerable amount of time) or to display the next amount of results. Then the number of results retrieved will double each time. I can foresee very view instances where a user would want to see all results in this case on screen at once. It would make more sense to let the engine refine the search.
There should be a ’search this result set’ refinement option.
The trees should be examined from the leaves up. This will save time when computing cardinalities. Further building the balanced binary tree pieces this way will allow for simple binary refinements in most cases when necessary.
Transforms will translate the internal format into other formats for inclusion of multiple documents into various types of superdocuments.

I just had to write some of this down before it got too big to remember all of it. I’ll keep adding to this post as I think of more things.