Wednesday, August 16, 2006

decaying filesystems

Wikipedia is a big collection of interconnected articles, each with its own edit history.
If we view each article+history as one object, then it will come as no surprise that links between articles link to the most recent version of the article.
But if instead we view each version of an article as a separate article, a different view comes out. Instead of wikipedia being a big collection of interconnected articles, we have a big collection of interconnected stacks of articles, where each stack represents the article and its previous versions.
Now why do links between articles automatically point to the 'top of the stack'? The article may be changed out of all recognition from what it was further down the stack (earlier in its history).
In fact, we arrive right at the paradox of the heap: when does an article undergoing many small incremental changes actually cross the boundary between being a modification of its earlier self to being a totally new article?
The answer is of course, that it doesn't. The boundary is entirely artificial, and almost entirely not useful.
So this is the fix: wikipedia article links should link through to the version of the article that is relevant to the text linking to it. Intuitive no?
This leads to some nice behaviours: because links can be redirected to different versions of an article you know you're pointing at what you intended to point at. Also, there is more emphasis on the age and stability of the information you're viewing.
But perhaps it would be tedious to have to keep updating links as articles improved over time. Perhaps there could be mechanisms that tracked people's browsing paths and updated links automatically. Or perhaps it would form the basis of a new recommendation system: major edits would gain approval from the community by getting linked to.
This system also suggests another improvement: branching articles. Each modification to an article is actually a branch. Vandalized branches would soon die, while community-approved branches would blossom.
You could even have a system whereby short branches with low link counts (or a high proportion of links from ancient but successful branches, representing abandoned links) could be migrated to lower quality media, or even disposed of: a sort of garbage collection for high-level human-readable information.
Branches are also excellent in that they solve the article renaming, moving, merging and splitting problems at a stroke: because the mechanisms to redirect links quickly are already in place, it is comparatively cheap to perform these operations.
The possibilities are endless. And it would be a job to get right, but it's something we could benefit from a lot I think.

No comments: