Wednesday, February 22, 2006

Cognitive Librarian, part 2: Reference Clouds

A tee-shirt company has offered up a utility that scans a website and generates a word cloud for it. This offers a combination of technological sweetness and cutting-edge narcissism that I, personally, find irresistable. But one soon tires of contemplating oneself, especially when one is a mere wiggly worm in the ecology of the blogosphere. How much more interesting to use this tool to do a little amateur librarianism.

So here is a word cloud made from the text of the Constitution, sans Amendments:

(There may very well be a few internet-specific terms hovering in the cloud, because of the way I set this up and the way the program scans websites.)

Here is a word cloud made from the Bill of Rights:

Some predictable words pop out, which is as it should be - predictability is one of the things cataloguing should do.

Here is a word cloud made from the text of the Constitution, with all of its Amendments:

Note, for instance, that "people," invisible at this resolution in the Constitution, is quite prominent in the Bill of Rights, a lesser but still notable presence in the Constitution taken as a whole.

Now this, while it's more fun than useful, strikes me as very like something that could, eventually, be useful. What if someone expert in knowledge assessment, and the way the mind categorizes and organizes knowledge - say, a cognitive psychologist, or a cataloguing librarian - were to create a rich (i.e. useful, and large but manageable) bunch of categories, and then what if someone with a lot of time on his hands wrote a program that could extrapolate from words and phrases to those categories? Then one could cross-reference to other works by not just one, but any number of simultaneous categories. In other words, we would have created a multidimensional, interactive catalogue... something like what Library Thing does, only I believe it relies on user-generated tags, not any formal, well-defined system. How is this different from what search engines already do? Well, it would be different, and presumably better, because it would be designed by someone who could make it particularly useful. How? How the hell would I know? What do I look like, a librarian?

Actually, I'm sure any number of real librarians are working on this, as it is neither new nor less than obvious. But there is one thing that is new: now (thank you, Snapshirts and Project Gutenberg) I can do it myself.

Here are a few well-known works, automatically analyzed:


At 6:15 PM, Blogger Murky Thoughts said...

Seems like a stretch to go to categories, given that word meaning depends so much on context. Word clouds themselves are crude and potentially misleading, nifty though they are.

At 7:01 PM, Blogger rain_rain said...

The Library of Congress cataloguing system is also crude and potentially misleading, nifty though it is. I think tag clouds (to be more specific), or something like them, will shortly be much more than nifty. There is already interest in them from real librarians... none of whose interest or comments I have handy. But you don't have to take my word for it: I'm sure Google, crude and misleading though it is, will help you find a few.

At 8:35 PM, Blogger Murky Thoughts said...

I've hardly thought about word clouds, and you're making me think harder about them, annoyingly though you're doing it. I didn't pause long on the size issue and how it might make context and meaning clear--especially when you see certain combinations of words of like size. Also as a way to categorize a text I like that its raw and theory-independent. Still, clearly only for dorks and their librarian croneys.

At 11:50 PM, Blogger rain_rain said...

Sorry. Long day.

At 8:41 AM, Blogger Murky Thoughts said...

Well, me too. e.g. misspelled "cronies"


Post a Comment

<< Home