For many years, I could draw my full tree directory from memory. Things have changed; I'm doing more things than I can track. Today, my $HOME is 2.4k directories, 43k files, and 1.3G bytes (this is almost all plain ASCII files -- no MS Office, no multimedia -- so 1.3G is a lot). My present filesystem has been uninterruptedly with me since 1993, and there are old things in there that I can scarcely remember. Now, I often wander around $HOME like a stranger, using file completion and "locate" to feel my way around. I recently needed some HTML files that I was sure I had once written, but I didn't know where they were. I found myself reduced to saying:
$ find ~ -name '*.html' -print | xargs egrep -il string
, which is a new low in terms of having no idea where things might be.
This article is a plea for help. We're all used to devoting effort to problems of information retrieval on the net. I think it's worth worrying about inner space. What lies beneath, under $HOME? How can relevant information and files be pulled up when needed? How can we navigate our own HOMEs with less bewilderment and confusion? Can software help us do this better? I know nothing about the literature on information retrieval, but this scratches my itch.
We have accumulated three different tree systems for organizing different pieces of information:
This is a mess. There should be only one filesystem, one set of folders.
Email is a major culprit. Everyone I know uses a sparse set of email folders and an elaborate filesystem, so we innately cut corners in organizing email.
We really need to make up our minds about how we treat email. Is email a channel, containing material which is in transit from the outside world to the "real" filesystem? In this case, the really important pieces of mail will get stored in their proper directory somewhere, and all other pieces of email will die. I have tried to achieve this principle in my life, with limited success.
Or is email permanent (as it is for most people), in which case material on any subject is fragmented between the directory system and email folders? If so, can email folders automatically adopt the organization of the directory system? Can email files be placed alongside the rest of the filesystem?
Web browser bookmarks are a third tree-structured organization which should not exist. It's easy to have a concept of having a metadata.html file in every directory, and storing the bookmarks there. The browser would inherit the tree directory structure of $HOME, and when sitting inside any one directory, the pertinent metadata would be handy.
Dhananjay Bal Sathe pointed out to me another source of escalation of the complexity of filesystems. This only effects users of software from Microsoft, so I'd never encountered it. It is MS's notion of "compound files", which are objects which look like normal files to the OS but are actually full directory systems (I guess they're like tarfiles). Since the content is hidden inside the compound files, you cannot use all OS tools for navigating inside this little filesystem, only the application that made the compound file. He feels that if compound files had been treated as ordinary directories of the filesystem, it would have been a "simple, beautiful, elegant" and largely acceptable solution instead of the mess which compound files have created.
If you use file utilities to navigate and search inside the filesystem, you will encounter some email. I use the "maildir" format, which is nice in that each piece of email lies in a separate file. However, MIME formats are a problem. When useful text is kept in MIME form, it's harder for tools to search for and access it.
MIME is probably a good idea when it comes to moving documents from one computer to another, but it seems to me that once email reaches its destination, it is better to store files in their native format.
In my dream world, each directory has all the material on a
subject (files, email, or metadata), and
work correctly, without being blocked by MIME-encoded files.
Geetanjali Sampemane pointed out that this is related to the questions about content-based filesystems, and suggested I look at a paper by Burra Gopal and Udi Manber on the subject (ask Google for it).
Postscript and PDF have worked wonders for document transmission over the Internet, but this has helped escalate the complexity of inner space:
While I'm on this subject, I should describe a file naming
convention I've evolved which seems to work well. I like it if a file
is named Authoryyyy_string.pdf; this encodes the lastname of the
author, the year, and a few bytes of a description of what this file
is about. For example, I use the filename
SrinivasanShah2001_fastervar.pdf for a paper written by
Srinivasan and Shah in 2001 about doing VaR faster.
I also take care to use this Authoryyyy_string as the key in my
.bib file, so it's easy to move between the bibliography file and the
documents. I often use regular expression searches on my bibliography
file, and once I know I want a document, I just say
Authoryyyy to track it down.
I'm not an expert on information retrieval, so these are just some ideas on what might be possible, from a user perspective.
Dhananjay Bal Sathe
reminded me that there is a good case for doing this on
a more ambitious scale, to comprehensively support URLs as
files so one would be able to say
$ cp URL file
$ lynx http://fqdn/path/a.html
:-) and it should work just fine. This goes beyond just symlinks.
superstringswhich thinks intelligently about the files it is facing. If the file it faces is a normal textfile,
superstringsis just strings(1), but if it faces .pdf, .ps, MIME, etc. it should extract the useful text with greater intelligence than ordinary strings(1). This can be combined with grep, etc., to improve tools for information access in the filesystem.
In summary, people working in information retrieval are focused on searching the Web, but I think we have a real problem lurking in our backyard. Many of us are finding it harder and harder to navigate inside our HOMEs and find the stuff we need. I think it's worth putting some effort into making things better. There is a lot that ye designers of software can do to help, ranging from putting file completion into Mozilla to new ideas in indexing tools.