Articles / Information retrieval from ...

Information retrieval from $HOME

Like everyone else, when I first encountered tree directory systems, I thought they were a marvelous way to organize information. I've been around computers since 1983, and have staunchly struggled to keep files and directories neatly organized. My physical filing cabinet has always been a mess, but I clung to the hope that my hard disk would be perfect.

For many years, I could draw my full tree directory from memory. Things have changed; I'm doing more things than I can track. Today, my $HOME is 2.4k directories, 43k files, and 1.3G bytes (this is almost all plain ASCII files -- no MS Office, no multimedia -- so 1.3G is a lot). My present filesystem has been uninterruptedly with me since 1993, and there are old things in there that I can scarcely remember. Now, I often wander around $HOME like a stranger, using file completion and "locate" to feel my way around. I recently needed some HTML files that I was sure I had once written, but I didn't know where they were. I found myself reduced to saying:

	$ find ~ -name '*.html' -print | xargs egrep -il string

, which is a new low in terms of having no idea where things might be.

This article is a plea for help. We're all used to devoting effort to problems of information retrieval on the net. I think it's worth worrying about inner space. What lies beneath, under $HOME? How can relevant information and files be pulled up when needed? How can we navigate our own HOMEs with less bewilderment and confusion? Can software help us do this better? I know nothing about the literature on information retrieval, but this scratches my itch.

Multiplicity of trees

We have accumulated three different tree systems for organizing different pieces of information:

  1. The filesystem
  2. Email folders
  3. Web browser bookmarks

This is a mess. There should be only one filesystem, one set of folders.

Email is a major culprit. Everyone I know uses a sparse set of email folders and an elaborate filesystem, so we innately cut corners in organizing email.

We really need to make up our minds about how we treat email. Is email a channel, containing material which is in transit from the outside world to the "real" filesystem? In this case, the really important pieces of mail will get stored in their proper directory somewhere, and all other pieces of email will die. I have tried to achieve this principle in my life, with limited success.

Or is email permanent (as it is for most people), in which case material on any subject is fragmented between the directory system and email folders? If so, can email folders automatically adopt the organization of the directory system? Can email files be placed alongside the rest of the filesystem?

Web browser bookmarks are a third tree-structured organization which should not exist. It's easy to have a concept of having a metadata.html file in every directory, and storing the bookmarks there. The browser would inherit the tree directory structure of $HOME, and when sitting inside any one directory, the pertinent metadata would be handy.

Dhananjay Bal Sathe pointed out to me another source of escalation of the complexity of filesystems. This only effects users of software from Microsoft, so I'd never encountered it. It is MS's notion of "compound files", which are objects which look like normal files to the OS but are actually full directory systems (I guess they're like tarfiles). Since the content is hidden inside the compound files, you cannot use all OS tools for navigating inside this little filesystem, only the application that made the compound file. He feels that if compound files had been treated as ordinary directories of the filesystem, it would have been a "simple, beautiful, elegant" and largely acceptable solution instead of the mess which compound files have created.

Non-text files

If you use file utilities to navigate and search inside the filesystem, you will encounter some email. I use the "maildir" format, which is nice in that each piece of email lies in a separate file. However, MIME formats are a problem. When useful text is kept in MIME form, it's harder for tools to search for and access it.

MIME is probably a good idea when it comes to moving documents from one computer to another, but it seems to me that once email reaches its destination, it is better to store files in their native format.

In my dream world, each directory has all the material on a subject (files, email, or metadata), and grep would work correctly, without being blocked by MIME-encoded files.

Geetanjali Sampemane pointed out that this is related to the questions about content-based filesystems, and suggested I look at a paper by Burra Gopal and Udi Manber on the subject (ask Google for it).

PDF and postscript documents

Postscript and PDF have worked wonders for document transmission over the Internet, but this has helped escalate the complexity of inner space:

  • As with MIME, .ps and .pdf files are not vulnerable to searches for regular expressions as text files are.
  • An interesting and subtle consequence of the proliferation of .ps and .pdf files in my filesystem is that a larger fraction of the files there are alien. In the olden days, every file that was in my filesystem was mine. It used my file naming conventions, etc., so when I wandered around my filesystem, I knew my way. Today, there are so many alien files hanging around that it reduces my confidence that I know what is going on.
  • Every now and then, I notice a .pdf file "which is going to be invaluable someday", and snarf it. If I'm lucky, it has a sensible filename, and if I'm lucky, I'll place it in the correct place in my filesystem. In this case, there's a bit of a hope that it'll get used nicely in the future. Unfortunately, a lot of people use incomprehensible names for .pdf files, such as ms6401.pdf, seiler.pdf, D53CCFF4C9021C19988841169FB6FD6EC1D56F711.pdf, and sr133.pdf. I find that interactive programs like Web browsers, email programs, etc. are clumsy at navigating tree directories, so my habit is to save into /tmp, then move the file using the commandline. Sometimes, I'm in too much of a hurry, and this gets messed up. Now and then, I place an incoming file into $HOME/JUNKPDF, hoping that I'll get around to organizing it later.

While I'm on this subject, I should describe a file naming convention I've evolved which seems to work well. I like it if a file is named Authoryyyy_string.pdf; this encodes the lastname of the author, the year, and a few bytes of a description of what this file is about. For example, I use the filename SrinivasanShah2001_fastervar.pdf for a paper written by Srinivasan and Shah in 2001 about doing VaR faster.

I also take care to use this Authoryyyy_string as the key in my .bib file, so it's easy to move between the bibliography file and the documents. I often use regular expression searches on my bibliography file, and once I know I want a document, I just say locate Authoryyyy to track it down.

Some suggestions

I'm not an expert on information retrieval, so these are just some ideas on what might be possible, from a user perspective.

  • Email and Web bookmarks. As mentioned above, we really need a solution to the problem of email folders versus Web bookmark folders versus the filesystem. I'd like to have a MUA and a Web browser which treat my normal filesystem as the classification scheme to use, and save information in the corresponding directories. Every time I make changes to the directory structure, the MUA and browser should automatically use the newest one.
  • Fulltext search. I think we should have fulltext search engines which are hooked into the filesystem. Every time a file under $HOME changes, the search engine should update its indexes. Like Google, this search engine should walk into .html, .pdf, and .ps files and index all the text found therein. This will give us the ability to search inside inner space.
  • URLs-as-symlinks. If we had a fulltext search engine which worked on $HOME, it'd be nice if we could have a concept of a symlink which links to a URL. This reduces overhead in the filesystem, and ensures that one is always accessing the most recent version of the file (in return, one suffers from the problem of stale links, but hopefully producers of information will be careful to leave redirects). By placing symlinks into my directory, I'd feed PDF or PS files into the universe that my personal search engine indexes. These files would be just as usable as normal downloaded files as far as Unix operations such as reading, printing, emailing, etc. are concerned. Web browsers should give me a choice between downloading the file and placing a symlink with a filename of my choice in a directory of my choice.

    Dhananjay Bal Sathe reminded me that there is a good case for doing this on a more ambitious scale, to comprehensively support URLs as files so one would be able to say
    $ cp URL file
    or
    $ lynx http://fqdn/path/a.html
    :-) and it should work just fine. This goes beyond just symlinks.

  • Digital libraries. I have seen software systems like Greenstone which do a good job of being digital library managers, and they may be part of the solution.
    I have sometimes toyed with the idea of using a digital library manager for all alien files. I could have a lifestyle in which every time I got a .pdf or .ps file from the net, I would simply toss it at the digital library software. (It would be nice if Mozilla and wget supported such a lifestyle with fewer keystrokes.) The digital library manager of my dreams would extract all the text from these files and fulltext index them (something that most library managers do not do), and it would not force me to type too much information about the file (which most of them do).
    The logical next step of this idea is a digital library manager which just scours my $HOME ferreting out all files and fulltext indexing them, and that seems like a better course. In this case, it's just my fulltext search engine which indexes everything in $HOME.
  • Bibliographical information for the library manager. One path for progress could be for people who publish .pdf and .ps documents on the Web to use some standards through which XML files containing bibliographical information about them are also available. Every URL http://path/file.pdf should be accompanied with a http://path/file.bib.xml, which contains the information.
    I know one initiative -- RePEC -- in which people supplying .pdf or .ps files also supply bibliographical information about them, but I think it's not quite there yet; it requires too much overhead. The proposal above is simpler. Every time a client fetches http://path/file.pdf, it can test for the existence of http://path/file.bib.xml, and if that's found, the user is spared the pain of typing bibliographical information into his digital library manager.
  • A user interface for supplying a path. When a file is being downloaded, the user is required to supply a filename and a path. I would really like it if authors of software (like Mozilla) gave us a commandline with file completion to do this. I find the GUI interaction that they force me to have extremely inefficient, and it costs so much time that when I'm in a hurry, I tend to misclassify an incoming file. File completion is the fastest way to locate a directory inside a filesystem, and I think I should at least have the choice of configuring Mozilla to use it instead of the silly GUI interface. When we re-engineer Unix to make it easy-to-learn, we should not give up easy-to-use.
  • Quality scoring in inner space. A search string will get hundreds of hits on a fulltext search engine, so how can software give us a better sense of which are the important documents and which aren't? In the problem of searching inside inner space, Google's technology (of counting hyperlinks to you) will not work. A few things that might help in inventing heuristics:
    1. The most recently read or written files should be treated as more important.
    2. Files that are accessed more often should be treated as more important. (This will require instrumenting the filesystem component inside the kernel.)
    3. Makefiles articulate relationships between files. An information retrieval tool that crawls around $HOME should use this information when it exists. Targets in makefiles are less important, and files mentioned in make clean or make squeaky are less important.
      As an example, such intelligence would really help an information retrieval tool which hit my $HOME. In every document directory, I have a Makefile, and the tool could use it to learn that a few .tex files matter, and the .pdf or .ps files do not (since they are all produced by the Makefile, and mentioned in make clean and make squeaky).
    4. "My files are more important than files by others" is a useful principle, but it's difficult to accurately know the authorship of a file. The URLs-as-symlinks idea (mentioned earlier) can help. If I have snarfed a .pdf file down into a directory, the search engine has no way of knowing that it's an alien file. If I have left a symlink to the .pdf file, the search engine knows this should be indexed, but at a lower priority.
  • Less is more -- how to store less. One way to reduce the complexity of the filesystem is to help people feel comfortable about not downloading from the net. When I see a page on the net that looks interesting, I tend to download it and keep a local copy, partly because I'm thinking that I might not be able to find it later.
    Instead, I'd like to hit a button on the browser which talks to Google and says "I think this page could be useful to me." From this point on, when I do searches with Google, this page should earn a higher relevancy score. If a large number of people used Google in this fashion, it would be a new and powerful way for Google to obtain information about the quality of pages on the Web.
  • Superstrings. I think we need a tool called superstrings which thinks intelligently about the files it is facing. If the file it faces is a normal textfile, superstrings is just strings(1), but if it faces .pdf, .ps, MIME, etc. it should extract the useful text with greater intelligence than ordinary strings(1). This can be combined with grep, etc., to improve tools for information access in the filesystem.
  • Help me delete files. Deleting files is one important way of reducing complexity. I'd like to get data about what parts of my filesystem I am never reading/touching. I could launch into spring cleaning every now and then and blow away files and directories that are really obsolete, supported by evidence about what I tend to use and what I tend to ignore. Note that I'm only envisioning a decision support tool, not an automated tool which deletes infrequently-used files. (Once again, this will require instrumenting the filesystem component inside the kernel.)

In summary, people working in information retrieval are focused on searching the Web, but I think we have a real problem lurking in our backyard. Many of us are finding it harder and harder to navigate inside our HOMEs and find the stuff we need. I think it's worth putting some effort into making things better. There is a lot that ye designers of software can do to help, ranging from putting file completion into Mozilla to new ideas in indexing tools.

RSS Recent comments

30 Aug 2001 00:44 anim8

The right software ...
... is the answer. Better yet, the right 'filesystem' ... an RDBMS filesystem where files can be categorized and tagged quickly (point-and-click -- not hand-typed). Then, while browsing your filesystem any one file could potentially be found under more than one 'directory'.

The methods mentioned in the article are sound. But trying to keep ALL your files on disk seems wasteful. Old stuff not seen in years should probably be backed-up, catalogued and removed.

Heirarchial storage requires rigid organization and diligent maintenance. It probably has outlived it's usefulness.

30 Aug 2001 01:36 jfgg

The Semantic Web
There is a nice article The Semantic Web (www.scientificamerican...) by Tim Berners-Lee e.a. (www.scientificamerican...) on the web aspect of information storage/retrieval.

(subtitle "A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities")

30 Aug 2001 02:46 nutbar

It all comes down to organization skills
I've not been dealing with computers as long as most people, but in my experience, you can keep any operating system or any file system in general neat, tidy, easy to search, and not full of unknowns by simply keeping it organized yourself.

Most Linux distrobutions suffer from tons of useless garbage lying around in common directories such as /usr/bin (bet you don't even know what half those programs are for!) and the likes. I find that the most part of people who use computers tend to be messy slobs when it comes to organizing their data.

My $HOME is virtually blank, I have my public_html, a couple of things I save for later reference which I quickly move/delete when I'm done with it. I've tidied up every directory in Linux, including /usr/bin (and I know what every program does). I know where every file is, and only resort to using locate or find when I need to get a list of files that may match a specific pattern (eg: manually removing a program after install and not watching 'make install'). My Windows partitions are kept spanking clean, and I know what stuff should and shouldn't be there. The only exception is my System directory in Windows, I just try to keep from installing useless programs and knowing what .dll's I need.

So when you step back and take a look at it all, you don't need to re-invent the wheel on where things should be stored, how to store them, and all that jazz.

Don't collect so much useless crap you think might be usefull - keep it if it IS usefull and bloody well use it, and when you're done, delete what you don't need!

If you don't like a file name, read up on the ren/mv commands! Amazingly I have yet to come across a file system that DOESN'T let you rename files!

As for other data files that contain non-plaintext data, use a usefull filename (after all, its text, so renaming it will not break stuff) and store it somewhere that makes sense. JUNKPDF isn't such an example - try something like Filesystems-renaming.pdf

Only flaw to my rant is that certain software bundles do have dorks who love to make a mess of your nicely structured file system. I just don't use their software or if it's for Linux and I've got their source, I try to edit what I can to make it better (then send them a .patch - usually pisses them off :)

30 Aug 2001 04:09 afields

Right On!
I think you raise a very important
issue. This all makes me think of the
"Future Vision" section of the Namesys
page, where Hans Reiser talks about the
need for a mathematical closure between
applications and bringing more advanced
features into the FS much like the idea
of 'the database is the
filesystem'.

Sometimes I am rather stumped as to how
I can organize all my files well, simply
because of the sheer volume. Some times
I wonder where it all comes from. =)
One thing is for sure, the current
system isn't making it natural to file
it all away as it comes in.

Because I hate recasting my thoughts
into the separate islands that are file
formats and specifically new file
formats that I am unfamiliar with, just
to decode them at a later date, I don't
feel like I can naturally organize my
files, ideas, correspondence, etc. in an
intuitive or overall advanced fashion.
There is always the fear that once it is
all organized, it will be organized in
the wrong file format/directory
structure for when I need to use it all
again or that when the next file format
comes along, I'll have to do it over
again. Then there is the issue of
remembering what the files that I place
in directories are for, and where/when I
saved that specific file I have in mind,
and which files did it related to -- and
how does it tie into the concept... Add
to that the number of computers I use
for personal use, work, etc. It all
seems unmanageable if I don't just cast
off the old, and archive it all up for
that 'some day, when I am gonna organize
this all'.

Additionally, when I try organizing PDFs
(that contain specs or product
information for instance) and all the
other files that I yank off various
pages from the Internet, I am at an even
greater loss as to how to integrate it
into a logical structure and relate it
to the existing files. Much like a
problem where you can define so many
bins that you don't know or remember
which to drop something in, thus
defeating their purpose. What if it
belongs to 2 or more drop-boxes. And It
sure would be nice to link those PDFs to
the source websites and to the related
searches on Google.

How do I save my concepts of linkage
between all these files and URLs and
emails. I can't remember it all,
especially after a few years. It's a
mess!!

I fear the trend towards a small
countries population of different
solutions above the scope of the OS,
with thousands of different approaches
of implementing the same structure over
and over, with-out any coordination.
XML can help, but I am not confident
that any markup/hypertextual system
alone is enough for anything past a
level of interoperability. Not like
interoperability would be bad or
anything, but...

It might be nice if we could get something to
organize it all, that is a unified standard, and heck, even cross platform. Can anyone suggest what I should look
at?

GroupWare (PHPGroupware looks promising)
and tools like Livelink (not Open Source) are a
start, but they also fall short in that
they all just build on top of the file
system and operating system removing the
convenience of the UNIX 'everything is a
file' accessible at the flat scope
idea. Maybe our filesystems just aren't
advanced enough to handle the load we
are trying to put on them.

In the case of commercial tools that
might suite my needs, they are all to
pricey. I also refuse to store my data
in any Windows file formats.. too many
bad experiences, I don't buy the
Microsoft integration concept.. Call it
a lack of trust that Microsoft can ever
be compatible with-out sucking my time
and dollars into a downward spiral of
non-addressable bugs and unholy
propriety that requires me to switch
from Windows at a great cost of time
in the end anyway.

I might have some ideas for the KDE
team so that they can avoid the same
problems. Can we please get past
this dark age of the stand-alone
application to something that finally
draws some closure? ;)

---

Allan Fields

30 Aug 2001 04:35 afields

Re: It all comes down to organization skills
Forgive me for saying so, but the above sounds simplistic. You may be looking at the simple approach, where there generally isn't the need to implement anything like this, because usage doesn't involve large volumes of information from many sources.

Deleting seems like a good approach for a work around, but many people have home directories chalk full of "good" stuff, that they really can use (if only they could correctly link it all together/index it in a timely fashion) and wouldn't think of deleting, because there WAS a reason they got it in the first place, and there is still a reason to keep it. Pruning is OK, but don't chop down the tree. (I agree with keeping your binary trees clean, why not right? But data is a little different. Isn't that part of the PC, the whole reason, we have a PC and not just some terminal?)

30 Aug 2001 05:08 jfw

The tree structure is one problem

I also ran into the data organization problem
in 1993 (when I last time lost files).

I found among others the file metaphor
and strict tree structure a major mismatch
with human cognition.
To tackle the problem Askemos (www.askemos.org/) was
done. It really helps.

BTW: Askemos is a GPLed software
(soon to be recategorized at freshmeat),
wich faces a legal threat at the moment.
Please help to keep it free, download!
Thaks

30 Aug 2001 05:18 rumblefish

when in doubt use brute force

I think all these fancy techniques are not really
needed. Look at history: there were a lot of early
search engines and systems designed by
architecture astronauts, such as WAIS which have
never got anywhere. In contrast, look at the
absolutely brilliant google, which cares nothing for
categories or semantics. I use google in for
everything, in preference even to categorised
vendors support pages for my support issues.

When in doubt, use brute force.
www.tuxedo.org/~esr/ja...

30 Aug 2001 05:37 simonsunnyboy

Re: when in doubt use brute force
Why use a search engine in your $HOME?
Simply delete files not needed and backup everything you may need in the future to an external storage device like a tape archiver or a cd-r.
Your $HOME will stay small and tidy.
Just make sure to go through this procedure once a week or once a month.
My $HOME is organized that way. However I store HTML, downloads, pictures and other non-plain-text information in there. I use some well known subdirectories and it works
perfectly. Simply tidy up!
Why using complex software for things that can be achieved with a little self discipline or even cronjobs?

30 Aug 2001 07:07 virtualizer

What about using the remembrance agent
I have more than decent success by using Bradley Rhodes' Remembrance Agent. That does a very good job by trying to provide me with JITIR.

30 Aug 2001 07:37 jimliotier

Re: What about using the remembrance agent

Trees are inherently limited to single entry. Organizing documents in a single tree will inevitably hit that wall. The only way to break through is to use thesaurus based keywords. The snag is that thesaurus building is a task of pharaonic proportions.

The quick and dirty approach that I used successfully when in dire need of hacking my way through 40GB of ps, pdf, txt, doc, ppt, html and xls documents is to use a full text indexer with external parsers. Dig (freshmeat.net/projects...) has done a great job (although phrase searching sorely lacks for now).

Thesaurus based keyword indexation is best because documents can be hit from any semantic angle. I would love to have the time and resources to do it for my company. But in the real world, meaningful file names, a basic and sane tree and full text indexing on top of that will do cheaply.

As far as mail is concerned, the single entry tree problem is somewhat alleviated by virtual folder approaches such as with Evolution (freshmeat.net/projects...)

30 Aug 2001 07:41 jimliotier

Re: What about using the remembrance agent
Sorry, I hit "reply" and forgot to modify the title. My post's title should read : "Experience dealing with large numbers of heterogeneous documents". Relational data rules !

30 Aug 2001 08:10 sorinm

There are tools...
I had the same problem. Until I discovered that
there are a lot of tools that can help. Sure, one
has to find all those tools and select the best of
them. The starting point for me was the desire to
have one (or two) places in which my important
stuff goes. Ideally a common interface for all this.
And the only environment that is ready to deal with
all sort of objects is the web. Therefore, my way of
solving the problem is:

- Use a perl, php enabled web server for your own
computer
- Use a Personal Information system (there are
several out there, I use MyPhPPim) with a web
interface, connected with a mysql database. In
that database goes all your E-mail, notes, todo-s,
etc.
- Use a bookmark manager connected with the
same Mysql server and with a web interface
- Use a web file manager system (such as
phpFileFarm) to work the pdf, html, ps files
- Use a web photo album to keep your photos (of
course with database back end)
- Use a cvs system for ASCII work in progress and
install a webcvs system (I use viewcvs).
- Finally, use HtDig or another search engine to
index the whole stuff. Configure htdig to search in
separate directories or in all.

Several more ideas:

use the same database engine (mysql or postgress
or another) to minimize the load
back up on a separate partition (or computer) all
the databases and the cvs system daily
back up the pdf, ps html directory weekly

And to add a little touch, make a script that
checks daily into the cvs system:
ls -lR in important directories
system settings

Your computer will have to work during the night
for one hour but...you have a clever system

30 Aug 2001 10:36 Jodrell

File naming rules
More thoughts on file naming rules:

www.everything2.com/in... (www.everything2.com/in...)

30 Aug 2001 11:33 Avatar tal197

Storing files
When a file is being downloaded, the user is required to supply a filename and a path. I would really like it if authors of software (like Mozilla) gave us a commandline with file completion to do this. I find the GUI interaction that they force me to have extremely inefficient, and it costs so much time that when I'm in a hurry, I tend to misclassify an incoming file.

This is perhaps the biggest problem -- it's so easy to just dump a file in the default
directory that people don't take a couple of seconds to put it somewhere sensible.

A solution? Get rid of the save dialog box and replace it with a draggable icon. To save, the icon is dragged to a filer window, directory on the panel, etc. Common save destinations (eg, the project you're currently working on) can then be kept handy along the bottom of the screen (or whereever). See here for an implementation of this system.

As any computer scientist knows, spending a little extra time storing your data can help a lot when it comes to retrieving it! BTW, I agree that an indexing agent should update as the filesystem is changed. The current massive-scan-once-a-day is slow and irritating.

30 Aug 2001 13:39 belg4mit

Mail / FS
This is exactly what mh-mail /nmh is designed for

"nmh consists of a collection of fairly simple single-purpose programs to send, receive, save, retrieve, and manipulate e-mail messages. Since
nmh is a suite rather than a single monolithic program, you may freely intersperse nmh commands with other commands at your shell prompt, or write custom scripts which use these commands in flexible ways."

www.mhost.com/nmh/

And if you must have a GUI there is xmh and exmh, or mh-rmail for emacs etc...

30 Aug 2001 18:40 ruddo

Data and metadata
The solution to these problemas has been discussed in tom's hardware. The proper solution is to have a filesystem that stores metadata, such as ReiserFS, and a unified interface to it, such as OMS (a XML dialect and categorization/metadata standard for storing metadata).

Naturally, it would require operating system kernel support, application VFS support and application front-end support, so it might as well be an herculean task. Whatever approach is used to solve the problem, it has to keep in mind that dumping the metadata while transferring files across the internet is unacceptable. MacOS had that solved with bundles. Why they dumped support for it in Mac OS X, I don't know.

30 Aug 2001 21:15 zenlunatics

beginnings
I've been wanting a non-hierarchical organizational system for quite some time. My main reason for wanting this is to organize browser bookmarks that can belong to more than one category. So, I've written the beginnings of such a system which can be found at zenlunatics.com (www.zenlunatics.com/zl...)

It's currently somewhere around the alpha stage and I haven't worked on it in a while. I haven't written a bookmark manager yet but did write an image viewer, an mp3 player, a simple note keeper and a utility for creating catalogs from a file system. For the bookmark manager I'm thinking of modifying gnobog, galeon or maybe mozilla (suggestions welcome). After that I'd like to like to tackle the file system possibly with a document launcher although I recently read about multi-session support which may solve that problem in a different way.

Anyway I'd really appreciate any comments on zl_catalog including suggestions for a better name :-)

thanks,

sean

31 Aug 2001 01:10 nutbar

Re: It all comes down to organization skills

> ... because usage doesn't involve
> large volumes of information from
> many sources.

When you have large volumes of information from many sources, that is called a data repository, also known as a library. When you have a library, you have an interface to get the data you want. I believe that a filesystem should do nothing more than what it's original intent was - to store data. If you want to retrieve data by 'searching' the stored contents of all your files, you should be using some sort of interface to retrieve that data. It's not the fault of the OS or filesystem that you can't find your stuff - it's your fault.

> ...Deleting seems like a good approach
> for a work around, but many people have
> home directories chalk full of "good"
> stuff, that they really can use ...

Organization. If you're unorganized, all that 'good stuff' is theoretically useless if you can't find it when you need it. Also, if you keep enough junk around that you think is 'good stuff' and you never really use it, chances are when you go to use it, it's outdated with something far superiour, or something else completely different (*laugh* ... ipfw, no wait, ipfwadm? nonono, ipchains!, no wait... iptables - that's it!).

Like I said, it all comes down to organizational skills - if you aren't adept at keeping structure in your data, you shouldn't be allowed to find what you want when you want it.

31 Aug 2001 01:36 afields

Re: It all comes down to organization skills

> When you have large volumes of
> information from many sources, that is
> called a data repository, also known as
> a library. When you have a library, you
> have an interface to get the data you

I don't yet.. that's exactly what I need, but I don't have an interface to "get the data" -- just a FS which I chuck stuff
in until I do. Where else would I put it lacking a repository?
I could put it into a temporary repository that doesn't have all the features I need yet like, say building a searchable web page set. But why do that if I can do it once, properly. No half-measures!

Yes it is a problem of knowledge management.
The knowledge management systems that exist
don't do it for me, and many are commercial, so
they are closed. No thanks. Too many bad experiences.

None of them integrate tightly enough. I sited Livelink already, that is a good example of something that makes the web a repository, but is commercial and doesn't have everything I would need to setup the repository properly.

> want. I believe that a filesystem
> should do nothing more than what it's
> original intent was - to store data. If
> you want to retrieve data by 'searching'
> the stored contents of all your files,
> you should be using some sort of
> interface to retrieve that data. It's
> not the fault of the OS or filesystem
> that you can't find your stuff - it's
> your fault.

I'm sorry, I know I screwed up, I'll do better next time. I knew that the filesystem wasn't designed for what I am trying to do..
That's why I've just been storing data on it, like it was intended to be used.

But wait a minute, that was my point! Maybe the filesystem needs to be extended -- not necessarily all in the kernel space, but the user space as well!

> Organization. If you're unorganized,
> all that 'good stuff' is theoretically
> useless if you can't find it when you
> need it. Also, if you keep enough junk
> around that you think is 'good stuff'
> and you never really use it, chances are

Actually it is still useful, it's just it is less likely it will be used effectively.


> Like I said, it all comes down to
> organizational skills - if you aren't
> adept at keeping structure in your data,
> you shouldn't be allowed to find what
> you want when you want it.

I'm allowed to do what-ever I please, regardless of qualification or skill with my own hardware (with-in bounds of law). I think I should be able to. And I am trying to better keep structure *in my data* -- not my head! That's why the computer should store structure or metadata, not just data that requires us to worry about the inforcement of the structure as an after-thought. That is error prone, as we are not machines.

Isn't the computer suppose to help us store, retrieve and compute information. Why not design it better to do so? I'm not the computer, the computer is the computer. I want to be able to have it present the information in an optimal and timely fashion, that can offload some of the burden of remembering the structure of my data.

---
Allan Fields

31 Aug 2001 02:07 afields

Re: The tree structure is one problem
Jerry,

I've taken a brief look, and find the structure a little daunting (also unfortunately I don't speak much German :( ) -- some of conepts seem neat.. I am interested to find out more, so I'll take another look some time soon. Good to see people working on solutions... One thing we perhaps should be careful of is to allow these solutions to have a tight level of integration to existing facilities so that they are intuative to users and don't appear to be a layer on a layer on a layer of storage (The multiplicity of trees - as mentioned above). Yours appears to also be an anonymous sharing protocol?

31 Aug 2001 02:12 afields

Re: Storing files
All good ideas, I think these type of UI inovations are what we all need!

31 Aug 2001 02:36 afields

Re: beginnings
Hi,

Looks good, I think you and all the other authors that have been working on these types of projects are heading in the right direction. We need to make sure we can bridge between all the apps, solutions, FS, transport mechanisms, etc. The library is definitely a great idea. Also an exhaustive effort is probably required to rival the integration of some commercial environments where integration is a goal and part of the project.

I have visions of what the filesystem should be like and how it should interface to the UI/shell. They are in many ways in agreeance with Reiser and the original Macintosh vision and in some aspects of Windows (all though I am no Microsoft fan) -- and many different schools of thought! I definitely agree with the author of the originating post, he has got some great points!!

Thanks to all that are working on a solution to this existent and persisting problem of computer science (which may have been solved already in some past era, if only we could revive the great softwares of the past!!! -- and which might already be solved already in some expensive commercial package that I can't afford and wouldn't want to use because of the software model.)

31 Aug 2001 02:52 afields

Re: Storing files

> A solution? Get rid of the save dialog

Actually, come to think about it, no reason to rid of it, just implement another approach and allow them to be configured on or off.

31 Aug 2001 03:33 Caglios

Re: There are tools...
Yes, the tools are there. But more often than not you need to write them yourself. Only in the last few months have I got my scripts down so that not even a tmp file escapes my wrath (Yay for PERL).

The overheads for this probably aren't worth it, and there's still a few bugs. The package (as yet unreleased) needs to work at a relatively low-level to query the fs to see which files have been opened (it presently only works on x86 machines) and another cron to take an image of the complete filesystem once a day, compare it agains the previous day, see which files have been opened in comparison, and stores this and other data in a mySQL table. Then... every month, like clockwork, I switch my pootie on and it takes about an hour to archive all of the unused files for the period.
After that, it's just a matter of scanning through the .zip's and removing what I don't really need.

Seems a bit gratuitous, really. But it works.

31 Aug 2001 05:14 jfw

Re: The tree structure is one problem

> I've taken a brief look, and find the
> structure a little daunting (also
> unfortunately I don't speak much German
> :( ) -- some of conepts seem neat.. I am

Thanks. Yes, I know there is several years
of work to be documented. I appreciate
all comments on how to improve documentation
structure. Promise: german gonna be translated.

> One thing we perhaps should be careful
> of is to allow these solutions to have a
> tight level of integration to existing
> facilities so that they are intuative to
> users and don't appear to be a layer on

That's a main point of Askemos.
It was actually started, when I realized,
that I can understand files, but my dad,
a philosopher, could not.

It's certainly not his fault.

> a layer on a layer of storage (The
> multiplicity of trees - as mentioned
> above). Yours appears to also be an

That's about technology. Askemos stores
it's data in one repository (two files,
provided by rschemes pstore moduel).
Within that repository, you find
internally hash tables and document trees.

The technology is called pointer swizzling at
page fault time, which says it all.

> anonymous sharing protocol?

Not exactly yet. There is one needed.

Askemos is by definition based on standards
wherever feasable.
For the sharing, I currently go though SOAP.
This is not the final solution.

31 Aug 2001 07:29 freem

Dumb data and the file system
I think that this article is brilliant, and also enjoyed the comments.

My perspectives on the issues raised are this.

I think there is a fundamental legacy that is difficult to overcome - of course it could be, but it is made very difficult by the commonly accepted underlying abstractions.

The UNIX operating system made a very effective design decision "everything is a file". Making the abstraction of "file" more useful than it had been in previous operating systems. This provided benifits similar to the benifits of closure in algerbraic mathmatic structures. The most obvious being the ability to use small programs in concert using pipes. There was a high level of consistency in the implementation allowing greater productivity for users. Allowing general utility programs to be very usefull.

However a "file" is just an abstraction. There are also downsides to treating everything as a file. The most obvious being that you need some intelligence about the data in the file to gain higher levels of usefullness.

The continued use of "files" as the dominant user interface abstraction for data storage - while at the same time loading more meaning into that data - has lead to both monolithic applications, and monolithic file sizes and increaing complex file types.

So in the examples raised, we have "bookmarks", "mail messages", "pdf files", etc, which are different concepts which are managed by different applications. by creating different applications to deal with them, we have lost something though, we have lost their similarities. It IS usefull to think of them at both the level of "file" - a chunk of data, and at a more meaningfull level - "bookmark".

Relational databases also had a brilliant idea - everything is a table - and also raised the usefulness bar in some contexts for a whole bunch of reasons, including greater description of what the data was.

However, I disagree with the idea of making a relational database interface to the underlying data storage to be the only way to access information. The data in relational databases is weakly typed. The abstraction is brilliant, but it continues to encourage the seperation of the data from the meaning. Meaning that you still loose generality. i.e. the data is still too dumb.

So, I think, there is a fundamental problem with the underlying abstraction "file" (specifically as the user interface), and I dont think progress is made by using the abstraction "table" or for that matter "XML file".

We do however have an abstraction that could serve as the basis for systems that could meet at least dodge some of the objections raised. Here goes...

"Everything is an object"

A system that was baised on the user interface to data being a persistent object store, I think would provide a foundation that more easily lead to the desired features.

With objects you have an explicit type heirarchy. giving you the ability to manipulate objects either at a high or low semantic level. giving you both general and specific tools.

An underlying object repository, also gives you a way to deal with the legacy problem of files. You can easily rap a file into an object, if you happen to not have access to the underlying semantics.

Going further, different views onto your object repository and different ways of locating the object(s) you want are required - but I think as an abstraction objects are a much better starting point than files.

Just for clarity, I am talking about the abstraction presented to the user interface. I am not advocateing what I think should be used as a physical storage abstraction.

In the (very intereesting) ResierFS documentation - it makes an argument for moving semantics into the file system implementation, which is another way of saying that there is no clean decomposition that is general between the user interface abstractions and the data storage abstraction. but then where ResierFS is heading could be used as a persistent object store.

SO, as an OO bigot, objects (and then a lot of hard work) ARE the panacea ;-)

- Michael.

31 Aug 2001 09:48 alexfarrell

This problem is solved very neatly already. Has nobody noticed...?
BeOS solves this problem very nicely, using something similar to the suggestion in the first post.

The filesystem (BFS) allows attributes (arbritary data streams) to be attached to filesystem elements, and these can be indexed by the OS. Queries can be performed on the filesystem based on attributes which are served as a "directory"

This makes organizing files very easy.

For example, the ID3 tags of mp3 files can be stored as attributes attached to each file. All mp3 files are then stored in a single directory, and a query (psuedo-directory) is created which shows all files belonging to, for instance, the "Rock" genre. Another query is created which shows all files by the Rolling Stones. Another could show all tracks written in the 80's etc etc. A particular track might show up in 1, 2 or all of the queries, and this allows you to get to what you want very quickly.
Feel like listening to the Stones? Just drag all files in the Stones directory (query) into your mp3 player. Feel like a rock evening? Also easy. Maybe you want to listen to all the old Stones stuff - easy again; just create a query for all Stones music before the 70s.
Your queries can be stored for later use.

These queries are provided at a filesystem level, which means that all applications can use them transparently. They are also instant, since they are indexed.

Problem solved.

Of course now that BeOS has been slain, it's probably not a good OS in which to invest your time. AtheOS ( www.atheos.cx ) promises similar facilities in the future, but it's not there yet.

Anyone feel like working on a cool new filesystem? Maybe you should contact the AtheOS author (I don't know him, so maybe he's not interested in support, but maybe he is).

01 Sep 2001 04:52 Avatar sparre

Filesystems, objects, databases and a command line interface...
Thanks for this inspiring article and comments. It got me started thinking about how such an information retrieval system can be organised without dropping too many of the benefits many people like with their Unix-like systems. There is not necessarily much new in the text below.

Information about files should be stored in a structured form (i.e. a "database"). This is the information I imagine is relevant to index:

* author
* language
* title (filename?)
* keywords
* description
* file type
* encoding
* creation and modification times
* projects
* categories
* dependencies (A is constructed from B and C)
* relations (if you are interested in A, then you are also likely to want to read B)
* full text

"URL symlinks" should be available as an alternative to "bookmark files". And they should be treated as if they were file they refer to.

The indexing system should recognise file types and extract as much as possible of the abovementioned information from the files.

"tar", "zip" and mailbox files - as well as other composite file types - should be indexed both as a whole and as the individual components.

If it is possible, then the indexing system should receive information from the file system, when files are created, modified or removed. Secondarily, the indexing system will have to scan the file system for changes on its own.

There should definitely be some kind of shell/command line interface to the system. And it should of course include file name completion-like features.

It should be considered to implement a virtual filesystem for formulate queries in the database. It would be great if I could do something like this in my favourite shell:

$ <x><v>< ></><tab>
[ the program `xv` can only read images ]
$ xv /.filetype/image/<.><c><a><tab>
$ xv /.filetype/image/.category/L<tab>
Choose:
L(i)nux
L(E)GO
$ xv /.filetype/image/.category/L<i><tab>
$ xv /.filetype/image/.category/Linux/T<tab>
$ xv /.filetype/image/.category/Linux/Tux.png

(not bad, I would say)

It would be nice, if filesystems could store more of the basic information about files (for example file type, encoding, language, author, keywords and description).

/Jacob

PS: Why are the "UL", "OL", "LI" and "BLOCKQUOTE" elements banned from the HTML formatting? :-(

02 Sep 2001 05:26 afields

Re: There are tools...

> Yes, the tools are there. But more
> often than not you need to write them
> yourself. Only in the last few months
> have I got my scripts down so that not
> even a tmp file escapes my wrath (Yay
> for PERL).

That is a good point, some times the best way to do it is your own script. I am also fond of Perl for some tasks.

> Seems a bit gratuitous, really. But
> it works.

Hmm.. seems like a good way to archive, but remember archiving offline isn't always the right/full solution.. Depends on peoples usage patterns I guess. :)

02 Sep 2001 06:58 afields

Re: There are tools...

> I had the same problem. Until I discovered
> that there are a lot of tools that can help.

I've been looking for tools, but even if I found a tool for each application (and it was open say), it still doesn't solve the closure issue fully. You can get pretty close though by using all web based tools.

> has to find all those tools and select
> the best of
> them. The starting point for me was
> the desire to
> have one (or two) places in which my
> important
> stuff goes. Ideally a common interface
> for all this.

Yeah, that would be nice to have one interface for all of the tasks you mention. Also, can you post a small list of links to the packages that you have found? That might be helpful for everyone here trying to setup a repository. I have searched Freshmeat and SourceForge but haven't yet got a good idea of what all exists and the extent of the work on these solutions. I know there are already a lot of commercial solutions to do these types of things... I imagine most are for large business/project management/office problems though. I wonder if any exist for research work.

> And the only environment that is ready
> to deal with
> all sort of objects is the web.
> Therefore, my way of
> solving the problem is:
>
> - Use a perl, php enabled web server
> for your own
> computer
> - Use a Personal Information system
> (there are
> several out there, I use MyPhPPim)
> with a web
> interface, connected with a mysql
> database. In
> that database goes all your E-mail,
> notes, todo-s,
> etc.
> - Use a bookmark manager connected
> with the
> same Mysql server and with a web
> interface
> - Use a web file manager system (such
> as
> phpFileFarm) to work the pdf, html, ps
> files
> - Use a web photo album to keep your
> photos (of
> course with database back end)
> - Use a cvs system for ASCII work in
> progress and
> install a webcvs system (I use
> viewcvs).
> - Finally, use HtDig or another search
> engine to
> index the whole stuff. Configure
> htdig to search in
> separate directories or in all.
>
> use the same database engine (mysql or
> postgress
> or another) to minimize the load

I agree with trying to get everything into one DBMS at least, even if there isn't seemless integration. Even more ideal is to have a strong level linkage between all the member DBs of the DBMS.

Also, it appears PostgreSQL and MySQL are a little behind Oracle in some of the Object over Relation framework features. Even nicer is the OO or Object-Relation ODBMSes like Cache, DB40 or (open source example) GOODS and Gigabase.

On the DB access layer another project that caught my eye was ColdStore (persistence framework using simple DB). And then there is J2EE for Java which is something to look at for Java apps.

There are lots of things to look at, and there are many projects adressing specific sections of the problem...

---

Allan Fields

03 Sep 2001 07:33 sorinm

Re: There are tools...

> I've been looking for tools, but even
> if I found a tool for each application
> (and it was open say), it still doesn't
> solve the closure issue fully. You can
> get pretty close though by using all web
> based tools.

Yes, web based tools are probabily the most
complete ones. And, yes, sometimes you are just
getting very close. But most of the time, since the
web tools are (some of them at least) rather
standard you can adapt yourself to the tools.

> Also, can you post a small
> list of links to the packages that you
> have found? That might be helpful for
> everyone here trying to setup a
> repository.

I will post several links but, as usually you should
check for yourself. Especially open source projects
are sometimes moving very fast. And let's hope
that others will reply adding some more.

About the repository: I use the common cvs
system that can be found at

www.cvshome.org
The cvs from there can be used fron the command
line, it has no graphical interface or web interface.
But once you've set a repository (or more) you can
use several tools that are available:

CVSWeb

stud.fh-heilbronn.de/~...
This is a single perl script that does not need a
database. You need to have the repository set up
and that's it

ViewCVS

freshmeat.net/projects...

The one I am using now. It is based on CVSWeb
but is in phyton. You can download tarballs from
your repositories and it has syntax highliting for a
lot of file types (based on enscript). It can be used
with a MySql database but this is not compulsory.
The database does not keep the repository, just
information.

Chora

horde.org/chora/
I haven't personaly tried this one. But I've seen
some online repositories and in matches pretty
close the previous one.

Freepository

www.freepository.com

The one I'll use in the future :) if I have time to
move all my stuff from Mysql to Postgresql. It is a
full web based tool, checkin, checkout, whatever.
Postgresql backend.

For CVS documentation ot tutorials: go to the
ViewCVS site, there are several links.

In principal, my web site has to have:

A news system - something that grabbs the
news from slashdot, freshmeat, etc.
A
Calendar

A bookmark manager

A place for notes

A place to put small articles that I find on the
web

A photo gallery

An E-mail system

A file manager

CVS Interface

An interface to the computer administration
tools

web ssh login

There are several tools that can do this. I will
mention two (although I am sure that more -
maybe better - can be found.

PhPGroupware - multiuser groupware tool that has
everything in the above list, except the last two
(as far as I know). The cvs interface is chora,
mentioned above. Very activelly developped (if you
go on Sourceforge you will almos always see it on
one of the first three places.
www.phpgroupware.org

PhPNuke + several modules (News, Gallery,
Calendar, etc)
All can be found on the phpNuke site:
www.phpnuke.org
PhPNuke is a system for building news sites but you
can use it for all of the above list, except the last
5.

I am using now PhPNuke. For other tasks:

E-mail system: There are a lot of webmail
programs. If you have the mail delivered to your
machine then you can use a href=
"neomail.sourceforge.ne... or
Openwebmail
and much more.

Web based file managers: an interesting one is PhPFileFarm

Interface for administration: the best one seems to
be WEBMIN (I am running Linux, you should check
their page for other systems).

www.webmin.com/webmin/

Webmin has also a file manager and a ssh login
shell, and much more.

Or you can use a combination of:

MyPhPIM
sourceforge.net/projec...
which has mail, calendar, todo, addresbook

and other tools described above.

> Also, it appears PostgreSQL and MySQL
> are a little behind Oracle in some of
> the Object over Relation framework
> features. Even nicer is the OO or
> Object-Relation ODBMSes like Cache, DB40
> or (open source example) GOODS and
> Gigabase.
> On the DB access layer another project
> that caught my eye was ColdStore
> (persistence framework using simple DB).
> And then there is J2EE for Java which
> is something to look at for Java apps.

I am not very familiar with object oriented
database. Postgress has table inheritance though.
But, for such a project (that is personal therefore
single user) a lightweight database seems the best
choice. This is why I am not yet convinced to
move my system fropm Mysql.

>
> There are lots of things to look at,
> and there are many projects adressing
> specific sections of the problem...

This is true. It will be more than nice to start a
project for this - a personal web content manager.
Oh, I mentioned HtDig. It is a search/indexing
engine that can be found at:

www.htdig.org (www.htdig.org)

>
> ---
> Allan Fields

03 Sep 2001 23:10 mschmidt

Indexing PDFs
I don't think having a bib file for every PDF is a good solution. PDF provides the possibility for metadata inclusion (pdflatex lets you do this, and I'm sure Adobe's own tools as well). Unfortunately, almost nobody is using these correctly. You can store title, author, date, keywords and probably more in a PDF. Command line tools like pdfinfo (or CTRL-D in Acrobat Reader) show this information. Additional files always get lost or do not get updated, so store the information in the files!

03 Sep 2001 23:18 mschmidt

Problems with superstrings
The idea is good, there is just a minor problem with character encoding. I don't say this cannot be solved, but I noticed that with latex special characters often get described in a visual way - e. g. a small 'a' with two dots on top of it. While this results in nice output, the information that this really is an umlaut 'a' (as in German Bär or Jägermeister) gets lost (does not get stored in hte dvi or pdf file that is the result of the (pdf)latex run). So searching for anything with ä in it doesn't work anymore.

04 Sep 2001 04:16 hvameln25

Another ERP-System
for those of you in need for a linux based ERP-System.
Have a look at www.pentaprise.de.

06 Sep 2001 07:19 ask

MacOS X "bundles"
In MacOS X we have "compound files" which can be navigated like the directory structures they really are. In the MacOS Finder they look and work like a single file, but after opening a shell you can easily see what's inside.

- ask

07 Sep 2001 07:56 gregholt

Re: when in doubt use brute force

> Simply delete files not needed and
> backup everything you may need in the
> future to an external storage device
> like a tape archiver or a cd-r.

I've seen this recommended by several folks, so don't think I'm singling you out...

Simply archiving and deleting things does not solve the problem, they make it *worse*. How do you find out what the hell you've archived? "Gee, I know John sent me an article he wrote on graphing small population relationships, but which of these 50 CDs or 150 backup tapes did I put that on?"

Greg

07 Sep 2001 11:27 afields

See also.. Story on OSNews
www.osnews.com/story.p... (www.osnews.com/story.p...)

Are Linux meta-data enabled filesystems ready for production? Never hurts to try out something new on a test machine. I plan to look at and compare the various file systems discussed in this article.

On another note, XML databases are very interesting indeed. Take a look at these handy resources: www.rpbourret.com/xml/... (www.rpbourret.com/xml/...), www.rpbourret.com/xml/... (www.rpbourret.com/xml/...). These pages describe XML database solutions and talk about how XML and databases fit together. Also mentioned is various XML databases and at this point there is a large, large, pool of XML database projects both commercial and open source. Some examples of XML/OO database products: Prowler, Ozone, etc.
XML databases might be a key element of how to get this to the user level in existing solutions. Also take a look at the section describing DTD schema translation, in what is described as 'object-relational mapping' method of document-centric XML files to object frameworks and then to the relational database backend such as say an SQL database like PostgreSQL perhaps.

Since I'm no XML expert some of this is new to me.

Well hope someone still reads through this thread, it seems a bit dated by now.

---

Allan Fields

08 Sep 2001 21:48 aglasgall

Re: Storing files
Didn't Acorn's RiscOS do this wrt saving stuff?

> A solution? Get rid of the save dialog
> box and replace it with a draggable
> icon. To save, the icon is dragged to a
> filer window, directory on the panel,
> etc. Common save destinations (eg, the
> project you're currently working on) can
> then be kept handy along the bottom of
> the screen (or whereever). See here for
> an implementation of this system.

10 Sep 2001 08:15 Avatar tal197

Re: Storing files

> Didn't Acorn's RiscOS do this wrt saving
> stuff?

Yep; my implementation looks very similar to it.

16 Sep 2001 15:27 jregel

Re: This problem is solved very neatly already. Has nobody noticed...?
To expand on the above, BeOS did the above well because it was so integrated into the OS and all applications made use of it. Scott Hacker had a
great article on BeOS filetypes at www.byte.com a while back that explained how apps such as the web browser would automatically fill in certain attributes for the user (such as the source URL, date it was downloaded, mimetype etc), and that mp3 rippers would populate the song title, author, track length etc.

My understanding is that the issue of filesystem metadata has been discussed on the Linux kernel mailing list, and Linus and co are trying to work on a proper implementation.

19 Sep 2001 01:26 rmjorja

Isn't Oracle IFS the answer (albeit an expensive one)
I do not have any experience in Oracle IFS (Internet File System), but from what I have heard and read about it seems to be the answer.
I saves everything in an RDB (obviously Oracle), so you can add tags, metadata, etc. (it even handles XML) and then you can query it via http, nfs, smb, imap, etc. any protocol!

03 Oct 2001 06:06 grom

command line vs gui
in regards to being able to call up the command-line interface when saving. Why not intergate command-line functionality into the gui. For example why not have path name completion in the filename text box. So a user can use either the gui or command-line. I feel that gui need to have the power of command-lines, but I'm still yet to witness this. Another idea I have is, a find file dialog (like that in windows explorer) that allows for regular expressions. Why give up the command-line for a gui interface when we can have the command-line built-in the gui.

16 Oct 2001 18:28 tobho970

Reducing everything to a file and using file filters
Maybe part of the solution to the problem that bookmarks cannot easily be integrated into the file system would be to write an http file system (maybe one already exists?). I am thinking you might mount it under say /http and reach web documents by standard URL:s like /http/user.passwd@host:port/path/filename.xyz where ofcourse user, passwd, port and path would be optional. This way you could create symlinks anywhere you want to save your web documents (enabling you to choose any filename you like) that points to somewhere under /http and then use it as a (probably but not necessarily exclusively (I guess?) read-only) file. This way you would always be using the latest version of the file and all tools that works with files would also work transparently with web documents.

Ofcourse you can imagine the http file system automatically finding stale links, doing regular download/backup of documents in the filesystem, etc and ofcourse again there is no reason why other protocols than http couldn't be used the same way.
...and ofcourse another useful protocol to implement a file system for would be standard mail folders, but as I have read the comments to the article there seams to be some mail storage standards for reaching the mail folder as a part of the file system.

Well, reducing bookmarks and mail folders to files might not be what we want, but it would be rather easy to do the indexing etc. for people wanting to be able to search their HOME.
When it comes to searching/indexing .ps and .pdf files and the like I do not understand why the indexer/searcher cannot apply filters based on filename extensions, magic numbers or mime type for example.

Wanting to use text tools like grep on .ps and .pdf files and the like can be solved in at least two ways:
1) make the tools (e.g. grep) smarter and allow for a filter to be applied to the files grep searches before grep does its search
2) make your favourite shell recognize a special charachter that you for example add just before filenames you want to read as plain text at a command line, then the shell applies the filter mechanism (e.g. pdf2txt or ps2txt) and sends the filtered file (probably stored in something like /tmp/shellfiltermechanism/path_to_the_real_file) into the command you entered (for example grep). of course grep would find matches in /tmp/shellfiltermechanism/foo/bar/fie.pdf but you would know you should look at /foo/bar/fie.pdf. This is ofcourse not a perfect solution, but people using grep and the special shell char would know how to handle it, so for us hackers it might do? :)

Just some thoughts...

28 Oct 2002 09:46 malcolmkav

Perishable Files
I remember reading somewhere about a file system
that deletes files automatically that haven't been accessed
for a certain amount of time.

Strange as this may seem, it does make a certain amount of sense. Probably if you haven't accessed a file for 5 years, you could probably do without it. Especially, you would probably not even notice that it had faded away.

05 Mar 2005 20:18 zoohoolinux

Re: Indexing PDFs

> I don't think having a bib file for

> every PDF is a good solution. PDF

> provides the possibility for metadata

> inclusion (pdflatex lets you do this,

> and I'm sure Adobe's own tools as well).

> Unfortunately, almost nobody is using

> these correctly. You can store title,

> author, date, keywords and probably more

> in a PDF. Command line tools like

> pdfinfo (or CTRL-D in Acrobat Reader)

> show this information. Additional files

> always get lost or do not get updated,

> so store the information in the files!

Since the authors of pdf files do not always include metadata into the file, design a tool that handles metadata both ways, putting metadata into a pdf from and file or pulling the metadata out of the pdf into a file. If an extensible standard for the seperate metadata files were made a whole set of tools might could be made that handles putting and pulling metadata from almost any file type that includes metadata pdf, mp3, ogg, mpeg, avi, and probably others.

05 Mar 2005 20:48 mschmidt

Re: Indexing PDFs

> Since the authors of pdf files do not
> always include metadata into the file,
> design a tool that handles metadata both
> ways, putting metadata into a pdf from
> and file or pulling the metadata out of
> the pdf into a file. If an extensible
> standard for the seperate metadata files
> were made a whole set of tools might
> could be made that handles putting and
> pulling metadata from almost any file
> type that includes metadata pdf, mp3,
> ogg, mpeg, avi, and probably others.

Adobe's XMP (www.adobe.com/products...) is a common metadata framework. I think it is already used with some PDFs as well.

I agree that a multi-level approach that tries different ways to access metadata is preferable.

Marco

14 Sep 2007 19:40 Logomachist

Re: The tree structure is one problem
I'm afraid that went over my head. What exactly is Askemos? In light of all the improved search tools we have, is it still as strong a solution today as it was 6 years ago?

14 Sep 2007 19:41 Logomachist

Re: File naming rules

> More thoughts on file naming rules:

> www.everything2.com/in...

Link's broken. Is that article still around?

14 Sep 2007 19:46 Logomachist

Good organizational skills are a help, BUT...
Good organizational skills are a help, but so are good user interfaces. One is not a replacement for the other, and in fact, IMHO they are self-reinforcing..

14 Sep 2007 19:48 Logomachist

Re: This problem is solved very neatly already. Has nobody noticed...?
Doesn't NTFS already have this?

Screenshot

Project Spotlight

nftables

A configuration tool for managing Linux kernel packet filtering rules of the nf_tables packet filter.

Screenshot

Project Spotlight

nassh-relay

A relay server for the Secure Shell Chromium plugin.