Larbin is an HTTP Web crawler with an easy interface that runs under Linux. It can fetch more than 5 million pages a day on a standard PC (with a good network).
| Tags | Internet Web Indexing/Search |
|---|---|
| Licenses | GPL |
| Operating Systems | POSIX Linux BSD FreeBSD |
| Implementation | C++ |
Recent releases


Release Notes: This release corrects some compilation tweaks with recent gcc versions, improves the configuration file parser, and adds new options for following links selectively.


Release Notes: This release compiles on Solaris, cookie management has been added, images can be fetched with pages, and many rewrites have been done for efficiency and portability.


Release Notes: With this release, it is possible again to crawl through a proxy, all configurations should compile (Linux and BSD), images can now be downloaded with pages, and the robots.txt parser has been enhanced.


Release Notes: Many efficiency updates were made to the sequencer, to buffer recycling, and to DNS management. A new output module for statistics has been added.


Release Notes: Output and buffer interfaces have been simplified. A dynamic buffer option has been added. The web server has been reworked.
Recent comments
13 Oct 2001 12:20
larbin@somewhere.com
I tried to reach the larbin project owner but I get the
following error, so I'm posting this here.
<sebastien.ailleret@inr...>
(reason: 550 5.7.1 <sebastien.ailleret@inr...>...
Access denied)
Either larbin does this as a default, or someone has
configured their version so, but I would appreciate it if
someone make sure that the the user-agent field for
larbin is *not* larbin@somewhere.com. I've gotten
several complaints, and I don't really appreciate it.