GNU Wget is a utility for noninteractive download of files from the Web. It supports HTTP and FTP protocols, as well as retrieval through HTTP proxies. It can follow HTML links, download many pages, and convert the links for local viewing. It can also mirror FTP hierarchies or only those files that have changed. Wget has been designed for robustness over slow network connections; if a download fails due to a network problem, it will keep retrying until the whole file has been retrieved.
With LinkChecker, you can check HTML documents and Web sites for broken links. It features recursion, robots.txt exclusion protocol support, HTTP proxy support, i18n support, multithreading, regular expression filtering rules for links, and user/password checking for authorized pages. Output can be colored or normal text, HTML, SQL, CSV, or a sitemap graph in DOT, GML, or XML format. Supported link types are HTTP/1.1 and 1.0, HTTPS, FTP, mailto:, news:, nntp:, Telnet, and local files.
urlwatch is a script intended to help you watch URLs and get notified (via email) of any changes. The change notification will include the URL that has changed and a unified diff of what has changed. The script works out of a single directory, so there is no need to install anything. State files are kept in the same folder. The script supports stripping parts of a page that are always changing through the use of a filter hook function. It is typically run as a cronjob.
RabbIt is a mutating, caching Web proxy used to speed up surfing over slow links like modems. It does this by removing advertising and background images and scaling down images to low quality JPEGs. RabbIT is written in Java and should be able to run on any platform. It does depend upon an image converter if image scaling is on. The recommended image converter is "convert" from the ImageMagick package.
webcheck is a Web site checking tool for Web masters. It crawls a given Web site and generates a number of reports. The whole system is pluggable, allowing extra reports and checks to be added easily. It supports retrieving Web sites over HTTP, file, and FTP protocols and produces reports on site structure, broken links, old Web pages, overviews of external links, and more. The links that webcheck considers external are configurable through regular expressions, and webcheck honors robots.txt.
Web-Analiser is a script to gather and analyze statistics of Website traffic. It provides the whole gamut of every possible statistics and reports, which analyze visitors, pages, and Website traffic. It is designed to operate both one and several domains (Websites) at once. The statistics are maintained, in which each Website dealt with separately and together. It includes a great number of statistical reports, making it possible to have a complete idea of your audience.
Websitary is a script that monitors Web pages, RSS feeds, and podcasts and reports what's new. For many tasks, it reuses other programs (such as w3m, diff, and webdiff) to do the actual work. By default, it works on an ASCII basis, i.e. with the output of text-based Web browsers. With the help of some friends, it can also work with HTML.
phoneutria is a Web crawler that is multi-threaded, scalable, high performance, extensible, and polite. It can be used to crawl, index, load-test, or even download any Web or enterprise domain and is configurable through a XML configuration file. Phoneutria can be used for either checking the links of a Web site or for load-testing purposes (i.e. the level of politeness can be configured). It provides a plug-in mechanism for further extensions.
ht://Check is a link checker derived from ht://Dig. It can retrieve information through HTTP/1.1 and store it in a MySQL database so that after a "crawl", ht://Check can return broken links, anchors not found, content-types, and HTTP status codes summaries. ht://Check also performs accessibility checks in accordance with the principles of the University of Toronto's Open Accessibility Checks (OAC) project, allowing users to discover site-wide barriers like images without proper alternatives, missing titles, etc. A PHP interface lets the user query and view the results directly via the Web.