This is a tool to collect information from web servers and to spider the web sites. This was written for the Open Source Security Testing Methodology (OSSTM) located on http://www.ideahamster.org/osstmm- description.htm. The spider is a multi-threaded resusable module that can be used in other projects.
Nuxeo Platform provides a framework and set of components to address document management and collaboration needs, including metadata/taxonomies, versioning, lifecyle management, workflow, relations, searching, reporting, transformation, auditing, and retention. Its flexible extension system, based on OSGi, allows developers to quickly configure and extend the platform by creating new components. Its default Web user interface, based on the JSF standard, uses AJAX to create a pleasant user experience. It can also be accessed by a rich client interface through the use of Web services, for instance using the Eclipse-based Nuxeo RCP rich client platform.
MM3WebAssistant Proxy Offline Browser Pro archives visited Web pages with your browser to be used online or offline. Offline, each page is available with its original URL. There is no difference between browsing the Internet or the archive. You can even use your bookmarks offline. Search, navigation, and marking make efficient use possible. It allows mobile users to access Internet information when they don't have Internet access.
focuseek searchbox is a family of easily installable full-text search engines that can spider Internet and intranet data sources (Web sites, newsgroups, FTP sites, and others) or index data you feed to it and make it available for searching. It supports a variety of input formats (among them HTML, PDF, Microsoft Word DOC, and RTF), and is easily scriptable via SOAP and extendable through plugins. It can scale to millions of documents and comes with a full-fledged GUI client, a built in Web search portal, and an RSS server.
Readerware is an easy and fast tool for cataloging your books, music, and videos. Its unique auto-catalog feature lets you feed in a list of ISBNs, UPCs, or barcode scans, automatically searching multiple Web sites to build the most complete database possible, with cover art. It is also possible to drag and drop from a browser. A Palm OS interface is provided, allowing you to take your database with you.
iTree Pro-XQ Powertree is a Java tree menu with drag 'n' drop, tabs, scalability, and database friendliness. It also features a large variety of search facilities and runtime editors which are especially useful for large interactive application interfaces. Menu content can be fed from an easily written text file or database interface script. Other features include multi-state user-definable icons, true-type fonts, pixel-level customizable colors and layout, script triggers, multiple scrolling logics, checkbox and radio menu items, line-wrapping, and rollover effects.
Xapian is a search engine library, scalable to collections containing hundreds of millions of documents. It's written in C++ with bindings for Perl, Python, PHP, Java, Tcl, C#, Ruby, and Lua. It is a highly adaptable toolkit that allows developers to easily add advanced indexing and search facilities to their own applications. It supports the Probabilistic Information Retrieval model and also a rich set of boolean query operators. Omega is a Web search application built upon the Xapian library. It can index a Web server's document tree (including HTML, PDF, OpenOffice, MS Word/Excel/Powerpoint/Works, WordPerfect, RTF, PS, etc.), or data exported from arbitrary sources (e.g. SQL databases).
Spider Linker provides the ability to efficiently make all content on a Web site, particularly dynamic content, available to Internet search engines. It discovers content on one or more Web sites, and creates a table of contents (TOC) of all the content in a format that is friendly to search engine spiders. HTML, XML, sitelist.txt, and Harvest Control List formats are supported, and custom formats can be constructed. Since it provides FTP, HTTP/HTTPS, upload, and email publishing mechanisms, it can also support XML feeds and other URL submission strategies.
PDFTextStream is a PDF text and metadata extraction library available for Java and .NET. It supports all versions of the PDF document specification (including v1.7, used by Acrobat 8, 9, and X), extraction of text encoded using double-byte character sets (including Chinese, Japanese, and Korean), decryption of documents encrypted using 40-bit, 128-bit, 256-bit, and variable bit length ciphers, and extraction of all document metadata provided by PDF documents (including form data, bookmarks, and annotations). Easy integration with Jakarta Lucene is included, as well as interactive form update capability.