OpenSearchServer is a stable, high-performance search engine and a suite of high-powered full text search algorithms. Documents can be indexed in sixteen languages. Multi-lingual analyzers slice sentences into words, then run lemmatisation algorithms on words based on the document's language. Numerous document formats are supported, such as XML, HTML/XHTML, PDF, Word, PowerPoint, RTF, OpenOffice, plain text, MP3/4, Ogg, FLAC, etc. The Web interface, built around the Zkoss framework, provides an easy way to manage OSS. The integration is fast using the PHP client or the API (XML over HTTP). The crawlers of OpenSearchServer go through Web sites, file systems, and databases to rapidly and easily build your index.
LogicalDOC is a Web-based document management system that is easy to use and learn. Its architecture leverages best-of-breed Java technology to achieve a powerful and flexible solution. It supports its users with a powerful search engine (Lucene), Web service interface (JAX-WS via CXF) compatible with .NET and PHP, versioning, annotation on documents, a WebDAV interface, importing and exporting from .zip files. Documents can be organized into hierarchical folders, searched using the integrated search engine, or browsed by Tag. The system is extensible thanks to the technologies used (Spring-Hibernate) and its plugin architecture.
Xapian is a search engine library, scalable to collections containing hundreds of millions of documents. It's written in C++ with bindings for Perl, Python, PHP, Java, Tcl, C#, Ruby, and Lua. It is a highly adaptable toolkit that allows developers to easily add advanced indexing and search facilities to their own applications. It supports the Probabilistic Information Retrieval model and also a rich set of boolean query operators. Omega is a Web search application built upon the Xapian library. It can index a Web server's document tree (including HTML, PDF, OpenOffice, MS Word/Excel/Powerpoint/Works, WordPerfect, RTF, PS, etc.), or data exported from arbitrary sources (e.g. SQL databases).
Nuxeo Platform provides a framework and set of components to address document management and collaboration needs, including metadata/taxonomies, versioning, lifecyle management, workflow, relations, searching, reporting, transformation, auditing, and retention. Its flexible extension system, based on OSGi, allows developers to quickly configure and extend the platform by creating new components. Its default Web user interface, based on the JSF standard, uses AJAX to create a pleasant user experience. It can also be accessed by a rich client interface through the use of Web services, for instance using the Eclipse-based Nuxeo RCP rich client platform.
Splunk is an engine for machine data. Use Splunk to collect, index, and harness the fast moving machine data generated by all your applications, servers, and devices: physical, virtual, and in the cloud. Search and analyze all your real-time and historical data from one place. Splunking your machine data lets you troubleshoot problems and investigate security incidents in minutes, not hours or days. Monitor your end-to-end infrastructure to avoid service degradation or outages. Meet compliance mandates at lower cost. Correlate and analyze complex events spanning multiple systems. Gain new levels of operational visibility and intelligence for IT and the business.
PDFTextStream is a PDF text and metadata extraction library available for Java and .NET. It supports all versions of the PDF document specification (including v1.7, used by Acrobat 8, 9, and X), extraction of text encoded using double-byte character sets (including Chinese, Japanese, and Korean), decryption of documents encrypted using 40-bit, 128-bit, 256-bit, and variable bit length ciphers, and extraction of all document metadata provided by PDF documents (including form data, bookmarks, and annotations). Easy integration with Jakarta Lucene is included, as well as interactive form update capability.
Readerware is an easy and fast tool for cataloging your books, music, and videos. Its unique auto-catalog feature lets you feed in a list of ISBNs, UPCs, or barcode scans, automatically searching multiple Web sites to build the most complete database possible, with cover art. It is also possible to drag and drop from a browser. A Palm OS interface is provided, allowing you to take your database with you.
focuseek searchbox is a family of easily installable full-text search engines that can spider Internet and intranet data sources (Web sites, newsgroups, FTP sites, and others) or index data you feed to it and make it available for searching. It supports a variety of input formats (among them HTML, PDF, Microsoft Word DOC, and RTF), and is easily scriptable via SOAP and extendable through plugins. It can scale to millions of documents and comes with a full-fledged GUI client, a built in Web search portal, and an RSS server.
This is a tool to collect information from web servers and to spider the web sites. This was written for the Open Source Security Testing Methodology (OSSTM) located on http://www.ideahamster.org/osstmm- description.htm. The spider is a multi-threaded resusable module that can be used in other projects.
MM3WebAssistant Proxy Offline Browser Pro archives visited Web pages with your browser to be used online or offline. Offline, each page is available with its original URL. There is no difference between browsing the Internet or the archive. You can even use your bookmarks offline. Search, navigation, and marking make efficient use possible. It allows mobile users to access Internet information when they don't have Internet access.