OpenSearchServer is a stable, high-performance search engine and a suite of high-powered full text search algorithms. Documents can be indexed in sixteen languages. Multi-lingual analyzers slice sentences into words, then run lemmatisation algorithms on words based on the document's language. Numerous document formats are supported, such as XML, HTML/XHTML, PDF, Word, PowerPoint, RTF, OpenOffice, plain text, MP3/4, Ogg, FLAC, etc. The Web interface, built around the Zkoss framework, provides an easy way to manage OSS. The integration is fast using the PHP client or the API (XML over HTTP). The crawlers of OpenSearchServer go through Web sites, file systems, and databases to rapidly and easily build your index.
LogicalDOC is a Web-based document management system that is easy to use and learn. Its architecture leverages best-of-breed Java technology to achieve a powerful and flexible solution. It supports its users with a powerful search engine (Lucene), Web service interface (JAX-WS via CXF) compatible with .NET and PHP, versioning, annotation on documents, a WebDAV interface, importing and exporting from .zip files. Documents can be organized into hierarchical folders, searched using the integrated search engine, or browsed by Tag. The system is extensible thanks to the technologies used (Spring-Hibernate) and its plugin architecture.
Nuxeo Platform provides a framework and set of components to address document management and collaboration needs, including metadata/taxonomies, versioning, lifecyle management, workflow, relations, searching, reporting, transformation, auditing, and retention. Its flexible extension system, based on OSGi, allows developers to quickly configure and extend the platform by creating new components. Its default Web user interface, based on the JSF standard, uses AJAX to create a pleasant user experience. It can also be accessed by a rich client interface through the use of Web services, for instance using the Eclipse-based Nuxeo RCP rich client platform.
PDFTextStream is a PDF text and metadata extraction library available for Java and .NET. It supports all versions of the PDF document specification (including v1.7, used by Acrobat 8, 9, and X), extraction of text encoded using double-byte character sets (including Chinese, Japanese, and Korean), decryption of documents encrypted using 40-bit, 128-bit, 256-bit, and variable bit length ciphers, and extraction of all document metadata provided by PDF documents (including form data, bookmarks, and annotations). Easy integration with Jakarta Lucene is included, as well as interactive form update capability.
Xapian is a search engine library, scalable to collections containing hundreds of millions of documents. It's written in C++ with bindings for Perl, Python, PHP, Java, Tcl, C#, Ruby, and Lua. It is a highly adaptable toolkit that allows developers to easily add advanced indexing and search facilities to their own applications. It supports the Probabilistic Information Retrieval model and also a rich set of boolean query operators. Omega is a Web search application built upon the Xapian library. It can index a Web server's document tree (including HTML, PDF, OpenOffice, MS Word/Excel/Powerpoint/Works, WordPerfect, RTF, PS, etc.), or data exported from arbitrary sources (e.g. SQL databases).
Splunk is an engine for machine data. Use Splunk to collect, index, and harness the fast moving machine data generated by all your applications, servers, and devices: physical, virtual, and in the cloud. Search and analyze all your real-time and historical data from one place. Splunking your machine data lets you troubleshoot problems and investigate security incidents in minutes, not hours or days. Monitor your end-to-end infrastructure to avoid service degradation or outages. Meet compliance mandates at lower cost. Correlate and analyze complex events spanning multiple systems. Gain new levels of operational visibility and intelligence for IT and the business.
Sesat is a search middleware with federation capabilities and a built-in search portal framework. It makes it easy to build applications that look for information in many different places simultaneously. It can connect to almost any kind of data source that can be accessed using Java - databases, search indexes, files, back office systems, Web services, and ESBs.
p-get is a porn Web spider for the efficient download of Internet video pornography. It is designed to never download the same file twice. It downloads from TGP galleries. Although written in Java, it is a command line tool in the Unix tradition. It is known to work in Linux, Windows, and OS X with Sun's Java, gcj, and jamvm+GNU classpath.
focuseek searchbox is a family of easily installable full-text search engines that can spider Internet and intranet data sources (Web sites, newsgroups, FTP sites, and others) or index data you feed to it and make it available for searching. It supports a variety of input formats (among them HTML, PDF, Microsoft Word DOC, and RTF), and is easily scriptable via SOAP and extendable through plugins. It can scale to millions of documents and comes with a full-fledged GUI client, a built in Web search portal, and an RSS server.
This is a tool to collect information from web servers and to spider the web sites. This was written for the Open Source Security Testing Methodology (OSSTM) located on http://www.ideahamster.org/osstmm- description.htm. The spider is a multi-threaded resusable module that can be used in other projects.