Projects / Duke

Duke

Duke is a fast and flexible record linkage engine. It does not use the traditional blocking (sort by key) approach, but instead relies on Lucene. This makes it high-performance (able to process 1,000,000 records in ~10 minutes). Duke can be run from the command line, but also has an API allowing incremental linking applications to be built easily. It supports reading data from CSV, JDBC, SPARQL, and NTriples, and also supports a number of string comparators and string normalizers.

Tags
Licenses
Operating Systems
Implementation
Translations

Recent releases

  •  15 Feb 2014 15:20

    Release Notes: This release adds much faster backends based on blocking. One is in-memory, the other is based on MapDB. It also has a new Record implementation which uses only 50% of the memory, and a number of other changes.

    •  19 Oct 2013 10:06

      Release Notes: The main new feature is a genetic algorithm, which can be used to tune configurations automatically. Thanks to active learning it can even be used without a correct set of test data.

      •  02 Mar 2013 09:14

        Release Notes: Support for multi-threading, an upgrade to Lucene 4.0, higher performance, more comparators, more cleaners, major improvements to the command line client, and more.

        •  15 Sep 2012 09:10

          Release Notes: New comparators, new cleaners, some bugfixes, an upgrade to Lucene 3.6.1, and some improvements in configurability.

          •  28 Mar 2012 10:01

            Release Notes: The internals have been cleaned and refactored, adding some performance tuning parameters. There are new cleaners, support for pluggable backends, a new na´ve in-memory backend, and much more.

            Screenshot

            Project Spotlight

            OpenStack4j

            A Fluent OpenStack client API for Java.

            Screenshot

            Project Spotlight

            TurnKey TWiki Appliance

            A TWiki appliance that is easy to use and lightweight.