Projects / Duke

Duke

Duke is a fast and flexible record linkage engine. It does not use the traditional blocking (sort by key) approach, but instead relies on Lucene. This makes it high-performance (able to process 1,000,000 records in ~10 minutes). Duke can be run from the command line, but also has an API allowing incremental linking applications to be built easily. It supports reading data from CSV, JDBC, SPARQL, and NTriples, and also supports a number of string comparators and string normalizers.

Tags
Licenses
Operating Systems
Implementation
Translations

RSS Recent releases

  •  28 Mar 2012 17:14

    Release Notes: The internals have been cleaned and refactored, adding some performance tuning parameters. There are new cleaners, support for pluggable backends, a new naïve in-memory backend, and much more.

    •  13 Jan 2012 16:18

      Release Notes: This release adds a more flexible API, a new cleaner (for personal names), two new data sources (in-memory and JNDI), and a number of bugfixes. Some additional utilities have also been added.

      •  11 Sep 2011 17:05

        Release Notes: This release offers a cleaned-up API and more comparators.

        •  02 Jun 2011 07:55

          Release Notes: This version fixes a number of bugs and adds a number of improvements. Example data and setup are now included in the distribution. New JaroWinklerTokenized and DifferentComparator comparators were provided along with a new DebugCompare command, more flexibility in the CSV data source, better reporting of configuration errors, and a --verbose option.

          •  20 May 2011 22:46

            Release Notes: The first version.

            Screenshot

            Project Spotlight

            Collax Business Server

            An all-in-one Linux server for small- and medium-sized businesses.

            Screenshot

            Project Spotlight

            Aspose.BarCode for Java

            A Java based visual component for generation and recognition of 1D and 2D barcodes.