go install github.com/go-air/dupi/...@latest
(If you would like binary distributions, please let us know on the issue tracker)
Dupi is designed to identify all "duplicates" in a set of documents. What a "duplicate" means in this context can vary a bit, so let's first rule out some possible mis-use cases:
Dupi is not designed to find all similar words in a set of words, nor is it designed to show "similar" documents to a given document, particularly in a way which takes into account semantic content, such as considering themes or synonyms.
A "duplicate" for dupi means, more or less, the same text being present in more than one document. Dupi can be used for helping to identify plagiarism or copyright violations accross large sets of documents, for example.
Text processing is however inherently noisy: the same text may be formatted differently, use different line-wraping techniques, or different uses of capitalisation, or OCR errors, etc. For this reason, dupi is built in a way which allows extending the idea of what is a "duplicate" in various ways and to various kinds of documents.
Out of the box, dupi is configured simply to find common subsequences of text accross sets of documents. This tutorial addresses this use case. More sophisticated usage and development is needed for other use cases.
Let's get started.
In the following, we will
- Look at the command line usage.
- Create a dupi index from a set of documents.
- Extract duplicate blots from the set of documents.
- Examine the blots.
- Append more documents to the index.
- Query the set of documents for things similar to a given document.
Dupi provides a command line interface which is common these days and
takes the form dupi <verb> [args]
where verb
tells dupi what to do
and [args]
provides information about the object of the verb or modifiers.
dupi -h
usage: dupi [global opts] <verb> <args>
verbs are:
index paths
extract extract from the index root
blot blot [files]
unblot unblot <blot>
inspect inspect the root index.
like file.
global options:
-r default="" index root
To get help on a verb, try dupi <verb> -h.
To create an index just run 'dupi index' and provide it with a list of files or directories. Dupi will traverse all subdirectories and add each file. The files should be text files.
Example:
dupi index .
This will create an index on all files under the current directory in $HOME/.dupi
Dupi extracts sets of documents which share a blot with the 'extract' verb.
dupi extract
These results are fast but noisy due to blot collisions. Extraction skips any blots which are associated with one or fewer documents.
Some options may be of interest
-b output only blots, one per line
-json output json (incompatible with -b)
-sigma float output only those blots with mean + sigma documents
By default, sigma is 3.0, it represents the standard deviation of
the number of documents associated with a blot. A higher value
outputs less information which is more likely to be associated with
actual duplicate text. A lower value is more thorough (has higher
recall) but less precision.
dupi index -a /path/to/new/docs
Sometimes it might be interesting to see if a file has a blot. Dupi provides the ability to blot files using the same mechanism as is used in the index.
dupi blot file
Dupi provides primitives for unblotting, which takes a blot and reconstructs the corresponding text and instances. This is still rudimentary, but here are some examples.
dupi extract -b | xargs dupi unblot
Dupi provides a 'like' verb which permits finding documents that are similar to a given one which is not in the index.
dupi like file
We have shown some basic usage of dupi. As dupi is in early stages of development, some things may be added or changed, we will try to keep this document up to date. Feel free to help improve our documentation using issues or pull requests.