All about Tracker Duplicates (Potential Duplicates)


A Script has been associated to this Guide: Listing Potential Duplicate TEIs

Lets imagine you get a ticket that says “I think we have duplicate TEIs and Enrollments - how do I identify and delete them?”. Now what? This Guide pulls different pieces of the documentation together to enable us to find a solution to the client’s needs.

Version Check! At the time this Guide was created (2.36.5), the “Potential Duplicates” feature was still under design and development. Depending on when the new changes are released, some things may be different than what this Guide details and will need to be updated.

Details here: https://community.dhis2.org/t/questions-about-the-potential-duplicates-feature/43784

The duplicate finding feature

In both cases of registering a tracked entity instance - with enrollment or without enrollment - the system will warn you for possible duplicates. Potential duplicates are records we work with in the data deduplication feature which represents a single or pair of records which are suspected to be a duplicate.

In addition to informing users about the tracked entity instance potentially being a duplicate, the flag will be used by the underlying system for finding and merging duplicates in coming versions of DHIS2.

How are duplicates identified in programs?

When configuring a program, if you enable Searchable on a specific attribute, then DHIS2 will search on those attributes to see if there are any duplicate combinations of values.

If I had to guess, if the program had searchable enabled for the the following fields:

  • First Name
  • Last Name
  • Date of Birth
  • Org Unit

Then I think at the database level, that would be searching for a COUNT > 1 where first_name, last_name, age, DOB, and organisation_unit are equal.

That means, if you have more fields (5) searchable, you’ll likely find less duplicates than if you had less fields (3) because more fields have to form a unique combination.

Either can be a good design or a poor design - it depends on your requirements, as well as how solid data collection is. If the data entrants cannot reliably fill out five fields for every record, I would not recommend having five searchable fields.

In short - it is likely best practice to enable searchable on fields that are most relevant to your TEIs as well as those that you are certain will have very high response rates.

How do we view duplicates?

You can retrieve a list of potential duplicates using the following endpoint:

GET /api/potentialDuplicates

Additionally you can inspect individual records:

GET /api/potentialDuplicates/<id>

Potential Duplicates are also viewable in the Tracker Capture App. This is not a very useful way of “finding” duplicates when you don’t know any information about them.

The other place is within the tracked entity instance dashboard [image removed]

How do we manage duplicates?

Linking

Linking duplicate TEIs is the the feature that was designed for acknowledging and managing true duplicates.

From the documentation (very bottom of linked section):

When filling in data you might face a warning telling you that a possible duplicate has been found. You can click the warning to see these duplicates and if the duplicate is a match you can choose to link that Tracked Entity Instance by clicking the Link button. If the warning is still present when you are done filling in data, you will not see the Create Tracked Entity Instance and Link button. Instead you will be presented with a button called Review duplicates. When you click this button a list of possible duplicates will be displayed. If any of these duplicates matches the Tracked Entity Instance you are trying to create you can click the Link button, if not you can click the Save as new person button to register a new Tracked Entity Instance.

What about duplicates with no “pair”?

The payload of a potential duplicate looks like this:

{
  "teiA": "<id>",
  "teiB": "<id>",
  "status": "OPEN|INVALID|MERGED"
}

But what if no teiB is listed? What does that mean….?

Well, we don’t know yet. Perhaps it is a way of marking a TEI as: "maybe this is a duplicate but I don't know of what" 🤷‍♂️

What about the Statuses?

Open

It has been flagged as a duplicate and no action has been taken

Invalid

From Developer Docs

You can mark a potential duplicate as invalid to tell the system that the potential duplicate has been investigated and deemed to be not a duplicate. To do so you can use the following endpoint:

PUT /api/potentialDuplicates/<id>/invalidation

Linked?

Why is there no status for Linked which is described as an action you can take on the Potential Duplicates