[SOLVED] Record Duplication Detection, Deduping, and Duplicate Merging

Aaron_Owen · ‎Feb 10, 2016

I would love to have a feature that would search a given table by a field or set of fields to find and surface duplicate rows. It would then be great to be able to take a pre-defined action on duplicated rows such as delete, or merge. Dealing with dupes is always a pain with large data sets or importing new data into a set where some records may already exist. A way to deal with dupes in a somewhat automated way would be really helpful.

W_Vann_Hall · ‎Aug 20, 2017

I’ve just made available on Google Drive a brief write-up of two approaches to duplication prevention within Airtable, one for de-duplication of an existing base and a second for ongoing duplicate detection. It’s an excerpt (in PDF format) from a larger work-in-progress: a collection of Airtable tips and tricks gleaned from postings in the Support forum, premised on the belief that important and useful information often goes overlooked because it lies buried in a confusingly or uninformatively titled thread. (Case in point: These two solutions were originally responses to Find duplicates with Zapier – even though they are built entirely with garden-variety Airtable.) I’ve created a 1,000-record[1] demonstration base for each approach, and the guide provides step-by-step instructions on how to retrofit either functionality into an existing base.

The de-duplication routine is intended to be run one or fewer times against an existing base only; it is almost painfully resource-hungry, and I offer it only reluctantly. Fortunately, once the base has been purged of duplicates, the worst-offending of the data structures can be eliminated and the rest modified or enhanced to support duplicate detection as part of ongoing base update.

Briefly, for both solutions the user defines a matchStr assembled from fields in the main table – that is, the table where duplicate records should be identified. matchStr should be designed to place more importance on the more-distinguishing characteristics of the record while ignoring those liable to be common among a number of records. For example, in the demo bases, matchStr contains the formula

UPPER(GivenName&SurName)

hoping to prevent false negatives caused by, say, the failure to include MiddleInitial on one record.

The systems alert when more than one record contains identical matchStrs; with the de-duplication code, any record containing matching matchStrs is tagged as a duplicate, as well. (While the duplicate detection routine toggles a single, system-wide flag, the expectation is the most recently entered record is one of a pair.) In both instances, the records are flagged only as possible duplicates, with final determination left to the user. Should he or she decide the records represent a false positive – for instance, a misperceived match between John Q. Public and John S. Public – the user selects the DupeOK field of the current record. This causes RECORD_ID() to become appended to matchStr, making it unique and resetting the flags.

Do these two sample implementations address all of the requirements and concerns voiced elsewhere in this thread? Of course not – but, then, they never intended to. Rather, these were attempts to demonstrate just how much of needed functionality could be provided using only the current release of Airtable.

. ----------------------

Actually, a 995-record base, having backed off from 1,000 to avoid the limit on free accounts.

[Note: The document referenced here is only an extract from a larger piece, released here prior to editing and fact-checking. The piece should be considered a lightly edited draft, at best.]

May_Tulin · ‎Sep 24, 2017

Hi there, apologies if this is written about elsewhere, I searched and didn’t find many relevant posts.

We have two duplicate records created from Acuity via Zapier. One Randall H. and one Randy H. Same person. I changed the name of the Randy in airtable to Randall, but they’re still different records and I’d like to merge all this student’s lessons to be under the same single student. Can this be done? Thanks!

Chais_Meyer · ‎Oct 09, 2017

You can “update” an Airtable record using zapier (Acuity & Airtable), which would replace the old record information. This is what you’d want to do. But, there is no way currently to merge information, without building a SUPER complex zap or some custom code.

I would love to do contact merge info as well, just not there yet.

Paul_K · ‎Oct 09, 2017

Have a public/example base showing how you’d like to merge them?

David_Von_Lehma · ‎Oct 11, 2017

In our case we want to prevent duplicates from getting created in the primary field. The UI should detect a duplicate as it is typed in and prevent the row from being saved. If trying to create a row via the API and the primary field value is a duplicate, then an error response should be returned. Hovering over the lock in the first column header states that the value is intended as “short unique representation” of the record. So it seems strange that this uniqueness would not be enforced automatically, or at least be able to toggle a setting.

Christopher_Rex · ‎Oct 21, 2017

I’ve been struggling with this problem for a long time also, but I walked through @W_Vann_Hall’s demo and it was extremely helpful. I was looking (like I think lots of ppl are) for a way to constantly check for duplicates, so the ARRAY issues concerned me.

Here’s my attempt to simplify the demo:

My own example table is posted here: https://airtable.com/shrEZNMgiKrQdv8lp/tblcywQ2RsDL3H9iH/viwd63P4gD3yMRUqD
Generate a “test duplicates” base with ARRAYJOIN roll-up to your list (this creates a string variable to pass back to your individual list items
In your main base, pass the list back using an ARRAYJOIN roll-up. I know this is counter-intuitive since you are joining only 1 other, but its necessary to read the column as a string
Create a formula to detect duplicates:
IF(LEN(List) - LEN(SUBSTITUTE(List,Name,"")) > LEN(Name),“yes”,“no”)

Hope this is helpful for others!

Julian_Kirkness · ‎Oct 22, 2017

@Christopher_Rex - like it - and wouldn’t it work so much better it you could default (and then hide) the linking field so that the process is transparent to the user.

kera_mchugh · ‎Oct 26, 2017

minor challenge with “as you type” detection and prevention is what if you have multiple people with same company name? and want each of them… but don’t want two bob smiths, or there are two bob smiths but they are actually different people with different contact info…
that needs to be part of the protocol… allowing for specific fields to be duplicated when necessary/logical…

totally agree with the disconnect of the primary field not forcing uniqueness at least!
i’ve taken to creating the formula field with auto-numbering to achieve that, and so far it’s working.

Being able to SET your primary field when you import would sure be a help…

… back to my manual deduplicating…

Gustavo_Lopez · ‎Oct 30, 2017

I agree with the prevention of possible records, which could be done when entering data manually. This could easily be a feature on a column type, where the “dupe detector” is turned on, and it searches to see if the record already exists, giving the user an option to add, ignore, or auto-rename. I have a small Contacts base that already has Bob Smith several times.

W_Vann_Hall · ‎Oct 30, 2017

@Gustavo_Lopez

The Duplicate Detection demo base allows the user to override a detected match. For instance, I enter the name of the second Bob Smith, create the link to the DeDupe table, and am immediately notified of a detected match. I then click the checkbox field beside Bob Smith II’s name, and the alert disappears.

Again, I see the techniques demonstrated in the two example bases as aspiring to some day being called a kludge; next to them, your typical stopgap measure would qualify as ‘elegant.’ That said, they still beat deduping manually from a printout hands down… :winking_face:

[SOLVED] Record Duplication Detection, Deduping, and Duplicate Merging

Format Formula as Checkbox

CSV Import in Interfaces

React to Record Comments

Allow Emoji reactions to comments

Multiplying by percent column yields result that i...