[SOLVED] Record Duplication Detection, Deduping, and Duplicate Merging


#33

In Airtable, there’s no such thing as using a field for “your” primary key. Airtable uses an internal field that is not visible through the user interface but is accessible through the API.

The first field at the left in Grid View is not the primary key. It’s what shows up in a linked record and it can be changed without affecting the link. There’s no need for it to be unique.


#34

Thanks for the explanation. I guess I mentally glossed over the difference between the primary field and primary key.


#35

I’m looking for a way to delete and/or merge duplicates as well!


#36

I see people are still suffering because of this, so I want to bring up this suggestion which didn’t seem to get much attention and it will absolutely help those who really need to find [at least some kind of] duplicates with some ease.


#37

As with lots of other issues this ia longly requested one.

I’ll add my +1 also by making a suggestion: How about allowing to filter grouped results.

Example: If I group by an email field in a table of 5000 entries with 10 duplicates, i can find duplicate email entries but I have to scroll through 4990 grouped rows. If I would be able to filter out grouped results with a result count <= 1 this could be much easier.

This feature would also help in other situations.


#38

I’ve just made available on Google Drive a brief write-up of two approaches to duplication prevention within Airtable, one for de-duplication of an existing base and a second for ongoing duplicate detection. It’s an excerpt (in PDF format) from a larger work-in-progress: a collection of Airtable tips and tricks gleaned from postings in the Support forum, premised on the belief that important and useful information often goes overlooked because it lies buried in a confusingly or uninformatively titled thread. (Case in point: These two solutions were originally responses to Find duplicates with Zapier – even though they are built entirely with garden-variety Airtable.) I’ve created a 1,000-record[1] demonstration base for each approach, and the guide provides step-by-step instructions on how to retrofit either functionality into an existing base.

The de-duplication routine is intended to be run one or fewer times against an existing base only; it is almost painfully resource-hungry, and I offer it only reluctantly. Fortunately, once the base has been purged of duplicates, the worst-offending of the data structures can be eliminated and the rest modified or enhanced to support duplicate detection as part of ongoing base update.

Briefly, for both solutions the user defines a matchStr assembled from fields in the main table – that is, the table where duplicate records should be identified. matchStr should be designed to place more importance on the more-distinguishing characteristics of the record while ignoring those liable to be common among a number of records. For example, in the demo bases, matchStr contains the formula

UPPER(GivenName&SurName)

hoping to prevent false negatives caused by, say, the failure to include MiddleInitial on one record.

The systems alert when more than one record contains identical matchStrs; with the de-duplication code, any record containing matching matchStrs is tagged as a duplicate, as well. (While the duplicate detection routine toggles a single, system-wide flag, the expectation is the most recently entered record is one of a pair.) In both instances, the records are flagged only as possible duplicates, with final determination left to the user. Should he or she decide the records represent a false positive – for instance, a misperceived match between John Q. Public and John S. Public – the user selects the DupeOK field of the current record. This causes RECORD_ID() to become appended to matchStr, making it unique and resetting the flags.

Do these two sample implementations address all of the requirements and concerns voiced elsewhere in this thread? Of course not – but, then, they never intended to. Rather, these were attempts to demonstrate just how much of needed functionality could be provided using only the current release of Airtable.

. ----------------------

  1. Actually, a 995-record base, having backed off from 1,000 to avoid the limit on free accounts.

[Note: The document referenced here is only an extract from a larger piece, released here prior to editing and fact-checking. The piece should be considered a lightly edited draft, at best.]


Going mad trying to avoid duplicates with ARRAYUNIQUE
Validating particular sets of fields
[SOLVED] Help With Staging A Base
Booking items shown as booked to avoid double booking
Unique email - check duplicates in the table
Check if column value from one table exists in another table
Prevent Duplicates
What are your favorite Airtable tips & tricks?
#39

In our case we want to prevent duplicates from getting created in the primary field. The UI should detect a duplicate as it is typed in and prevent the row from being saved. If trying to create a row via the API and the primary field value is a duplicate, then an error response should be returned. Hovering over the lock in the first column header states that the value is intended as “short unique representation” of the record. So it seems strange that this uniqueness would not be enforced automatically, or at least be able to toggle a setting.


#40

I’ve been struggling with this problem for a long time also, but I walked through @W_Vann_Hall’s demo and it was extremely helpful. I was looking (like I think lots of ppl are) for a way to constantly check for duplicates, so the ARRAY issues concerned me.

Here’s my attempt to simplify the demo:

  1. My own example table is posted here: https://airtable.com/shrEZNMgiKrQdv8lp/tblcywQ2RsDL3H9iH/viwd63P4gD3yMRUqD
  2. Generate a “test duplicates” base with ARRAYJOIN roll-up to your list (this creates a string variable to pass back to your individual list items
  3. In your main base, pass the list back using an ARRAYJOIN roll-up. I know this is counter-intuitive since you are joining only 1 other, but its necessary to read the column as a string
  4. Create a formula to detect duplicates:
    IF(LEN(List) - LEN(SUBSTITUTE(List,Name,"")) > LEN(Name),“yes”,“no”)

Hope this is helpful for others!


#41

@Christopher_Rex - like it - and wouldn’t it work so much better it you could default (and then hide) the linking field so that the process is transparent to the user.


#42

minor challenge with “as you type” detection and prevention is what if you have multiple people with same company name? and want each of them… but don’t want two bob smiths, or there are two bob smiths but they are actually different people with different contact info…
that needs to be part of the protocol… allowing for specific fields to be duplicated when necessary/logical…

totally agree with the disconnect of the primary field not forcing uniqueness at least!
i’ve taken to creating the formula field with auto-numbering to achieve that, and so far it’s working.

Being able to SET your primary field when you import would sure be a help…

… back to my manual deduplicating…


#43

I agree with the prevention of possible records, which could be done when entering data manually. This could easily be a feature on a column type, where the “dupe detector” is turned on, and it searches to see if the record already exists, giving the user an option to add, ignore, or auto-rename. I have a small Contacts base that already has Bob Smith several times.


#44

@Gustavo_Lopez

The Duplicate Detection demo base allows the user to override a detected match. For instance, I enter the name of the second Bob Smith, create the link to the DeDupe table, and am immediately notified of a detected match. I then click the checkbox field beside Bob Smith II’s name, and the alert disappears.

Again, I see the techniques demonstrated in the two example bases as aspiring to some day being called a kludge; next to them, your typical stopgap measure would qualify as ‘elegant.’ That said, they still beat deduping manually from a printout hands down… :wink:


#45

Here is another solution to this issue. It requires one relationship that you might already have.

Let’s say you have two tables: Contacts and Companies and you want to detect duplicate Emails in the Contacts table. The initial structure should look something like this:

Contacts
Email | Company (Link to)

Company
Company Name | Emails (link to)

In the Contacts table we’re going to add two columns, that we can later hide.

Col1: Rollup::Company::Emails with the function CONCATENATE(values)
Col2: Formula with the formula LEN({Col1}) != LEN(SUBSTITUTE({Col1},Email,"",2))

Hide these columns and create a conditional color change with the formula Where Col2 = 1 to create a highlight effect on duplicate emails.

Hope this can help someone else.


#46

Christoff,
The fact that the “Key” field can contain duplicates causes big problems for us when we convert columns of text to links - we can’t be sure which record gets the link.
-Donald


#47

Donald, I’m not sure why you directed your comment to me. I am not an Airtable employee and I can’t solve the big problems that you’re experiencing with duplicate records.


#48

Just adding my +1 for a dupe detection feature, I’m using Zapier to bring in form data and don’t want to duplicate contacts if they are already in the table.


#49

Hi Everyone,

Just of the top of my head…
(This would be directed to the users more experienced with programming/coding)

A solution to this problem would be to use AIrtable’s API…
It allows you to create,delete etc records in Airtable…

eg. Use case

When you want to import data in from Google sheets…
Use the Google SheetsAPI to store each record in a data structure(In case you are using JAVA to access the API you could use an ArrayList)
ONLY add the record in the data structure if a similar record does not exists.

After that you could loop over the data structure (which now consists of ONLY unique records) and use the Airtable API to insert it into a particular table.


#50

+1 here for a duplicates detection and possibly pre-defined actions when they are detected. This is casually usefully in SO MANY scenarios. Specifically, I’m using Airtable for asset tracking using barcodes, and would like to know if someone has attempted to use the same asset ID number twice.


#51

This is ultimately an ask for data validation per field, against another field or linked field.


#52

This feature looks like it’s something the user base want and was requested back in early 2016… can we get an official answer from Airtable on whether or not its been approved?

2 years…