[SOLVED] Record Duplication Detection, Deduping, and Duplicate Merging


#1

I would love to have a feature that would search a given table by a field or set of fields to find and surface duplicate rows. It would then be great to be able to take a pre-defined action on duplicated rows such as delete, or merge. Dealing with dupes is always a pain with large data sets or importing new data into a set where some records may already exist. A way to deal with dupes in a somewhat automated way would be really helpful.


Merge Duplicate Records
More advanced formulas
How avoid duplicate value
Merging rows, dealing with duplicates
Updating records from import data
[SOLVED] Merge duplicate records?
Multiselect in linked record field without going back and clicking ADD again
#2

I was told this is a future feature. Does anyone have an idea of when?


#3

As a rule, we don’t commit to specific timelines. But the more details we get from you all about how you’d want to use a feature, or how you’d want it to be implemented, the easier it is for our team to work on those features and make them the best they can be. Your feedback is valuable and appreciated!


#4

First of all I would like help on preventing duplicates.

Second, on duplicate detection, I think it should work similar to gmail’s contacts duplicate detection and merging.


#5

This request came about because of an issue with dealing with a contact table. Most address book or contact management applications have the ability to search for dupes. This is usually based on the name field. However, since Airtable’s power is in its flexibility, I would like the ability to choose the criteria that dupes are surfaced. For example, if I have a contact that has a different first name in two rows, but the same surname and same email address, I’d like to surface those two rows as duplicates.

The ability to surface dupes against a percentage of sameness would also be awesome. The example here would be that if 80% of a record’s fields are found to be identical, surface as a dupe.

Once the duplicate is found, an action should be the next step. Possible actions would be to merge the two rows, but some more information about HOW to merge the rows: which should be the master record and which should be the slave / deleted record? Delete older / newer would also be an expected action. Knowing which record is older and which is newer is a primary reason that a lot of folks have requested a date created and a date modified field in other feature requests. Lots of times, a record that has been created at a later date is actually the older record in terms of date of modification, and thus has older data. The ability to switch the definition of “newer” based on created / modified status would be helpful here.

While contact management is a primary use case, I also see this as being useful for data imports. Many times the data stored in Airtable is being generated by an outside source and is imported via csv. Let’s say that this is an outside database system or third party cloud application that can dump its data to a flat file. If I’ve already done an export from the other system, then many of the records that are present in this new file have already been imported into Airtable. The ability to see that there are potential duplicates that are about to be imported, and take an action on them BEFORE the import occurs would be a great feature. The duplicate filtering should work the same as described above, but dupes would be caught before the data lands in a table where it may be linked with other tables.

Another use case that I’ve run into is that I’m integrating records from lots of various sources (data mining) that often times have the same data on a particular organization. Instead of simply having to choose between one or the other, I’d like the ability to intelligently blend the two data sets together in order to aggregate all of the information about a particular company. An example might be that data set A might have the Company Name, phone number and website for Company X, while data set B might have the website and names plus email addresses for various contacts. Being able to import this two records and blend the different data points into a single record (and maybe break out the people into linked records on a different table) in one operation would be a really great feature.


#6

I work with data that may be easily duplicated but haven’t got the imperative need of merging as others do, as data is likely to be identical most of the time. To me it’d be valuable to be able to identify duplicates based on criteria of my choice and have the ability to merge as needed. This could be achieved with a side-by-side comparison of potential matching records where I have the option to ignore, delete one or the other, or merge.


#7

I’d really like this. I hope to just start managing everything in one place but in the meantime, and especially for importing, it would be nice to have some way for duplicates to be flagged.

I’m doing a CRM. We have a lot of places information is stored (master contact list with email addresses, a different list with organization names, another list from an event showing who registered; and all of these have overlap in the people involved).

It would be really helpful to have a way to identify which contacts might be duplicates, at a few levels – for example, if I add an organization called The Happy Station and also one called Happy Station; or if I add Shannon Joseph this week because she gets our newsletter and I add her a month from now because she attended an event. It’s easier for me to just paste the whole list of 100 event attendees than look up each one and add the event to a cell in their row.

Maybe I could press a “Check for duplicates” button or a prompt could show up upon opening the spreadsheet, like, “Looks like you might have some duplicates. Want to glance through them?” with a checkbox saying “Don’t show this again.”

I think it could be helpful for the system to flag them so I can see them, help me filter them to see only duplicates, or maybe show a separate visual: like, show one set of possible duplicates at a time and give me the option to merge the two or choose one (default to the most recent); or, show them all in rows with one pane for each version and, again, those options to merge or choose one. It would be great if the platform could figure out how to show me a merge version that is likely to be what I want. (Pipe dream? :))

In Excel, I usually do a conditional formatting filter of duplicate text, apply a color on those cells, then manually reconcile.


#8

I just came across thread because I need a way to find duplicate URL’s throughout rows of a specific column.

The data I am storing is a learning resource URL for my University and many times, the course instructor will post the same lesson link in the syllabus and I would like a way to identify these and mark them as a duplicate.

Functionally, I’d like to be able to write a formula using something like IF(NOTUNIQUE(columnStart, columnStop), TRUE, FALSE).


#9

Another vote for duplicate detection!

A merge wizard would be amazing too. Somthing similar to Gmail’s implementation, ideally


#10

Yes, this feature would be a big step in the right direction!


#12

I need this as well. I am trying to house a large number of events and contacts but duplicate detection is needed.


#13

I would vote for this feature… it is hugely helpful in CRM


#14

I’m running into this problem right now. I want to import data but I know there are duplicates


#15

+1 on duplicate detection!


#16

It is hard to believe as many tables are formed bringing in dirty data and needing to remove or manage dupes. Gives me second thoughts as migrating into a system like this requires these types of tools.


#17

+1, duplicate detection is one of the few things I miss in Airtable


#18

I would also need this feature since I am dealing with large sets of data. I am using airtable to organize a database of business people, so obviously having duplicated records for e-mail addresses and name of companies is quite an issue for me.


#19

I also think that row merge would be a great improvement.


#20

Are any of you using duplicate preventing processes or procedures you’d recommend beyond - “Hey, check the DB before you add a new record?” I am setting up a system and have yet to release it to the team. When I do, I would like to specify procedural guidelines for quality data curation. Thank you.


#21

Duplicate checking is important when collaborating. As far as I can tell, if I share a form, invitees can’t see if a name has already been added to a base, and they are able to re-add a name without warning. They are submitting names of people that could be donation prospects, and some of our best prospects will be recommended multiple times.

It appears I would really need to share the base in grid view rather than a form.