Deduplication 3.0 -- Now with video!


#1

I know I fell off the board lately, but it was for [I hope] a worthy cause.

Data Deduplication v3.0

The third iteration of my deduplication routines for Airtable, totally rewritten, with additional features and functionality.

  • Standard Airtable functionality
    • Works on Free and Plus subscriptions
    • Optional use of Pro features (record color, Blocks)
  • Does not require 3rd-party SaaS integration
  • Includes 3 “modules”
    • Duplicate identification
    • Suppression of false positives
    • Merging of true duplicates
  • Supports persistent tagging of false positives
    • No need repeatedly to vet the same flagged dupes
    • Suppressed dupes still evaluated against new records
    • Color coding indicates past, new matches
  • Supports real-time deduplication
    • Each new record checked at time of data entry

This version of the routines has been tested against bases as large as 25,000 records while proving to maintain an acceptable response time for duplicate detection and false positive functionality.¹ Three demonstration bases are provided, identical except for the number of records, to allow prospective users to judge how well the solution might work in their environments: 14 records, 1,000 records, and 10,000 records.

I’ve prepared a surprisingly long (to me, at least, but probably not to anyone reading this) document describing how the routines work, the steps one needs to take to add such functionality to an existing base, possible problems and ‘gotchas,’ the philosophy of crafting the optimal match field, and many other areas of, at best, tangential interest.

However, please note it also includes a Quickstart section.

The document can be found at paladesigns.com/airtable/dedupe.pdf.

There should also be an introductory video available later today or tomorrow.


  1. A framework within which to perform duplicate merger is provided for completeness’ sake. As each individual merger results in the recalculation of a dozen or more fields for each record in the base, processing delay becomes noticeable after only a few thousand records. However, as it takes essentially the same amount of time to merge one record as it would to merge all records, the system allows one to indicate desired mergers, which are then executed en masse.

#2

That’s pretty impressive work there. Nice job. Will definitely refer to when need. thx


#3

Alas, there was a mistake in the initial version of the 14-Record Deduplication Demo base I shared. (It has since been corrected.)

In the [DeDupe] table, the correct formula for {OKRecsMatch} should be simply

{DupeOKRecs}&{OKRecsHold}

Sorry for any trouble this may have caused.


#4

Thank you for creating this, I’m not a techie but would like to utilize this on one of my tables. You mention a video, I’m a visual learner could you direct me to the URL of the video so I can watch?

Grateful for your brilliance!


#5

With luck, I hope to post it (them?) a little later today — if I can stand to listen to myself go through the narration another couple-dozen times. I’ll update this thread when it’s ready.

Edit: OK, later today [Saturday], is what I really meant. :wink:


#6

First, thank you for this awesome tool and for this clear forum.

Second, I have two questions:

  1. Any progress on the explainer video yet?

  2. Do you think you will build an auto-deduper? e.g. a program that always overwrites selected older fields with selected newer fields, rather than having to do this manually for each record?


#7

Well, in the end I had to replace my sound card, break out the soldering iron to fix an XLR cable, and record only between 2 and 5 a.m., to prevent the dulcet tones of South Van Ness from overwhelming the narrative, but it’s finally finished — and I hope I never do anything like this ever again. (Actually, I know in a few days the irregular audio will bother me enough I’ll be compelled to fix it, but for the moment I’m fantasizing I’m free.)

And while I’m at it, whose bright idea was it to call this ‘data deduplication’ instead of, say, ‘Bob?’ Do you have any idea how difficult it is to say ‘deduplication’ repeatedly? Even now, several of the takes are ones in which I didn’t mangle the term too badly. If I’d kept at it until I managed to nail that word every time, it would have been next Summer before the video appeared. (Here’s a sneak preview of version 4.0 of my deduplication routines: They’re called ‘Bob.’)

It’s a long video — around 22 minutes, as I recall — but it’s thorough: Each of the component dedupe ‘modules’ is discussed, with an animation detailing each routine’s process flow, followed by a narrated screen capture sequence illustrating how the functionality evinces itself within Airtable. I tried to put together a shorter introductory video consisting solely of the capture portions, but it weighed in at nearly 15 minutes. I’ll probably throw together a barebones, screen cap intro eventually, once the memory of this one had faded. In the meantime, feel free to fast forward.

Undoubtedly, self-hosting the video means someone out there won’t be able to watch it. Unfortunately, at the moment I’ve somehow managed to screw up my Google credentials to the point where I can’t verify I’m me and thus enable uploads of longer than 15 minutes. Once I figure out what I did wrong, I’ll relocate this to YouTube. (In the meantime, for anyone having trouble viewing it from my server, it exists on YouTube in a ‘Part 1’/‘Part 2’ version.)

I want to emphasize, though, this video is intended to accompany the users guide, rather than replace it. That document can be found here. It’s been updated recently, but only to include the video URLs found in this reply.

Full video: http://paladesigns.com/airtable/dedupe/

Alternative part 1: https://youtu.be/1IQBNKidE0k
Alternative part 2: https://youtu.be/6bHT3CLclAM


#8

See previous reply. :wink:

Probably not. For one thing, the Deduplication Block does an excellent job of merging records — and it can work around issues of type infelicities and the lack of a persistent calculated field type that makes writing such generic applications within Airtable virtually impossible. Actually, I guess that’s two things, and together they pretty much cover it.

My goal in creating version 3 was to provide functionality currently not available in the Block — namely, persistent support for false positives and support for ongoing deduplication. (The former has been a bugbear of mine from a day or two after I first fired up PC-File, up through a couple of weeks ago, when I found myself trying to sanitize a client’s database.) I didn’t see much reason to implement a half-assed version of something Airtable already did quite elegantly. :wink: