Sep 25, 2018 07:58 AM
I know I fell off the board lately, but it was for [I hope] a worthy cause.
The third iteration of my deduplication routines for Airtable, totally rewritten, with additional features and functionality.
This version of the routines has been tested against bases as large as 25,000 records while proving to maintain an acceptable response time for duplicate detection and false positive functionality.¹ Three demonstration bases are provided, identical except for the number of records, to allow prospective users to judge how well the solution might work in their environments: 14 records, 1,000 records, and 10,000 records.
I’ve prepared a surprisingly long (to me, at least, but probably not to anyone reading this) document describing how the routines work, the steps one needs to take to add such functionality to an existing base, possible problems and ‘gotchas,’ the philosophy of crafting the optimal match field, and many other areas of, at best, tangential interest.
However, please note it also includes a Quickstart section.
The document can be found at paladesigns.com/airtable/dedupe.pdf.
There should also be an introductory video available later today or tomorrow.
Sep 25, 2018 08:19 AM
That’s pretty impressive work there. Nice job. Will definitely refer to when need. thx
Sep 25, 2018 07:31 PM
Alas, there was a mistake in the initial version of the 14-Record Deduplication Demo base I shared. (It has since been corrected.)
In the [DeDupe]
table, the correct formula for {OKRecsMatch}
should be simply
{DupeOKRecs}&{OKRecsHold}
Sorry for any trouble this may have caused.
Oct 02, 2018 08:12 AM
Thank you for creating this, I’m not a techie but would like to utilize this on one of my tables. You mention a video, I’m a visual learner could you direct me to the URL of the video so I can watch?
Grateful for your brilliance!
Oct 03, 2018 01:24 PM
With luck, I hope to post it (them?) a little later today — if I can stand to listen to myself go through the narration another couple-dozen times. I’ll update this thread when it’s ready.
Edit: OK, later today [Saturday], is what I really meant. :winking_face:
Oct 12, 2018 03:33 AM
First, thank you for this awesome tool and for this clear forum.
Second, I have two questions:
Any progress on the explainer video yet?
Do you think you will build an auto-deduper? e.g. a program that always overwrites selected older fields with selected newer fields, rather than having to do this manually for each record?
Oct 22, 2018 02:40 PM
Well, in the end I had to replace my sound card, break out the soldering iron to fix an XLR cable, and record only between 2 and 5 a.m., to prevent the dulcet tones of South Van Ness from overwhelming the narrative, but it’s finally finished — and I hope I never do anything like this ever again. (Actually, I know in a few days the irregular audio will bother me enough I’ll be compelled to fix it, but for the moment I’m fantasizing I’m free.)
And while I’m at it, whose bright idea was it to call this ‘data deduplication’ instead of, say, ‘Bob?’ Do you have any idea how difficult it is to say ‘deduplication’ repeatedly? Even now, several of the takes are ones in which I didn’t mangle the term too badly. If I’d kept at it until I managed to nail that word every time, it would have been next Summer before the video appeared. (Here’s a sneak preview of version 4.0 of my deduplication routines: They’re called ‘Bob.’)
It’s a long video — around 22 minutes, as I recall — but it’s thorough: Each of the component dedupe ‘modules’ is discussed, with an animation detailing each routine’s process flow, followed by a narrated screen capture sequence illustrating how the functionality evinces itself within Airtable. I tried to put together a shorter introductory video consisting solely of the capture portions, but it weighed in at nearly 15 minutes. I’ll probably throw together a barebones, screen cap intro eventually, once the memory of this one had faded. In the meantime, feel free to fast forward.
Undoubtedly, self-hosting the video means someone out there won’t be able to watch it. Unfortunately, at the moment I’ve somehow managed to screw up my Google credentials to the point where I can’t verify I’m me and thus enable uploads of longer than 15 minutes. Once I figure out what I did wrong, I’ll relocate this to YouTube. (In the meantime, for anyone having trouble viewing it from my server, it exists on YouTube in a ‘Part 1’/‘Part 2’ version.)
I want to emphasize, though, this video is intended to accompany the users guide, rather than replace it. That document can be found here. It’s been updated recently, but only to include the video URLs found in this reply.
Full video: http://paladesigns.com/airtable/dedupe/
Alternative part 1: https://youtu.be/1IQBNKidE0k
Alternative part 2: https://youtu.be/6bHT3CLclAM
Oct 22, 2018 03:38 PM
See previous reply. :winking_face:
Probably not. For one thing, the Deduplication Block does an excellent job of merging records — and it can work around issues of type infelicities and the lack of a persistent calculated field type that makes writing such generic applications within Airtable virtually impossible. Actually, I guess that’s two things, and together they pretty much cover it.
My goal in creating version 3 was to provide functionality currently not available in the Block — namely, persistent support for false positives and support for ongoing deduplication. (The former has been a bugbear of mine from a day or two after I first fired up PC-File, up through a couple of weeks ago, when I found myself trying to sanitize a client’s database.) I didn’t see much reason to implement a half-assed version of something Airtable already did quite elegantly. :winking_face:
Mar 19, 2019 05:33 PM
@W_Vann_Hall we added to resource directory here: https://builtonair.com/resources - thanks for making cool stuff!
Jan 26, 2021 07:00 AM
Absolutely unreal. I am blown away by your work. You helped me so much. This is insanely good.