Better merging via Dedupe


#1

Our entire workflow hinges on quick and easy mass-merging of records based on simple rules:

  • When the Email keys of two records match create a single record out of both:
    • Ignore empty fields
    • When available, select the more recent entry to a field

This is because we plan on sending follow-up surveys to people in the database, and updating their records w/ updated answers or new columns.

Also, at the scale we’re planning, it’s simply not workable for us to go record-by-record and manually figure out which fields are newer. So, ideally this could be done for the entire database with a click of the button—or even automatically as new rows are added to the database.

Thanks!


#2

You can accomplish this relatively easily with Airtable today. Probably the best way to describe how this works is to try it using the following demo base with the step-by-step instructions that follow. (Note: Also see the update, which includes a second demo base, for an alternative approach.)

First, here’s a read-only link to the base. Duplicate the base in your own workspace and the follow the directions. When you’re done, you will have created a 1000-record base and then merged a second 1090 records, 201 of which have duplicate keys. The resulting 1799-record base will contain the most up-to-date values for the merged records.

A quick description of the base

There are three tables:

  • [Import]. This is the initial 1000-record import and the main entry point for newly importaed data.
  • [Master]. This is the table containing canonical data. For instance, this is the table you would use to print mailing address for surveys, and so forth.
  • [Second import]. This contains the data records to be merged with the initial import. I created it as a second table in order to create records with duplicate keys and to make it easier to demonstrate. In practice, this might be replaced with a feed from the CSV Import Block or similar.

How it works

The keys from newly imported records in [Import] are linked to matching records in [Master]. If no matching record is found, a new [Master] record is created.

If a [Master] record is linked to more than one [Import] record, the date of the most recent [Import] record is logged. Each [Import] record, in turn, checks to see if it is the latest record; if so, each of its data fields are reproduced in a series of fields named {Latest[whatever]}. Finally, each [Master] record rolls up all of the associated {Latest[whatever]} fields to create a master list of most-recent data.

To demonstrate the base

  1. In [Import], click on the header for {EmailAddress}. This marks all cells in the column.
  2. Press Ctrl-C to copy all cells in the column.
  3. Click on the header for {Link2Master}. This selects all cells in that column.
  4. Press Ctrl-V to paste the values copied from {EmailAddress} into {Link2Master}. This will create 1000 records in [Master] essentially duplicating the records in [Import].
  5. Right-click the header for {Link2Master} and hide the field.
  6. In [Master], right-click the header for {Original Value} and select ‘Customize field value.’ Change the field type from formula to single-line text. Note: This step is merely to help identify updated records; it is not required in a production base.
  7. In [Second Import], select {EmailAddress}` in Row 1 of the table.
  8. Scroll to Row 1000 and, while holding down the Shift key, select {ZipCode}. This selects all 5000 cells in the table.
  9. Press Ctrl-C to copy the cells.
  10. In '[Import]`, scroll to the bottom of the table.
  11. Click ‘+’ to create a blank Row 1001.
  12. Select {EmailAddress} in Row 1001.
  13. Press Crtl-V to paste all 5000 copied cells (1000 rows) into [Import]. Airtable
    asks if you want to add 999 records; say ‘yes’. Airtable whirs a bit and then creates 999 records in [Import].
  14. Select {EmailAddress} for Row 2000.
  15. Scroll back to Row 1001 and, **while holding down the Shift key, select {EmailAddress} for Row 1001. This causes the `{EmailAddress} cells for Rows 1001 through 2000 to be selected.
  16. Press Ctrl-C to copy the selected cells.
  17. Click the Hide Fields control and un-hide {Link2Master}.
  18. Select {Link2Master} in Row 1001.
  19. Press Ctrl-V to paste the copied values into {Link2Master}. This will create and populate 799 new [Master] records and update an additional 201.

You can filter on (or just browse for) {Master::Updated?}} = 1 to find those records that were updated. (Again, neither {Original Value}, {Updated Values}, nor {Updated?} are necessary in a working base; they are included here only as part of the demonstration.


Update: It just dawned on me the following is a potentially simpler variation — although it requires more manual intervention following each import — and it may more closely meet your needs. It uses a demo base similar to the first, which can be found here; as always, duplicate it into your workspace:


The differences in table structure should be apparent as you follow carry out these steps:

  1. In [Import], click on the header for {EmailAddress}. This marks all cells in the column.
  2. Press Ctrl-C to copy all cells in the column.
  3. Click on the header for {Link2Master}. This selects all cells in that column.
  4. Press Ctrl-V to paste the values copied from {EmailAddress} into {Link2Master}. This will create 1000 records in [Master] essentially duplicating the records in [Import].
  5. In [Master], select {UpdatedAddress} in Row 1.
  6. Scroll tothe bottom of the base and, while holding down the Shift key, select {UpdatedZip} in the bottom-most row of the base. This selects all of the cells for all imported records.
  7. Press Ctrl-C to copy all of the selected cells.
  8. Scroll back to the top of the base and select {CurrentAddress} in Row 1.
  9. Press Ctrl-V to paste all of the copied cells.
  10. Return to [Import] and select the check-box to the far left of the field header row. This causes all records in the table to be selected.
  11. Right-click on any record and select ‘Delete all selected records.’ Confirm this is what you wish to do. Airtable deletes all records from the table.
  12. Right-click on the {Link2Master} header and select ‘Hide field’.
  13. In [Second Import], select {EmailAddress}` in Row 1 of the table.
  14. Scroll to Row 1000 and, while holding down the Shift key, select {ZipCode}. This selects all 5000 cells in the table.
  15. Press Ctrl-C to copy the cells.
  16. In ‘[Import], click '+`’ to create a new Row 1.
  17. Select {EmailAddress} in Row 1.
  18. Press Crtl-V to paste all 5000 copied cells (1000 rows) into [Import]. Airtable
    asks if you want to add 999 records; say ‘yes’. Airtable whirs a bit and then creates 999 records in [Import].
  19. Click the 1 hidden field control and un-hide {Link2Master}.
  20. Click the header for {EmailAddress}. This marks all cells in the column.
  21. Press Ctrl-C to copy all cells in the column.
  22. Click the header for {Link2Master}. This selects all cells in that column.
  23. Press Ctrl-V to paste the values copied from {EmailAddress} into {Link2Master}. This will create and populate 799 new [Master] records and update an additional 201.

You can filter on (or just browse for) {Master::Updated?}} = 1 to find those records that were updated.

Note that before each import, you need to copy and paste the updated field values into current field values (Steps 5 – 9) in [Master] and to mark and delete all records from [Import] (Steps 10 and 11).

This variant provides the additional functionality of consolidating non-blank fields, which the original base did not support. (For instance, if {Field 1} is blank, filled, and blank in Imports 1, 2, and 3, respectively, in the original base it would be blank after Import 3, while in the variant it would be filled with the value originally taken from Import 2.)

Note: There is an even more streamlined modification suitable for either base I could not implement without obscuring part of the demonstration. This would be to define [Import]'s primary field as a formula configured to equal {Link2Master}. This allows all imports (or copy-and-paste) to be performed with the imported email address mapped or pasted directly to {Link2Master}. This allows the linked records to be created immediately upon import and eliminates the need to hide/un-hide {Link2Master} and to copy and paste {EmailAddress} into {Link2Master}.


Almost forgot a shout-out to FakeNameGenerator, the source — as usual — for these dummy records.