Best Practices for linking tables together - Unique ID numbers vs

Bill_Sewell · ‎Jun 06, 2019

Greetings all, I am building a non-profit membership tracking system that pulls in new member requests from our Mailchimp list and invoices from our Xero system. I am used to the standard unique key, foreign key database structure and am struggle a bit to understand Airtable’s relationship structure. As of now, our mailchimp list and xero invoices link OK on somene’s primary email address, which is my unique identifier for the Contact table. I realize that if they change / udpate their email address (which happens), that connector will break. It took me a long time to realize that the first column of each is automatically the “unique” ID, even if it allows for multiple identical string entries. So, originally I had full name as the first column, but John Smith and John Smith can exist and are not the same person… which screws up the linked tables. “Name” + “Company” (or any secondary field) doesn’t work well either, as often times people don’t fill out too much in the forms. Phone number isn’t great, and email kinda sucks as well (personal vs business).

What’s a newbie to do?

NelsonBonner · ‎Jun 06, 2019

What about a concatenation of name + created time?

Justin_Barrett · ‎Jun 06, 2019

…or name + autonumber. Or even just autonumber, though having the name in there makes it more useful if you see those records linked anywhere else in your base.

Bill_Sewell · ‎Jun 06, 2019

My understanding is that the leftmost column of Table A is the Unique ID that needs to match with an exact copy in the “foreign” key of Table B. I have 1 table for Contacts and 1 for Invoices. Both tables are updated from different sources, but I need them to connect together. I am using email address as the shared unique / foreighn key.

Justin_Barrett · ‎Jun 06, 2019

I think you may be making an incorrect assumption about how linked records work in Airtable. In a sense, it’s using the unique key/foreign key concept that you’re talking about, but not with the literal content of the primary field in the linked record. What actually ties them together is an internal record ID that Airtable assigns (e.g. recKOooLFtPySUxpW - that’s an actual Airtable record ID). The text you see in the primary field is kind of a pseudonym, and that pseudonym can change as much as you want without breaking the link because the link is based on that internal record ID, not the pseudonym.

You expressed a concern in your initial message that a member’s record in the [Contacts] table would lose its link to their relevant records in the [Invoices] table if their email address were to change, but that’s not the case. If you’re using an email address as the primary field value, and you change that address, you’ve only changed the pseudonym. The link between the tables’ records remains unaffected. So if you prefer to use their email address in that primary field, that shouldn’t be a problem…unless someone doesn’t have an email address, of course.

W_Vann_Hall · ‎Jun 07, 2019

Actually, he does have a point: He runs the risk of losing historical data over time. Imagine this scenario:

‘John Smith’ with email address ‘iamjohnsmith@example.com’ purchases a membership. @Bill_Sewell imports records from Mailchimp and Xero and matches them up.
The following year, ‘John Smith’ buys another membership — but having forgotten his original email password, he’s now ‘iamjohnnysmith@example.com’.

Now, unless Bill has some way of knowing the two ‘John Smiths’ are actually a single John Smith, he’ll carry two membership records for one person. You’re right in the connections between existing [Contacts] and [Invoices] won’t break, but he’ll now have two [Contacts] records representing the same person.

This isn’t an Airtable-only issue, of course: It’s a reality issue. Without some degree of omniscience, Bill will have this problem, with any — or with no — data management system.

There are two possible solutions, depending on just how much of an issue this is:

Ignore it. Assuming the purchases are something relatively monolithic, such as an annual membership, and both the cost of maintaining duplicate entries and the value of maintaining consistent records are low, who cares? Use Mailchimp’s annual opt-in feature to weed out inactive subscribers, and resign oneself to dragging around some extraneous member data.
If Option 1 isn’t, um, an option, the only way around is through manual deduplication of the data. Routines within Airtable can help with this, making it much more manageable, but it’s nearly impossible to automate completely.¹

Unfortunately, in terms of effort, this latter task falls somewhere on the ‘medium-large’ to ‘gargantuan’ scale, depending on how many variables one needs to examine to determine two records are or are not duplicates (as well as how OC one — or one’s management — is).

I have a client whose business lies in staging large conferences at a number of locations worldwide for people working in a certain industry. The industry is one in which people frequently change employers or job titles, and it’s one that has seen extensive M&A activity and rebranding. My client’s source data were drawn from multiple years’ worth of attendee records, with most attendees participating in one or two conferences a year. Oh, and to make things even more fun, many of the larger employers in the industry are known by several variations of their name, the way “The Great Atlantic & Pacific Tea Company” was also known as “A&P”.²

Now, in their case it was important to establish consistency across years and conferences — and, to an extent, it was this experience that eventually led to version 3 of my data deduplication routines. Essentially, I set up multiple {MatchKeys} on the data, based on such combinations as {FirstName}&{LastName}, {LastName}&{Employer}, {EmailSansDomain}, and so on; I also performed extensive normalization on the {MatchKeys}: substituting unaccented vowels for accented vowels; replacing common diphthongs and non-ASCII characters with ASCII representations; eliminating such low-significance words and abbreviations as ‘Company’, ‘Inc.’, ‘Gmbh’, ‘SDn BHD’, ‘D.O.O.’; and so on. Finally, I used these various `{MatchKeys} to find possible duplicate records. Those that were false positives, I flagged as such, while actual duplicates I merged. (In the client’s case, all the deduplication took place one rung above the base root — during the initial import, I created a unique ID for each contact, and simply changed the master key for duplicate records accordingly. Today, I’d probably just use a self-join to merge dupes, but when I built the client base, Airtable did not support lookup or rollup fields on self-joins.)

Does Bill need to go to such trouble? Probably not — but it depends on his use case. If there is a significant value, to either the organization or the member, to being able to track a member’s purchases over time, I’d say he has no choice.

For instance, if his non-profit is, say, a medical buyer’s club to which the authorities turn a blind eye as long as each member purchases only enough gray-market meds for his or her own use, then I’d say, yeah, it’s pretty darn essential he maintains data consistency across purchases. Similarly, if a member accrues discounts or benefits based on cumulative purchases, then it’s worth making the extra effort in order to keep the member happy. Or if there is a significant per-member cost incurred — say, the cost to print and mail an annual report — then it could also be in the organization’s best interests to devote more resources to keeping the contacts list squeaky clean.

On the other hand, if data purity is only a ‘nice-to-have,’ it’s probably not worth the effort to do a deep cleanse. If the necessity falls somewhere between ‘ignore’ and ‘essential’, Bill can try deduping based on one or two {MatchKeys}.

The Deduplication Block is probably of limited utility, here: It’s ideal for one-time (such as after initial importation) or infrequent deduplication, or for more-frequent deduplication of data with relatively few false positives, since it works on existing tables with no need for additional coding or prep work. Since Bill’s use case wants there to be a lot of false positives — he wants to find as many possible dupes as possible, so he can identify the relative few that actually do represent the same person — that would mean he’d have to ignore manually every false positive in the table every time he runs the Dedupe Block. For ongoing or real-time deduplication, he’d probably be better off using my routines: While they require additional fields and a table be added to the base to deduplicate (as well as the dreaded ‘link-every-record-of-a-table-to-a-single-record-in-another-table’ step), they do support persistent marking of false positives, so after he flags all the known false positives in the base, they won’t appear on subsequent runs unless a newly added record triggers the match.

@Bill_Sewell, if you’re interested, the post linked above contains links to several demonstration bases (of 14, 1,000, and 10,000 records, respectively, so you can get an idea of what performance might be like for your data), a 32-page User’s Guide (with a Quickstart section, thank goodness), and a 20+ minute video that demonstrates how the code works behind the scenes and shows what the user sees when using it. Yes, it is daunting for a self-proclaimed newbie — so if you have any questions about whether your use case merits such firepower or how to implement it in your case, please don’t hesitate to ask. :winking_face:

Think of all those ‘people search’ sites out there, such as Spokeo or, well, PeopleSearch: Despite access to creepily deep pools of personal data and potentially vast amounts of targeted marketing-related development funding, you’ll still find duplicate, triplicate, quadruplicate entries for the same individual — as well as entries that conflate multiple individuals.

Even systems that hold themselves out as the gold standard in data management aren’t immune: I’m a Jr., and my father was forever grousing about how my somewhat peripatetic approach to personal finance was fouling up his credit score. On the other hand, I was delighted my credit record showed I’d had a Sears charge card in good standing since I was 3 years old…
OK, still not fun enough for you? Recall these are global conferences, with people registering by mail, email, and phone, so ‘Sonia’, ‘Sónia’, and ‘Soñia’ might all be the same person — to say nothing of ‘Sonya’ and ‘Sonja’.

M_k · ‎Jun 07, 2019

Hi

Just wanting two add my two cents worth into the conversation.

I don’t know if it is of interest, but Airtable has opened up their API to Integromat, to now allow Airtable users with the Integromat app, to delete duplicate records. Now I don’t know if there is a way to view the duplicates beforehand or have the duplicate placed into Data store (Integromats own database app), this I don’t know. Integromat has a Facebook page for users to post questions, though. As far as creating a scenario to delete duplicate records, the scenario setup is quick and easy to use. I am quite pleased with this feature.

Just thought I would throw that idea into the mix.

Thank you,
Mary K

Bill_Sewell · ‎Jun 07, 2019

Thank you - this is good to know… If I can train my members to use their personal email address to match between mailchimp and xero, I should be OK. I did notice that I needed to get data cleaned up perfectly before importing into Airtable, because I could not change values to match once I pasted into the system. I really dislike leading off with the email address as the unique field, because that’s what’s showing up as the first item in Airtable’s mobile app. I recognize the value of the dedupe block to merge records… am I correct to assume that you cannot manually merge records w/o dedupe?

Bill_Sewell · ‎Jun 07, 2019

Thank you so much for your deep dive explanations here. It took me quite a few attempts to discover how tables linked up properly by importing 2 tables and getting them to match up. The experience is more like VLOOKUP in Excel than Index/Match, which lets me control where the matching keys exist in each table.

I am going to work through your examples to improve my understanding even further. I am new to Airtable but not new to data modeling… I’m just struggling a little with Airtable’s “we protect you from the info” approach to key matching. Quickbase seems to work the same way, btw.

Selimafifi · ‎May 28, 2023

@W_Vann_Hall thank you for your answer.

. I guess my question here is pros and cons of email vs. Full name as Primary fields. At first i had emails, which is very useful as i tend to link my calendar instances using "attendees" field which then links with contact email address. But the cons was whenever i searched i want to search for a contact, i can't do so using their name. (in forms etc, linking manually in other tables etc). Now i switched the primary field to full name, but then i am unable to auto link my calendar instance using Attendees field. So my question here is: what is best practice? is there a way of keeping email, but being able to find contacts using their name rather when searching? alternatively, is there a way of linking record to table using a defined field than automatic? thanks