Nov 03, 2023 08:32 AM - edited Nov 03, 2023 08:33 AM
Are there help entries providing best practices for normalizing data? Typically data comes in somewhat non-normalized, records with duplicate (non-unique) or empty values. Take as an example, the following table, Grocery_Shopping:
Day | Qty | Item | Type |
Mon | 4 | apples | fruit |
Mon | 10 | snap beans | vegetable |
Tues | 13 | strawberries | |
Tues | 6 | potatoes | vegetable |
Tues | 5 | kiwi | fruit |
Thurs | 23 | bananas | fruit |
Thurs | 6 | bananas | fruit |
I can manually create a unique table of Types, then manually change field type in Grocery_Shopping.type so it's linked to Types.type. Similarly I could do the same, creating unique reference tables for Day and Item. What many users need to a set of rules, automations, links, etc which automatically update reference tables when additional entries into Grocery_Shopping are imported. If there is such a set of best practices, please point in their direction. Thanks.
Nov 03, 2023 09:09 AM
Hey @DanBliss Can you clarify a bit more what you mean by normalizing your data?
The first thought that comes to mind is a using a 'Single Select' Field, which limits the options to pre-defined values (example in screenshot). But I need more context to really understand what you're trying to do. Thanks!
Nov 03, 2023 10:16 AM - edited Nov 03, 2023 10:20 AM
Hi Arthur,
Thanks for your response. In a small table with not much data, manually designating characteristics of the data is possible, as you suggest, but as the data set gets quite large (imagine expanding columns to include the Buyer; Name_of_Retailer; Store_Location; Recipient (Grandma, Auntie, etc). Now let's add several years worth of shopping - now you might have thousands of rows, the dataset becomes too large to update manually.
Now let's imagine that we wish to clean up the Recipient column data, so that entries like "Aunt Sally" and "Susan Smith", who are the same person, are shown consistently as "Aunt Sally". Typically that might involve a series of steps: (1) Create a working copy of Grocery_Shopping, (2) in the working copy remove all columns except Buyer, (3) remove duplicate Buyer names, (4) remove any empty values. This creates a table of unique buyer names, which we can call Buyer_Unique. Next create a crosswalk table of 2 columns, the first being a list of all unique Buyer values from Grocery_Shopping, the second column matching Buyer_Unique. The task of building the crosswalk is mostly human intervention (although with the increasing prevalence of AI, I expect this problem will go away). Once the crosswalk table is updated and complete, all matches made, a "view" of Grocery_Shopping could be constructed replacing Buyer with Buyer_Unique values: no more "Sally Smith" - instead they would all appear as "Aunt Sally". Such standardization of data involves a lot of normalizing of the data. I hope you get the idea. The question is... how best to automate this process so that as more data is imported, unique data sets are updated and links built to other table values, in a way that minimizes repetitive data editing?
Updated Grocery_Shopping:
Buyer Day Qty Item Type
Aunt Sally | Mon | 4 | apples | fruit |
Sally Smith | Mon | 10 | snap beans | vegetable |
George Costanza | Tues | 13 | strawberries | |
G. Costanza | Tue | 6 | potatoes | vegetable |
Martin Luther | Tues | 5 | kiwi | fruit |
Hareit Marshall | Thur | 23 | bananas | fruit |
H Marshal | Thurs | 6 | banana | fruit |
Hariett Marshall | Fri | 14 | chili pepers | spice |
List of preferred buyer names,
Table: Buyer_Unique
Preferred Name |
Aunt Sally |
George Costanza |
Martin Luther |
Hariett Marshall |
And here's a table, Buyer_Crosswalk matching Buyer to Buyer_Unique values. In the crosswalk table, the Buyer column is linked to Grocery_Shopping.Buyer and the Preferred Name column values are linked to the table Buyer_Unique.Preferred Name.
Buyer Preferred_Name
Aunt Sally | Aunt Sally |
Sally Smith | Aunt Sally |
George Costanza | George Costanza |
G. Costanza | George Costanza |
Martin Luther | Martin Luther |
Hareit Marshall | Hariett Marshall |
H Marshal | Hariett Marshall |
Hariett Marshall | Hariett Marshall |
We could avoid the crosswalk table by adding a column in Grocery_Shopping linking to Buyer_Unique. Whether this is advisable depends on the number of rows - with very large number of rows, it's more efficient to create the crosswalk table instead - and also better normalization.
Nov 03, 2023 08:21 PM
Hey @DanBliss wow thanks for the detailed follow up, that really helped to better understand your situation.
Unfortunately this one falls outside my experience level and I don't believe there's much here I can help you with.
Sounds like you have a large and complex dataset you're working with, might be worth hiring an airtable consultant or agency to help you set it up and manage it.
But I'll leave it open here for someone else to chime in with some suggestions...
Nov 05, 2023 12:16 PM
@DanBliss For your dataset do you have something you could use as a unique identifier such as an email?
This would take some time to set up either way, if you want to book a call to discuss further click this link here.
https://zcal.co/systemswithsteve/freeconsult
Steven Da Silva
Systems With Steve