Help

Best Practices for Normalizing Data?

Topic Labels: Data
679 4
cancel
Showing results for 
Search instead for 
Did you mean: 
DanBliss
5 - Automation Enthusiast
5 - Automation Enthusiast

Are there help entries providing best practices for normalizing data?  Typically data comes in somewhat non-normalized, records with duplicate (non-unique) or empty values.  Take as an example, the following table, Grocery_Shopping:

DayQtyItemType
Mon4applesfruit
Mon10snap beansvegetable
Tues13strawberries 
Tues6potatoesvegetable
Tues5kiwifruit
Thurs23bananasfruit
Thurs6bananasfruit

I can manually create a unique table of Types, then manually change field type in Grocery_Shopping.type so it's linked to Types.type.  Similarly I could do the same, creating unique reference tables for Day and Item.  What many users need to a set of rules, automations, links, etc which automatically update reference tables when additional entries into Grocery_Shopping are imported.  If there is such a set of best practices, please point in their direction.  Thanks.   

4 Replies 4
Arthur_Tutt
8 - Airtable Astronomer
8 - Airtable Astronomer

Hey @DanBliss Can you clarify a bit more what you mean by normalizing your data? 

The first thought that comes to mind is a using a 'Single Select' Field, which limits the options to pre-defined values (example in screenshot). But I need more context to really understand what you're trying to do. Thanks! 

Screenshot 2023-11-03 120639.png

DanBliss
5 - Automation Enthusiast
5 - Automation Enthusiast

Hi Arthur,

Thanks for your response.  In a small table with not much data, manually designating characteristics of the data is possible, as you suggest, but as the data set gets quite large (imagine expanding columns to include the Buyer; Name_of_Retailer; Store_Location; Recipient (Grandma, Auntie, etc).  Now let's add several years worth of shopping - now you might have thousands of rows, the dataset becomes too large to update manually. 

Now let's imagine that we wish to clean up the Recipient column data, so that entries like "Aunt Sally" and "Susan Smith", who are the same person, are shown consistently as "Aunt Sally".   Typically that might involve a series of steps:  (1) Create a working copy of Grocery_Shopping, (2) in the working copy remove all columns except Buyer, (3) remove duplicate Buyer names, (4) remove any empty values.  This creates a table of unique buyer names, which we can call Buyer_Unique.  Next create a crosswalk table of 2 columns, the first being a list of all unique Buyer values from Grocery_Shopping, the second column matching Buyer_Unique.  The task of building the crosswalk is mostly human intervention (although with the increasing prevalence of AI, I expect this problem will go away). Once the crosswalk table is updated and complete, all matches made, a "view" of Grocery_Shopping could be constructed replacing Buyer with Buyer_Unique values: no more "Sally Smith" - instead they would all appear as "Aunt Sally".  Such standardization of data involves a lot of normalizing of the data.  I hope you get the idea.  The question is... how best to automate this process so that as more data is imported, unique data sets are updated and links built to other table values, in a way that minimizes repetitive data editing?

Updated Grocery_Shopping:

Buyer                Day        Qty                   Item                Type

Aunt SallyMon4applesfruit
Sally SmithMon10snap beansvegetable
George CostanzaTues13strawberries 
G. CostanzaTue6potatoesvegetable
Martin LutherTues5kiwifruit
Hareit MarshallThur23bananasfruit
H MarshalThurs6bananafruit
Hariett MarshallFri14chili pepersspice

List of preferred buyer names,

Table: Buyer_Unique

Preferred Name
Aunt Sally
George Costanza
Martin Luther
Hariett Marshall

And here's a table, Buyer_Crosswalk matching Buyer to Buyer_Unique values.  In the crosswalk table, the Buyer column is linked to Grocery_Shopping.Buyer and the Preferred Name column values are linked to the table Buyer_Unique.Preferred Name.

Buyer                               Preferred_Name

Aunt SallyAunt Sally
Sally SmithAunt Sally
George CostanzaGeorge Costanza
G. CostanzaGeorge Costanza
Martin LutherMartin Luther
Hareit MarshallHariett Marshall
H MarshalHariett Marshall
Hariett MarshallHariett Marshall

We could avoid the crosswalk table by adding a column in Grocery_Shopping linking to Buyer_Unique.  Whether this is advisable depends on the number of rows - with very large number of rows, it's more efficient to create the crosswalk table instead - and also better normalization.  

Arthur_Tutt
8 - Airtable Astronomer
8 - Airtable Astronomer

Hey @DanBliss wow thanks for the detailed follow up, that really helped to better understand your situation. 

Unfortunately this one falls outside my experience level and I don't believe there's much here I can help you with. 

Sounds like you have a large and complex dataset you're working with, might be worth hiring an airtable consultant or agency to help you set it up and manage it. 

But I'll leave it open here for someone else to chime in with some suggestions...

stevendasilva
6 - Interface Innovator
6 - Interface Innovator

@DanBliss For your dataset do you have something you could use as a unique identifier such as an email? 

This would take some time to set up either way, if you want to book a call to discuss further click this link here. 

https://zcal.co/systemswithsteve/freeconsult

Steven Da Silva
Systems With Steve