Best Practices for Normalizing Data?

Forum|Forum|2 years ago
November 3, 2023
4 replies
207 views

+4

DanBliss
New Participant

Are there help entries providing best practices for normalizing data? Typically data comes in somewhat non-normalized, records with duplicate (non-unique) or empty values. Take as an example, the following table, Grocery_Shopping:

Day	Qty	Item	Type
Mon	4	apples	fruit
Mon	10	snap beans	vegetable
Tues	13	strawberries
Tues	6	potatoes	vegetable
Tues	5	kiwi	fruit
Thurs	23	bananas	fruit
Thurs	6	bananas	fruit

I can manually create a unique table of Types, then manually change field type in Grocery_Shopping.type so it's linked to Types.type. Similarly I could do the same, creating unique reference tables for Day and Item. What many users need to a set of rules, automations, links, etc which automatically update reference tables when additional entries into Grocery_Shopping are imported. If there is such a set of best practices, please point in their direction. Thanks.

+19

Arthur_Tutt
Brainy
Forum|Forum|2 years ago
November 3, 2023

Hey @DanBliss Can you clarify a bit more what you mean by normalizing your data?

The first thought that comes to mind is a using a 'Single Select' Field, which limits the options to pre-defined values (example in screenshot). But I need more context to really understand what you're trying to do. Thanks!

Like

D

+4

DanBliss
Author
New Participant
Forum|Forum|2 years ago
November 3, 2023

Hey @DanBliss Can you clarify a bit more what you mean by normalizing your data?

The first thought that comes to mind is a using a 'Single Select' Field, which limits the options to pre-defined values (example in screenshot). But I need more context to really understand what you're trying to do. Thanks!

Hi Arthur,

Thanks for your response. In a small table with not much data, manually designating characteristics of the data is possible, as you suggest, but as the data set gets quite large (imagine expanding columns to include the Buyer; Name_of_Retailer; Store_Location; Recipient (Grandma, Auntie, etc). Now let's add several years worth of shopping - now you might have thousands of rows, the dataset becomes too large to update manually.

Now let's imagine that we wish to clean up the Recipient column data, so that entries like "Aunt Sally" and "Susan Smith", who are the same person, are shown consistently as "Aunt Sally". Typically that might involve a series of steps: (1) Create a working copy of Grocery_Shopping, (2) in the working copy remove all columns except Buyer, (3) remove duplicate Buyer names, (4) remove any empty values. This creates a table of unique buyer names, which we can call Buyer_Unique. Next create a crosswalk table of 2 columns, the first being a list of all unique Buyer values from Grocery_Shopping, the second column matching Buyer_Unique. The task of building the crosswalk is mostly human intervention (although with the increasing prevalence of AI, I expect this problem will go away). Once the crosswalk table is updated and complete, all matches made, a "view" of Grocery_Shopping could be constructed replacing Buyer with Buyer_Unique values: no more "Sally Smith" - instead they would all appear as "Aunt Sally". Such standardization of data involves a lot of normalizing of the data. I hope you get the idea. The question is... how best to automate this process so that as more data is imported, unique data sets are updated and links built to other table values, in a way that minimizes repetitive data editing?

Updated Grocery_Shopping:

Buyer Day Qty Item Type

Aunt Sally	Mon	4	apples	fruit
Sally Smith	Mon	10	snap beans	vegetable
George Costanza	Tues	13	strawberries
G. Costanza	Tue	6	potatoes	vegetable
Martin Luther	Tues	5	kiwi	fruit
Hareit Marshall	Thur	23	bananas	fruit
H Marshal	Thurs	6	banana	fruit
Hariett Marshall	Fri	14	chili pepers	spice

List of preferred buyer names,

Table: Buyer_Unique

Preferred Name

Aunt Sally

George Costanza

Martin Luther

Hariett Marshall

And here's a table, Buyer_Crosswalk matching Buyer to Buyer_Unique values. In the crosswalk table, the Buyer column is linked to Grocery_Shopping.Buyer and the Preferred Name column values are linked to the table Buyer_Unique.Preferred Name.

Buyer Preferred_Name

Aunt Sally	Aunt Sally
Sally Smith	Aunt Sally
George Costanza	George Costanza
G. Costanza	George Costanza
Martin Luther	Martin Luther
Hareit Marshall	Hariett Marshall
H Marshal	Hariett Marshall
Hariett Marshall	Hariett Marshall

We could avoid the crosswalk table by adding a column in Grocery_Shopping linking to Buyer_Unique. Whether this is advisable depends on the number of rows - with very large number of rows, it's more efficient to create the crosswalk table instead - and also better normalization.

Like

+19

Arthur_Tutt
Brainy
Forum|Forum|2 years ago
November 4, 2023

Hey @DanBliss wow thanks for the detailed follow up, that really helped to better understand your situation.

Unfortunately this one falls outside my experience level and I don't believe there's much here I can help you with.

Sounds like you have a large and complex dataset you're working with, might be worth hiring an airtable consultant or agency to help you set it up and manage it.

But I'll leave it open here for someone else to chime in with some suggestions...