Parsing Text without Split()

Bill_French · ‎Dec 24, 2019

Recently, a member contacted me directly asking for help …

I am have been searching desperately to find a clean way to break apart an address like this…
123 anywhere st, San Diego, CA, 90001

Unfortunately, there is no “clean” way to do this without a Split() method as described in detail here.

And rather than jumping to the end by showing a unified working formula (that even I would be unable to read), I approached this question by providing a framework for tackling almost any parsing challenges.

Without Split() we are forced to fabricate rather complex formulas and like many of you, my brain is incapable of juggling a lot of string pointers to create that perfect [single] formula. And therein lay the issue - we often assume we must build a single formula from the beginning. This is neither necessary nor likely to be helpful. Sure, you want to use the fewest number of fields and create concise formulas for simplicity. However, there’s plenty of time to polish the code once you have something that works reliably.

So here’s how I tend to approach these tasks.

I reduce every step of the parsing process into the smallest unit of functionality. For example, in this challenge, we need to split the four entities of the address string into separate columns. But to do that, we have to sustain an always shrinking string so we can easily peel off the leading segment with each step toward parsing completion.

Step 1

Tease out the first entity (the street address).

Step 2

Eliminate the first entity from the full address string.

Step 3

Repeat steps one and two until all four entities are separated.

As you work through the steps, a common parsing pattern will emerge. Each of the patterns become[somewhat] reusable. It always helps to do this with real data and separate fields exposing each step and the remaining text to be parsed one entity at a time.

While I laid out the stages separately, you can combine these formulas pretty easily to eliminate the interim fields Remainder1 and Remainder2. Just replace each of the remainder field references with their respecting field formulas. Working in small chunks provides a manageable framework that also serves as a good learning exercise. But most important - when tackling a complex parsing project, I recommend you eat the elephant in small stages before trying to compile a unified parsing approach.

Street Address
Take the left-most characters from the beginning of the address string until the first character before the first comma.

LEFT(Address, FIND(",", Address) -1)

Remainder1
Take the full address string and remove the left-most characters from the beginning of the address string until the first comma. This serves as a waypoint to capture what’s remaining after we parse off the first entity in the string.

RIGHT(Address, (LEN(Address) - FIND(",", Address)))

City
Using the value of Remainder1 we are now in a position to parse just like we did for the street address.

LEFT(Remainder1, FIND(",", Remainder1) -1)

Remainder2
Once again, we reduce Remainder1 to exclude the City, thus setting the stage for two final splits to reveal the state, and the zip entities.

RIGHT(Remainder1, (LEN(Remainder1) - FIND(",", Remainder1)))

State

LEFT(Remainder2, FIND(",", Remainder2) -1)

Zip

RIGHT(Remainder2, (LEN(Remainder2) - FIND(",", Remainder2)))

Unfortunately, Airtable doesn’t support a Split() method for strings. If Airtable would provide a Split() method, this challenge would be delightfully fun. :winking_face:

address = split(Address, ", ")[0]
city    = split(Address, ", ")[1]
state   = split(Address, ", ")[2]
zip     = split(Address, ", ")[3]

Merry Christmas!

Christiaan_Huiz · ‎Dec 26, 2019

I like the solution a lot and I wonder what to do in case we have no “,” but only white spaces. The addresses we work with are like this :

Fortstraat 14, 3249 AD Herkingen, The Netherlands.

As you can see, we have no ‘,’ between the PC ( 3249 AD ) and the village (Herkingen)

Thanks for sharing your insights.

Bill_French · ‎Dec 26, 2019

This may be impossible to parse reliably, but let’s investigate. Reviewing the postal code system in The Netherlands, I assume 3249 AD Herkingen is the postal code and the city and it is a complete entity.

If you want to parse the city out of the post code fragment, you must build the logic required to detect the postcode pattern - i.e.,

The first two digits of the postcode indicate a city and a region. The following two digits and letters indicate a range of house numbers usually on the same street.

Overcoming the added complexity in separating the city name from the postcode will require another step that performs a search for every city in your country inside the postcode. Then - and only then - will you be able to extract the postcode and the city separately.

This also demonstrates why Airtable needs to open up the Blocks platform so that we can call out to AI systems that can read addresses like this and perform entity extractions based on well-trained AI models.

Assuming the string is in a field named “Address” and these are the desired entities…

Fortstraat 14
3249 AD Herkingen
The Netherlands

Street Address

LEFT(Address, FIND(", ", Address) -1)

Remainder1

RIGHT(Address, (LEN(Address) - FIND(",", Address)))

Postal Code & City

LEFT(Remainder1, FIND(", ", Remainder1) -1)

Country

RIGHT(Remainder1, (LEN(Remainder1) - FIND(",", Remainder1)))

Christiaan_Huiz · ‎Dec 26, 2019

Yep, this would solve a lot of issues. I work with addresses in The Netherlands, Belgium and France and for each country we have to create different rules…

Christiaan_Huiz · ‎Dec 26, 2019

thx a lot Bill, as I understand your writing is that using " " (so white space) is not possible, we can only split if we have “,” or “;” or something else, but not white space (like we can have in xlsx). I am I right or does my idea need an adjustment?

Bill_French · ‎Dec 26, 2019

And therein lay the benefits of AI - the models have all been trained; we simply need a way to apply them to data and Airtable doesn’t make this possible internally. We can - however - use the API to do this. But that approach feels like turning off a light switch with a wrecking ball.

Not exactly correct. We can find spaces in strings, so all you’d need to do in this case is perform one additional parse on postcode + city looking for the second space. I assume all post codes are of the format “NNNN AA” (i.e., 2500 DL) and thus, all text after this pattern is by definition the city, right?

The absolute necessity of AI-based entity extraction becomes evident when spaces are used as delimiters adjacent to city names wich may include multiple spaces or no spaces.

Christiaan_Huiz · ‎Dec 26, 2019

Yes in The Nethelands it is NNNN AA (i.e. 2500 DL) so indeed all text after this pattern is the city

and then of cours you have city names that contain spaces like “Den Haag” (The Hague)

In Belgium it is different, only NNNN (i.e. 9000) . When a French speaking people writes an address it goes like : housenumber, street, pc, city, thus something like 34, Rue du Président, 1050 Ixelles, while a Flemish speaking person would write for the same address the Flemish version: Voorzittersstraat 34, 1050 Elsene.

Yep it is confusing.

Thanks a lot for helping me out on the white space logic.

JonathanBowen · ‎Feb 12, 2020

@Bill.French, trying out the new scripting block on this:

let table1 = base.getTable("Table1");

let result = await table1.selectRecordsAsync();

for (let record of result.records) {
    let address = record.getCellValue('Address');
    let split_address = address.split(',');
    let address1 = split_address[0];
    let city = split_address[1];
    let state = split_address[2];
    let zip  = split_address[3];

    await table1.updateRecordAsync(record.id, {
        "Address 1": address1,
        "City": city,
        "State": state,
        "Zip": zip
    })    
}

JB

Bill_French · ‎Feb 12, 2020

Unfortunately, I didn’t make the cut for beta. I requested access three times.

BobBannanas · ‎Feb 12, 2020

Someone get this man beta