Parsing Text without Split()

Recently, a member contacted me directly asking for help …

I am have been searching desperately to find a clean way to break apart an address like this…
123 anywhere st, San Diego, CA, 90001

Unfortunately, there is no “clean” way to do this without a Split() method as described in detail here.

And rather than jumping to the end by showing a unified working formula (that even I would be unable to read), I approached this question by providing a framework for tackling almost any parsing challenges.

Without Split() we are forced to fabricate rather complex formulas and like many of you, my brain is incapable of juggling a lot of string pointers to create that perfect [single] formula. And therein lay the issue - we often assume we must build a single formula from the beginning. This is neither necessary nor likely to be helpful. Sure, you want to use the fewest number of fields and create concise formulas for simplicity. However, there’s plenty of time to polish the code once you have something that works reliably.

So here’s how I tend to approach these tasks.

I reduce every step of the parsing process into the smallest unit of functionality. For example, in this challenge, we need to split the four entities of the address string into separate columns. But to do that, we have to sustain an always shrinking string so we can easily peel off the leading segment with each step toward parsing completion.

Step 1

Tease out the first entity (the street address).

Step 2

Eliminate the first entity from the full address string.

Step 3

Repeat steps one and two until all four entities are separated.

As you work through the steps, a common parsing pattern will emerge. Each of the patterns become[somewhat] reusable. It always helps to do this with real data and separate fields exposing each step and the remaining text to be parsed one entity at a time.


While I laid out the stages separately, you can combine these formulas pretty easily to eliminate the interim fields Remainder1 and Remainder2. Just replace each of the remainder field references with their respecting field formulas. Working in small chunks provides a manageable framework that also serves as a good learning exercise. But most important - when tackling a complex parsing project, I recommend you eat the elephant in small stages before trying to compile a unified parsing approach.

Street Address
Take the left-most characters from the beginning of the address string until the first character before the first comma.

LEFT(Address, FIND(",", Address) -1)

Remainder1
Take the full address string and remove the left-most characters from the beginning of the address string until the first comma. This serves as a waypoint to capture what’s remaining after we parse off the first entity in the string.

RIGHT(Address, (LEN(Address) - FIND(",", Address)))

City
Using the value of Remainder1 we are now in a position to parse just like we did for the street address.

LEFT(Remainder1, FIND(",", Remainder1) -1)

Remainder2
Once again, we reduce Remainder1 to exclude the City, thus setting the stage for two final splits to reveal the state, and the zip entities.

RIGHT(Remainder1, (LEN(Remainder1) - FIND(",", Remainder1)))

State

LEFT(Remainder2, FIND(",", Remainder2) -1)

Zip

RIGHT(Remainder2, (LEN(Remainder2) - FIND(",", Remainder2)))

Unfortunately, Airtable doesn’t support a Split() method for strings. If Airtable would provide a Split() method, this challenge would be delightfully fun. :wink:

address = split(Address, ", ")[0]
city    = split(Address, ", ")[1]
state   = split(Address, ", ")[2]
zip     = split(Address, ", ")[3]

Merry Christmas!

4 Likes

I like the solution a lot and I wonder what to do in case we have no “,” but only white spaces. The addresses we work with are like this :

Fortstraat 14, 3249 AD Herkingen, The Netherlands.

As you can see, we have no ‘,’ between the PC ( 3249 AD ) and the village (Herkingen)

Thanks for sharing your insights.

This may be impossible to parse reliably, but let’s investigate. Reviewing the postal code system in The Netherlands, I assume 3249 AD Herkingen is the postal code and the city and it is a complete entity.

If you want to parse the city out of the post code fragment, you must build the logic required to detect the postcode pattern - i.e.,

The first two digits of the postcode indicate a city and a region. The following two digits and letters indicate a range of house numbers usually on the same street.

Overcoming the added complexity in separating the city name from the postcode will require another step that performs a search for every city in your country inside the postcode. Then - and only then - will you be able to extract the postcode and the city separately.

This also demonstrates why Airtable needs to open up the Blocks platform so that we can call out to AI systems that can read addresses like this and perform entity extractions based on well-trained AI models.

Assuming the string is in a field named “Address” and these are the desired entities…

  • Fortstraat 14
  • 3249 AD Herkingen
  • The Netherlands

Street Address

LEFT(Address, FIND(", ", Address) -1)

Remainder1

RIGHT(Address, (LEN(Address) - FIND(",", Address)))

Postal Code & City

LEFT(Remainder1, FIND(", ", Remainder1) -1)

Country

RIGHT(Remainder1, (LEN(Remainder1) - FIND(",", Remainder1)))

Yep, this would solve a lot of issues. I work with addresses in The Netherlands, Belgium and France and for each country we have to create different rules…

thx a lot Bill, as I understand your writing is that using " " (so white space) is not possible, we can only split if we have “,” or “;” or something else, but not white space (like we can have in xlsx). I am I right or does my idea need an adjustment?

And therein lay the benefits of AI - the models have all been trained; we simply need a way to apply them to data and Airtable doesn’t make this possible internally. We can - however - use the API to do this. But that approach feels like turning off a light switch with a wrecking ball.

Not exactly correct. We can find spaces in strings, so all you’d need to do in this case is perform one additional parse on postcode + city looking for the second space. I assume all post codes are of the format “NNNN AA” (i.e., 2500 DL) and thus, all text after this pattern is by definition the city, right?

The absolute necessity of AI-based entity extraction becomes evident when spaces are used as delimiters adjacent to city names wich may include multiple spaces or no spaces.

Yes in The Nethelands it is NNNN AA (i.e. 2500 DL) so indeed all text after this pattern is the city

and then of cours you have city names that contain spaces like “Den Haag” (The Hague)

In Belgium it is different, only NNNN (i.e. 9000) . When a French speaking people writes an address it goes like : housenumber, street, pc, city, thus something like 34, Rue du Président, 1050 Ixelles, while a Flemish speaking person would write for the same address the Flemish version: Voorzittersstraat 34, 1050 Elsene.

Yep it is confusing.

Thanks a lot for helping me out on the white space logic.

@Bill.French, trying out the new scripting block on this:

let table1 = base.getTable("Table1");

let result = await table1.selectRecordsAsync();

for (let record of result.records) {
    let address = record.getCellValue('Address');
    let split_address = address.split(',');
    let address1 = split_address[0];
    let city = split_address[1];
    let state = split_address[2];
    let zip  = split_address[3];

    await table1.updateRecordAsync(record.id, {
        "Address 1": address1,
        "City": city,
        "State": state,
        "Zip": zip
    })    
} 

JB

2 Likes

Unfortunately, I didn’t make the cut for beta. I requested access three times.

Someone get this man beta

@Bobby_Estep,

You must have some pull - I have access now. Coincidence?

Thanks!

1 Like

Do you have any idea how we would apply {typecast: true} in the context of updateRecordAsync?

@Bill.French - Doesn’t look like that’s an option right now.

Really unfortunate. This is why I come a little unglued when I learn that a vendor has deliberately chosen to not use their own API to build new functionality. Instead of hardening the core API so that it can support many use cases, they decided to implement this as an offshoot afterthought in an entirely separate code base.

I had a use case that required just extracting the city names from addresses, but they were sometimes 1-line addresses and sometimes 2-line (e.g., with an Apt.# or Suite#). I modified @Bill.French’s technique to do it, and thought I’d share in case it helps anyone else.

Caveats:

  • There’s a built-in assumption that cities are all in the same state (NC) (they were for me), so it would have to be tweaked for multiple states.

  • It requires the original addresses to have commas in between Street Address, Street Address Line 2, City, and State.

Here’s what it looks like:

Here are the formulas:

Start with {Address_Full} …then…

{FX_Street_Address} =
LEFT({Address_Full}, FIND(", ", {Address_Full}) -1)

{FX_Remainder} =
TRIM(RIGHT({Address_Full}, (LEN({Address_Full}) - FIND(",", {Address_Full}))))

{FX_Remainder_Split_a} =
TRIM(LEFT(FX_Remainder, FIND(", ", FX_Remainder) -1))

{FX_Remainder_Split_b} =
TRIM(RIGHT(FX_Remainder, (LEN(FX_Remainder) - FIND(",", FX_Remainder))))

{City_Extracted} =
IF(LEFT(FX_Remainder_Split_b,2)=‘NC’,{FX_Remainder_Split_a},LEFT(FX_Remainder_Split_b, FIND(", ", FX_Remainder_Split_b) -1))

Here’s a commenter link to the sample base (let me know if I need to share differently…I’m new to sharing examples).

@Bill.French I owe you a beer! I’m guessing you just saved me hours of work, thank you!

1 Like