Use Regex to extract info form text

Marc_Evrard · ‎Apr 07, 2018

First Airtable. is. awesome. I which I knew about it before to help organize my work.
Something I feel is missing though is a more powerful approach to deal with string-based functions.
For instance, it would be great to be able to use RegEx in the FIND or SUBSTITUTE functions.
A use case, e.g., is the extraction of part of a URL to use it as a primary key. It would allow me to name automatically a bunch of website links I wanna store.

I.e., there’s a REGEXEXTRACT function in Google Sheet, this is extremely useful for numerous of applications.

kuovonne · ‎Jan 09, 2021

Thanks for adding this support for regular expressions. It is also awesome that you have both example uses and the results that the produce.

REGEX_REPLACE("Hello World", "W.*", "") => "Hello"

Shouldn’t the result of this function have a trailing space? (I’m hours away from a computer and cannot test it myself for sure.)

REGEX_REPLACE("Hello World", "W.*", "") => "Hello "

Also, you might want to consider an example where there are multiple replacements.

Cedric_Carrobou · ‎Jan 10, 2021

Hello @Adam_Minich, great news for these new Regex features ! In the support document I see the Regex Extract returns the FIRST substring that matches a regular expression. Are you planning another Regex which will return ALL subtrings that match a regular expression ?

Cheers,

ScottWorld · ‎Jan 10, 2021

Yes, this seems like a major oversight on Airtable’s part that the REGEX_EXTRACT only returns the FIRST substring that matches a regular expression. This makes the function significantly less useful than one would hope for.

For example, if I was searching for all the email addresses in a long text field, I might use this regular expression:

[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,63}

However, Airtable’s REGEX_EXTRACT function will only return the very first email address that it can find.

@Adam_Minich @Jason @Taylor_Savage , this seems like a really important thing to fix. Can you guys please look into seeing if this can be fixed?

However, the good news for all of us is that MiniExtensions.com has come to the rescue yet again!

MiniExtensions has a RegEx extraction tool which works perfectly for returning ALL substrings, and it even lets you specify how you want to separate the different substrings that it finds (comma, new line, blank space, etc.). See screenshot below for how MiniExtensions lets you specify what you would like to use as your separator.

I’ve just tested this out on my end, and it works 100% perfectly:

Screenshot #1

kuovonne · ‎Jan 11, 2021

I expect that this design choice was made because Airtable’s formula fields have extremely limited ability to deal with arrays.

Justin_Barrett · ‎Jan 11, 2021

The formatting that you see in the example on the formula reference page doesn’t make this clear, but the actual sample formula when copied is:

REGEX_REPLACE("Hello World", " W.*", "")

There’s a space in front of the “W”, so “Hello” is returned with no trailing space.

kuovonne · ‎Jan 11, 2021

Hum. I’m pretty sure that I did a direct copy/paste when I created my post, and in my post there is no space before the W. Maybe Airtable fixed the documentation after I posted.

kuovonne · ‎Jan 12, 2021

@Adam_Minich The documentation does not state what REGEX_EXTRACT(string, regex) returns when there is no matching string. It looks like it returns #ERROR!.

It would make more sense for REGEX_EXTRACT to return either null or an empty string if there is no match. (Of course, it would be even nicer if the function returned an array of all the matches, but that is a different matter.)

The current situation leads to the workaround of having to test for a match before extracting it to avoid the unsightly error:

IF(
  REGEX_MATCH({Text Field}, "W.*"), 
  REGEX_EXTRACT({Text Field}, "W.*")
)

In comparison, notice that REGEX_REPLACE does not return a error if there is no match; it simply does not replace any text.

carloschristian · ‎Jan 12, 2021

I’m trying to build a formula that extracts the text between [en-multi] in the following string:

[en-multi]THIS SPRING
12 JANUARY[en-multi][de-DE]DIESES FRÜHJAHR
AB 12. JANUAR[de-DE]ESTA PRIMAVERA
12 DE ENEROESTA PRIMAVERA
12 DE ENEROCE PRINTEMPS
LE 12 JANVIER
QUESTA PRIMAVERA
DAL 12 GENNAIODIT VOORJAAR
12 JANUARI

I will use another field to extract the text between [de-DE] and so on and so forth… Is this something that can be “easily” achieved with REGEX?

Justin_Barrett · ‎Jan 12, 2021

I’m somewhat of a regex newbie, but I was able to make this work after some experimentation on regex101.com (recommended by @Jason in the announcement post - THANK YOU!). Make sure to use the Golang (RE2) “flavor”.

Screen Shot 2021-01-12 at 9.30.30 PM

The formula is:

REGEX_EXTRACT(Name, "(?:\\[en\\-multi\\])([a-zA-Z\\d\\s]*)")

carloschristian · ‎Jan 13, 2021

So awesome! I’m so thankful for your help. Thank you, Justin. This code works well. I did run into some trouble with the umlaut u - ü - and period after the numbered date for German, but I did some googling an came up with the following for languages with other characters. Sharing just in case someone else should find it helpful.

REGEX_EXTRACT({Name}& “”, “(?:\[de\-DE\])([A-Za-zÀ-ž\u0370-\u03FF\u0400-\u04FF*.\d\s]*)”)

Thanks again! :slightly_smiling_face:

Use Regex to extract info form text

Format Formula as Checkbox

CSV Import in Interfaces

React to Record Comments

Allow Emoji reactions to comments

Multiplying by percent column yields result that i...