Use Regex to extract info form text

@Marc_Evrard Thanks for mentioning this Google Sheet formula. For some reason I had never seen that before. I do think that the things you mention can be realized by using FIND + LEFT/MID/RIGHT + LEN and so on. I’m kinda curious to see some of the cases.

1 Like

Hi Andre,
Thanks for the suggestion.
For instance the function =REGEXEXTRACT(<url text>, ".*\://?([^\/]+)") in Google Sheet would extract the sub-domain and domain names of a given URL. I’m not sure how I could produce such a concise solution using a bunch of FIND, LEFT, MID, RIGHT and LEN functions. RegEx may not be particularly readable, but they are extremely powerful.
Best

2 Likes

Just a quick follow up on this, I did solve the problem with the following formula:

MID(URL, 
    FIND("//", URL) + 2, 
    FIND("/", 
         URL, 
         FIND("//", URL) + 2
        ) - (FIND("//", URL) + 2)
   )

Some people might actually find it more readable that the regex above. Though, for instance, this problem was a rather simple one when compared to all the possibilities offered by regular expressions (some examples).
By the way, if you think about any better way to proceed with this, feel free to comment further.

1 Like

Regex in GENERAL would be a great addition to Airtable.

10 Likes

Oh that’s sad, I thought this existed! Damn REGEXEXTRACT is SO useful. You can enter a URL, and then get page title, meta data, even drill down to specific DIVs. I would LOVE this added.

1 Like

We’ve built a tool that allows you to use Regex with Airtable.

3 Likes

Nice proposal but rather expensive in my opinion!

Absolutely REGEX feature is a “must have” feature in Airtable. So many problems could be resolved w/ it ( even if it’s not always easy to read | understand :wink: )

2 Likes

Hello,

Adam from the Airtable support team here. We now have REGEX functions (REGEX_EXTRACT() being one of them) available for use in formula fields. You can find some more information about those functions in this support document.

Cheers!

6 Likes

@Adam_Minich, @Jason, @Taylor_Savage,

This addition to formulas deserves its own announcement post!!! This is huge!

I didn’t know about this until just now when I was checking something in the formula field reference page and noticed the new “Regex” section. I thought, “Wow! Surely there was an announcement or a beta that I somehow totally missed!” But I came back to the forums and all I can find are your replies to Product Suggestions from two days ago. I went at least two days not even knowing that this incredibly useful and powerful addition to formula fields existed!!!

I would suggest making an announcement post about this new feature, and when replying to Product Suggestions for the feature (like this one, which you missed), link to the announcement post instead.

4 Likes

Thanks for adding this support for regular expressions. It is also awesome that you have both example uses and the results that the produce.

REGEX_REPLACE("Hello World", "W.*", "") => "Hello"

Shouldn’t the result of this function have a trailing space? (I’m hours away from a computer and cannot test it myself for sure.)

REGEX_REPLACE("Hello World", "W.*", "") => "Hello "

Also, you might want to consider an example where there are multiple replacements.

1 Like

Hello @Adam_Minich, great news for these new Regex features ! In the support document I see the Regex Extract returns the FIRST substring that matches a regular expression. Are you planning another Regex which will return ALL subtrings that match a regular expression ?

Cheers,

3 Likes

Yes, this seems like a major oversight on Airtable’s part that the REGEX_EXTRACT only returns the FIRST substring that matches a regular expression. This makes the function significantly less useful than one would hope for.

For example, if I was searching for all the email addresses in a long text field, I might use this regular expression:

[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,63}

However, Airtable’s REGEX_EXTRACT function will only return the very first email address that it can find.

@Adam_Minich @Jason @Taylor_Savage , this seems like a really important thing to fix. Can you guys please look into seeing if this can be fixed?

However, the good news for all of us is that MiniExtensions.com has come to the rescue yet again!

MiniExtensions has a RegEx extraction tool which works perfectly for returning ALL substrings, and it even lets you specify how you want to separate the different substrings that it finds (comma, new line, blank space, etc.). See screenshot below for how MiniExtensions lets you specify what you would like to use as your separator.

I’ve just tested this out on my end, and it works 100% perfectly:

1 Like

I expect that this design choice was made because Airtable’s formula fields have extremely limited ability to deal with arrays.

The formatting that you see in the example on the formula reference page doesn’t make this clear, but the actual sample formula when copied is:

REGEX_REPLACE("Hello World", " W.*", "")

There’s a space in front of the “W”, so “Hello” is returned with no trailing space.

1 Like

Hum. I’m pretty sure that I did a direct copy/paste when I created my post, and in my post there is no space before the W. Maybe Airtable fixed the documentation after I posted.

1 Like

@Adam_Minich The documentation does not state what REGEX_EXTRACT(string, regex) returns when there is no matching string. It looks like it returns #ERROR!.

It would make more sense for REGEX_EXTRACT to return either null or an empty string if there is no match. (Of course, it would be even nicer if the function returned an array of all the matches, but that is a different matter.)

The current situation leads to the workaround of having to test for a match before extracting it to avoid the unsightly error:

IF(
  REGEX_MATCH({Text Field}, "W.*"), 
  REGEX_EXTRACT({Text Field}, "W.*")
)

In comparison, notice that REGEX_REPLACE does not return a error if there is no match; it simply does not replace any text.

2 Likes

I’m trying to build a formula that extracts the text between [en-multi] in the following string:

[en-multi]THIS SPRING
12 JANUARY[en-multi][de-DE]DIESES FRÜHJAHR
AB 12. JANUAR[de-DE]ESTA PRIMAVERA
12 DE ENEROESTA PRIMAVERA
12 DE ENEROCE PRINTEMPS
LE 12 JANVIER
QUESTA PRIMAVERA
DAL 12 GENNAIODIT VOORJAAR
12 JANUARI

I will use another field to extract the text between [de-DE] and so on and so forth… Is this something that can be “easily” achieved with REGEX?

I’m somewhat of a regex newbie, but I was able to make this work after some experimentation on regex101.com (recommended by @Jason in the announcement post - THANK YOU!). Make sure to use the Golang (RE2) “flavor”.

Screen Shot 2021-01-12 at 9.30.30 PM

The formula is:

REGEX_EXTRACT(Name, "(?:\\[en\\-multi\\])([a-zA-Z\\d\\s]*)")
4 Likes

So awesome! I’m so thankful for your help. Thank you, Justin. This code works well. I did run into some trouble with the umlaut u - ü - and period after the numbered date for German, but I did some googling an came up with the following for languages with other characters. Sharing just in case someone else should find it helpful.

REGEX_EXTRACT({Name}& “”, “(?:\[de\-DE\])([A-Za-zÀ-ž\u0370-\u03FF\u0400-\u04FF*.\d\s]*)”)

Thanks again! :slight_smile:

1 Like