Use Regex to extract info form text

Marc_Evrard · 2018-04-07T09:22:50+00:00

First Airtable. is. awesome. I which I knew about it before to help organize my work. Something I feel is missing though is a more powerful approach to deal with string-based functions. For instance, it would be great to be able to use RegEx in the FIND or SUBSTITUTE functions. A use case, e.g., is the extraction of part of a URL to use it as a primary key. It would allow me to name automatically a bunch of website links I wanna store. I.e., there’s a REGEXEXTRACT function in Google Sheet, this is extremely useful for numerous of applications.

A

+18

Andre_Zijlstra
Inspiring
Forum|Forum|7 years ago
April 9, 2018

@Marc_Evrard Thanks for mentioning this Google Sheet formula. For some reason I had never seen that before. I do think that the things you mention can be realized by using FIND + LEFT/MID/RIGHT + LEN and so on. I’m kinda curious to see some of the cases.

Like

M

Marc_Evrard
Author
New Participant
Forum|Forum|7 years ago
April 9, 2018

@Marc_Evrard Thanks for mentioning this Google Sheet formula. For some reason I had never seen that before. I do think that the things you mention can be realized by using FIND + LEFT/MID/RIGHT + LEN and so on. I’m kinda curious to see some of the cases.

Hi Andre,
Thanks for the suggestion.
For instance the function =REGEXEXTRACT(<url text>, ".*\://?([^\/]+)") in Google Sheet would extract the sub-domain and domain names of a given URL. I’m not sure how I could produce such a concise solution using a bunch of FIND, LEFT, MID, RIGHT and LEN functions. RegEx may not be particularly readable, but they are extremely powerful.
Best

Like

M

Marc_Evrard
Author
New Participant
Forum|Forum|7 years ago
April 19, 2018

Hi Andre,
Thanks for the suggestion.
For instance the function =REGEXEXTRACT(<url text>, ".*\://?([^\/]+)") in Google Sheet would extract the sub-domain and domain names of a given URL. I’m not sure how I could produce such a concise solution using a bunch of FIND, LEFT, MID, RIGHT and LEN functions. RegEx may not be particularly readable, but they are extremely powerful.
Best

Just a quick follow up on this, I did solve the problem with the following formula:

MID(URL, 
    FIND("//", URL) + 2, 
    FIND("/", 
         URL, 
         FIND("//", URL) + 2
        ) - (FIND("//", URL) + 2)
   )

Some people might actually find it more readable that the regex above. Though, for instance, this problem was a rather simple one when compared to all the possibilities offered by regular expressions (some examples).
By the way, if you think about any better way to proceed with this, feel free to comment further.

Like

B

+7

Brian_Frank
Participating Frequently
Forum|Forum|7 years ago
February 6, 2019

Regex in GENERAL would be a great addition to Airtable.

Like

M

+3

Maurice11
Known Participant
Forum|Forum|6 years ago
November 15, 2019

Oh that’s sad, I thought this existed! Damn REGEXEXTRACT is SO useful. You can enter a URL, and then get page title, meta data, even drill down to specific DIVs. I would LOVE this added.

Like

M

+15

Moe
Inspiring
Forum|Forum|5 years ago
May 25, 2020

We’ve built a tool that allows you to use Regex with Airtable.

Like

E

+4

Eiffel_Bemorese
Participating Frequently
Forum|Forum|5 years ago
October 27, 2020

We’ve built a tool that allows you to use Regex with Airtable.

Nice proposal but rather expensive in my opinion!

Like

E

+4

Eiffel_Bemorese
Participating Frequently
Forum|Forum|5 years ago
October 27, 2020

Absolutely REGEX feature is a “must have” feature in Airtable. So many problems could be resolved w/ it ( even if it’s not always easy to read | understand :winking_face: )

Like

+18

Adam_Minich
Employee
Forum|Forum|5 years ago
January 7, 2021

Hello,

Adam from the Airtable support team here. We now have REGEX functions (REGEX_EXTRACT() being one of them) available for use in formula fields. You can find some more information about those functions in this support document.

Cheers!

Like

J

+18

Jeremy_Oglesby
Inspiring
Forum|Forum|5 years ago
January 9, 2021

Hello,

Adam from the Airtable support team here. We now have REGEX functions (REGEX_EXTRACT() being one of them) available for use in formula fields. You can find some more information about those functions in this support document.

Cheers!

@Adam_Minich, @Jason, @Taylor_Savage,

This addition to formulas deserves its own announcement post!!! This is huge!

I didn’t know about this until just now when I was checking something in the formula field reference page and noticed the new “Regex” section. I thought, “Wow! Surely there was an announcement or a beta that I somehow totally missed!” But I came back to the forums and all I can find are your replies to Product Suggestions from two days ago. I went at least two days not even knowing that this incredibly useful and powerful addition to formula fields existed!!!

I would suggest making an announcement post about this new feature, and when replying to Product Suggestions for the feature (like this one, which you missed), link to the announcement post instead.

Like

+29

kuovonne
Brainy
Forum|Forum|5 years ago
January 9, 2021

Hello,

Adam from the Airtable support team here. We now have REGEX functions (REGEX_EXTRACT() being one of them) available for use in formula fields. You can find some more information about those functions in this support document.

Cheers!

Thanks for adding this support for regular expressions. It is also awesome that you have both example uses and the results that the produce.

REGEX_REPLACE("Hello World", "W.*", "") => "Hello"

Shouldn’t the result of this function have a trailing space? (I’m hours away from a computer and cannot test it myself for sure.)

REGEX_REPLACE("Hello World", "W.*", "") => "Hello "

Also, you might want to consider an example where there are multiple replacements.

Like

C

+2

Cedric_Carrobou
New Participant
Forum|Forum|5 years ago
January 10, 2021

Hello,

Adam from the Airtable support team here. We now have REGEX functions (REGEX_EXTRACT() being one of them) available for use in formula fields. You can find some more information about those functions in this support document.

Cheers!

Hello @Adam_Minich, great news for these new Regex features ! In the support document I see the Regex Extract returns the FIRST substring that matches a regular expression. Are you planning another Regex which will return ALL subtrings that match a regular expression ?

Cheers,

Like

+35

ScottWorld
Genius
Forum|Forum|5 years ago
January 10, 2021

Hello @Adam_Minich, great news for these new Regex features ! In the support document I see the Regex Extract returns the FIRST substring that matches a regular expression. Are you planning another Regex which will return ALL subtrings that match a regular expression ?

Cheers,

Yes, this seems like a major oversight on Airtable’s part that the REGEX_EXTRACT only returns the FIRST substring that matches a regular expression. This makes the function significantly less useful than one would hope for.

For example, if I was searching for all the email addresses in a long text field, I might use this regular expression:

[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,63}

However, Airtable’s REGEX_EXTRACT function will only return the very first email address that it can find.

@Adam_Minich @Jason @Taylor_Savage , this seems like a really important thing to fix. Can you guys please look into seeing if this can be fixed?

However, the good news for all of us is that MiniExtensions.com has come to the rescue yet again!

MiniExtensions has a RegEx extraction tool which works perfectly for returning ALL substrings, and it even lets you specify how you want to separate the different substrings that it finds (comma, new line, blank space, etc.). See screenshot below for how MiniExtensions lets you specify what you would like to use as your separator.

I’ve just tested this out on my end, and it works 100% perfectly:

Like

+29

kuovonne
Brainy
Forum|Forum|5 years ago
January 11, 2021

Yes, this seems like a major oversight on Airtable’s part that the REGEX_EXTRACT only returns the FIRST substring that matches a regular expression. This makes the function significantly less useful than one would hope for.

For example, if I was searching for all the email addresses in a long text field, I might use this regular expression:

[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,63}

However, Airtable’s REGEX_EXTRACT function will only return the very first email address that it can find.

@Adam_Minich @Jason @Taylor_Savage , this seems like a really important thing to fix. Can you guys please look into seeing if this can be fixed?

However, the good news for all of us is that MiniExtensions.com has come to the rescue yet again!

MiniExtensions has a RegEx extraction tool which works perfectly for returning ALL substrings, and it even lets you specify how you want to separate the different substrings that it finds (comma, new line, blank space, etc.). See screenshot below for how MiniExtensions lets you specify what you would like to use as your separator.

I’ve just tested this out on my end, and it works 100% perfectly:

I expect that this design choice was made because Airtable’s formula fields have extremely limited ability to deal with arrays.

Like

+21

Justin_Barrett
Inspiring
Forum|Forum|5 years ago
January 11, 2021

Thanks for adding this support for regular expressions. It is also awesome that you have both example uses and the results that the produce.

REGEX_REPLACE("Hello World", "W.*", "") => "Hello"

Shouldn’t the result of this function have a trailing space? (I’m hours away from a computer and cannot test it myself for sure.)

REGEX_REPLACE("Hello World", "W.*", "") => "Hello "

Also, you might want to consider an example where there are multiple replacements.

The formatting that you see in the example on the formula reference page doesn’t make this clear, but the actual sample formula when copied is:

REGEX_REPLACE("Hello World", " W.*", "")

There’s a space in front of the “W”, so “Hello” is returned with no trailing space.

Like

+29

kuovonne
Brainy
Forum|Forum|5 years ago
January 11, 2021

The formatting that you see in the example on the formula reference page doesn’t make this clear, but the actual sample formula when copied is:

REGEX_REPLACE("Hello World", " W.*", "")

There’s a space in front of the “W”, so “Hello” is returned with no trailing space.

Hum. I’m pretty sure that I did a direct copy/paste when I created my post, and in my post there is no space before the W. Maybe Airtable fixed the documentation after I posted.

Like

+29

kuovonne
Brainy
Forum|Forum|5 years ago
January 12, 2021

Hello,

Adam from the Airtable support team here. We now have REGEX functions (REGEX_EXTRACT() being one of them) available for use in formula fields. You can find some more information about those functions in this support document.

Cheers!

@Adam_Minich The documentation does not state what REGEX_EXTRACT(string, regex) returns when there is no matching string. It looks like it returns #ERROR!.

It would make more sense for REGEX_EXTRACT to return either null or an empty string if there is no match. (Of course, it would be even nicer if the function returned an array of all the matches, but that is a different matter.)

The current situation leads to the workaround of having to test for a match before extracting it to avoid the unsightly error:

IF(
  REGEX_MATCH({Text Field}, "W.*"), 
  REGEX_EXTRACT({Text Field}, "W.*")
)

In comparison, notice that REGEX_REPLACE does not return a error if there is no match; it simply does not replace any text.

Like

C

+2

carloschristian
Participating Frequently
Forum|Forum|5 years ago
January 12, 2021

I’m trying to build a formula that extracts the text between [en-multi] in the following string:

[en-multi]THIS SPRING
12 JANUARY[en-multi][de-DE]DIESES FRÜHJAHR
AB 12. JANUAR[de-DE]ESTA PRIMAVERA
12 DE ENEROESTA PRIMAVERA
12 DE ENEROCE PRINTEMPS
LE 12 JANVIER
QUESTA PRIMAVERA
DAL 12 GENNAIODIT VOORJAAR
12 JANUARI

I will use another field to extract the text between [de-DE] and so on and so forth… Is this something that can be “easily” achieved with REGEX?

Like

+21

Justin_Barrett
Inspiring
Forum|Forum|5 years ago
January 13, 2021

I’m trying to build a formula that extracts the text between [en-multi] in the following string:

[en-multi]THIS SPRING
12 JANUARY[en-multi][de-DE]DIESES FRÜHJAHR
AB 12. JANUAR[de-DE]ESTA PRIMAVERA
12 DE ENEROESTA PRIMAVERA
12 DE ENEROCE PRINTEMPS
LE 12 JANVIER
QUESTA PRIMAVERA
DAL 12 GENNAIODIT VOORJAAR
12 JANUARI

I will use another field to extract the text between [de-DE] and so on and so forth… Is this something that can be “easily” achieved with REGEX?

I’m somewhat of a regex newbie, but I was able to make this work after some experimentation on regex101.com (recommended by @Jason in the announcement post - THANK YOU!). Make sure to use the Golang (RE2) “flavor”.

The formula is:

REGEX_EXTRACT(Name, "(?:\\[en\\-multi\\])([a-zA-Z\\d\\s]*)")

Like

C

+2

carloschristian
Participating Frequently
Forum|Forum|5 years ago
January 13, 2021

I’m somewhat of a regex newbie, but I was able to make this work after some experimentation on regex101.com (recommended by @Jason in the announcement post - THANK YOU!). Make sure to use the Golang (RE2) “flavor”.

The formula is:

REGEX_EXTRACT(Name, "(?:\\[en\\-multi\\])([a-zA-Z\\d\\s]*)")

So awesome! I’m so thankful for your help. Thank you, Justin. This code works well. I did run into some trouble with the umlaut u - ü - and period after the numbered date for German, but I did some googling an came up with the following for languages with other characters. Sharing just in case someone else should find it helpful.

REGEX_EXTRACT({Name}& “”, “(?:\[de\-DE\])([A-Za-zÀ-ž\u0370-\u03FF\u0400-\u04FF*.\d\s]*)”)

Thanks again! :slightly_smiling_face:

Like

+24

Karlstens
Brainy
Forum|Forum|4 years ago
October 21, 2021

BUMP … please please please, can we have REGEX_MATCH() return an integer count of the matches found, instead of the boolean true/false? Such functionality would make my log parsing table a dream to work with.

Like

A

+17

Ambroise_Dhenai
Inspiring
Forum|Forum|3 years ago
June 16, 2022

I’m surprised REGEX_EXTRACT returns an ERROR when no match is found.

Like

+21

Justin_Barrett
Inspiring
Forum|Forum|3 years ago
June 22, 2022

I’m surprised REGEX_EXTRACT returns an ERROR when no match is found.

It’s similar to how the DATETIME_FORMAT() function returns an error when trying to format a date from an empty date field. As such, it doesn’t surprise me, though that doesn’t make it any less annoying. It just becomes one of those “this is how Airtable works” things that you have to log away for future reference. Similar to looking for a date before trying to format it, you have to look for a match with REGEX_MATCH() before trying to extract it.

Like

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded