Help

Extract every first letter from a strings with regex without a loop

Topic Labels: Formulas
Solved
Jump to Solution
6353 12
cancel
Showing results for 
Search instead for 
Did you mean: 

I’m trying to make shortenings of company names using a regular expression, by taking taking the first letter every part of the name.
Fx ‘Great Garden Centre’ → GGC
I’m not sure if this is possible without a loop, but I’m also no buff at using regex.

I’ve sofar I can only extract the first letter using this:

REGEX_EXTRACT(Company, "\\b[a-zA-Z]")

Is this possible without looping through every word?

1 Solution

Accepted Solutions

An even simpler formula would be

REGEX_REPLACE(Notes, "\\B\\w+\\W*", "")

This will delete all word characters that are not at the start of a word up to, and including, eventual subsequent non-word characters. “Great Garden Company” becomes “GGC” and “Not so great Company” becomes “NsgC”.

Note that, as of this writing, Airtable regular expressions are not Unicode aware, so that letters outside the English Latin subset will be considered non-word characters and thus will not be abbreviated.

See Solution in Thread

12 Replies 12
Ela_Automatela_
5 - Automation Enthusiast
5 - Automation Enthusiast

Hi there! What you did will only extract the letter G from the example you’ve provided. It’s because according to documentation REGEX_EXTRACT returns only the first occurence that matches your regex. I thought it’s not possible but I’ve played a while and I was able to achieve it by below formula:

SUBSTITUTE(REGEX_REPLACE(Notes, "[a-z]", ""), " ", "")

So with REGEX_REPLACE I’m replacing every small letter with empty string first. This doesn’t seem to work as expected giving me “G G C” with spaces between letters. For that I’ve wrapped it with SUBSTITUTE that replaces spaces (" “) with nothing (”") so you get “GGC”.

Ahhh thats quite a nice way around it, but I assume I’ll need always to write first letters in caps. But guess it’ll have to do. Thank you for your help and clear explanation.

If you don’t want to worry about capital letters or not, this should do it:

SUBSTITUTE(REGEX_REPLACE(Notes, "\\B[a-z]", ""), " ", "")

I’ve checked it and it seems to work. I’m not an expert in regex though, I just try and see what happens :grinning_face_with_big_eyes:

An even simpler formula would be

REGEX_REPLACE(Notes, "\\B\\w+\\W*", "")

This will delete all word characters that are not at the start of a word up to, and including, eventual subsequent non-word characters. “Great Garden Company” becomes “GGC” and “Not so great Company” becomes “NsgC”.

Note that, as of this writing, Airtable regular expressions are not Unicode aware, so that letters outside the English Latin subset will be considered non-word characters and thus will not be abbreviated.

That is just perfect - thank you @Martin_Kopischke

This is a nice solution.

However, one of the problems with creating a regular expression like this is dealing with punctuation:

  • Annie’s Bagels
  • Pick 'n Save
  • 'Tis the Season
  • company.com

A slight tweak will seal with the first case.

REGEX_REPLACE(Notes, "\\B[\\w']+\\W*", "")

But there are so many cases where you just can’t predict punctuation that is is very hard to have a robust regular expression to deal with all of these situations.

Absolutely, you are right on all counts. There is only so much you can achieve with a regex considering the bewildering range of possible inputs, including incorrect forms (case in point: your regex will actually not match your first example the way you would expect it to, because that apostrophe is a typographically correct one while the regex only matches on ASCII single quotes), never mind the fact that any letter not accounted for by ASCII throws a spanner into the works, as mentioned. Personally, I’d recommend not relying on such an automatism at all and setting the abbreviation manually instead, but that is outside the scope of OP’s question, and I prefer to avoid answering “you shouldn’t be doing that” / “why would you want to do that” when I can :winking_face: .

For whatever reason I always dip headless into the automation bucket and completely disregard that sometimes the best way is just do it manually.

One of the nicer aspects of Airtable is that you can actually combine both approaches for an existing data set. Define a formula column using a variant of my regex to get a first raw set of abbreviations, then convert that column into text. Airtable will keep the calculated values while converting, which means you can now fine tune the existing abbreviations, and any new entry will then add their abbreviation manually.

Or, if you prefer, you can keep the calculated column and add an override text column where users can manually tune (or deviate from) it, then base all display and operations on a third formula column that selects the override value if present, or the automatic one if not (I do that in some scenarios where the automatic value is right in at least 75 % of the cases). It all depends on factors like existing data set, frequency of new inputs, qualification of users inputting data etc.

On a more personal note, I wouldn’t call this “headless”: the allure of automatisms is great, and when they work, they feel like magic. It’s when you start hitting the edge cases you will come to the insight that doing the work, albeit annoying, is often superior to magic that goes wrong. At least that’s what happened to me: I have rolled back most automatisms in our system over time, sometimes falling back to the override pattern I described, but most often just saying “f**k it, not worth the hassle” (disclaimer: we have a very small, specialised team doing data entry and the volume of new inputs at any point in time is small enough for me to keep tabs on inconsistencies and lags in the data; YMMV).

This is working really well. Do you have any advice on how to extract the first TWO characters from each word? Having trouble finding documentation online about this. Thanks!

Zach_Young
5 - Automation Enthusiast
5 - Automation Enthusiast

Like @Martin_Kopischk pointed out, Airtable regex doesn't support Unicode categories in its character classes: both `\bB` and `\wW` only consider ASCII characters. Being able to specify something like `\p{Uppercase_Letter}` might be very handy.

Airtable regex does support grouping and group references in the substitution. With that you can specify more generally what you want to keep instead of trying to specify what you want to get rid of.

(Oh, very basic regex primer: the period (dot) means "any character".)

 

 

REGEX_REPLACE(Field, '(.).*?\s', '$1')

 

 

Means find the first character and capture (remember) it:

 

 

(.)

 

 

find any remaining characters up to the first space char (could be space, tab, or any line breaking char)

 

 

.*?\s

 

 

then, replace all that with the single captured char

 

 

'$1'

 

 

Because only the first character was captured, the following characters for each word are removed when substituting in the captured char.

For a string like 'Great Garden Centre' that yields only 'GG' because 'Centre' doesn't have a space after it. That can easily be addressed by just tacking on a space to the end of the field in the formula:

 

 

REGEX_REPLACE(Field&' ', '(.).*?\s', '$1')

 

 

@Steady_Admin, this can be adapted for your need by just extending the capture group to include the first two chars:

 

 

REGEX_REPLACE(Field, '(..).*?\s', '$1')

 

 

I made a comparison of the various proposals, here.