Re: Omitting words from Regex_Replace() sanitation

Karlstens · ‎Jul 02, 2022

Using https://regex101.com/ - the below works well enough for omitting words from sanitation.

(?i)(?!keepWord1|keepWord2|keepWord3)(replaceWord1|replaceWord2|replaceWord3|replaceWord4|replaceWord5|replaceWord6)

So this doesn’t work;

REGEX_REPLACE(Player,"(?i)(?!keepWord1|keepWord2|keepWord3)(replaceWord1|replaceWord2|replaceWord3|replaceWord4|replaceWord5|replaceWord6)", "Word_Replacement")

The below at least does work, but it’s missing the important omission check - so some proper words get flagged and replaced unnecessarily.

REGEX_REPLACE(Player,"(?i)(replaceWord1|replaceWord2|replaceWord3|replaceWord4|replaceWord5|replaceWord6)", "Word_Replacement")

Any ideas of how to formulate this within a Formula Field?

Ben_Young1 · ‎Jul 06, 2022

Hey @Karlstens!

Airtable and regex have a strange relationship.
The functionality, as you’ve noticed, is limited.

In your situation, I can think of three possible solutions:

Nested Formula Oblivion

You’d pretty much just have to write a massive IF function.
If you pass this condition:

OR(
    SEARCH(
        LOWER(
            {Player}
        ),
        "keepWord1"
    ),
    SEARCH(
        LOWER(
            {Player}
        ),
        "keepWord2"
    ),
    SEARCH(
        LOWER(
            {Player}
        ),
        "keepWord3"
    )
)

You’re doing the equivalent of this portion of your regex pattern:
(?!keepWord1|keepWord2|keepWord3)

I don’t even want to think about how dense the nesting would be for the next parameters of the formula.

Multiple Formula Fields

I’m not even sure I can properly think about how I would do it, but another idea would be to create a dedicated formula field to evaluate the big parts of what you’re trying to do.

You’d have a formula field to see if keepWord{1,3} is present.
Then you’d need a formula field to look at the replaceWord{1,6} values and evaluate whether the value exists.

If those two fields cooperate, you’d then have a final field that spits out the properly transformed string using either the SUBSTITUTE or REGEX_REPLACE functions.

Needless to say, this is also messy.

Scripting / Javascript

For your use case, I would actually recommend this method.
You’d just have an automation that looks at changes to the {Player} field.

If the field returns a value, then evaluate the string against the regex pattern, and then write the transformed string into the field.

Again… none of it is ideal, but the scripting method is honestly pretty clean if you ask me.

Karlstens · ‎Jul 06, 2022

Thanks @Ben.Young - I thought I may have been asking too much of Airtable Formula Regex when having play with this idea - and yes, I’m sanitising swearing & cuss words here, solving the age old problem of allowing the user name “Master Assassin” but not allowing “### Master” or incorrect returns such as “Master Buttbuttin”. :joy:

Knowing now that my initial Regex formula isn’t compatible with Airtable Formula Field version of Regex (and thanks a million for checking this with me) I too feel the best solution will be found through some Javascript tinkering - with the exception that, I may just revisit the problem with a multi-formula field approach over the weekend, just to be sure there isn’t a comfortable method that will work. Like, there’s only so many swear words I’m checking for… ponders at the thought, just how many there might actually be? :thinking:

It’s a little frustrating, knowing how simple, powerful and instant Regex can be, and how Airtable have implemented a watered down version (yet, that’s still powerful in its own right).

The main advantage of using a Formula Field, is that it’s calculated near on instantly and the data from it returns with an API Post new record creation - which might have been overly useful.

Keen to here if anyone else has tried anything for sanitising their user name data.

Justin_Barrett · ‎Jul 06, 2022

The negative lookahead token combo—(?!...) doesn’t work with the variation of REGEX that Airtable uses. The Golang variant is closest, and you won’t find that negative lookahead token in the token list when switching to that variant on regex101.

Even if it did, I’m not sure that it would actually work as expected. The way that I read the docs, negative lookahead is positionally based, meaning that it’s specifically looking to ignore the listed content after other content earlier in the expression.

Something else to keep in mind is that the majority of regular expressions are parsed from left to right. The way your original expression is designed, it’s specifically looking to ignore the “keep” words when they appear immediately before a “replace” word. In other words, the expression will never be read as “keep these and ignore these, wherever they happen to appear in the source text.”

That said, I think that there might still be a way to do what you want. Are you able to provide examples of some “keep” words that are getting mixed up with “replace” words?