Help

Re: Extract Text from a Data string

2233 0
cancel
Showing results for 
Search instead for 
Did you mean: 
mustOfGotLost
4 - Data Explorer
4 - Data Explorer

I have read the solutions for extracting text but still can’t seem to get it right.

Need extract text between the dashes (-). The text between the dashes could be any number of characters, if there is a limited it is unknown.

- ###### - ##### - ##### -

4 Replies 4
augmented
10 - Mercury
10 - Mercury

Hi Rick. I would just substitute in an empty string or a space, depending on your preference.

SUBSTITUTE({string field name}, '-','')

Sound good?

what would removing the spaces get me.

ABCD - EFGH - IJKL - MNOP - QRST

Would like to create 5 columns from each data set between the dashes. The length of each data set is not fixed.

This is a very similar use case to this thread:

That example was looking for numbers presented in the pattern of
(length)( x )(width)( x )(height)

Your pattern appears to be alphanumeric characters separated by spaces and dashes, or
(alphanumeric text 1)( - )(alphanumeric text 2)( - )(alphanumeric text 3)( - )(alphanumeric text 4)( - )(alphanumeric text 5)

The REGEX pattern for that is:

(\\w*)(\\-+)(\\w*)(\\-+)(\\w*)(\\-+)(\\w*)(\\-+)(\\w*)

^ if you want to explicitly define the separator as one or more consecutive dashes. (\\-+) means one or more - characters.
or

(\\w*)(\\W+)(\\w*)(\\W+)(\\w*)(\\W+)(\\w*)(\\W+)(\\w*)

^ if you want to explicitly define the separator as one or more consecutive non-alphanumeric characters (which would include both spaces and dashes). (\\W+) means one or more non-alphanumeric characters

Therefore a full formula for the first of the five fields could be either:
(1)

REGEX_REPLACE(
   SUBSTITUTE({string field name}, " ", ""), 
   "(\\w*)(\\-+)(\\w*)(\\-+)(\\w*)(\\-+)(\\w*)(\\-+)(\\w*)", 
   "$1"
)

Note: the SUBSTIUTE was included, assuming you want to remove the spaces.

(2)

REGEX_REPLACE(
   {string field name}, 
   "(\\w*)(\\W+)(\\w*)(\\W+)(\\w*)(\\W+)(\\w*)(\\W+)(\\w*)", 
   "$1"
)

^ this is the recommended formula. If there are not always 5 groups of text, then replace all the + with * to change the formula from “one or more” to “zero or more”.


For either method, the REGEX pattern has 9 “capture groups” (each group is what’s in-between the parentheses). Each of your five formula fields that extracts the alphanumeric text will have the exact same REGEX_REPLACE() formula apart from the last argument to reference the relevant capture group.

Your first field will have "$1" as the last argument (as shown above), the second field will have "$3", the third will have "$5", etc.

Someday I will “get” regex but until that day…thank you.

I am having an issue with the extraction, the first $1 work great, but after that, $3, $5 I am getting improper results.

I did realize my example was slightly incorrect the first part of the pattern is always a number, not text.