Skip to main content

I have read the solutions for extracting text but still can’t seem to get it right.


Need extract text between the dashes (-). The text between the dashes could be any number of characters, if there is a limited it is unknown.



- ###### - ##### - ##### -

Hi Rick. I would just substitute in an empty string or a space, depending on your preference.


SUBSTITUTE({string field name}, '-','')


Sound good?


Hi Rick. I would just substitute in an empty string or a space, depending on your preference.


SUBSTITUTE({string field name}, '-','')


Sound good?


what would removing the spaces get me.


ABCD - EFGH - IJKL - MNOP - QRST


Would like to create 5 columns from each data set between the dashes. The length of each data set is not fixed.


what would removing the spaces get me.


ABCD - EFGH - IJKL - MNOP - QRST


Would like to create 5 columns from each data set between the dashes. The length of each data set is not fixed.


This is a very similar use case to this thread:



That example was looking for numbers presented in the pattern of

(length)( x )(width)( x )(height)


Your pattern appears to be alphanumeric characters separated by spaces and dashes, or

(alphanumeric text 1)( - )(alphanumeric text 2)( - )(alphanumeric text 3)( - )(alphanumeric text 4)( - )(alphanumeric text 5)


The REGEX pattern for that is:


(\\w*)(\\-+)(\\w*)(\\-+)(\\w*)(\\-+)(\\w*)(\\-+)(\\w*)

^ if you want to explicitly define the separator as one or more consecutive dashes. (\\-+) means one or more - characters.

or


(\\w*)(\\W+)(\\w*)(\\W+)(\\w*)(\\W+)(\\w*)(\\W+)(\\w*)

^ if you want to explicitly define the separator as one or more consecutive non-alphanumeric characters (which would include both spaces and dashes). (\\W+) means one or more non-alphanumeric characters


Therefore a full formula for the first of the five fields could be either:

(1)


REGEX_REPLACE(
SUBSTITUTE({string field name}, " ", ""),
"(\\w*)(\\-+)(\\w*)(\\-+)(\\w*)(\\-+)(\\w*)(\\-+)(\\w*)",
"$1"
)

Note: the SUBSTIUTE was included, assuming you want to remove the spaces.


(2)


REGEX_REPLACE(
{string field name},
"(\\w*)(\\W+)(\\w*)(\\W+)(\\w*)(\\W+)(\\w*)(\\W+)(\\w*)",
"$1"
)

^ this is the recommended formula. If there are not always 5 groups of text, then replace all the + with * to change the formula from “one or more” to “zero or more”.




For either method, the REGEX pattern has 9 “capture groups” (each group is what’s in-between the parentheses). Each of your five formula fields that extracts the alphanumeric text will have the exact same REGEX_REPLACE() formula apart from the last argument to reference the relevant capture group.


Your first field will have "$1" as the last argument (as shown above), the second field will have "$3", the third will have "$5", etc.


This is a very similar use case to this thread:



That example was looking for numbers presented in the pattern of

(length)( x )(width)( x )(height)


Your pattern appears to be alphanumeric characters separated by spaces and dashes, or

(alphanumeric text 1)( - )(alphanumeric text 2)( - )(alphanumeric text 3)( - )(alphanumeric text 4)( - )(alphanumeric text 5)


The REGEX pattern for that is:


(\\w*)(\\-+)(\\w*)(\\-+)(\\w*)(\\-+)(\\w*)(\\-+)(\\w*)

^ if you want to explicitly define the separator as one or more consecutive dashes. (\\-+) means one or more - characters.

or


(\\w*)(\\W+)(\\w*)(\\W+)(\\w*)(\\W+)(\\w*)(\\W+)(\\w*)

^ if you want to explicitly define the separator as one or more consecutive non-alphanumeric characters (which would include both spaces and dashes). (\\W+) means one or more non-alphanumeric characters


Therefore a full formula for the first of the five fields could be either:

(1)


REGEX_REPLACE(
SUBSTITUTE({string field name}, " ", ""),
"(\\w*)(\\-+)(\\w*)(\\-+)(\\w*)(\\-+)(\\w*)(\\-+)(\\w*)",
"$1"
)

Note: the SUBSTIUTE was included, assuming you want to remove the spaces.


(2)


REGEX_REPLACE(
{string field name},
"(\\w*)(\\W+)(\\w*)(\\W+)(\\w*)(\\W+)(\\w*)(\\W+)(\\w*)",
"$1"
)

^ this is the recommended formula. If there are not always 5 groups of text, then replace all the + with * to change the formula from “one or more” to “zero or more”.




For either method, the REGEX pattern has 9 “capture groups” (each group is what’s in-between the parentheses). Each of your five formula fields that extracts the alphanumeric text will have the exact same REGEX_REPLACE() formula apart from the last argument to reference the relevant capture group.


Your first field will have "$1" as the last argument (as shown above), the second field will have "$3", the third will have "$5", etc.


Someday I will “get” regex but until that day…thank you.


I am having an issue with the extraction, the first $1 work great, but after that, $3, $5 I am getting improper results.


I did realize my example was slightly incorrect the first part of the pattern is always a number, not text.


Reply