Re: REGEX extract all text lines except for last

stefw · ‎Mar 13, 2023

I have addresses in multiple lines and am using REGEX EXTRACT in Airtable to separate the components (street address, suburb, state, post code) before sending them on for shipping labels.

I've used REGEX_EXTRACT({text input}, "^(?:.*\n){1}") to extract the first line and this works in most cases except when the street address is over two lines (see example B).

Can someone please help me with a REGEX EXTRACT that extracts every line except the last line (whether the address is 2 or 3 lines)?

Or is there a simpler solution I'm not aware of?

Example A

1 Test Street

Richmond, Perth 3021

Example B

1 Test Street

Unit 6

Richmond, Perth 3021

For reference on how I'm dealing with the rest of the components...

I've used REGEX_EXTRACT({text input}, "^.\r?\n(.)") to extract the last line (Subrub, State, Postcode) and then used REGEX_EXTRACT({text input}, ".*(\d{4})") to extract the post code.

Once I have the post code I'm matching it against a database to find the Suburb and State. I did this because I believe it's too difficult to do a REGEX that will extract the Suburb, State and Postcode individually. I think it's too difficult because both the Suburb and State can be one, two or three words long. If you have a better solution I'm all ears!

TheTimeSavingCo · ‎Mar 13, 2023

Hmm, could you try:

REGEX_EXTRACT(Address,'(?s)^(.*)\n[^\n]*$')

Should do what you want?

---

I'm not too familiar with the suburb and state stuff I'm afraid. If you could provide some examples I can see what I can figure out!

Adam, Consultant @ The Time Saving Co

See Solution in Thread

Steve_Haysom · ‎Mar 14, 2023

Here is a solution that extracts suburb and state, allowing for extra line breaks, spaces and differing capitalis ation:

AddressPiped: "¬" & REGEX_REPLACE(REGEX_REPLACE(Address,"[\n\\r]+", "¬"), " +", " ")
LastLine: LOWER(REGEX_EXTRACT(AddressPiped,"¬([^¬]+)$"))
LastLineWithShortState: REGEX_REPLACE(REGEX_REPLACE(LastLine, "victoria ([0-9]{4})", "vic $1"),"new south wales ([0-9]{4})","nsw $1")
State: REGEX_EXTRACT(LastLineWithShortState," (nsw|vic) [0-9]{4}")
Suburb: REGEX_EXTRACT(LastLineWithShortState,"^(.+) " & State & " [0-9]{4}")
LastLineProper: REGEX_EXTRACT(AddressPiped,"¬([^¬]+)$")
SuburbProper: LEFT(LastLineProper, LEN(Suburb))
SuburbProperFinal: UPPER(LEFT(SuburbProper,1)) & MID(SuburbProper,2,300)
StateFull: SWITCH(State,"nsw","New South Wales", "vic", "Victoria")

I've only done this for new south wales and victoria so far, the fields LastLineWithShortState, State and StateFull need finishing with the other states.

See Solution in Thread

TheTimeSavingCo · ‎Mar 13, 2023

Hmm, could you try:

REGEX_EXTRACT(Address,'(?s)^(.*)\n[^\n]*$')

Should do what you want?

---

I'm not too familiar with the suburb and state stuff I'm afraid. If you could provide some examples I can see what I can figure out!

Adam, Consultant @ The Time Saving Co

stefw · ‎Mar 13, 2023

Thanks Adam, you're a legend!

The last line of each address is made up of the Suburb, State and Postcode.

Currently I'm only extracting the postcode and then matching it against a database to find the Suburb and State. I'm doing this because I don't know how to split the three elements.

This is solution is mostly fine however sometime it returns a Suburb that is not the one in the orginal address. This is because a single post code may apply to multiple suburbs. For example, the postcode 3121 is for Richmond, Cremorne, Burnley etc.

Removing the 4 digit post code is easy enough. It is always at the end and is always 4 digits. However splitting the Suburb and State is trickier. I think it would have to be something around the State given there are only 8 of them and thousands of Suburbs.

The8 States/Territories in Australia (where we deliver):
VIC - Victoria
NSW - New South Wales
QLD - Queensland
NT - Northern Territory
SA - South Australia
ACT - Australia Capital Territory
WA - Western Australia
TAS - Tasmania
Customers will write it either as an abbreviation listed or as the full name, (NSW, Nsw, nsw, new south wales, New South Wales, New south wales), which I think ads to the complexity.

Note: There are no commas or any characters splitting the elements.

Examples:
Richmond VIC 3121
Upwey Victoria 3158
Sydney New south wales 2000
St Albans Nsw 2021

Open to any thoughts on this and thanks again for solving the last one!

Stefaan

Steve_Haysom · ‎Mar 14, 2023

Here is a solution that extracts suburb and state, allowing for extra line breaks, spaces and differing capitalis ation:

AddressPiped: "¬" & REGEX_REPLACE(REGEX_REPLACE(Address,"[\n\\r]+", "¬"), " +", " ")
LastLine: LOWER(REGEX_EXTRACT(AddressPiped,"¬([^¬]+)$"))
LastLineWithShortState: REGEX_REPLACE(REGEX_REPLACE(LastLine, "victoria ([0-9]{4})", "vic $1"),"new south wales ([0-9]{4})","nsw $1")
State: REGEX_EXTRACT(LastLineWithShortState," (nsw|vic) [0-9]{4}")
Suburb: REGEX_EXTRACT(LastLineWithShortState,"^(.+) " & State & " [0-9]{4}")
LastLineProper: REGEX_EXTRACT(AddressPiped,"¬([^¬]+)$")
SuburbProper: LEFT(LastLineProper, LEN(Suburb))
SuburbProperFinal: UPPER(LEFT(SuburbProper,1)) & MID(SuburbProper,2,300)
StateFull: SWITCH(State,"nsw","New South Wales", "vic", "Victoria")

I've only done this for new south wales and victoria so far, the fields LastLineWithShortState, State and StateFull need finishing with the other states.

TheTimeSavingCo · ‎Mar 14, 2023

Glad I could help!

re: the states
Hmm, yeah this is tricky. Given that it's either written as an abbreiviation or as a full name, I'd just hardcode it and call it a day, really

Link to base

And you can duplicate the base to see the formulas used!

Adam, Consultant @ The Time Saving Co

Steve_Haysom · ‎Mar 14, 2023

I posted a reply to this but edited it, it has now disappeared...

Steve_Haysom · ‎Mar 14, 2023

I'll try again:

Here is a solution that extracts suburb and state, allowing for extra line breaks/spaces, and differing spelling of state and capitalisation (and also if the state name is in the suburb!):

AddressPiped: "¬" & REGEX_REPLACE(REGEX_REPLACE(Address,"[\n\\r]+", "¬"), " +", " ")
LastLine: LOWER(REGEX_EXTRACT(AddressPiped,"¬([^¬]+)$"))
LastLineWithShortState: REGEX_REPLACE(REGEX_REPLACE(LastLine, "victoria ([0-9]{4})", "vic $1"),"new south wales ([0-9]{4})","nsw $1")
State: REGEX_EXTRACT(LastLineWithShortState," (nsw|vic) [0-9]{4}")
Suburb: REGEX_EXTRACT(LastLineWithShortState,"^(.+) " & State & " [0-9]{4}")
LastLineProper: REGEX_EXTRACT(AddressPiped,"¬([^¬]+)$")
SuburbProper: LEFT(LastLineProper, LEN(Suburb))
SuburbProperFinal: UPPER(LEFT(SuburbProper,1)) & MID(SuburbProper,2, 300)
StateFull: SWITCH(State,"nsw","New South Wales", "vic", "Victoria")

The fields LastLineWithShortState, State and StateFull need finishing off with the other states.

Steve_Haysom · ‎Mar 14, 2023

Here is a solution that extracts suburb and state, allowing for extra line breaks/spaces, and differing spelling of state and capitalisation (and also if the state name is in the suburb!).

AddressPiped: "¬" & REGEX_REPLACE(REGEX_REPLACE(Address,"[\n\\r]+", "¬"), " +", " ")
LastLine: LOWER(REGEX_EXTRACT(AddressPiped,"¬([^¬]+)$"))
LastLineWithShortState: REGEX_REPLACE(REGEX_REPLACE(LastLine, "victoria ([0-9]{4})", "vic $1"),"new south wales ([0-9]{4})","nsw $1")
State: REGEX_EXTRACT(LastLineWithShortState," (nsw|vic) [0-9]{4}")
Suburb: REGEX_EXTRACT(LastLineWithShortState,"^(.+) " & State & " [0-9]{4}")
LastLineProper: REGEX_EXTRACT(AddressPiped,"¬([^¬]+)$")
SuburbProper: LEFT(LastLineProper, LEN(Suburb))
SuburbProperFinal: UPPER(LEFT(SuburbProper,1)) & MID(SuburbProper,2, 300)
StateFull: SWITCH(State,"nsw","New South Wales", "vic", "Victoria")

The fields LastLineWithShortState, State and StateFull need finishing off with the other states.