Help

Re: REGEX extract all text lines except for last

Solved
Jump to Solution
3686 0
cancel
Showing results for 
Search instead for 
Did you mean: 
stefw
4 - Data Explorer
4 - Data Explorer

I have addresses in multiple lines and am using REGEX EXTRACT in Airtable to separate the components (street address, suburb, state, post code) before sending them on for shipping labels.

I've used REGEX_EXTRACT({text input}, "^(?:.*\n){1}") to extract the first line and this works in most cases except when the street address is over two lines (see example B).

Can someone please help me with a REGEX EXTRACT that extracts every line except the last line (whether the address is 2 or 3 lines)?

Or is there a simpler solution I'm not aware of? 

Example A

1 Test Street

Richmond, Perth 3021

Example B

1 Test Street

Unit 6

Richmond, Perth 3021

For reference on how I'm dealing with the rest of the components...

I've used REGEX_EXTRACT({text input}, "^.\r?\n(.)") to extract the last line (Subrub, State, Postcode) and then used REGEX_EXTRACT({text input}, ".*(\d{4})") to extract the post code.

Once I have the post code I'm matching it against a database to find the Suburb and State. I did this because I believe it's too difficult to do a REGEX that will extract the Suburb, State and Postcode individually. I think it's too difficult because both the Suburb and State can be one, two or three words long. If you have a better solution I'm all ears!

2 Solutions

Accepted Solutions
TheTimeSavingCo
18 - Pluto
18 - Pluto

Hmm, could you try: 

REGEX_EXTRACT(Address,'(?s)^(.*)\n[^\n]*$')

Screenshot 2023-03-14 at 10.26.23 AM.png

Should do what you want?

---

I'm not too familiar with the suburb and state stuff I'm afraid.  If you could provide some examples I can see what I can figure out!

See Solution in Thread

Steve_Haysom
8 - Airtable Astronomer
8 - Airtable Astronomer

Here is a solution that extracts suburb and state, allowing for extra line breaks, spaces and differing capitalis ation:

Untitled.png

AddressPiped: "¬" & REGEX_REPLACE(REGEX_REPLACE(Address,"[\n\\r]+", "¬"), " +", " ")
LastLine: LOWER(REGEX_EXTRACT(AddressPiped,"¬([^¬]+)$"))
LastLineWithShortState: REGEX_REPLACE(REGEX_REPLACE(LastLine, "victoria ([0-9]{4})", "vic $1"),"new south wales ([0-9]{4})","nsw $1")
State: REGEX_EXTRACT(LastLineWithShortState," (nsw|vic) [0-9]{4}")
Suburb: REGEX_EXTRACT(LastLineWithShortState,"^(.+) " & State & " [0-9]{4}")
LastLineProper: REGEX_EXTRACT(AddressPiped,"¬([^¬]+)$")
SuburbProper: LEFT(LastLineProper, LEN(Suburb))
SuburbProperFinal: UPPER(LEFT(SuburbProper,1)) & MID(SuburbProper,2,300)
StateFull: SWITCH(State,"nsw","New South Wales", "vic", "Victoria")

I've only done this for new south wales and victoria so far, the fields LastLineWithShortState, State and StateFull need finishing with the other states.

 

 

See Solution in Thread

7 Replies 7
TheTimeSavingCo
18 - Pluto
18 - Pluto

Hmm, could you try: 

REGEX_EXTRACT(Address,'(?s)^(.*)\n[^\n]*$')

Screenshot 2023-03-14 at 10.26.23 AM.png

Should do what you want?

---

I'm not too familiar with the suburb and state stuff I'm afraid.  If you could provide some examples I can see what I can figure out!

Thanks Adam, you're a legend! 

The last line of each address is made up of the Suburb, State and Postcode.

Currently I'm only extracting the postcode and then matching it against a database to find the Suburb and State. I'm doing this because I don't know how to split the three elements. 

This is solution is mostly fine however sometime it returns a Suburb that is not the one in the orginal address. This is because a single post code may apply to multiple suburbs. For example, the postcode 3121 is for Richmond, Cremorne, Burnley etc.

Removing the 4 digit post code is easy enough. It is always at the end and is always 4 digits. However splitting the Suburb and State is trickier. I think it would have to be something around the State given there are only 8 of them and thousands of Suburbs. 

The8 States/Territories in Australia (where we deliver):
VIC - Victoria
NSW - New South Wales
QLD - Queensland
NT - Northern Territory 
SA - South Australia 
ACT - Australia Capital Territory 
WA - Western Australia
TAS - Tasmania
Customers will write it either as an abbreviation listed or as the full name, (NSW, Nsw, nsw, new south wales, New South Wales, New south wales), which I think ads to the complexity. 

Note: There are no commas or any characters splitting the elements. 

Examples:
Richmond VIC 3121
Upwey Victoria 3158
Sydney New south wales 2000
St Albans Nsw 2021

Open to any thoughts on this and thanks again for solving the last one!

Stefaan

 

 

Steve_Haysom
8 - Airtable Astronomer
8 - Airtable Astronomer

Here is a solution that extracts suburb and state, allowing for extra line breaks, spaces and differing capitalis ation:

Untitled.png

AddressPiped: "¬" & REGEX_REPLACE(REGEX_REPLACE(Address,"[\n\\r]+", "¬"), " +", " ")
LastLine: LOWER(REGEX_EXTRACT(AddressPiped,"¬([^¬]+)$"))
LastLineWithShortState: REGEX_REPLACE(REGEX_REPLACE(LastLine, "victoria ([0-9]{4})", "vic $1"),"new south wales ([0-9]{4})","nsw $1")
State: REGEX_EXTRACT(LastLineWithShortState," (nsw|vic) [0-9]{4}")
Suburb: REGEX_EXTRACT(LastLineWithShortState,"^(.+) " & State & " [0-9]{4}")
LastLineProper: REGEX_EXTRACT(AddressPiped,"¬([^¬]+)$")
SuburbProper: LEFT(LastLineProper, LEN(Suburb))
SuburbProperFinal: UPPER(LEFT(SuburbProper,1)) & MID(SuburbProper,2,300)
StateFull: SWITCH(State,"nsw","New South Wales", "vic", "Victoria")

I've only done this for new south wales and victoria so far, the fields LastLineWithShortState, State and StateFull need finishing with the other states.

 

 

Glad I could help!

re: the states
Hmm, yeah this is tricky.  Given that it's either written as an abbreiviation or as a full name, I'd just hardcode it and call it a day, really

Screenshot 2023-03-14 at 6.36.33 PM.png

Link to base

And you can duplicate the base to see the formulas used!

I posted a reply to this but edited it, it has now disappeared...

I'll try again:

Here is a solution that extracts suburb and state, allowing for extra line breaks/spaces, and differing spelling of state and capitalisation (and also if the state name is in the suburb!):

Untitled.png

AddressPiped: "¬" & REGEX_REPLACE(REGEX_REPLACE(Address,"[\n\\r]+", "¬"), " +", " ")
LastLine: LOWER(REGEX_EXTRACT(AddressPiped,"¬([^¬]+)$"))
LastLineWithShortState: REGEX_REPLACE(REGEX_REPLACE(LastLine, "victoria ([0-9]{4})", "vic $1"),"new south wales ([0-9]{4})","nsw $1")
State: REGEX_EXTRACT(LastLineWithShortState," (nsw|vic) [0-9]{4}")
Suburb: REGEX_EXTRACT(LastLineWithShortState,"^(.+) " & State & " [0-9]{4}")
LastLineProper: REGEX_EXTRACT(AddressPiped,"¬([^¬]+)$")
SuburbProper: LEFT(LastLineProper, LEN(Suburb))
SuburbProperFinal: UPPER(LEFT(SuburbProper,1)) & MID(SuburbProper,2, 300)
StateFull: SWITCH(State,"nsw","New South Wales", "vic", "Victoria")

The fields LastLineWithShortState, State and StateFull need finishing off with the other states.

 

 

 

 

Here is a solution that extracts suburb and state, allowing for extra line breaks/spaces, and differing spelling of state and capitalisation (and also if the state name is in the suburb!). 

AddressPiped: "¬" & REGEX_REPLACE(REGEX_REPLACE(Address,"[\n\\r]+", "¬"), " +", " ")
LastLine: LOWER(REGEX_EXTRACT(AddressPiped,"¬([^¬]+)$"))
LastLineWithShortState: REGEX_REPLACE(REGEX_REPLACE(LastLine, "victoria ([0-9]{4})", "vic $1"),"new south wales ([0-9]{4})","nsw $1")
State: REGEX_EXTRACT(LastLineWithShortState," (nsw|vic) [0-9]{4}")
Suburb: REGEX_EXTRACT(LastLineWithShortState,"^(.+) " & State & " [0-9]{4}")
LastLineProper: REGEX_EXTRACT(AddressPiped,"¬([^¬]+)$")
SuburbProper: LEFT(LastLineProper, LEN(Suburb))
SuburbProperFinal: UPPER(LEFT(SuburbProper,1)) & MID(SuburbProper,2, 300)
StateFull: SWITCH(State,"nsw","New South Wales", "vic", "Victoria")

The fields LastLineWithShortState, State and StateFull need finishing off with the other states.