Help

Re: How to scan for duplicates in a URL?

1566 0
cancel
Showing results for 
Search instead for 
Did you mean: 
Mr_Kav
5 - Automation Enthusiast
5 - Automation Enthusiast

Hi guys,

First of all, I’m happy to be here and I would love to get your help with a method I want to develop.
I’ll try to explain myself as clear as I can so you could understand my needs and help.

We’re using a LinkedIn search URLs on a daily basis. Most of them contains many strings and terms in order to filter our target audience. For example:

*(Lead OR Architect OR Director OR SVP OR EVP OR Vice OR Chief OR VP OR Head) AND Data AND (Engineering OR “Application Development” OR “Software Development”) AND NOT (Expert OR Technician OR Support OR Maintenance OR QA OR quality OR Customer OR Specialist OR student *

And here’s a part of the URL for example:

companyIncluded=NOT%2520(college%2520OR%2520university%2520OR%2520school%2520OR%2520HP%2520OR%2520%2522Hewlett%2520Packard%2522%2520OR%2520Samsung%2520OR%2520Nvidia%2520OR%2520Paypal%2520OR%2520google%2520OR%2520BAE)&companySize=F%2CG%2CH%2CI&companyTimeScope=CURRENT&doFetchHeroCard=false&geoIncluded=103644278%2C100506914%2C101165590%2C101174742%2C101452733%2C102454443%2C103291313%2C104738515%2C103323778%2C105146118%2C100446943%2C105490917&industryExcluded=71%2C75%2C77%2C96&keywords=(KYC%20OR%20AML%20OR%20%22authentication%22%20OR%20Fraud%20OR%20Identity%20OR%20passport%20OR%20%22customer%20experience%22%20OR%20%22Compliance%20and%20Assurance%22%20OR%20compliancence%20OR%20Onboarding%20OR%20Identification%20OR%20Journey%20OR%20%22call%20center%22%20OR%20%22contact%20center%22%20OR%20KYC%20OR%20PI%20OR%20%22Personal%20Information%22%20OR%20%22personal%20data%22%20OR%20%22data%20governance%22)%20NOT%20(Intern%20OR%20Student)&listExcluded=all&listType=ACCOUNT&logHistory=true&page=1&rsLogId=529923394&searchSessionId=LDepG29UTV6p3xAdg0C5mg%3D%3D&seniorityIncluded=6%2C4%2C7%2C5%2C8&titleIncluded=(Risk%2520OR%2520Credit%2520OR%2520Prevention%2520OR%2520%2522Compliance%2520and%2520Assurance%2522)%2520AND%2520(Fraud%2520OR%2520Identity%2520OR%2520Authentication%2520OR%2520Risk%2520OR%2520%2522Digital%2520Identity%2522)%2520AND%2520NOT%2520(professor%2520OR%2520office%2520OR%2520Sales%2520OR%2520owner%2520OR%2520software%2520OR%2520Consultant%2520OR%2520Adviser%2520OR%2520Consulting%2520OR%2520Board%2520OR%2520professor%2520OR%2520intern%2520OR%2520assistant%2520OR%2520junior%2520OR%2520JR%2520OR%2520office%2520OR%2520founder%2520OR%2520%2522co-founder%2522%2520OR%2520owner%2520OR%2520extern%2520OR%2520graduate%2520OR%2520undergrad%2520OR%2520contractor%2520OR%2520%2522Chief%2520Executive%2522%2520OR%2520Associate%2520OR%2520advisor%2520OR%2520entry%2520OR%2520journalist%2520OR%2520writer%2520OR%2520secretary%2520OR%2520trainee%2520OR%2520volunteer%2520OR%2520volunteering%2520OR%2520aide%2520OR%2520apprentice%2520OR%2520recruit%2520OR%2520novice%2520OR%2520beginner%2520OR%2520adviser%2520OR%2520Postgrad%2520OR%2520author%2520OR%2520freshman%2520OR%2520novice%2520OR%2520Undergraduate%2520OR%2520postgraduate%2520OR%2520Sales%2520OR%2520coed%2520OR%2520cofounder%2520OR%2520council%2520OR%2520partner%2520OR%2520entry%2520OR%2520expert%2520OR%2520supervisor%2520OR%2520Learning%2520OR%2520Project%2520OR%2520Tutor%2520OR%2520Support%2520OR%2520CEO%2520OR%2520analyst%2520OR%2520Agent%2520OR%2520Rep%2520OR%2520Representative%2520OR%2520Manager%2520OR%2520Lead%2520OR%2520HR%2520OR%2520Human%2520OR%2520Talent%2520OR%2520Recruiter%2520OR%2520Recruiting%2520OR%2520Audit%2520OR%2520Research%2520OR%2520Researcher%2520OR%2520Project%2520OR%2520Account%2520OR%2520Sales%2520OR%2520Presales%2520OR%2520Audit%2520OR%2520Analyst%2520OR%2520Engineer%2520OR%2520Developer%2520OR%2520Regional%2520OR%2520Area%2520OR%2520Local%2520OR%2520account%2520OR%2520Sales%2520OR%2520Specialist%2520OR%2520Modelling%2520OR%2520Investigator%2520OR%2520Instruction%2520OR%2520Instructional%2520OR%2520Products%2520OR%2520Solutions)&titleTimeScope=CURRENT

Basically, we have a lot of duplicates on this links, and I want to reduce this problem in order to optimize our search, targeting and of course to create space for different filters. (space is limited).

So, I’ve created a new workspace on AirTable for this task.
I want to use a script/method/app to spot all the duplicates on the above chain, remove them and create the output on a different column.

So the script needs to flag and remove these duplicates automatically, and reconstruct a clean chain.

Perhaps the way to do this is to break the link into Cells and regroup it into a chain.

The chain has the following logics

  1. SpaceORSpace between every keyword
  2. Quotes " " compile 2 or more keywords into one. For exmaple “Thank you” will go into a cell of its own.
  3. Partneasis () to use AND/OR for conditions.

Any idea to solve this problem would be very helpful. Thanks for your attention!

5 Replies 5

Hi @Mr_Kav, and welcome to the community!

Please point out an example of a “duplicate” in the above example. Also, please define a “clean chain”.

@Bill.French Thanks for your response!

I’ve pasted only a partial chain so the post will be user friendly.
Here’s the full chain:

(Lead OR Architect OR Director OR SVP OR EVP OR Vice OR Chief OR VP OR Head) AND Data AND (Engineering OR “Application Development” OR “Software Development”) AND NOT (Expert OR Technician OR Support OR Maintenance OR QA OR quality OR Customer OR Specialist OR student OR owner OR media OR founder OR Consultant OR Adviser OR Consulting OR Consultancy OR Advisory OR Board OR professor OR intern OR assistant OR junior OR JR OR office OR ceo OR “co-founder” OR graduate OR undergrad OR contractor OR “Chief Executive” OR advisor OR entry OR journalist OR writer OR secretary OR trainee OR recruit OR novice OR adviser OR author OR freshman OR Undergraduate OR postgraduate OR Sales OR coed OR cofounder OR council OR partner OR advisor OR Project OR Projects OR District OR Regional OR HR OR Human OR Analyst OR Freelance OR Freelancer OR Editor OR Associate OR Supervisor OR Analyst OR Prep OR Account OR Accounting OR Accounts OR Crime OR solutions OR Services OR Solution OR “Data Center” OR Datacenter OR Azure OR GCP OR Science OR Azure OR portfolio OR field OR Mechanical OR Electric OR Electronic OR Controls OR Structure OR clinical)

You can see Analyst and Azure appear twice in this chain

“Clean chain” - meaning new cell with the same chain minus duplicates

Thanks so much for your help!

Yep - I see that, and isn’t it possible (in some “chains”) that such duplications are required depending upon the entire context of the boolean query? If so, how will such streamlining logic be able to understand the intent of the query-writer?

If there are indeed cases where fields need to be used multiple times to construct the intended query, whatever algorithm you create to optimize the length of the query must deconstruct the entire query to be sure it doesn’t thwart the original intent.

Certainly, this example demonstrates cases where additional mentions of Azure serve no purpose and may be eliminated. And such optimization is certainly possible through a script transformation. However, I would be more interested in fixing this at the source. One must ask -

What tool created this sub-optimized query in the first place? And can it be modified to produce “clean chains”?

One Approach…

To do this with precision probably requires a regular expression in javascript like this. This is supported in Script blocks and Action scripts. It might be possible in a formula, but that would make it extremely difficult to achieve without knowing the repeating tokens in advance. Ideally, you want an algorithm that works when unanticipated [duplicate] patterns emerge.

@Bill.French, Thanks again for your attention

Actually there isn’t a case in which I’ll need the same expression to appear twice. This query built to include/exclude titles.
We can have similar words like “Engineer” and “Engineering” or “Project” and “Projects”, but nothing identical.
The reason that we have duplicates in it in the first place, is that too many users have access to this query, and each one of them add/remove titles. That’s why we ended up with a lot of expressions and duplicates.

The current situation require us to go over the whole chain, spot duplicates and remove them.
So I looked into a solution in which I could filter these duplicate and print the new output “ready to use”.

If I’m going macro here, this is a part of the new base we’re building right now. I want to create a campaign builder, so our users will insert the relevant field and get the final outcome. Their next action will be copy and paste on the website and initiate the campaign search link.

I’ve looked into your solution and tried to implement it using this app, but it doesn’t recognize JS.
Also, I need to figure out how to exclude terms like ),(, OR,". They appear multiple times but these are syntax terms and can’t be removed.
Can you give an example using this script and the above chain?

One last thing, if this thing is too complicated, I thought maybe I can use Vlookup and print only the duplicates so I’ll know what to remove. Honestly, I prefer the first method so it’ll be “plug & play” without the need to edit it, but if there’s no solution - I’ll take that.

Thanks so much for your time, I really appreciate it.

Sure. Imagine a string with these duplicate queries:

let queryString = "... OR Analyst OR Freelance OR Freelancer OR Editor OR Associate OR Supervisor OR Analyst OR Prep ...";

The way to remove all the dupes would go something like this:

newQuery = queryString.replace(/\\b(\\w+)(?:\\W+\\1\\b)+/ig, "").replace(/ OR OR /ig, " OR ");

The first replace() eliminates the second, third, n… duplicates. The second replace() eliminates the vacancies left between the ORs.

Without an understanding of the app and how it crafts these suboptimal queries, I can’t really advise.