Sep 07, 2019 10:42 AM
Hi all. This is my first post, so let me know if my request makes sense and is directed towards the right group.
I have a list of URLs for companies that I’m tracking as part of a broader database and wanted to integrate a column which shows the date (mm/dd/yyyy) when the site was last updated. I would prefer that it detail any change on the entire site–regardless of size or scope–as my primary objective is simply to obtain an additional data point on whether the company is still active.
My question is dual-faceted:
Sep 08, 2019 03:01 AM
HI.
To know when the website was updated, you can create a http request.
In headers of request there will be a field called “last-modifed”.
Then you could match that data with existing date in Airtable, and if the date is newer, update the field in Airtable.
That kind of integration can be done in Integromat.
It has HTTP request module, where you can parse headers. Also, it has a ready Airtable module.
If need help with this kind of integration, let me know.
Kind regards.
Sep 08, 2019 06:46 PM
There are numerous challenges with this requirement. For example, to know if anything has changed across the entire site, you must index the entire site because the web page at a specific domain address may not be updated with deep link pages in the site are updated. So, unless you have a complete list of every page’s URL in the domain, you may get some false negatives - i.e., the site was actively changing but the home page was not modified at all.
An HTTP request that senses any given page has been updated could provide false positives because many web sites are “touched” daily even though the content has not change at all. They do this in an attempt to influence search engines. As such, your competitive intelligence system will be completely inaccurate simply because a content management system is aggressively touching the pages.
Lastly, web servers are likely to get in the way of your intelligence-gathering logic because many are set up to optimize responses and defend against certain types of crawling activities. This can give you misleading outcomes.
Computing Content Deltas
The only reasonable and reliable way to determine content change over time is to compute the delta (i.e., difference) for each page in the site between two dates. This is a big task and similar to version control of content and building this is non-trivial.
Google Alerts. Build out your competitive intelligence solution by using alerts and target specific keywords to provide precise notifications about the terms that really matter.
Visual Ping - this platform is also very good at finding change.
For each of these services, you can have the alerts forwarded to Zapier or Integromat and then on to Airtable.
Sep 14, 2019 05:28 PM
Bill, thanks so much for the comprehensive and informative response. Are you saying that Visual Ping is likely the closest purpose built service that exists to perform the task I described or are there others as well?
Sep 14, 2019 06:58 PM
Sure - there are many in this class and they all try to differentiate a little, but the core requirement is pretty simple. I regard them all as features looking for a product to live in. :winking_face:
One thing I like about Visual Ping is that it uses screenshots to determine if there is a change. This is a unique approach because it is “visual” (as they advertise) as opposed to being swayed by data in the HTML (which can be misleading - i.e., no content changes but many underlying data changes).
You can still get spoofed for example when the color of a logo changes - that will [literally] be seen as a change despite the fact that it’s probably uninteresting for your purposes.