Re: Airtable should buy or implement SimpleScraper's functionality

chrismessina · ‎Dec 04, 2019

The SimpleScraper (go to simplescraper dot io) Chrome extension looks super powerful and useful for sucking up data into Airtable!

chrismessina · ‎Jan 08, 2020

Awesome to have your perspective here, @Mike_ss!

Zapier would be the obvious and probably fastest path for integration, since it’s a known vendor with a bunch of patterns/infra in place, but I can’t speak on behalf of Airtable of course.

In Airtable land, I kind of think setting up a table in advance is a normal task, so not too much friction.

To give you a sense for my use case… I have a huge base w articles in which I’ve been mentioned or have given interviews, and I’d love to grab the metadata and content from them using something like your SimpleScraper. Given a list of URLs, then, SimpleScraper would go and either update or populate my base. Of course, most publishers have different HTML formatting, so it might take some work to generate the scrapers, but if those could be shared w/ the SimpleScraper community, that would greatly share the burden.

Thoughts?

Bill_French · ‎Jan 09, 2020

This is why scraping in general is so brittle. Yes, it will take some effort to get them working, but the maintenance required is vastly more effort because you will have to do it many times. This constant arms race will eventually push you toward a less brittle approach unless you enjoy constantly discovering the scraping processes are failing and you enjoy debugging repeatedly.

Ideally, you want to build content parsers that do not depend on tokenizing text. Instead, it’s better to employ NLP to extract entities using proven AI models. This abstracts the discovery of data away from specific content formats and makes it possible to harvest data from web sites you never planned or scripted for.

Perhaps @Mike_ss is about to include an AI architecture in SimpleScraper to make these content scraping approaches obsolete. :winking_face:

Jack_Mazzan · ‎Jan 09, 2020

I am interested in discussing your theory of mass scraping. Do you have a minute or many?

Bill_French · ‎Jan 10, 2020

Hi Jack, and welcome to the community!

Are you referring to my comments or from @chrismessina, or those from @Mike_ss?

Jack_Mazzan · ‎Jan 10, 2020

I believe it was messina’s. Thanks for the response.

chrismessina · ‎Jan 11, 2020

It depends, lol. I’m about to be traveling for two weeks, so prior to February, no — I have but few minutes.

For background, I spent a bunch of time working in the 2010s on various internet formats called microformats to mark up web pages to make it easier to extract data (concept: turn every webpage into a database entry by exposing the structure using well-known semantics via CSS classes).

When I talk about “mass scraping”, I’m imagining a shared repo like Userscripts dot org (which no longer exists) where users upload their scripts for popular sites, and then collaboratively maintain them. I don’t know if this is how a service like Mercury Reader adapts to all the pages it turns into readable content, but something like that.

In an ideal world, I’d go into an editor mode and just draw boxes around the content that I want to extract from a page, and it would either figure out how to reference that content for me, and then general to other pages on the site, or somehow help me w/ the extraction process.

Jack_Mazzan · ‎Jan 11, 2020

Thank you for taking the time for this information. As I said, I am green to the internet (been in prison for 40 years), but I am halfway computer literate. I will try to digest what you told me and hope to hear from you when you have the time. Thanks again.

Bill_French · ‎Jan 12, 2020

We all are halfway literate and especially so about the web because it is always changing.

This is the promise of tools like import.io which has a free tier to gain some experience. They use some of the AI concepts I maintain are critical for building reliable and sustainable data extraction processes.

This was a common approach during the past two decades and quite effect - even today. However, the number of new content rendering techniques and the growing number of abstraction possibilities have diminished the use of CSS markers as an effective extraction method. These new layers of abstraction have inspired data seekers to look to AI as the likely pathway for transforming unstructured information into data.

It’s easy to say “AI”; much harder to put it into practical use. But, the effort to make this leap pays huge dividends. Imagine a single extraction model to pull real estate listing data from 10 prominent web platforms and all without ever modifying the extraction model when any one of the sites changes how they render specific data values. I know, it sounds too good to be true, right? But it works.

Mike_ss · ‎Jan 20, 2020

Thanks for explaining the use case some more, @chrismessina.

Similar to your idea of a shared repo, we have the concept of a “recipe store” than contains a collection of scraping configurations for various websites. As these are valuable to everybody, it’s not a problem setting these up.

I’ll dig into the Airtable API over the next couple of weeks and see what can be achieved.

@Bill.French - If the results could live up to the promise, I’d ship an AI powered Simplescraper tomorrow! Outside of some heuristics for scraping articles, this still appears to be an unsolved problem.

At the end of the day, updating a CSS selector is trivial. The question then becomes 'how do we detect and fix invalid selectors as quickly as possible?" which is a much easier problem to solve.

Will be in touch with updates soon.

Bill_French · ‎Jan 20, 2020

Indeed, this is unsolved only because no one has invested in developing models that can accurately extract data. In commoditized service offerings, vendors tend to sustain the status quo because the margins are only getting slimmer, the sure indicator disruption is likely on the near horizon.

The fact that using AI for acquiring information is still unsolved is the business opportunity - the first company that uses AI to scrape successfully will disrupt all other companies in the web-to-data segment because it will have changed the game by eliminating the brittleness, the need to craft so many scraping approaches, and the maintenance issues. Plus, it’ll expand the possibilities by adding image and PDF scraping for data, an often-used approach to thwart scraping.

Even the BoilerPipe (open-source) project has demonstrated significant improvement in scraping by peeling away shallow text artefacts using a complex algorithm. Subtle uses of AI will soon give way to pervasive use.

I see this not as an unsolved problem; rather it’s simply an unsolved solution.