Airtable should buy or implement SimpleScraper's functionality

Thanks for the info, Bill.

Found the relevant API docs and the Javascript client so looks like it’s possible to cook up some kind of integration. Using something like Zapier is an option but directly is cleaner and simpler (for the user at least!).

To begin we can replicate what happens with the Google Sheets integration: any data scraped using Simple Scraper is instantly copied to a table. The new data can either replace or be appended to existing data.

The only limitation I can see is that it’s not possible to create tables via the API, so we’d need to create a table for the data in advance. This presents a little friction but perhaps we can make it work.

@chrismessina, how does this sound as a starting point?


Yep. This is one of the many limitations of the current API.

I also recommend you get on the beta list to work with their forthcoming javascript implementation for building integrated Blocks.

1 Like

Awesome to have your perspective here, @Mike_Simplescraper!

Zapier would be the obvious and probably fastest path for integration, since it’s a known vendor with a bunch of patterns/infra in place, but I can’t speak on behalf of Airtable of course.

In Airtable land, I kind of think setting up a table in advance is a normal task, so not too much friction.

To give you a sense for my use case… I have a huge base w articles in which I’ve been mentioned or have given interviews, and I’d love to grab the metadata and content from them using something like your SimpleScraper. Given a list of URLs, then, SimpleScraper would go and either update or populate my base. Of course, most publishers have different HTML formatting, so it might take some work to generate the scrapers, but if those could be shared w/ the SimpleScraper community, that would greatly share the burden.


This is why scraping in general is so brittle. Yes, it will take some effort to get them working, but the maintenance required is vastly more effort because you will have to do it many times. This constant arms race will eventually push you toward a less brittle approach unless you enjoy constantly discovering the scraping processes are failing and you enjoy debugging repeatedly.

Ideally, you want to build content parsers that do not depend on tokenizing text. Instead, it’s better to employ NLP to extract entities using proven AI models. This abstracts the discovery of data away from specific content formats and makes it possible to harvest data from web sites you never planned or scripted for.

Perhaps @Mike_Simplescraper is about to include an AI architecture in SimpleScraper to make these content scraping approaches obsolete. :wink:

I am interested in discussing your theory of mass scraping. Do you have a minute or many?

Hi Jack, and welcome to the community!

Are you referring to my comments or from @chrismessina, or those from @Mike_Simplescraper?

I believe it was messina’s. Thanks for the response.

It depends, lol. I’m about to be traveling for two weeks, so prior to February, no — I have but few minutes.

For background, I spent a bunch of time working in the 2010s on various internet formats called microformats to mark up web pages to make it easier to extract data (concept: turn every webpage into a database entry by exposing the structure using well-known semantics via CSS classes).

When I talk about “mass scraping”, I’m imagining a shared repo like Userscripts dot org (which no longer exists) where users upload their scripts for popular sites, and then collaboratively maintain them. I don’t know if this is how a service like Mercury Reader adapts to all the pages it turns into readable content, but something like that.

In an ideal world, I’d go into an editor mode and just draw boxes around the content that I want to extract from a page, and it would either figure out how to reference that content for me, and then general to other pages on the site, or somehow help me w/ the extraction process.

Thank you for taking the time for this information. As I said, I am green to the internet (been in prison for 40 years), but I am halfway computer literate. I will try to digest what you told me and hope to hear from you when you have the time. Thanks again.

We all are halfway literate and especially so about the web because it is always changing.

This is the promise of tools like which has a free tier to gain some experience. They use some of the AI concepts I maintain are critical for building reliable and sustainable data extraction processes.

This was a common approach during the past two decades and quite effect - even today. However, the number of new content rendering techniques and the growing number of abstraction possibilities have diminished the use of CSS markers as an effective extraction method. These new layers of abstraction have inspired data seekers to look to AI as the likely pathway for transforming unstructured information into data.

It’s easy to say “AI”; much harder to put it into practical use. But, the effort to make this leap pays huge dividends. Imagine a single extraction model to pull real estate listing data from 10 prominent web platforms and all without ever modifying the extraction model when any one of the sites changes how they render specific data values. I know, it sounds too good to be true, right? But it works.

1 Like

Thanks for explaining the use case some more, @chrismessina.

Similar to your idea of a shared repo, we have the concept of a “recipe store” than contains a collection of scraping configurations for various websites. As these are valuable to everybody, it’s not a problem setting these up.

I’ll dig into the Airtable API over the next couple of weeks and see what can be achieved.

@Bill.French - If the results could live up to the promise, I’d ship an AI powered Simplescraper tomorrow! Outside of some heuristics for scraping articles, this still appears to be an unsolved problem.

At the end of the day, updating a CSS selector is trivial. The question then becomes 'how do we detect and fix invalid selectors as quickly as possible?" which is a much easier problem to solve.

Will be in touch with updates soon.


Indeed, this is unsolved only because no one has invested in developing models that can accurately extract data. In commoditized service offerings, vendors tend to sustain the status quo because the margins are only getting slimmer, the sure indicator disruption is likely on the near horizon.

The fact that using AI for acquiring information is still unsolved is the business opportunity - the first company that uses AI to scrape successfully will disrupt all other companies in the web-to-data segment because it will have changed the game by eliminating the brittleness, the need to craft so many scraping approaches, and the maintenance issues. Plus, it’ll expand the possibilities by adding image and PDF scraping for data, an often-used approach to thwart scraping.

Even the BoilerPipe (open-source) project has demonstrated significant improvement in scraping by peeling away shallow text artefacts using a complex algorithm. Subtle uses of AI will soon give way to pervasive use.

I see this not as an unsolved problem; rather it’s simply an unsolved solution.

It would great to see a general purpose data extraction tool, and many companies are trying.

I think what may inhibit progress in this area is the possibility of ‘only’ achieving 90% reliability, which, while an amazing feat, is effectively no different from the brittleness inherent in today’s scraping tools.

Relevant xkcd ; )

It’s certainly worth trying though as it would be game changing.

Perhaps true if you set aside benefits like these:

  1. Never configuring a scraper to get the data in the first place;
  2. Never having a scraper break;
  3. Never needing to monitor if a scraper is performing;
  4. Never modifying a scraper after it has fallen over;
  5. Never getting egg on your face when a scraper fails;
  6. Never spending any time analyzing the CSS patterns.

I believe that if you eliminate all of these activities and achieve 80% accuracy, the cost-benefit ratio tips well into your advantage column. You then have a chance to use additional pattern detection to remove the remaining 20% inaccurate data.

As such, even if you only eliminate 50% of the effort spent on these six activities, it is very different the brittleness inherent in today’s scraping tools.

If you have metrics concerning these six classes of profit-robbing activity, I have a hunch the math would demonstrate good reason to invest in a different technical approach.

Bear in mind, I’m certainly not a scraping expert; I just spend a lot of time revamping process automation and quite often, there are vast activities that can be eliminated while changing the underlying infrastructure.

@chrismessina - Hey Chris, a quick update on this:

We’ve been working on a custom block that allows you to easily import data into Airtable using Simplescraper. Here’s a preview of it in action:

In the demo we import data from Stackoverflow but the source could be any website that you choose - simply use the dropdown to change recipes.

Let me know if this is similar to what you had in mind?

Airtable’s custom blocks are still in preview so no ETA but wanted to keep you posted and listen to any suggestions that you may have.

Peace, Mike


Ooo… looks promising! So — how will this apply to arbitrary URLs?

Here’s a use case: I post a lot of stuff to Product Hunt. A lot of the apps I post are in Apple’s App Store. I’d like to be collect the app’s:

  • icon
  • the gallery of images, not just at the 300px size, but the 960px size
  • description
  • website

How could I use your block to do this?

1 Like

This looks amazing! Any improvement with this block/script project?

I would be interested to use it for product importation on a website where I cannot use API

Mike from simple scraper. Any update for your block? This seems like a great idea.

1 Like

Hello! Is the project dead?

Looks like was put on ice for new users so no new signups for free tier accounts, just the sales pitch available :wink: so I am hoping too this project makes some progress. Any news @Mike_Simplescraper ?

The one issue with Simplescraper is it doesn’t look for text in the page based on some logic, but rather it looks at the structure of the page. For example, there can be a page 1 with text as follows

Where AAA, BBB, CCC are my titles of fields.

Page 1 is crawled correctly.

The issue is if some of the elements of the page are missing.
If there is another page 2 (same site essentially) and it has same structure but the element AAA: XXXX is missing (not listed on the page). Because the way Simplescraper works, when it crawls the page, it will put the results:

CCCC: (will be empty because now the data got shifted to AAAA and BBBB.

Simplescraper should have looked at the page structure but also look for similar blocks in text. If I create recipe where AAAA, BBBB, and CCCC are there, and if AAAA is missing, then it should still be able to fill in BBBB and CCCC correctly and not move things up.

The specific page is Amazon product page.

The specific issue is caused by the product details element for “Discontinued by manufacturer” property. I am not interested in this one. Just want to start the scrape at Product Dimensions.

  • Is Discontinued By Manufacturer : No

I can solve is by having 2 different recipes.

I know this is not Simplescraper support forum, but since I am doing it into Airtable I wanted to mention in case @Mike_Simplescraper happens to be around reading it.

Other than that it works well. Is an integration necessary to Airtable, what would be the benefit of Airtable buying it? I can use Zapier or Integromat to process the results.

Anyhow, earlier there was discussion local vs cloud scrape. In the simplesraper FAQ:

What’s the difference between local scraping and cloud scraping?

Using the extension to select and download data is local scraping. It’s simple and free. If you scrape the same pages often, need to scrape multiple pages, or want to turn a website into an API, you can create a scrape recipe that runs in the cloud. Cloud scraping has advantages like speed, page navigation, a history of scrape results, scheduling and the ability to run multiple recipes simultaneously.