Help

Re: Airtable should buy or implement SimpleScraper's functionality

1609 0
cancel
Showing results for 
Search instead for 
Did you mean: 
chrismessina
6 - Interface Innovator
6 - Interface Innovator

The SimpleScraper (go to simplescraper dot io) Chrome extension looks super powerful and useful for sucking up data into Airtable!

37 Replies 37

It would great to see a general purpose data extraction tool, and many companies are trying.

I think what may inhibit progress in this area is the possibility of ‘only’ achieving 90% reliability, which, while an amazing feat, is effectively no different from the brittleness inherent in today’s scraping tools.

cdccf9911edc8aac4e81db619e753d6b9c9b9b05.png

Relevant xkcd ; )

It’s certainly worth trying though as it would be game changing.

Perhaps true if you set aside benefits like these:

  1. Never configuring a scraper to get the data in the first place;
  2. Never having a scraper break;
  3. Never needing to monitor if a scraper is performing;
  4. Never modifying a scraper after it has fallen over;
  5. Never getting egg on your face when a scraper fails;
  6. Never spending any time analyzing the CSS patterns.

I believe that if you eliminate all of these activities and achieve 80% accuracy, the cost-benefit ratio tips well into your advantage column. You then have a chance to use additional pattern detection to remove the remaining 20% inaccurate data.

As such, even if you only eliminate 50% of the effort spent on these six activities, it is very different the brittleness inherent in today’s scraping tools.

If you have metrics concerning these six classes of profit-robbing activity, I have a hunch the math would demonstrate good reason to invest in a different technical approach.

Bear in mind, I’m certainly not a scraping expert; I just spend a lot of time revamping process automation and quite often, there are vast activities that can be eliminated while changing the underlying infrastructure.

Mike_ss
5 - Automation Enthusiast
5 - Automation Enthusiast

@chrismessina - Hey Chris, a quick update on this:

We’ve been working on a custom block that allows you to easily import data into Airtable using Simplescraper. Here’s a preview of it in action:

In the demo we import data from Stackoverflow but the source could be any website that you choose - simply use the dropdown to change recipes.

Let me know if this is similar to what you had in mind?

Airtable’s custom blocks are still in preview so no ETA but wanted to keep you posted and listen to any suggestions that you may have.

Peace, Mike

Ooo… looks promising! So — how will this apply to arbitrary URLs?

Here’s a use case: I post a lot of stuff to Product Hunt. A lot of the apps I post are in Apple’s App Store. I’d like to be collect the app’s:

  • icon
  • the gallery of images, not just at the 300px size, but the 960px size
  • description
  • website

How could I use your block to do this?

This looks amazing! Any improvement with this block/script project?

I would be interested to use it for product importation on a website where I cannot use API

Air_Table3
5 - Automation Enthusiast
5 - Automation Enthusiast

Mike from simple scraper. Any update for your block? This seems like a great idea.

Hello! Is the project dead?

Looks like import.io was put on ice for new users so no new signups for free tier accounts, just the sales pitch available :winking_face: so I am hoping too this project makes some progress. Any news @Mike_ss ?

The one issue with Simplescraper is it doesn’t look for text in the page based on some logic, but rather it looks at the structure of the page. For example, there can be a page 1 with text as follows
AAA: XXXX
BBB: YYYYY
CCC: ZZZZZZ

Where AAA, BBB, CCC are my titles of fields.

Page 1 is crawled correctly.

The issue is if some of the elements of the page are missing.
If there is another page 2 (same site essentially) and it has same structure but the element AAA: XXXX is missing (not listed on the page). Because the way Simplescraper works, when it crawls the page, it will put the results:

AAAA: YYYY
BBBB: ZZZZ
CCCC: (will be empty because now the data got shifted to AAAA and BBBB.

Simplescraper should have looked at the page structure but also look for similar blocks in text. If I create recipe where AAAA, BBBB, and CCCC are there, and if AAAA is missing, then it should still be able to fill in BBBB and CCCC correctly and not move things up.

The specific page is Amazon product page.
Amazontofieldstext

The specific issue is caused by the product details element for “Discontinued by manufacturer” property. I am not interested in this one. Just want to start the scrape at Product Dimensions.

  • Is Discontinued By Manufacturer : No

I can solve is by having 2 different recipes.

I know this is not Simplescraper support forum, but since I am doing it into Airtable I wanted to mention in case @Mike_ss happens to be around reading it.

Other than that it works well. Is an integration necessary to Airtable, what would be the benefit of Airtable buying it? I can use Zapier or Integromat to process the results.

Anyhow, earlier there was discussion local vs cloud scrape. In the simplesraper FAQ:

What’s the difference between local scraping and cloud scraping?

Using the extension to select and download data is local scraping. It’s simple and free. If you scrape the same pages often, need to scrape multiple pages, or want to turn a website into an API, you can create a scrape recipe that runs in the cloud. Cloud scraping has advantages like speed, page navigation, a history of scrape results, scheduling and the ability to run multiple recipes simultaneously.

Mike_ss
5 - Automation Enthusiast
5 - Automation Enthusiast

@Air_Table3, @chrismessina, @Nicolas_Lapierre, @Bill.French, @itoldusoandso

Hey, please excuse the shortage of updates on this. I progressed further down the block (now App) path before realizing that a direct integration would be a smoother and more user-friendly solution.

So I’ve built the integration and it’s now live at Simplescraper.

Enter your Airtable info (API key, Base ID, table name) and then any website data that you scrape will instantly appear in Airtable. Here’s the tutorial and below is a video of scraping jobs from Indeed.com into Airtable with Simplescraper.

Give it a try and hope it proves useful. Happy to answer questions and take suggestions.

Simplescraper Airtable integration preview

Yep, we will take a look, very helpful for other sites …
It gets knocked out though by Amazon site changing structure unfortunately.
There were some limitations of SimpleScraper as I described earlier however for me so I ended up going with the Airtable clipper tool (more manual and requires lots of fields).
Thanks for the update.