Dec 04, 2019 12:19 PM
The SimpleScraper (go to simplescraper dot io) Chrome extension looks super powerful and useful for sucking up data into Airtable!
Jan 20, 2020 09:26 AM
It would great to see a general purpose data extraction tool, and many companies are trying.
I think what may inhibit progress in this area is the possibility of ‘only’ achieving 90% reliability, which, while an amazing feat, is effectively no different from the brittleness inherent in today’s scraping tools.
Relevant xkcd ; )
It’s certainly worth trying though as it would be game changing.
Jan 20, 2020 09:40 AM
Perhaps true if you set aside benefits like these:
I believe that if you eliminate all of these activities and achieve 80% accuracy, the cost-benefit ratio tips well into your advantage column. You then have a chance to use additional pattern detection to remove the remaining 20% inaccurate data.
As such, even if you only eliminate 50% of the effort spent on these six activities, it is very different the brittleness inherent in today’s scraping tools.
If you have metrics concerning these six classes of profit-robbing activity, I have a hunch the math would demonstrate good reason to invest in a different technical approach.
Bear in mind, I’m certainly not a scraping expert; I just spend a lot of time revamping process automation and quite often, there are vast activities that can be eliminated while changing the underlying infrastructure.
Apr 09, 2020 12:33 PM
@chrismessina - Hey Chris, a quick update on this:
We’ve been working on a custom block that allows you to easily import data into Airtable using Simplescraper. Here’s a preview of it in action:
In the demo we import data from Stackoverflow but the source could be any website that you choose - simply use the dropdown to change recipes.
Let me know if this is similar to what you had in mind?
Airtable’s custom blocks are still in preview so no ETA but wanted to keep you posted and listen to any suggestions that you may have.
Peace, Mike
Apr 27, 2020 10:41 PM
Ooo… looks promising! So — how will this apply to arbitrary URLs?
Here’s a use case: I post a lot of stuff to Product Hunt. A lot of the apps I post are in Apple’s App Store. I’d like to be collect the app’s:
How could I use your block to do this?
Oct 20, 2020 07:21 AM
This looks amazing! Any improvement with this block/script project?
I would be interested to use it for product importation on a website where I cannot use API
Nov 19, 2020 03:58 PM
Mike from simple scraper. Any update for your block? This seems like a great idea.
May 04, 2021 12:07 PM
Hello! Is the project dead?
May 14, 2021 11:53 AM
Looks like import.io was put on ice for new users so no new signups for free tier accounts, just the sales pitch available :winking_face: so I am hoping too this project makes some progress. Any news @Mike_ss ?
The one issue with Simplescraper is it doesn’t look for text in the page based on some logic, but rather it looks at the structure of the page. For example, there can be a page 1 with text as follows
AAA: XXXX
BBB: YYYYY
CCC: ZZZZZZ
Where AAA, BBB, CCC are my titles of fields.
Page 1 is crawled correctly.
The issue is if some of the elements of the page are missing.
If there is another page 2 (same site essentially) and it has same structure but the element AAA: XXXX is missing (not listed on the page). Because the way Simplescraper works, when it crawls the page, it will put the results:
AAAA: YYYY
BBBB: ZZZZ
CCCC: (will be empty because now the data got shifted to AAAA and BBBB.
Simplescraper should have looked at the page structure but also look for similar blocks in text. If I create recipe where AAAA, BBBB, and CCCC are there, and if AAAA is missing, then it should still be able to fill in BBBB and CCCC correctly and not move things up.
The specific page is Amazon product page.
The specific issue is caused by the product details element for “Discontinued by manufacturer” property. I am not interested in this one. Just want to start the scrape at Product Dimensions.
I can solve is by having 2 different recipes.
I know this is not Simplescraper support forum, but since I am doing it into Airtable I wanted to mention in case @Mike_ss happens to be around reading it.
Other than that it works well. Is an integration necessary to Airtable, what would be the benefit of Airtable buying it? I can use Zapier or Integromat to process the results.
Anyhow, earlier there was discussion local vs cloud scrape. In the simplesraper FAQ:
Using the extension to select and download data is local scraping. It’s simple and free. If you scrape the same pages often, need to scrape multiple pages, or want to turn a website into an API, you can create a scrape recipe that runs in the cloud. Cloud scraping has advantages like speed, page navigation, a history of scrape results, scheduling and the ability to run multiple recipes simultaneously.
Jul 04, 2021 10:57 PM
@Air_Table3, @chrismessina, @Nicolas_Lapierre, @Bill.French, @itoldusoandso
Hey, please excuse the shortage of updates on this. I progressed further down the block (now App) path before realizing that a direct integration would be a smoother and more user-friendly solution.
So I’ve built the integration and it’s now live at Simplescraper.
Enter your Airtable info (API key, Base ID, table name) and then any website data that you scrape will instantly appear in Airtable. Here’s the tutorial and below is a video of scraping jobs from Indeed.com into Airtable with Simplescraper.
Give it a try and hope it proves useful. Happy to answer questions and take suggestions.
Jul 05, 2021 10:47 AM
Yep, we will take a look, very helpful for other sites …
It gets knocked out though by Amazon site changing structure unfortunately.
There were some limitations of SimpleScraper as I described earlier however for me so I ended up going with the Airtable clipper tool (more manual and requires lots of fields).
Thanks for the update.