Airtable should buy or implement SimpleScraper's functionality

chrismessina · ‎Dec 04, 2019

The SimpleScraper (go to simplescraper dot io) Chrome extension looks super powerful and useful for sucking up data into Airtable!

Bill_French · ‎Dec 06, 2019

Indeed it is. However, there are many issues with scraping data on the web and one is the possibility that the domain doing the scraping will be blacklisted from the content being scraped. As such, don’t look for Airtable to implement this functionality on airtable.com anytime soon.

Another issue is the ToS of any given content site. If you violate it, your account may be suspended. Airtable should fear getting in between its users and their non-Airtable data.

Lastly, security context. Scraping data from secure sites is really risky and wrought with a variety of security issues that Airtable would need to divert resources to. Are you sure you want to distract these folks from the core functionality of their solution?

chrismessina · ‎Dec 25, 2019

Thanks for your perspective.

I don’t think that the issues you raised apply here, given that the extension is running in the user’s browser context, where it can access all the information I can access. Therefore, Airtable’s domain is unlikely to be blacklisted because the extension is accessing the web resource from the user’s IP address.

I’m unclear what TOS I might be violating if I copy content from a webpage, even if it’s restricted access. I can always save a webpage to my harddrive, or copy the text manually and put it into an Airtable base myself… what difference does it make if I use a tool for that purpose? The law has already made it clear that websites can’t deny web scrapers… (see arstechnica dot com/tech-policy/2019/09/web-scraping-doesnt-violate-anti-hacking-law-appeals-court-rules/).

The Airtable extension is simply automating me having to copy and paste content that I can already see, so I don’t see how there’s any additional security concerns here. The data is going to get into my base one way or another; of course it would be best if the scraper ran locally and then uploaded the data directly to Airtable, and so presuming that, I’m not immediately seeing how your concerns apply.

Perhaps to clarify my point: the SimpleScraper functionality should run on the client side without requiring any requests from Airtable to the page being scraped.

Bill_French · ‎Dec 25, 2019

This may indeed be the case today. You certainly know more about it than I do. However, are you certain that there would be no advantage to running some aspect of this technology on servers if in your hypothetical recommendation that SimpleScraper were acquired by Airtable? What would be the point of buying a scraper technology of you didn’t intend to integrate it in some way that creates competitive advantage.

But even so, how can you be so certain that the chrome extension you are using doesn’t interact with a server that performs scraping activity on your behalf?

To me, what “scrape in the cloud” means is far different from “scraping the cloud”. It suggests that there are process activities that may indeed be configured and managed in your browser, but which are actually performed by proxy services elsewhere.

If you can schedule a scrape and close your browser, how do you think that works? Surely, it cannot run [solely] in your browser.

I’ve worked for a few companies that provide a variety of data harvesting technologies (import.io) and every one of them utilized servers to do the heavy lifting because browsers are (a) inefficient, (b) unable to pace themselves, (c ) unable to spread requests across multiple IPs, multiple user-agents and domains, and (d) are miserable tools where same-origin security policies come into play.

I think scraping is a fine activity and it’s wonderful to utilize these tools to acquire data. My comment was simply to point out that …

Airtable is not likely to get into the data scraping business;
There are lots of potential issues with scraping (technical and otherwise);
Airtable has some pretty big competitive gaps in their solution; scraping is not going to help to close those gaps.

chrismessina · ‎Dec 25, 2019

You’re right; “scrape in the cloud” does imply “closing my browser” and having some server execute the scraping.

My interest is simply making the Web Clipper useful and more powerful; right now it’s pretty barebones and the SimpleScraper has some great features which I’d like to see implemented in Web Clipper. Personally I care less about implementation and more about value/functionality delivered. If it preserves privacy and avoids getting blocked along the way, then those characteristics seem necessary to providing that value/functionality over time.

The current Web Clipper is severely limited in its utility, so that’s my primary concern.

Bill_French · ‎Dec 26, 2019

Indeed, web clipper, like many aspects of Airtable suffer from very low operating ceilings. And despite the fact that I don’t really keep tabs on all things Airtable, I can quickly name a dozen critical features that are severely limited.

In my view, web clipping doesn’t even make the top twenty most critical things in the scope of clear-headed product management.

I don’t intend to lessen the importance of web clipping to you or any other user, but I do intend to call out the deep contrasts between poorly-implemented “features” that fall into the nice to have category and deeply dependent infrastructure requirements that have been ignored for more than half a decade.

This is the difference between features that help users in a narrow scope of activity – versus – features that help people help themselves which have a vastly broader and deep-reaching impact on the ability for users solve many data management challenges, the latter of which tend to be boundless.

Airtable does some things really, really well and they’ve captivated the attention and imagination of a large segment of underserved workers who need information management solutions that are both delightfully appealing and easy to use.

How did Airtable win these customers?

Not by making it easy to import data. :winking_face: To the contrary - Airtable’s data ingestion and importing capabilities are among the weakest in their segment. Certainly, this aspect of the product must be improved, but seamless flows of data into their platform is not in their wheelhouse. As such, I must ask -

Will more users come to Airtable because it has a great web clipper?

Or …

Will vastly more users come to Airtable because it is a great data management platform – AND STAY – because they can effortlessly solve complex data management problems with advanced formula methods such as Split()?

Across this forum, I can’t recall a single instance where a user has announced their departure from Airtable because it couldn’t clip data from the web. Yet, I see many such announcements from serious business users who cannot sustain their interest going forward because the product lacked essential infrastructures that make it possible to solve data manipulation objectives.

Mike_ss · ‎Dec 30, 2019

@chrismessina @Bill.French,

Hey guys,

Mike from Simple Scraper here. Just stumbled across this discussion.

To clarify how Simple Scraper works: you can either scrape locally (no servers involved) or in the cloud (servers required), with cloud scraping having more powerful features like scheduling, as you’ve mentioned.

While not totally familiar with Airtable, I did have a quick read through the docs and it looks like a custom Simple Scraper ‘block’ may provide some of the functionality that you’re looking for, Chris.

Is it possible to build and publish custom blocks, and if not, is there any other way to pass data to Airtable?

We’ve just launched a Google Sheets integration today that instantly copies data that you’ve scraped to a Google Sheet (see ‘Saving data automatically to Google Sheets’ here: https :// simplescraper. io/guide). Airtable appears to lack an API that would support this but if there’s some equivalent method I’d gladly build an integration.

I think we’re all in agreement that moving data from A to B on the web should be as easy as possible so happy to invest time in anything that furthers that goal.

Cheers

Bill_French · ‎Dec 30, 2019

Hi Mike, thanks for chiming in.

Yes, everyone loves data. :winking_face: The easier the better.

No. Airtable doesn’t expose this to the general user community or vendors. I recommend you contact Airtable though as a block would be one of the better approaches to providing a clean integration.

Only their REST API which means you’d have to create a boatload of infrastructure (at your server) to manage authentication, etc. (I think) Anyone, of course, could use your API and Airtable’s API to glue scraped content to a table. Airtable does support Zapier and Integromat, so that might be a pathway if your support either of those platforms.

That’s great - a good feature to have.

Mike_ss · ‎Dec 31, 2019

Thanks for the info, Bill.

Found the relevant API docs and the Javascript client so looks like it’s possible to cook up some kind of integration. Using something like Zapier is an option but directly is cleaner and simpler (for the user at least!).

To begin we can replicate what happens with the Google Sheets integration: any data scraped using Simple Scraper is instantly copied to a table. The new data can either replace or be appended to existing data.

The only limitation I can see is that it’s not possible to create tables via the API, so we’d need to create a table for the data in advance. This presents a little friction but perhaps we can make it work.

@chrismessina, how does this sound as a starting point?

Bill_French · ‎Dec 31, 2019

Yep. This is one of the many limitations of the current API.

I also recommend you get on the beta list to work with their forthcoming javascript implementation for building integrated Blocks.