It would great to see a general purpose data extraction tool, and many companies are trying.
I think what may inhibit progress in this area is the possibility of ‘only’ achieving 90% reliability, which, while an amazing feat, is effectively no different from the brittleness inherent in today’s scraping tools.
Relevant xkcd ; )
It’s certainly worth trying though as it would be game changing.
Perhaps true if you set aside benefits like these:
Never configuring a scraper to get the data in the first place;
Never having a scraper break;
Never needing to monitor if a scraper is performing;
Never modifying a scraper after it has fallen over;
Never getting egg on your face when a scraper fails;
Never spending any time analyzing the CSS patterns.
I believe that if you eliminate all of these activities and achieve 80% accuracy, the cost-benefit ratio tips well into your advantage column. You then have a chance to use additional pattern detection to remove the remaining 20% inaccurate data.
As such, even if you only eliminate 50% of the effort spent on these six activities, it is very different the brittleness inherent in today’s scraping tools.
If you have metrics concerning these six classes of profit-robbing activity, I have a hunch the math would demonstrate good reason to invest in a different technical approach.
Bear in mind, I’m certainly not a scraping expert; I just spend a lot of time revamping process automation and quite often, there are vast activities that can be eliminated while changing the underlying infrastructure.