4

I have been having quite a time getting this to work reliably for 100s of thousands of terms and potentially millions of pages per source and ETL the resulting data into a database in an automated fashion. I need to run the tasks in Mesos on a repeating schedule. The required languages are Scala/Java.

For acquisition, I need to parse javascript, render data from ajax, work with tracking cookies; etc. in order to scrape the sites. I've been working on an open source tool to do this as well. I discovered and have created an extremely simple API surrounding Selenium for this task with serializable configuration for distribution. The tool is plug and play for a webdriver.

However, the crawls constantly run into trouble in that they always hang despite being isolated fairly well and stripped down from one another (by specifying cache locations,minimizing the cache size, not downloading images;etc.).

Errors range from phantomjs returning a cleanup error and failing to continue to a general hang in Chrome Driver despite not running out of memory according to VisualVM. In fact, the highest memory use has been 25% and CPU use at 50% using 3-5 individual child processes.

Should I be running each term in a container? How to make web driver reliable over a period of weeks or months? Is there an equally generic alternative?

gnat
  • 20,543
  • 29
  • 115
  • 306

2 Answers2

1

This may not be the most satisfactory type of answer, but the fact is that web browsers are not built and tested with the expectation that they will be run continuously for weeks or months, while fetching hundreds of thousands of pages.

While the browser developers of course do their best to make their software work reliably during long browsing sessions, problems that only appear in extreme use cases are unlikely to get the highest level of attention.

Therefore, first of all, try to recycle browser processes on a regular basis. After every few thousand pages fetched might be a good starting point.

Second most important, try different browser types. Perhaps Chrome/Chromium/PhantomJS is not currently the the most stable browser type for your use case. (At the time of writing, PhantomJS is no longer under active development.) A great advantage of using Selenium is that many different types of browsers are supported. Try Firefox and see if performance is better. Or if the platform supports it, try Edge or Safari.

Third, make sure that unexpected situations are handled in ways that do not consume too much resources. For example, if links to PDF files, downloadable documents, etc are encountered during scraping, or if popup windows, new tabs or dialog boxes are opened. Some websites link to print versions of pages that automatically open a print dialog.

Otto G
  • 141
0

Most websites do not require that you stand up an entire browser and automation framework to scrape them; doing so marshals a lot of machinery that is simply not necessary, and introduces a lot of complexity that reduces your overall system reliability.

For examples on how to write scrapers that are more lightweight and more reliable, a look at Jaunt: http://jaunt-api.com/

Robert Harvey
  • 200,592