Donate. I desperately need donations to survive due to my health

Get paid by answering surveys Click here

Click here to donate

Remote/Work from Home jobs

nutch with selenium to fetch javascript content

Was trying to enable selenium plugin for crawling dynamic rendered content of javascript for the websites having https enabled (https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=L, https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=G).

I have followed all the instructions mentioned @ https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium. keep the plugin, protocol-httpclient along with protocol-selenium, in nutch-site.xml @NUTCH_HOME/conf as the crawling websites are of https. Enabled selenium.take.screenshot property and the selenium is running as well.

When I started crawling, I don’t see javascript data fetched from the websites as well selenium screen captured.

Had any one tried the same, pls do let me know, Thanks!

Apache nutch version: 1.12 FireFox version: 60.3.0 Selenium version: 3.4.0 (standalone).

Comments