How to Scrape Javascript in Java

Nowadays the most popular websites have some kind of dynamic elements and they use javascript to display information. Chances are you have to crawl a website full of javascript content. Designing our web scraper, we should look for simple and pure html web pages to fetch data without hassling with javascript or the like. Though, there are cases when we cannot get around scraping javascript rendered pages. If you are struggling with scraping javascript generated information keep reading in this tutorial I’ll show you how you can make it happen easily in Java with htmlUnit!

Scraping Javascript content Link to heading

So you want to scrape information which is rendered/displayed with javascript. Before we jump into it be aware of that you cannot scrape javascript generated html with a simple html parser like BeautifulSoup in python or JSoup in Java. You need something more. You need a real browser(engine) which will run javascript for you. When you scrape, you can use a headless browser for this. In this tutorial I will use htmlUnit which is widely used to test web applications and it has an amazing javascript support in its headless browser which is relatively fast.

Setting up htmlUnit Link to heading

You can download the htmlUnit library from HERE.

How to use WebClient to scrape javascript Link to heading

In htmlUnit you will use the WebClient class to simulate a real browser.

You can instantiate a WebClient like this:

WebClient webClient = new WebClient();

This will create a headless browser with the default configuration(internet explorer). In some cases it won’t load javascript properly so you might try other configurations:

WebClient webClient = new WebClient(BrowserVersion.CHROME);

WebClient webClient = new WebClient(BrowserVersion.FIREFOX_45);

Obviously, use the one that works the best for you after trying some of them.

Enable Javascript in your WebClient Link to heading

After setting up your WebClient the next step is to enable javascript in it.

webClient.getOptions().setJavaScriptEnabled(true);

Additionally, by getOptions(), if you need you can enable/disable or configure things like CSS, max timeout, SSL etc..

In most cases you have to give some time to your WebClient to load javascript properly. You can do it with this:

webClient.waitForBackgroundJavaScript(3000);

Real Working Example Scraping Javascript Link to heading

Let’s say you want to scrape some soccer data from THIS soccer stats page.

It’s going to be a great example because the full page is loaded by Javascript.

Here’s the full code:

    String START\_URL = "http://www.statistics.bet3000.com/s4/?clientid=193&language=en#2\_1,3\_1,22\_1,5\_32887,9\_overview,25\_1"
    try {
        webClient = new WebClient(BrowserVersion.FIREFOX\_45);
        HtmlPage page = webClient.getPage(START\_URL);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.waitForBackgroundJavaScript(3000);
    } catch (IOException ex ) {
        ex.printStackTrace();
    }

Now you can scrape and fetch whatever you like because each Javascript html content is now visible fully.

For example, this snippet scrapes each premiere league team from the page and stores it in a List:

List teams = (List) page.getByXPath("//td[@class=‘team’]");