Scrapy - Login to websites

There are situations when you have to be logged in to access the data you are after. When using scrapy it should not discourage you because scrapy deals with login forms and cookies easily. Be aware that when you need to login to reach the data it is not accessible for everyone. So perhaps it’s not ethical to scrape data from such website. I have no idea what your situation is but I warn you to pay attention to be ethical and legal!

Login to websites

When you visit the site you want to login you have your username(or maybe email) and password. That’s all you need when you login with scrapy too. You don’t have to deal with sending POST request, storing cookies or anything else. Scrapy does all the hard work for you.

Scrapy login With FormRequest

You need to use scrapy’s FormRequest object. It will handle the login form and try to login with the given credentials in the constructor. This is the general use of a FormRequest:

def parse(self, response):
    return scrapy.FormRequest.from_response(
        response,
        formdata={'username': 'randomuser', 'password': 'topsecret'},
        callback=self.after_login
     )

In the example above, the response object is the HTTP response of the page where you need to fill in the login form. As you see, the FormRequest has to contain information about where the form is(response), the credentials which will be sent to the server(formdata) and a callback function that will probably scrape the page after logged in.

Before scraping the page, you should make sure you are correctly logged in. A simple solution would be searching for an error message in the response body. If there is no login-failed error message you are good to go with scraping the page.

def after_login(self, response):
    if "Error while logging in" in response.body:
        self.logger.error("Login failed!")
    else:
        self.logger.error("Login succeeded!")
        item = SampleItem()
        item["quote"] = response.css(".text").extract()
        item["author"] = response.css(".author").extract()
        return item

If the page you are redirected to after login is not the one where the data is you should Request the right page this way:

scrapy.Request(url="http://scrape.this.com", callback=self.parse_something)

Pagination After Login

You may not find all the data you need on one page so you need some sort of pagination. As I mentioned scrapy does everything you need under the hood so takes care of staying logged in while paginate through pages. Cookies are stored automatically.

Here’s a simple example of pagination on a website which has a “Next Page” button:

next_page_url = response.css("li.next > a::attr(href)").extract_first()
if next_page_url is not None:
	yield scrapy.Request(response.urljoin(next_page_url))

You include this or something similar at the end of your parsing method and you will be able to parse each page which need authentication.