Not so long ago, I was building a spider which queried product ids from a database before actually scraping the site. The task was to assign specific product ids to scraped products.

In the database table I had two columns: product_id and URL. Each URL redirected scrapy to a product page. So when scrapy was about to scrape the URL I had to find a way to pass the correct product id to the parsing method so I could populate the item with the correct product_id.

Essentially, I had to connect to the database, get the url and product_id then scrape the URL while passing its product id. All these had to be done in start_requests because that is the  function scrapy invokes to request urls. This function has to return a Request object. I figured out that Request has got a parameter called meta which we can use to pass arbitrary amount and type of data. Which then you can access in the parsing function as a dictionary of the Response object.

Passing meta parameter in request object Link to heading

So as I said start_requests has to return a Request object. But before that I queried the database for the URLs and product ids just like I did in another project:

def start_requests(self):
    conn = MySQLdb.connect(
        user='user',
        passwd='pass',
        db='mdb',
        host='1.1.1.1',
        charset="utf8",
        use_unicode=True,
        cursorclass=MySQLdb.cursors.DictCursor
    )
    cursor = conn.cursor()
    cursor.execute(
        'SELECT * FROM products'
    )
    rows = cursor.fetchall()

I defined a dict in the meta parameter so later I could easily access it:

for row in rows:
    product_id = row["product_id"]
    url = row["url"]
    yield Request(url=url, meta={"product_id": product_id})

I also could have passed the URL in the meta parameter but I didn’t because I could access it directly through the Response object anyway.

These two pieces of code all that I’ve got in start_requests function. Querying db then passing data. Task here is done.

Getting meta data from response object Link to heading

Now, we should be able to access our passed data in the parsing function where we populate the items.

def parse(self, response):
    item_loader = ItemLoader(item=ProductItem(), response=response)
    item_loader.default_input_processor = MapCompose(remove_tags)
    item_loader.default_output_processor = TakeFirst()
    product_id = response.meta["product_id"]
    price_selector = "dd[itemprop='price']"
    name_selector = "h1[itemprop='name']"

    item_loader.add_value("product_id", product_id)
    item_loader.add_css("price", price_selector)
    item_loader.add_css("name", name_selector)
    item_loader.add_value("url", response.url)
    return item_loader.load_item()

Simply calling response.meta we get the dictionary we just passed through the Request object.