Selectors are one of the most important pieces of your scraper. Well-written selectors make your web scraper work efficiently and fast. When the website’s layout changes your scraper’s selectors need to be changed as well. Then, in a well-established scraping environment the only things that have to be changed are the selectors. In this post I will dig into CSS selectors and XPATH and share some good tips with you to write effective and fast selectors for your web scraper.

CSS Selectors and XPATH Link to heading

CSS Selectors Link to heading

Css selectors are widely used by frontend developers to associate css properties with their html elements. For web scrapers, we can use it to navigate in the structure of an html file. If you are a beginner scraper and you’re familiar with css then I suggest that you should use css selectors over xpath, though in some cases you have to use xpath.

XPATH Link to heading

Xpath is a specification which is created to help you navigate in any XML document so you can use it while you’re parsing an html file. Almost each html parsing or web scraping related library has Xpath support. It’s a more robust and powerful way to locate elements than css selectors.

CSS Selectors Basic Link to heading

#x Link to heading

Element that has x id.

#id

.x Link to heading

Elements that has x class. Selects all elements that have x class

.content

x y Link to heading

Element is a direct or non-direct descendant of x.

body p

x, y Link to heading

Elements that are x or y.

div, p

x + y Link to heading

First element that is immediately preceded by x

div + p

x > y Link to heading

Element is a direct child of x.

div > p

x ~ y Link to heading

All elements that is preceded by x

p ~ ul

x[y] Link to heading

Element that has y attribute.

div[alt]

x[y=’z’] Link to heading

Element’s y attribute is “z”.

img[alt=’image’]

x:last-child Link to heading

Elements that is the last child of its parent.

p:last-child

x:empty Link to heading

Elements that have no children.

p:empty

XPATH Basic Link to heading

/ Link to heading

Start searching from root node.

// Link to heading

Start searching from the start of the document.

//x[@id=’y’] Link to heading

Element that has y id.

//div[@id=’foo’]

//x[@class=’y’] Link to heading

Elements that has y class.

//div[@class=’foo’]

//x | //y Link to heading

Selects elements that are x or y. Searching in the whole document.

//H1 | //H2

//x[@y=’z’] Link to heading

Elements that has y attribute which are z.

//img[@alt=’image’]

//x/y/z Link to heading

Element is direct descendant of y and y is direct descendant of x.

//p/ul/li

//x/text() Link to heading

Selects the text in x.

//p/text()

//x/y[N] Link to heading

The Nth y element that is a child of x.

//div[@id=’foo’]/td[1]

4 Basic Tips to Write Effective Selectors Link to heading

  • Be specific if necessary and at the same time use as short selectors as you can.
  • Know the HTML structure of the website thoroughly. Take time to go over it.
  • Maintain the selectors. If the layout changed you probably need to change your code.
  • Write selectors for yourself. Try to avoid tools.

XPATH and CSS Selector Generator Tools Link to heading

It can take a lot of time to figure out and test your selectors especially if it is a large project. If you are not afraid of messy CSS Selectors or XPATH or simply you don’t want to waste time writing your own selectors you can use one of the amazing tools below to make your job easier. These tools will generate your desired selectors and xpath. Be aware that these tools don’t necessarily create the most readable and most efficient piece of code. Also, they sometimes generate wrong strings that doesn’t select what you need.

CSS Selector Tools Link to heading

http://selectorgadget.com/

https://chrome.google.com/webstore/detail/css-selector-helper-for-c/gddgceinofapfodcekopkjjelkbjodin

http://getfirebug.com/

XPATH Tools Link to heading

https://extendsclass.com/xpath-tester.html

http://www.altova.com/xmlspy/xpath-analyzer.html#xpath_analyzer20

http://xmltoolbox.appspot.com/xpath_generator.html

http://getfirebug.com/