Lately, I’ve been trying to tackle two of my biggest challenges: Getting new clients onboard and creating the product matching module. I can’t speak much now about the first one because we are still in the process of discovering how to approach clients with this kind of software. Momentarily, we’ve been cold-calling certain small and medium sized e-commerce companies. We’ve created a short script that we can use to quickly introduce PriceMind and get clients excited about the software. After the first call, if the client agrees to it, we show off a real working demo created for them. Anyway in this post I wanna speak about something else that I have just a little clue about as well but it’s more of a technical challenge: data matching.
As you may know, we build a price intelligence platform. The base of everything is the database - with a huge amount of crawled data. It contains price and other information of products. Now, the difficult part is not web scraping. Because you can write any number of web spiders and get any kind of data from any website, in most cases. Then, you can also clean and standardize data so the whole database is easy to work with in the future. The real hard stuff is to create a system that finds the same product on various websites. Because a price intelligence platform is worth zero (well not zero, but not much) if it doesn’t compare your prices with the competitor’s prices. That’s an important feature for clients.
In order to resolve that, we have two options. The first one is easy but lame.
You can gather the product page URLs of the same product from multiple websites. Then you save these URLs in a database so the scraper can reach it and scrape the websites accordingly. This is called manual matching. It’s really time consuming and not so fun to do. But also you only have to do it once then it will work properly (hopefully, if the URL doesn’t change which can happen). Momentarily, this is how PriceMind works. We have to gather product match URLs manually.
The second option is to automate this process. In last 2 weeks I’ve been working on a module that does just that. I feel like it will save us countless hours of URL gathering. I’m gonna give you an introduction how I’ve been building this module.
So let’s imagine this scenario: I have two tables in the db. One contains products of SHOP A, the other one contains products of SHOP B. There are products that you can find in both shops. The problem is that the data records are not exactly the same although they refer to the same product. Let’s take an example from the contact lenses industry (because there’s a relatively small amount of products in this category so it’s easy to test our process), these are the same products but with different product names:
SHOP A: “Focus dailies toric”
SHOP B: “Focus dailies all day comfort toric”
How can we figure out if they refer to the same product? (without much knowledge about contact lens products) String comparison!
So I researched how to compare two strings and get their similarity value. There are some well-known algorithms for this:
This method gives you the number of minimum edits(insertions, deletion, substitutions) required to change one string to another.
This will return a value between 0 and 1. 0 means the two strings are completely different, 1 means they are totally the same. It measures how many characters are in common also it assumes that similarity or difference near the start of the string is more important than at the end of it.
There are a bunch of other ways to measure the similarity between two strings(q-grams, phonetic encoding, and other modifications of the above) but these two are the ones I’ve been experimenting lately. I prefer the Jaro-Winkler method because it seems to be more accurate when comparing product names.
Great thing that you don’t have to implement these algorithms because there are numerous libraries that already did that. I’ve been experiencing with these:
A python lib containing a lot of string comparison methods. Also it can compare string based on phonetic encoding. It’s a really useful library to try out different ways to do stuff and just play around.
This is another pretty useful lib in python. It is so cool because it’s got not only the pure implementations of the algorithms I mentioned before but furthermore it can compare strings with tokenizing which is very useful when dealing with multiple words strings. Also it can compare not just two but multiple strings. It’s a fun library for sure.
Moving on from string matching, what I found is that considering only product names is not enough to create a module that provides accurate product matches. Normally, a product has several parameters. For example: size, color, width, height, weight, type etc.. of course these are all domain specific properties. We should make use of them when searching for product matches! Again, I’m gonna stick to contact lenses because right now this is the domain I’ve been experiencing with. Contact lenses have many parameters aside from name: diameter, use time, oxygen content, water content, toric, multifocal etc… These are all valid parameters and we can figure out if two products are the same or not based on only parameters - ignoring the product names.
In data matching there are some fundamental practices we have to do to create an accurate matching system. One of them is the similarity vector. A similarity vector contains a number of similarity values. Let’s see this product as an example:
SHOP A: PureVision 2 SHOP B: PureVision 2 HD
Sab = [brand, name, size, water content, oxygen content] Sab = [1, 0.75, 1, 0, 1]
Each value inside the similarity vector refers to a parameter. 1 means that the two parameters are exactly the same. 0 means that the two products have totally different values regarding that parameter. In this example the name parameter is 0.75. Which means the names are somewhat similar but not exactly the same. Using the Jaro-Winkler algorithm we get 1 similarity value for these two strings:
These are brand names and it’s obvious they are the same brand. Let’s see something not so obvious, the product names:
- “2 HD”
These are the product names. Using the Jaro-Winkler method we get a similarity value of 0.75. Knowing the similarity value between the brand names, and product names, without a deep knowledge of this specific domain I would have a hard time figuring out if these products are the same. Maybe HD means they are totally different products who knows. That’s why multi-parametric matching is so useful. The next parameter we’re matching is the size. We get 1 as a similarity value which means it matches. Now we are getting assured these two products are rather a match.
Moving on, next parameter is water content. Its value is zero. Which means they have totally different values or simply the data field is NULL for one of the products. Unfortunately, sometimes it’s not possible to gather the same kind and amount of information from various websites so there will be empty data fields that makes the data matching process more difficult.
The last parameter we measure is oxygen content. The similarity value is 1 which means it’s the same. Cool. Knowing these similarities we can have a pretty good guess if the products are the same or not. Also, without domain knowledge it’s pretty hard. Maybe if the water content is different there’s no way these products are the same for example.
Anyway, I will continue my journey learning data matching. Hopefully in the next post I will be able to show you a solid process how I do it. Now I just quickly rambled around some interesting stuff I’ve researched.
Some libraries and frameworks that I’ve been working with recently:
Awesome data matching learning material: