Scraping ecommerce with javascript: Node, Puppeteer and Cheerio

Did you know it? Scraping is the new black! Let’s talk about 2018 webscraping using javascript

Well if you know me you know i love Java. But when it comes to web scraping, nobody beats javascript. And today i’m gonna talk a little about scraping in js in 2018.
ATTENTION: this is not a technical article, or better, this article is not filled with code that you can easily already find googling

Table of Contents

  • Scraping? What is it and why
  • Scraping tools and consideration
    • Cheerio
    • Puppeteer
    • Pros and Cons
      • Cheerio
      • Puppeteer
      • Comparison verdict
  • Scraping ecommerces tips&tricks
    • Save html
    • Proxy
    • Ecommerce structure
    • Ecommerce: scraping strategies
      • First scrape all the list pages
      • For any json file parse its item
      • Filters
      • DB
    • Parallelize work
  • Final considerations

Scraping? What is it and why

If you need to retrieve programmatically informations from a website you have to get the information via API. This is the best way, but sometimes it isn’t possible since the API doesn’t exist or doesn’t provide the specific information you need.
Here comes web scraping: scraping means analyze a web page and extracting information from the page.
The reason why you may want to retrieve informations may be the most different ones: price tracking, competitors’ analysis, informations collection and so on.

Scraping tools and consideration

When looking at scraping, any language provides a library or framework to provide scraping capabilities but there a language that has a valuable advantage over the others: javascript.

The reason is due to the fact that web pages are built mixing html and javascript so this language is a de-facto first citizen.

Programmatically webscraping could be performed on page simply injecting javascript code via browser inspector console but obviously that approach isn’t useful since we want to perform scraping on multiple pages, multiple times. So we need javascript server side AKA NodeJs.

The two most commonly used (and best) libraries are Cheerio and Puppeteer.

Cheerio

This library requires to be fed with html code. After having loaded it, the library allows you to query the page via jQuery like syntax to retrieve all the informations you need.

Puppeteer

This library lets you drive a Chromium browser (which is the open source Chrome web core) and perform any javascript operation with the rendered page. The nice thing is Chromium is a headless browser: if you want you can run it without starting its UI.
Puppeteer is a great library whose scope goes well beyond the one of scraping: you can use it to print PDFs, perform unit testing, automate operations like form filling and even to automate interaction with web services that allow you to use them only via web without a public API like Whatsapp for example.

Pros and Cons

Cheerio

  • [PRO] Cheerio works on html so you can provide it a html file and it will work without any other resource so you can pass to it the html source of any page using any js http library like request, axios, fetch, etc…
  • [PRO] since Cheerio does need htnl code, you can save it and pass to Cheerio any time you need it instead of performing calls to the source (useful when developing scraper since you may retry the scraping more than once to get all the info you need)
  • [CONS] if the information you need is not present on html source but it’s provided via ajax call then you will have to discover that ajax call and invoke it and pass the resulting response to Cheerio to parse it
  • [CONS] Cheerio has a jQuery like syntax and does return Cheerio objects so you must learn its syntax and methods: i.e. innerText dom node property can be accessed via text() method

Puppeteer

  • [PRO] Puppeteer starts a Chromium instance so what it renders on its page is exactly what you would see on a Chrome browser, no difference at all
  • [PRO] since using Chromium you don’t need to discover what are the eventually needed ajax call performed since their results will be rendered on the page: coding a scraper doesn’t require you to know how the website’s call work
  • [PRO] since using Chromium you can scrape information via standard client side javascript so you can test it on a Chrome instance using its powerful inspector and when you are ok with the results simply copy that code in a page.evaluate Puppeteer’s method (see the docs) and it will work seamlessly
  • [PRO] since using Chromium if you need to interact with an in-page javascript method or variable, you can simply call or evaluate it since you have full access to the javascript page engine (with Cheerio you have to consider in-page javascript as text within a script tag)
  • [PRO] Puppeteer does allow you to inject javascript files in page (this is extremely useful when scraping, see later in the article)
  • [CONS] since using Chromium Puppeteer must fully load a page completely: this means it loads all the page resources even if you have to look at the initial html source code. This is both time and bandwidth consuming
  • [CONS] since using Chromium Puppeteer does require machine resources: both cpu and memory
  • [CONS] since using Chromium Puppeteer does require you to load a page…well, you can serve also a local file and if needed you can manually inject a complete html structure in the page so it could work offline too but it is not as comfortable as with Cheerio in this

Comparison verdict

If you can afford more load time, cpu and bandwidth consumption then there is no game, Puppeteer is the winner. Puppeteer is also the only solution when needed to scrape hard to discover or to repeat ajax calls (sometimes some ajax calls requires specific headers that it’s hard or impossible to determine without running the page in browser).
That said, still consider Cheerio any time you must perform scraping: in a lot of cases it may be more than sufficient.

Scraping ecommerces tips&tricks

One of the most favourite scraping targets are ecommerce websites. This kind of “target” require massive scans and multiple scan iterations since you want to track items.
I use Puppeteer but following considerations may be useful for Cheerio scraping too.

Save html

When retrieving the pages, always save the rendered htmls (can simply do it with Puppeteer): you may need to re-access them without repeating a full massive scan to extract specific features you forgot during the scans; in this case you’ll write short scripts using Cheerio to retrieve the missing informations and update your Puppeteer script for the future scans

Proxy

Use a proxy if possible…rotate ip addresses

Ecommerce structure

Ecommerces shares the same logic structure 99% of the times: they are massive lists of paginated items and for any items there is a specific item page and then you have filters on the lists page.
Build a framework around this concept: i have built a framework i can configure via json file to scrape ecommerces websites to dramatically speed up the process of integrating new ecommerces to scrape. If using Puppeteer you can test all the client side scraping code in Chrome inspector and then execute it dynamically with Puppeteer. I use json structure like this:

"itemInfos": [
        {"key": "title", "selector": "h1", "attr": "innerText"},
        {"key": "brand", "selector": "div.product__attributes__item--marca", "attr": "innerText"},
        {"key": "price", "selector": "span.product__price", "attr": "innerText"},
        {"key": "image", "selector": "a.product__images__gallery__item img", "attr": "src"},
        {"type": "code", "path": "custom-code/targetecommerce.js", "function": "extractItem"}
      ]

As you can see i can provide javascript selector directly via json specifying what attribute i want to retrieve or even execute a custom function to extract fields i need.

Ecommerce: scraping strategies

Subdivide your scraping process in steps. These are mine.

First scrape all the list pages

Scrape all the list pages (usually you have a “full list” paginated or you can go by topics). For any item on the page save a json file whose name is the url of the item (nt the full path, just the url of the page) plus “.json”.
This will help when you need to understand what item is the file referred to instead of anonymous 1234.json file name. Store in the file the item url plus some info you can already get while parsing the list (usually the item title and its price but this may vary form site to site). Ah, remember to save the list htmls, it may turns out a useful suggestion ;)

For any json file parse its item

For any json file, take the url, visit the item page and parse it. Embed the info you got from the list page and then save a json too for the item file, with the url.json filename as before (obviously in a different folder).
Then move the original list json file to a folder called processed: this will help you if you need to restart multiple time the same “scan session”: you know, the script may interrupt or crash for some reason and then you don’t want to re-process what you have already processed.
This approach may turn useful even if you want to split your scraping session in multiple sub-sessions. And remember, save the item’s html ;)

Filters

If item and list page doesn’t provide all the infos you may obtain using filters (yes, sometimes it happens) then you have to iterate through the filters on the list pages and add the filter provided information to a new file with same item url name

DB

Well now you have to move this amount of stuff into a db: so, iterate through all the items’ jsons, look for “filter created json” using same filename url and integrate their informations and now you have a full json object to import into the db and you can pre-process it if needed.
This was the last step!

Parallelize work

Using Puppeteer you can parallelize work in two ways: one browser instance with multiple tabs or multiple browser instances each with one tab or multiple browser instances with multiple tabs…choose wisely, i prefer one browser mutlitab but if it crashes problems may arise!

Final considerations

If possible use APIs. When not possible use web-scraping but please consider not to DDOS your target!
Always consider legal issues: even if anybody does, scraping is quite always not allowed.
Since you will have to repeat scraping, using proxy is necessary and since you will be doing it for multiple ecommerces, consider creating a framework or contact me to buy mine :D

Alberto Plebani Written by:

Alberto Plebani is an italian software engineer currently working as a freelance developer. Happy father, happy bushcrafter!

Find him on LinkedIn
comments powered by Disqus