Webscraping Aliexpress with Rselenium

R
Web scraping
Author

Aurelien Callens

Published

November 18, 2020

Today, I am going to show you how to scrape product prices from Aliexpress website.

A few words on web scraping

Before diving into the subject, you should be aware that web scraping is not allowed on certain websites. To know if it is the case for the website you want to scrape, I invit you to check the robots.txt page which should be located at the root of the website address. For Aliexpress this page is located here : www.aliexpress.com/robots.txt .

This page indicates that webscrapping and crawling are not allowed on several page categories such as /bin/*, /search/*, /wholesale* for example. Fortunately for us, the /item/* category, where the product pages are stored, can be scraped.

RSelenium

Installation for Ubuntu 18.04 LTS

The installation for RSelenium was not as easy as expected and I encountered two errors.

The first error I got after I installed the package and tried the function Rsdriver was :

Error in curl::curl_fetch_disk(url, x$path, handle = handle) :
Unrecognized content encoding type. libcurl understands deflate, gzip content encodings.

Thanks to this post, I installed the missing package : stringi.

Once this error was addressed, I had a different one :

Error: Invalid or corrupt jarfile /home/aurelien/.local/share/binman_seleniumserver/generic/4.0.0-alpha-2/selenium-server-standalone-4.0.0-alpha-2.jar

This time the problem came from a corrupted file. Thanks to this post, I knew that I just had to download this file selenium-server-standalone-4.0.0-alpha-2.jar from the official selenium website and replace the corrupted file with it.

I hope this will help some of you to install RSelenium with Ubuntu 18.04 LTS !

Opening a web browser

After addressing the errors above, I can now open a firefox browser :

library(RSelenium)

#Open a firefox driver
rD <- rsDriver(browser = "firefox") 
remDr <- rD[["client"]]

Logging in Aliexpress

The first step to scrape product prices on Aliexpress is to log in into your account:

log_id <- "Your_mail_adress"
password <- "Your_password"

# Navigate to aliexpress login page 
remDr$navigate("https://login.aliexpress.com/")

# Fill the form with mail address
remDr$findElement(using = "id", "fm-login-id")$sendKeysToElement(list(log_id))

# Fill the form with password
remDr$findElement(using = 'id', "fm-login-password")$sendKeysToElement(list(password))

#Submit the login form by clicking Submit button
remDr$findElement("class", "fm-button")$clickElement()

Phantomjs version

If you execute the code above, you should see a firefox browser open and navigate through the list you provided. In case you don’t want an active window, you can replace firefox by phantomjs browser which is a headless browser (without a window).

I don’t know why but using rsDriver(browser = "phantomjs") does not work for me. I found this post which propose to start the phantomjs browser with the wdman package:

library(wdman)
library(RSelenium)
# start phantomjs instance
rPJS <- wdman::phantomjs(port = 4680L)

# is it alive?
rPJS$process$is_alive()

#connect selenium to it?
remDr <-  RSelenium::remoteDriver(browserName="phantomjs", port=4680L)

# open a browser
remDr$open()

remDr$navigate("http://www.google.com/")

# Screenshot of the headless browser to check if everything is working
remDr$screenshot(display = TRUE)

# Don't forget to close the browser when you are finished ! 
remDr$close()

Conclusion

Once you have understand the basics of RSelenium and how to select elements inside HTML pages, it is really easy to write a script to scrape data on the web. This post was a short example to scrape the product price on Aliexpress pages but the script can be extended to scrape more data on each page such as the name of the item, its rating etc… It is even possible to automate this script to run daily in order to see price changes over time. As you see possibilities are endless!

Back to top