library(RSelenium)
#Open a firefox driver
<- rsDriver(browser = "firefox")
rD <- rD[["client"]] remDr
Today, I am going to show you how to scrape product prices from Aliexpress website.
A few words on web scraping
Before diving into the subject, you should be aware that web scraping is not allowed on certain websites. To know if it is the case for the website you want to scrape, I invit you to check the robots.txt page which should be located at the root of the website address. For Aliexpress this page is located here : www.aliexpress.com/robots.txt .
This page indicates that webscrapping and crawling are not allowed on several page categories such as /bin/*
, /search/*
, /wholesale*
for example. Fortunately for us, the /item/*
category, where the product pages are stored, can be scraped.
RSelenium
Installation for Ubuntu 18.04 LTS
The installation for RSelenium was not as easy as expected and I encountered two errors.
The first error I got after I installed the package and tried the function Rsdriver was :
Error in curl::curl_fetch_disk(url, x$path, handle = handle) :
Unrecognized content encoding type. libcurl understands deflate, gzip content encodings.
Thanks to this post, I installed the missing package : stringi.
Once this error was addressed, I had a different one :
Error: Invalid or corrupt jarfile /home/aurelien/.local/share/binman_seleniumserver/generic/4.0.0-alpha-2/selenium-server-standalone-4.0.0-alpha-2.jar
This time the problem came from a corrupted file. Thanks to this post, I knew that I just had to download this file selenium-server-standalone-4.0.0-alpha-2.jar from the official selenium website and replace the corrupted file with it.
I hope this will help some of you to install RSelenium with Ubuntu 18.04 LTS !
Opening a web browser
After addressing the errors above, I can now open a firefox browser :
Logging in Aliexpress
The first step to scrape product prices on Aliexpress is to log in into your account:
<- "Your_mail_adress"
log_id <- "Your_password"
password
# Navigate to aliexpress login page
$navigate("https://login.aliexpress.com/")
remDr
# Fill the form with mail address
$findElement(using = "id", "fm-login-id")$sendKeysToElement(list(log_id))
remDr
# Fill the form with password
$findElement(using = 'id', "fm-login-password")$sendKeysToElement(list(password))
remDr
#Submit the login form by clicking Submit button
$findElement("class", "fm-button")$clickElement() remDr
Phantomjs version
If you execute the code above, you should see a firefox browser open and navigate through the list you provided. In case you don’t want an active window, you can replace firefox by phantomjs browser which is a headless browser (without a window).
I don’t know why but using rsDriver(browser = "phantomjs")
does not work for me. I found this post which propose to start the phantomjs browser with the wdman package:
library(wdman)
library(RSelenium)
# start phantomjs instance
<- wdman::phantomjs(port = 4680L)
rPJS
# is it alive?
$process$is_alive()
rPJS
#connect selenium to it?
<- RSelenium::remoteDriver(browserName="phantomjs", port=4680L)
remDr
# open a browser
$open()
remDr
$navigate("http://www.google.com/")
remDr
# Screenshot of the headless browser to check if everything is working
$screenshot(display = TRUE)
remDr
# Don't forget to close the browser when you are finished !
$close() remDr
Conclusion
Once you have understand the basics of RSelenium and how to select elements inside HTML pages, it is really easy to write a script to scrape data on the web. This post was a short example to scrape the product price on Aliexpress pages but the script can be extended to scrape more data on each page such as the name of the item, its rating etc… It is even possible to automate this script to run daily in order to see price changes over time. As you see possibilities are endless!