IPRoyal
Back to blog

Web Scraping With R: The Complete Guide for 2024

Justas Vitaitis

Last updated -

Proxy fundamentals

In This Article

Ready to get started?

Register now

Web scraping is a widely used practice of developing automated scripts that go to a large number of URLs and extract important information from them. It’s been done with many programming languages , each of which has its own benefits and drawbacks.

R is normally used for data analysis, making it uniquely useful to the practice since a large part of web scraping is making use of the information you collect. While it doesn’t have as much community support as some other programming languages, there’s more than enough for most web scraping projects.

On the other hand, R has plenty of data manipulation libraries, each of which can be immensely useful at the later stages of the process of web scraping.

Which Is Better For Web Scraping, R or Python?

Python is likely the most popular language used in web scraping applications . There’s a few reasons for its popularity:

  • Ease of use

Python is an object oriented language, which makes it a lot simpler than many other languages (such as C++). Python programming is relatively easy to learn as well, so it’s also a great choice for beginners.

  • Community support

Being both a popular language in general and a popular one for web scraping, Python has a lot of libraries tailor-made for data extraction and analysis. Additionally, you can find code samples for nearly any possible scenario.

Due to these and many other reasons, Python has often been considered the number one programming language for web scraping, except for some niche scenarios.

R, however, doesn’t fall that far from Python and, in some cases, may be even better. There are quite a few libraries for web scraping available, and while they’re not as optimized as those of Python, R has many more data manipulation tools available.

As such, it’s a trade-off, depending on the end goal of web scraping. If all you need is web scraping and no analysis, Python may be better for you. If you need both, R could be the better choice.

We can also compare two of the most popular libraries used in web scraping are BeautifulSoup (Python) and Rvest (R). They both have their own drawbacks and benefits:

  • BeautifulSoup

Highly versatile and great at handling HTML and XML documents. It’s well-documented, supported, and constantly updated. On the other hand, BeautifulSoup cannot perform any web scraping on its own and relies on libraries such as Requests , Playwright, or Selenium to visit websites.

  • Rvest

Great at handling smaller websites while being slightly less flexible than BeautifulSoup. On the other hand, it has no additional requirements for web scraping, so you can do most of the work with just one package.

In the end, Python is usually the preferred option for large web scraping projects that have a lot of moving parts and many dynamic websites. Additionally, it’s a great choice if all you need is web scraping itself.

On the other hand, R is better for smaller projects, such as static websites, especially if a developer is already familiar with the programming language.

Web Scraping Using The Rvest Package

Before you can even get started with Rvest, you’ll need R and an IDE itself . Luckily, the programming language comes with a good IDE by itself (Rstudio) .

Once both are installed, open Rstudio, follow the instructions and start a new project (if necessary) . Then, use the console to enter the command:

install.packages("rvest")

After the package installation is complete, click on “File”, then “New” and create a “R Script”.

Start by loading the Rvest package:

library(rvest)

Now, we can start loading a HTML page. Let’s use the IPRoyal website for testing purposes:

url <- "https://iproyal.com"
webpage <- read_html(url)

All our script currently does is load the page, so running it will essentially do nothing. We’ll need to target some HTML elements first:

links <- webpage %>% html_nodes("a") %>% html_attr("href")
print(links)

Our script now uses the HTML content stored in “webpage” and searches through the file for every CSS selector that has the tag “a”. Then, after truncating the HTML content, it searches from the previous selection for every CSS selector that has the attribute “href”.

Take note that R uses the “%>%” operator which allows you to perform some actions in sequence on a variable . It’s identical to defining a variable that would hold every CSS selector with the tag “a” and then another variable that would search through the first one for the attribute “href”.

Finally, we can run the code as:

library(rvest)
url <- "https://iproyal.com"
webpage <- read_html(url)
links <- webpage %>% html_nodes("a") %>% html_attr("href")
print(links)

An important caveat is that the default “Run” function in Rstudio works a bit different than in other IDEs – it runs the currently selected line. So, you can use “CTRL + A” to mark all lines and click “Run”, use the shortcut “CTRL + SHIFT + S”, or use the “Source” button.

We can now tweak our basic configuration for other HTML elements. For example, to extract titles, we’d use:

titles <- webpage %>% html_nodes("h1") %>% html_text()

Replacing the print function with the argument “titles” will give us all the HTML elements that have H1 headings that are in the IPRoyal home page.

Finally, you can also extract image URLs from the HTML content in a similar fashion:

images <- webpage %>% html_nodes("img") %>% html_attr("src")

Replacing the print function argument with “images” will output all of the elements that have the tag “img” and the HTML element attributes “src”. Since the URLs are usually stored in “src”, you’ll get a long list of image URLs.

Web Scraping Using Rcrawler

Rcrawler is a more advanced library that allows you to run through many sites with greater ease. Instead of using one target web page, Rcrawler can automatically go through every link found and collect data from all of them. So, instead of a target web page, you can target the entire website in one line.

You’ll need to install the library once again:

install.packages("Rcrawler")

Then, import the Rcrawler library:

library(Rcrawler)

And now all you need is a single line to scrape an entire website:

Rcrawler(Website = "https://iproyal.com", 
         ExtractXpathPat = c("//h1", "//p"), 
         no_cores = 4, no_conn = 4)

We set our Rcrawler to run through the IPRoyal website. It will then use XPath to go through the HTML code and find everything that has the H1 and <p> HTML elements.

Additionally, you can set various other optimization settings. Our Rcrawler will use four cores to run through each of the static web pages faster and process up to four pages in parallel.

Finally, Rcrawler creates a folder and CSV file by default. The folder stores all of the HTML code downloaded as files. In the CSV file you’ll find the output that’s generated according to your web scraping settings.

There’s even more benefits to Rcrawler. Even if you cancel the process midway through, all of the previously collected data is stored within the CSV file, so you can easily interrupt the web scraping process whenever you want.

Web Scraping JavaScript-rendered Content

Most modern websites use JavaScript with their HTML code to dynamically load content. As such, without JavaScript rendering, some HTML elements or other types of information may not be immediately visible.

R has a library that can simulate browsers, following in the footsteps of Python’s Selenium:

install.packages("RSelenium")

Once the package is installed, we’ll need to import it and create a browser instance:

library(RSelenium)
rD <- rsDriver(browser = "chrome", port = 4545L)
remDr <- rD$client

We’ll now visit the IPRoyal website to download the HTML code through the browser:

remDr$navigate("https://iproyal.com")
page_source <- remDr$getPageSource()[[1]]

You can then use the rvest package or any other data manipulation library to make use of the page source. Note that the source code will include everything that’s available in the HTML code, so you’ll still need to find the required CSS selector or XPath to parse the data.

Create account

Author

Justas Vitaitis

Senior Software Engineer

Justas is a Senior Software Engineer with over a decade of proven expertise. He currently holds a crucial role in IPRoyal’s development team, regularly demonstrating his profound expertise in the Go programming language, contributing significantly to the company’s technological evolution. Justas is pivotal in maintaining our proxy network, serving as the authority on all aspects of proxies. Beyond coding, Justas is a passionate travel enthusiast and automotive aficionado, seamlessly blending his tech finesse with a passion for exploration.

Learn More About Justas Vitaitis
Share on

Related articles