In This Article

Back to blog

Nokogiri: Parse HTML & XML in Ruby

Tutorials

Learn how to use Nokogiri for high-performance HTML and XML parsing. Master data extraction using CSS selectors and XPath in your Ruby projects.

Justas Vitaitis

Last updated - ‐ 5 min read

Key Takeaways

  • Nokogiri makes parsing and querying XML and HTML documents in Ruby straightforward and efficient.

  • Installation via gem install nokogiri is simple, but you may need to consider dependencies and Ruby version compatibility.

  • Use CSS selectors and XPath expressions to extract data from HTML content or structured XML.

Nokogiri is the definitive Ruby library for efficiently parsing and manipulating both XML and HTML documents. Its powerful capabilities are based on wrapping high-speed native C libraries, making it indispensable for web scraping , data extraction, and general structured data management in professional Ruby workflows.

What Is Nokogiri?

Nokogiri is a foundational Ruby gem engineered for the parsing and manipulation of XML and HTML documents. Its use of native code distinguished it. Nokogiri functions as an interface that wraps industry-standard C libraries, primarily libxml2 and libxslt.

Such architecture grants Nokogiri superior speed and efficiency within the Ruby ecosystem, which makes it the preferred solution for performance-critical data processing tasks.

The processing of web page content and structured data feeds is a common requirement in Ruby development, particularly for:

  • Web scraping and data extraction. Systematically retrieving data from external web pages.
  • API development. Handling data payloads in XML or constructing web service responses.

Nokogiri provides a strong, well-documented API and boasts widespread community adoption. As a result, it positions itself as the standard library for data manipulation within Ruby frameworks.

Installing Nokogiri Ruby Gem

Effective deployment of Nokogiri requires attention to both gem management and system prerequisites. The gem can be installed directly via the command line:

gem install nokogiri

For projects using Bundler (which is most Ruby projects), add this line to your Gemfile:

gem 'nokogiri'

Following this, execute bundle install to lock the dependency.

Managing Dependencies and Compatibility

There are two main things here to take note of:

  • System dependencies. As Nokogiri relies on native extensions, specific operating systems, particularly Linux distributions, necessitate the pre-installation of development packages for the underlying libraries, such as libxml2-dev and libxslt-dev. If you fail to meet these requirements, it will prevent successful compilation during gem installation.
  • Ruby versioning. Compatibility must be maintained between the Nokogiri version and the runtime Ruby environment. Developers must reference the official RubyGems documentation for the latest specific version requirements to ensure a stable installation.

Ready to get started?
Register now

Parsing HTML With Nokogiri

Nokogiri processes HTML content by constructing a standard document tree model. The Nokogiri::HTML method is used for parsing.

Document Loading

Content can be loaded from standard strings or external URLs.

1. Loading From a String

require 'nokogiri'

html_string = "<html><body><h1>Page Header</h1><p class='intro'>Introduction.</p></body></html>"
doc = Nokogiri::HTML(html_string)

puts doc

2. Loading From a URL (Requires open-uri)

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(URI.open("https://iproyal\.com"))

puts doc

Note: for production web scraping, developers should use a dedicated HTTP client gem (like HTTParty or Faraday) to handle requests, error responses, and timeouts more reliably than the standard open-uri.

Data Extraction via XPath and CSS Selectors

The doc object enables data retrieval through two powerful querying mechanisms:

Method Selector standard Use case
.css() CSS Selectors (e.g., 'h1', '.class', '#id') Simple, familiar querying, ideal for class and ID attributes
.xpath() XPath Expressions (e.g., '//h1', //a/@href) Complex querying, better for traversing hierarchical relationships and accessing specific attributes

Both methods return a NodeSet, which is an enumerable collection of matching elements.

HTML Extraction Example

require 'nokogiri'
require 'open-uri'

url = "https://iproyal.com"
doc = Nokogiri::HTML(URI.open(url))

puts "Page title: " + doc.css('title').text 

doc.css('p.intro').each do |p_node|
  puts "Intro paragraph: " + p_node.text.strip
end

links = doc.css('a').map { |a| a['href'] } 

Parsing XML With Nokogiri

XML parsing utilizes Nokogiri::XML, which applies strict parsing rules suitable for structured data.

Loading XML Content

require 'nokogiri'

xml_string = <<~XML
<catalog>
  <book id="bk101">
    <title>XML Developer’s Guide</title>
    <author>Matthew Gambardella</author>
  </book>
</catalog>
XML

doc = Nokogiri::XML(xml_string)
puts doc

Structured Data Retrieval

XPath is the frequently preferred method for XML due to its capability in managing namespaces and complex hierarchical structures:

require 'nokogiri'

xml_string = <<-XML
<library>
  <book id="1">
    <title>The Pragmatic Programmer</title>
    <author>David Thomas</author>
  </book>
  <book id="2">
    <title>Clean Code</title>
    <author>Robert Martin</author>
  </book>
</library>
XML

doc = Nokogiri::XML(xml_string)

doc.xpath('//book').each do |book_node|
  title  = book_node.at_xpath('title').text
  author = book_node.at_xpath('author').text
  id     = book_node['id']
  puts "[ID: #{id}] #{title} by #{author}"
end

Nokogiri vs Other Parsers

Nokogiri serves as the performance-optimized equivalent to parsing tools found in other programming languages :

Library Language Primary role Key differentiator
Nokogiri Ruby HTML/XML parsing High performance driven by native C libraries (libxml2).
BeautifulSoup Python HTML/XML parsing Pure-Python implementation, valued for its tolerance of broken HTML.
Cheerio JavaScript HTML parsing (Node.js) Provides a server-side API mirroring the familiar jQuery syntax.

Nokogiri is the definitive selection for Ruby developers requiring fast, reliable, and standards-compliant manipulation of XML and HTML documents. Its strong native architecture and comprehensive feature set integrate seamlessly into professional Ruby development projects.

Pros of Nokogiri

  • Good performance due to native parsing libraries.
  • Supports both XML and HTML documents.
  • Rich querying via CSS selectors and XPath.
  • Large community and widely used in Ruby projects.

Cons of Nokogiri

  • Installation may require handling of system dependencies for native extensions.
  • As with any powerful library, there is a learning curve for advanced features.

Conclusion

Nokogiri is the definitive, high-performance Ruby library for manipulating HTML and XML documents. It uses native C libraries for speed and provides methods like CSS selectors and XPath expressions for precise data extraction.

Mastering Nokogiri is essential for developers needing efficient and reliable parsing and processing of complex web pages and structured data feeds in the Ruby environment.

Create Account
Share on
Article by IPRoyal
Meet our writers
Data News in Your Inbox

No spam whatsoever, just pure data gathering news, trending topics and useful links. Unsubscribe anytime.

No spam. Unsubscribe anytime.

Related articles