50% OFF Residential Proxies for 9 months — use code IPR50 at checkout

Get The Deal

In This Article

Back to blog

How to Scrape Emails with Python: A Step-by-Step Guide

Python

Learn how to scrape email addresses from websites using Python. This step-by-step guide covers tools, code examples, and tips for ethical email scraping.

Eugenijus Denisov

Last updated - ‐ 7 min read

Key Takeaways

  • Use regular expressions to extract email-like patterns from HTML.

  • Handle duplicates, broken patterns, and complex web pages accordingly.

  • Store your email addresses in .csv or .txt to use them easily later.

Finding email addresses from websites using code takes some technical know-how, tools, and understanding ethical/legal boundaries. In this guide, we’ll take you through everything: the basics, more advanced tips using Python, and legal considerations when extracting data.

Even if you’ve never done email scraping with Python before, you’ll be able to understand what’s going on as we’ll break it down as clearly as possible.

Email scraping means using code to extract email addresses from websites. It’s a way to find contact information that’s published online, usually in plain text. The goal is to scrape email addresses by looking through HTML content on sites, blogs, forums, or directories.

Here are some everyday use cases for email scraping:

  • Marketing teams use it to build outreach lists.
  • Recruiters use it to connect with potential candidates.
  • Researchers gather emails for academic or survey purposes.
  • Sales teams use it to contact leads actively from web pages.

While email scraping can have legitimate research or business applications, it’s also often associated with spam or unsolicited outreach, which is why extracting data is heavily regulated.

You may think that all public data is safe to scrape, but it’s not entirely true. Even if emails are publicly visible, scraping them may still fall under data protection laws (like GDPR in the EU) or Terms of Service restrictions. Legality depends on jurisdiction and purpose, not just whether the data is public.

However, one thing is for sure: if you’re scraping private or protected data, you will most likely get into legal trouble.

Before you start any email scraping project, make sure to consult a legal professional, or at the very least check and follow the site’s terms of service, and don’t forget compliance with GDPR, CCPA, or CAN-SPAM.

Tools and Libraries You’ll Need

To scrape email addresses with Python, you only need a few tools. Most are simple to install and use:

  • requests. For making HTTP requests to websites.
  • BeautifulSoup. Helps you read and break down HTML from websites.
  • re. Lets you use regular expressions to match patterns like emails.

You can install the first two by running:

pip install requests beautifulsoup4

The regular expression library is available by default.

These three cover most needs. But for more advanced tasks or dynamic content, you can also try Scrapy, which is a full-featured framework, and Selenium for handling JavaScript-heavy web pages.

You’ll often use regular expressions with these tools to extract email-like patterns, but be aware that regex isn’t perfect and may return false positives or miss some valid emails from time to time.

Ready to get started?
Register now

Basic Email Scraper Using Python (With Regex)

Here’s a simple Python script to scrape email addresses from one page:

import requests
from bs4 import BeautifulSoup
import re

url = 'https://httpbin.org/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Method 1: Look for mailto links specifically
emails = set()
for link in soup.find_all('a', href=True):
    if 'mailto:' in link['href']:
        email = link['href'].replace('mailto:', '').strip()
        emails.add(email)

# Method 2: Search in all text content AND attributes
for tag in soup.find_all(True):
    # Check tag text
    if tag.string:
        found_emails = re.findall(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+', tag.string)
        emails.update(found_emails)

    # Check all attributes
    for attr_value in tag.attrs.values():
        if isinstance(attr_value, str):
            found_emails = re.findall(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+', attr_value)
            emails.update(found_emails)

# Method 3: Search the raw HTML (most comprehensive)
all_emails = set(re.findall(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+', response.text))
emails.update(all_emails)

print("Found emails:")
for email in emails:
    print(email)

The Python script sends a request to the website, reads the page, looks for patterns using regular expressions, and then prints all found email addresses. It’s quick and works great for web scraping emails from simple pages.

Keep in mind that the regex pattern r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+' is only a practical shortcut for finding most email addresses. It matches the general structure of emails, but it’s not fully compliant with the official RFC-5322 standard.

If you’re interested in a slightly improved regular expression you can use in place of the simpler regex, you can try this one: r'^[\w.+-]+@[A-Za-z0-9-]+(.[A-Za-z0-9-]+)+$'. It’s better because:

However, the simpler regex still works well for the majority of everyday email formats and is easy enough for beginners to use effectively.

Extract Emails From Multiple Pages or URLs

If you want to scrape email addresses from several pages, you can use a loop and save the results:

import requests
from bs4 import BeautifulSoup
import re
import csv

urls = [
    'http://example.com',  
    'https://httpbin.org/',  
    'https://www.python.org/community/forums/', 
]

all_emails = set()
results_per_site = {}  

for url in urls:
    try:
        print(f"Scraping: {url}")
        res = requests.get(url, timeout=10)
        soup = BeautifulSoup(res.text, 'html.parser')

        # Remove script and style elements
        for script_or_style in soup(["script", "style"]):
            script_or_style.decompose()

        text = soup.get_text()

        # Also search in the raw HTML (sometimes emails are in attributes)
        combined_text = text + " " + res.text

        # Find emails with regex
        found = re.findall(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+', combined_text)

        # Clean and validate emails
        site_emails = set()
        for email in found:
            email = email.lower().strip()
            # Basic validation - must have @ and . after @
            if '@' in email and '.' in email.split('@')[1]:
                site_emails.add(email)
                all_emails.add(email)

        # Store results for this site
        results_per_site[url] = site_emails

        if site_emails:
            print(f"  Found {len(site_emails)} email(s): {', '.join(site_emails)}")
        else:
            print(f"  No emails found")

    except Exception as e:
        print(f"  Error scraping {url}: {e}")
        results_per_site[url] = set()

# Save to text file
with open('emails.txt', 'w') as f:
    f.write(f"Total unique emails found: {len(all_emails)}\n")
    f.write("=" * 50 + "\n\n")

    # Write emails by site
    for url, emails in results_per_site.items():
        f.write(f"From {url}:\n")
        if emails:
            for email in sorted(emails):
                f.write(f"  - {email}\n")
        else:
            f.write("  - No emails found\n")
        f.write("\n")

    # Write all unique emails
    f.write("=" * 50 + "\n")
    f.write("All unique emails:\n")
    for email in sorted(all_emails):
        f.write(email + '\n')

# Save to CSV file with more information
with open('emails.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(["email", "source_url"])  # headers

    for url, emails in results_per_site.items():
        for email in sorted(emails):
            writer.writerow([email, url])

# Also create a simple CSV with just unique emails
with open('unique_emails.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(["email"])  # header
    for email in sorted(all_emails):
        writer.writerow([email])

# Print summary
print("\n" + "=" * 50)
print(f"Scraping complete!")
print(f"Total sites scraped: {len(urls)}")
print(f"Sites with emails found: {sum(1 for emails in results_per_site.values() if emails)}")
print(f"Total unique emails found: {len(all_emails)}")
print(f"\nResults saved to:")
print("  - emails.txt (detailed report)")
print("  - emails.csv (emails with source URLs)")
print("  - unique_emails.csv (just unique emails)")

if all_emails:
    print(f"\nEmails found: {', '.join(sorted(all_emails))}")

In this example, the results are saved to both .txt and .csv files, making them easy to share or reuse later. As before, the regex example is not fully RFC-5322 compliant but is good enough for scraping everyday email addresses.

Handling Edge Cases

Web scraping emails can sometimes result in issues. Here’s how you can handle the more tricky parts:

  • Duplicate emails. Use set() to keep only unique ones.
  • Malformed emails. Use better regular expressions to filter bad ones.
  • Irrelevant sections. Remove <script> and <style> tags before calling .get_text(), otherwise their contents may be included in the extracted text.
  • Validation. Use extra checks to see if an email format is valid.

For example, if you want to clean your emails, you could use:

def clean_emails(email_list):
    valid_emails = []
    for email in email_list:
        if re.match(r'^[\w\.\+\-]+@[A-Za-z0-9-]+(\.[A-Za-z0-9-]+)+$', email):
            valid_emails.append(email)
    return list(set(valid_emails))

It removes bad entries and duplicates from your list. Again, the regex is only a practical shortcut here, as mentioned earlier.

When to Use Scrapy or Selenium for Email Scraping

Sometimes requests and BeautifulSoup can’t handle complex sites, especially if they’re running heavy JavaScript. In those cases, Scrapy or Selenium can help.

Scrapy is excellent when you need to scrape email addresses at scale and you want built-in throttling and crawling. You can install it like so:

pip install scrapy

Selenium, on the other hand, is a browser automation tool that helps when sites use JavaScript to load content, and you need to click buttons or scroll. You can run it in headless mode if you don’t want the browser window to open visibly.

Install Selenium with pip install selenium, and also ensure you have a browser and the matching webdriver available (or use selenium-manager/webdriver-manager to install the driver automatically).

You can check out our tutorial on Selenium Stealth to help bypass anti-bot protection by making the automation look human.

How to Store and Export Scraped Emails

After you scrape email addresses, saving them comes next.

To save to .csv, use:

import csv

with open('emails.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['Email Address'])
    for email in all_emails:
        writer.writerow([email])

To save to .txt, run:

with open('emails.txt', 'w') as f:
    for email in all_emails:
        f.write(email + '\n')

Example output could be like this:

print("Scraped Email Addresses:")
for email in all_emails:
    print(email)

Now you can use the data, share it, or store it for later.

Conclusion

Email scraping with Python is valuable and useful as long as you do it right and remain compliant. Now you know how to use tools like requests, BeautifulSoup, and regex, how to loop through web pages, save your scraped data, and handle some common errors.

If you run into more complex tasks, you can rely on Selenium and Scrapy instead. Just make sure that you’re always on the right side of the law and you’re not scraping private or protected data.

Create Account
Share on
Article by IPRoyal
Meet our writers
Data News in Your Inbox

No spam whatsoever, just pure data gathering news, trending topics and useful links. Unsubscribe anytime.

No spam. Unsubscribe anytime.

Related articles