How to Scrape Job Postings from Indeed: A Step-by-Step Guide
Websites

Marijus Narbutas
Key Takeaways
-
Make sure to scrape ethically to avoid server overloads and crashing.
-
Use advanced scraping setups with tools like Selenium, Playwright, or specialized platforms like Octoparse to gather job posting data.
-
Stay hidden with residential proxies to avoid IP bans while you scrape.
Web scraping is the process of pulling information from websites automatically without spending hours reading each page manually.
One major use case is tracking job details to identify hiring trends and gather other HR-related insights. Companies and researchers watch job listings to study job markets. Students sometimes check job listings to find internships. All of that could be valuable data to someone working in the hiring field.
Indeed is one of the most popular places for such data collection. It’s full of job details from everywhere. While free for human browsing, Indeed strictly protects its data against automated bots, meaning you will need strong stealth techniques or access to their official API to collect this data successfully.
Is It Legal to Scrape Indeed Job Postings?
In the U.S., web scraping lives in a grey area. It’s not downright illegal, but it can get you in some trouble. When you scrape Indeed, you might face legal issues if you break the site’s rules.
Indeed’s Terms of Service have anti-scraping rules in place. They don’t want bots extracting or aggregating job data without permission in violation of their user agreements. If you ignore that, you risk bans or worse.
Websites like Indeed use specific tools to spot scraping attempts. If they catch your bot sifting through their job details, they might block your IP very quickly. Other times, they may quietly flag your account without any warning. It’s only a matter of time.
While residential proxies are essential for solving the IP banning problem, you’ll also need specialized anti-bot bypass tools or fortified headless browsers to handle the advanced Cloudflare and DataDome protections that Indeed employs.
Tools and Methods to Scrape Indeed Job Listings
There are several ways to do Indeed web scraping and collect job listing data. Here’s a quick overview:
- BeautifulSoup and requests are great for simple websites, but they will fail on Indeed because they cannot execute JavaScript or bypass its advanced anti-bot protections.
- Headless browsers like Selenium or Playwright are necessary to render Indeed’s JavaScript, but you must use stealth modifications (like undetected-chromedriver) so the site doesn’t instantly recognize your browser as a bot.
- Octoparse is a visual scraper tool that’s great if you can’t code but still want to extract job data. It also has a free and paid version.
- Third-party scraper APIs can bypass the security blocks for you and deliver clean job details without you having to build the crawler yourself.
- Purchasing datasets is another option if scraping is not a necessity for you.
Big companies usually use smart bots, multiple IP addresses, and a handful of tools to scrape Indeed and gather or update thousands of job details simultaneously.
Comparing Extraction Methods: APIs, No-Code Tools, and Custom Scripts
Choosing the appropriate approach for your project depends heavily on your technical expertise, and it also relies entirely on the scale of your operation. Therefore, evaluating the available methods ensures you pick the most efficient route for collecting your required job data.
Writing your own custom scripts gives you complete flexibility over how you parse and store the data, though your extraction success will still depend entirely on your ability to bypass Indeed’s evolving anti-bot defenses continuously. It demands significant programming knowledge and forces you to handle maintenance manually whenever the target website updates its internal structure.
Platforms like Octoparse provide a highly visual interface that simplifies the entire workflow, and they’re great for users with zero coding experience who want to gather information quickly. On the downside, these no-code solutions frequently come with restrictive paywalls, and they might lack the deep customization necessary for highly complex scraping Indeed tasks.
Utilizing a dedicated scraping API streamlines the process by delivering pre-structured data directly to your database. It typically handles all the proxy rotation and CAPTCHA solving behind the scenes. But these premium services can become quite expensive at scale, so they might not fit within the budget of smaller projects.
Evaluating these tradeoffs allows you to align the solutions better with your business objectives.
Step-by-Step Guide to Scraping Indeed Jobs
Now, let’s get down to the brass tacks of scraping. Here’s a simple guide on how to scrape Indeed using Python.
Step 1: Set Up Your Environment
Install these libraries before you go on to code:
pip install requests
pip install beautifulsoup4
pip install pandas
pip install curl-cffi
Step 2: Write a Basic Indeed Scraper
Here’s a basic script that demonstrates the fundamental logic of parsing job data, though you will quickly find that Indeed’s security blocks standard requests like this, meaning you'll need to upgrade to stealth tools to run it successfully. This specific code snippet will scrape software engineer jobs in New York City from Indeed’s search page:
import pandas as pd
import time
import random
from bs4 import BeautifulSoup
from curl_cffi import requests as creq
QUERIES = [
"software+engineer",
"backend+developer",
"frontend+developer",
"python+developer",
"fullstack+engineer",
]
LOCATION = "New+York%2C+NY"
def random_headers():
chrome_versions = ["123.0.0.0", "124.0.0.0", "125.0.0.0", "126.0.0.0"]
v = random.choice(chrome_versions)
return {
"User-Agent": f"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
f"AppleWebKit/537.36 (KHTML, like Gecko) "
f"Chrome/{v} Safari/537.36",
"Accept-Language": random.choice(["en-US,en;q=0.9", "en-US,en;q=0.8"]),
}
def scrape_indeed_jobs():
job_list = []
seen_urls = set()
session = creq.Session(impersonate="chrome")
for query in QUERIES:
url = f"https://www.indeed.com/jobs?q={query}&l={LOCATION}&start=0"
print(f"Scraping [{query}]: {url}")
resp = session.get(url, headers=random_headers())
if resp.status_code != 200:
print(f" Got status {resp.status_code}, skipping")
continue
soup = BeautifulSoup(resp.text, "html.parser")
cards = soup.find_all("div", class_="job_seen_beacon")
print(f" Found {len(cards)} cards")
if not cards:
print(" No cards found — likely blocked or selectors changed.")
print(f" Response length: {len(resp.text)} chars")
for card in cards:
h2 = card.find("h2", class_="jobTitle")
a = h2.find("a", href=True) if h2 else None
job_title = a.get_text(strip=True) if a else None
job_url = f"https://www.indeed.com{a['href']}" if a else None
if job_url in seen_urls:
continue
seen_urls.add(job_url)
comp = card.find("span", {"data-testid": "company-name"})
company = comp.get_text(strip=True) if comp else None
loc = card.find("div", {"data-testid": "text-location"})
location = loc.get_text(strip=True) if loc else None
snippet = card.find("div", {"data-testid": "jobsnippet_footer"})
summary = snippet.get_text(" ", strip=True) if snippet else None
job_list.append({
"Job Title": job_title,
"Company": company,
"Location": location,
"Summary": summary,
"Query": query.replace("+", " "),
"URL": job_url,
})
time.sleep(random.uniform(5, 10))
return job_list
if __name__ == "__main__":
jobs = scrape_indeed_jobs()
df = pd.DataFrame(jobs)
df.to_csv("indeed_job_postings.csv", index=False)
print(f"Scraped {len(jobs)} unique jobs and saved to indeed_job_postings.csv")
If you need different job descriptions or job positions in some other locations, you will have to adjust the code to fit your scraping needs. If you want to scrape other platforms like Glassdoor, you cannot simply reuse this code; you’ll need to write an entirely new script tailored to that site's unique HTML structure and specific anti-bot defenses.
Extracting Data From Embedded JSON
When you're inspecting the page source, you'll notice that modern websites load their content using embedded JSON objects within the HTML structure. Extracting it directly is incredibly efficient, and parsing JSON data is significantly more reliable than navigating complex HTML trees.
Once you've successfully used stealth tools to bypass the anti-bot protections and retrieve the actual page source, you can locate the script tag containing the application state and load it into a structured format for immediate access to the data. Here's a quick example demonstrating how you can parse embedded data:
import json
import re
import pandas as pd
import time
import random
from bs4 import BeautifulSoup
from curl_cffi import requests as creq
QUERIES = [
"software+engineer",
"backend+developer",
"frontend+developer",
"python+developer",
"fullstack+engineer",
]
LOCATION = "New+York%2C+NY"
def random_headers():
chrome_versions = ["123.0.0.0", "124.0.0.0", "125.0.0.0", "126.0.0.0"]
v = random.choice(chrome_versions)
return {
"User-Agent": f"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
f"AppleWebKit/537.36 (KHTML, like Gecko) "
f"Chrome/{v} Safari/537.36",
"Accept-Language": random.choice(["en-US,en;q=0.9", "en-US,en;q=0.8"]),
}
def extract_jobs_from_json(html):
"""Extract job data from embedded JSON in script tags."""
soup = BeautifulSoup(html, "html.parser")
jobs = []
for script in soup.find_all("script"):
if not script.string:
continue
# look for common Indeed data variable patterns
match = re.search(
r'window\.mosaic\.providerData\["mosaic-provider-jobcards"\]\s*=\s*({.+?})\s*;',
script.string,
re.DOTALL,
)
if not match:
continue
try:
data = json.loads(match.group(1))
except json.JSONDecodeError:
continue
# navigate to the job results — structure may vary
results = (
data.get("metaData", {})
.get("mosaicProviderJobCardsModel", {})
.get("results", [])
)
for r in results:
jobs.append({
"Job Title": r.get("title"),
"Company": r.get("company"),
"Location": r.get("formattedLocation"),
"Summary": r.get("snippet"),
"URL": f"https://www.indeed.com/viewjob?jk={r.get('jobkey')}" if r.get("jobkey") else None,
})
break # found our data, no need to check more scripts
return jobs
def extract_jobs_fallback(html):
"""Fallback: dump all script tags to find the right JSON blob."""
soup = BeautifulSoup(html, "html.parser")
for script in soup.find_all("script"):
if not script.string or len(script.string) < 5000:
continue
# search for any large JSON object assigned to a window variable
match = re.search(r'window\.[a-zA-Z_.]+\s*=\s*({.+})\s*;', script.string, re.DOTALL)
if match:
try:
data = json.loads(match.group(1))
# save for inspection so you can map the correct keys
with open("indeed_raw_json.json", "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
print(" Dumped raw JSON to indeed_raw_json.json — inspect to find job keys")
return data
except json.JSONDecodeError:
continue
return None
def scrape_indeed_jobs():
job_list = []
seen_urls = set()
session = creq.Session(impersonate="chrome")
for query in QUERIES:
url = f"https://www.indeed.com/jobs?q={query}&l={LOCATION}&start=0"
print(f"Scraping [{query}]: {url}")
resp = session.get(url, headers=random_headers())
if resp.status_code != 200:
print(f" Got status {resp.status_code}, skipping")
continue
jobs = extract_jobs_from_json(resp.text)
if not jobs:
print(" Primary JSON extraction found nothing, trying fallback...")
extract_jobs_fallback(resp.text)
print(" Check indeed_raw_json.json and update the key paths.")
continue
print(f" Extracted {len(jobs)} jobs from JSON")
for job in jobs:
if job["URL"] in seen_urls:
continue
seen_urls.add(job["URL"])
job["Query"] = query.replace("+", " ")
job_list.append(job)
time.sleep(random.uniform(5, 10))
return job_list
if __name__ == "__main__":
jobs = scrape_indeed_jobs()
df = pd.DataFrame(jobs)
df.to_csv("indeed_job_postings_json.csv", index=False)
print(f"Scraped {len(jobs)} unique jobs and saved to indeed_job_postings_json.csv")
While it provides a cleaner path to the job data and avoids the extreme fragility of parsing changing HTML classes, keep in mind that websites periodically rename these internal JavaScript variables, meaning your regular expressions will still require occasional maintenance.
Step 3: Handle Pagination
Each page has around 15 job positions. You can continue scraping through the pages by increasing the scrape_indeed_jobs(pages=2) parameter, but keep in mind that Indeed strictly caps search results at 1,000 jobs (about 66 pages) per query, meaning you'll need to use narrower search filters to scrape larger datasets.
Sometime after 2026, Indeed implemented a login screen from page 2 onwards. Scraping behind a log-in requires you to accept the Terms of Service, which means that doing so would break ToS. We recommend consulting with a legal professional before engaging in any scraping.
Step 4: Extract Key Fields
Here you should include the data fields you need. You’ll only get as much data as you requested. You can pull job details like:
- Job position title
- Company name
- Location
- Short summary snippet
Extracting the full job description, however, requires writing additional code to visit each job's specific URL.
Step 5: Export Results
Use CSV, JSON, or any other format that works for you to save your job listing data cleanly. It’s recommended that you put more effort into it since your data is only as useful as it is readable.
Exporting your gathered data into a structured JSON format preserves the hierarchical nature of the information, which makes it highly compatible with modern databases or web applications. You can accomplish it easily with pandas by using the df.to_json('indeed_job_postings.json', orient='records') command, and you'll have a neatly organized file ready for further processing.
Furthermore, cleaning your dataset involves removing duplicate entries and standardizing text fields, so you guarantee the accuracy of analytics. For example, you might normalize salary ranges into a consistent currency format, or you could filter out postings that lack critical information.
As a result, it becomes a powerful asset for visualizing hiring trends across different regions, and it allows HR teams to identify the most frequently requested skills within their specific industry.
Taking the time to structure your output properly ultimately maximizes the value of your web scraping efforts.
Tips for Staying Undetected
If you scrape Indeed too fast and too much, you’ll get banned. There are some tricks, however, to help you stay under the radar:
- Use residential proxies. They make your bot look like a normal user since the traffic comes from a legitimate home network.
- Crawl politely. Slow down between requests when collecting job details to prevent server overload.
- Rotate user agents and IPs. Professionals who scrape thousands of job positions daily do that, along with advanced stealth tools, to avoid getting flagged or banned.
Some more advanced Indeed scrapers even randomize patterns to look more human when they scrape Indeed for massive job detail gathering operations.
Conclusion
Scraping job positions from Indeed can be useful for gathering job market intelligence. It’s great for market research, trend tracking, and HR companies that need to find the perfect job position before anyone else.
But you have to be smart about it. Stay hidden, use the right tools, and innovate to overcome new anti-scraping measures that are constantly being deployed by the targets. When you set up a good system, you can scrape Indeed and other platforms smoothly without getting slammed by bans.
FAQ
How do I handle CAPTCHAs when scraping Indeed?
You'll inevitably encounter automated security checks, but instead of relying on traditional CAPTCHA-solving services, the most effective workaround is using premium residential proxies or heavily fortified stealth browsers that prevent the challenges from triggering in the first place.
Additionally, mimicking human behavior by adding random delays between your requests significantly reduces the likelihood of triggering these defensive mechanisms in the first place.
High-quality infrastructure remains practically mandatory for uninterrupted access during any automation while scraping Indeed.
Why is my IP getting blocked or rate-limited?
Websites monitor incoming traffic for unnaturally rapid request patterns, and they'll immediately restrict access if they detect a single IP address making hundreds of connections simultaneously.
You must route your traffic through a reliable proxy network, since rotating your IP hides your automated activities effectively. Proper rate limiting on your end prevents these temporary bans from derailing your project completely.
How can I detect blocked or partial responses?
While monitoring HTTP status codes is helpful, advanced firewalls often return a 200 OK even for a block page. You must explicitly check the HTML for challenge keywords (like “Verify you are human”) or verify the presence of core data elements to confirm a successful scrape.
Furthermore, implementing specific checks for known error messages or missing core HTML elements helps you identify when the server is serving a restricted version of the site.
What should I do if my CSS selectors stop working?
Frontend developers frequently update their website's layout, which breaks the hardcoded selectors within your extraction script. Because Indeed uses dynamically generated class names that change with every site update, you should update your code to rely on stable data-testid attributes or bypass the HTML entirely by extracting the embedded JSON data.
How can I scrape JavaScript-loaded content?
Many modern platforms rely heavily on client-side rendering, so utilizing a headless browser like Selenium allows you to execute the necessary JavaScript before parsing the HTML. You can also intercept the background network requests to find the direct JSON data source, which often yields faster results than simulating a full browser environment.
How can I stay updated when Indeed changes its structure?
Maintaining an automated testing suite that runs your script against a known static page alerts you instantly whenever the extraction logic fails. Furthermore, checking developer communities and web scraping forums provides valuable early warnings about major platform updates, so you can adapt your code before your project of scraping Indeed grinds to a halt.