Proxies for Large Language Model (LLM) Training: A Complete Guide
Proxy fundamentalsFind out how proxies are used to train large language models (LLMs) such as ChatGPT and Claude. Start collecting data for your models today.

Marijus Narbutas
Key Takeaways
-
LLMs use deep learning neural networks to process textual data and discover important patterns without human intervention.
-
Machine learning algorithms and neural networks especially require enormous volumes of data to start producing reasonably accurate results.
-
The amount of data required can only be practically acquired through web scraping and the use of proxies.
-
Proxies are essential for web scraping to retain long-term access to diverse data. In turn, proxies are the backbone of LLMs.
LLMs have become all the rage ever since ChatGPT launched. While large AI companies continue to compete to bring the largest LLMs to market, there are plenty of smaller players running similar models for more niche tasks. There are even prebuilt open source models that enthusiasts can run on their own machine and tweak to their liking.
Under the hood of every large language model run machine learning models that have been tuned to perform extremely well on natural language. To get all the impressive results we have been seeing, however, machine learning models need data – and lots of it.
What Are Large Language Models?
Large Language Models (LLMs) are a type of machine learning algorithm, belonging to the deep learning family. They are neural networks with numerous layers placed on top of each other.
Regular machine (or statistical) learning models require humans to engineer features, deciding which attributes of the data are relevant.
For example, to predict house prices, an engineer might choose features like square footage, number of bedrooms, and neighborhood. But there may be other and possibly even more important features, such as proximity to good schools.
Issues arose when applying classic machine learning models to real-world problem solving, such as spam detection. Machine learning engineers must define what is important for spam detection (e.g., the ratio of uppercase to lowercase letters, previous interactions between these two emails), but doing so can lead them to miss other important features (e.g., word frequency counts).
In deep learning, the machine learning algorithm attempts to understand the features of datasets on its own. Language, for example, has certain rules and usage features that are nearly identical across numerous texts. Given enough data, an algorithm can detect patterns, ranging from surface-level (e.g., syntax) to deep understanding (e.g., being able to detect unique word usage).
Large Language Models generally use the transformer architecture to perform analytical and generative tasks. As mentioned above, they get, as training input, various forms of texts.
Each text is taken apart by breaking it down into tokens (roughly, the length of a word). These words are then made into multidimensional vectors. Over time, words will converge close to each other based on features such as implied qualities – the word “king” implies the gender of “male”. On the other hand, “king” likely has a low semantic connection to the word “table”.
The transformer architecture, in simple terms, allows each word to “look at” every other word in the sequence and gather relevant information from them.
Over time, the transformer will incorporate lots of contextual information, improving the accuracy of the output and overall understanding. So, for example, the LLM may understand that the word “bank” has been used in the context of bodies of water, so it’s likely a “river bank”.
LLMs can reach their massive scale because they don’t require data labeling. As they simply predict the next word in a sequence, the actual next word serves as the answer. Any text becomes training data without any annotation requirements.
When compared to classical machine learning, however, LLMs are significantly more data-hungry . After all, they have to find the relevant features of something as complex as language all on their own.
So, model training requires immense data collection capabilities as LLMs intend to learn an inherently massively complex subject while also using a highly data-inefficient algorithm.
Importance of Data Diversity
Training data diversity is a key requirement for any large-scale LLM . While one of the benefits of deep learning is that models learn and test on their own, it’s difficult to intercept and point them in a particular direction. The dataset decides how the model learns.
For example, you could likely train a model on just the works of Shakespeare (or Shakespearian English). It might be incredibly well-suited to continue to understand and produce works of a similar vernacular, but it would struggle to understand modern English.
Additionally, each text presents some particular viewpoint or philosophical position. While these may be more subtle for works of literature, political texts, for example, can exhibit significant bias. Failing to control for these factors can lead to a biased LLM.
LLM training can also reveal implicit biases that were culturally deeply nested within texts. Training an LLM on works in the public domain (not copyrighted, usually works published before the early 20th century) might inadvertently cause it to reflect the cultural sensibilities of periods up until that point.
Remember that for an LLM, text is everything. They will learn the features and meanings from the works they parse. If the works themselves hold significant bias, that will bleed into the LLM.
What Is a Proxy and How Does It Help AI Systems?
A proxy is another device that works as an internet relay, taking requests from the source and forwarding them to the destination. In most cases, the proxy will present itself as the source of the request, effectively hiding the true originator.
While proxies are sometimes used for privacy and security purposes, in many cases, VPNs are the better option. They’re easier to set up, generally have better encryption, and work in a nearly identical fashion.
Where proxies truly shine is in business use cases, particularly large-scale data acquisition. Web scraping would be nigh impossible without proxies, and just that use case powers numerous businesses across the globe.
One of the key features of proxies is a large number of IP addresses from numerous geographical locations. Some data online changes based on geographic location, such as prices on some ecommerce stores.
The large number of IP addresses helps retain access to data sources. Web scraping bots are often issued IP bans. Regularly, such restrictions would remove access to data, as most devices are granted a single address. Proxies, however, make IP tracking and banning mostly irrelevant, as you can change your address on a whim.
On the other hand, the multitude of locations helps you bypass geo restrictions. If you know that a website displays region-specific content, all you need to do is switch to an IP address from that location.
Web Scraping, LLMs, and Proxies
Since LLMs require such a vast amount of data, the only source that can cover the needs is the internet at large. Web scraping is the only option, and to do that, you’ll need proxies. Otherwise, you’ll miss out on a lot of region-specific data and will eventually get an IP ban.
To avoid detection, prevent IP bans, and reduce rate limits, proxies will have to be used. So, all three (web scraping, LLMs, and proxies) form a tight triad that enables the production of state-of-the-art AI.
Since the article is about LLMs and proxies, we’ll skip over the data collection technicalities; however, we have plenty of web scraping guides available on our blog.
Regarding proxies, several types may be chosen, depending on the source of data and budget availability:
- Datacenter proxies
These are some of the cheapest and fastest proxies for LLM training available. Datacenter IPs, however, are easily detected by most websites, which can lead to frequent IP bans, sometimes even access restrictions ahead of time. As such, they’re good when you need global data that’s not being overly protected by advanced anti-bot systems.
- Residential proxies
Pricing-wise, they’re usually middle-of-the-pack as these IPs are sourced from household devices. While connections are slower than datacenter proxies, they make up for it by having low detection rates and lots of IPs in the pool. Best used to avoid detection and bypass geo-restrictions. Generally, they’re the proxies for LLM training.
- ISP proxies
Combine the benefits of residential and datacenter proxies. Created in business-grade servers but have IP addresses assigned by an Internet Service Provider, so maintain the trust of residential proxies. ISP proxies are the most expensive, so they are best reserved for extremely important LLM data collection use cases.
- Mobile proxies
Technically, a subtype of residential proxies. They’re gathered from household devices; however, only mobile phones are used (as the name may imply). Mobile proxies are even harder to detect than their regular residential counterparts, so they are generally great for all types of data collection.
Challenges of LLM Data Collection Without Proxies
Since all proxies cost money and are usually the biggest part of the budget of LLM training, one might be tempted to skip them entirely and go straight to data collection.
Unfortunately, as mentioned previously, such an endeavour is bound to fail. You’ll eventually get an IP ban on most websites, whether due to sending too many requests in a short period of time or any other reason, and you’ll lack access to geographically sensitive information.
It’s also not a viable approach because every website has different triggers for bot detection and IP bans. So, rate limits that work on one website will fail on another. When performing data acquisition, you’ll usually get a few IP addresses banned just to check the possibilities.
If real-time data is required, the issues continue to compound. Even with rotating proxies, losing access to a data source can happen. Without them, accessing real-time data for an extended period of time is nigh impossible.
So, there’s no way to perform LLM training without the use of proxies. Even if you were to get enough data for a single training session, models need constant updating, so the scraping process has to be continuous.
Legal Compliance and Ethical Proxy Use
Proxy usage itself is, in most countries, legal. It’s a simple tool that relays requests from one machine to another, a lot like a VPN, just for business use cases.
How you use proxies, however, is an entirely different story. In terms of data acquisition and LLM training, the nature of the information defines whether you’re in the clear to scrape the data or not.
As a general rule of thumb, publicly accessible, non-personal, and non-copyrighted information is fair game. If you need to log in to see data, that’s usually out of the question, as most Terms of Service forbid the use of automated bots. By logging in, you explicitly agree to the Terms.
Personal data is protected nearly universally worldwide. GDPR and CCPA are two of the most often cited privacy and personal data laws that cover most of the Western hemisphere. Most of the time, you need explicit, signed consent to collect personal data, and then it’s still subject to numerous restrictions, security requirements, and much more.
Getting consent from every user is impossible when performing web scraping. Even if you did get that, you’d still need to follow the rest of the requirements. So, never scrape personal data.
Copyrighted data follows in the same vein. Some specific right holders have the exclusive right to reproduce the work. By web scraping, it could be argued, you’re reproducing works without the right to do so.
Note, however, that the legal landscape of web scraping is swiftly changing. Always first consult with a legal professional and do not take this blog post as legal advice.
Optimizing Proxy Usage for LLM Training and Web Scraping
Every website will have its own unique bot protection algorithm, which you can only figure out through trial-and-error. The algorithms will change frequently, so constant adaptation is part of the process. There are, however, general best practices that work with nearly every website:
1. Proxy rotation
Having a large number of proxies is a lifesaver, mostly because you can avoid a lot of the usual scraping hazards – IP bans, CAPTCHAs, and other restrictions. Setting up website-specific IP rotation (after some requests, visit the homepage first, then go to the target URL, then switch IP and do the same) will minimize data acquisition issues.
2. Human-like behavior
Scrapers can be extremely efficient, more so than any human. But that also makes them easily detectable. Adding randomized and human-like behavior (e.g., scrolling, clicking, delays) to your web scraping solution will also reduce the likelihood of tripping up anti-bot algorithms.
3. Performance monitoring
Eventually, something will go wrong. Website layouts may change, IPs get banned, proxies stop working. Performance monitoring (such as the amount of data retrieved, HTTP response codes, etc) will save you a lot of time and effort down the road. It’s not immediately useful, but you’ll be glad you have it once things go awry.
4. Compliance
You’ll need to constantly keep tabs on the ever-evolving legal landscape, Terms of Service, and other intricacies to ensure you don’t get into trouble while scraping data for training purposes. Since so much data is required, you’ll need to keep tabs on the field fairly regularly.
Conclusion
There would be no LLMs without scraping and no scraping without proxies. As enticing as skipping over the most expensive part of data acquisition may sound, all projects that avoid proxies are bound to fail. Find a good provider that’ll let you keep costs low while also being highly responsive to your inquiries.
FAQ
What is an LLM proxy?
An LLM proxy is one that either helps with data acquisition for the model (and is integrated into the web scraping pipeline) or one that is integrated directly into the LLM to help access websites when required.
Are proxies legal for AI data collection?
Proxies themselves are legal in all cases. Data collection practices are somewhat limited by legislation (such as GDPR and CCPA), so it’s recommended that you first discuss your plans with a legal professional.
How do proxies prevent IP bans?
Proxies help you distribute requests through a larger pool of addresses, raising less suspicion. Additionally, if you do get banned, simply changing the proxy solves the problem.
Which proxy type is best for LLM training?
Residential proxies will power most of your data acquisition. They’re harder to detect, decently fast, and cover the entire world, so you can get localized data.