How to scrape data anonymously using Tor proxies and Python?
When scraping online, using one's own server or computer, it can happen that one is blocked by the target website, which then blocks the IP address. This is a hardblock, which makes further collection difficult. For example, on the site https://www.carrefour.fr/, protected by the bot-mitigation tool provided by https://www.cloudflare.com/.
It is then particularly tempting to use an external proxy provider, such as the solid https://brightdata.com/, or the very user-friendly https://www.scrapingbee.com/. And so change your IP to hide your real identity and bypass the blocking. However, these providers often offer an expensive service - 0.5 EUR per GB at brightdata for example.
In this tutorial, we'll look at how to use the Tor proxy network, The Onion Router, with Python 3 and Torpy to browse online. It's a collaborative, decentralized network, where the message you send goes through a series of distinct identities before it reaches its destination, called Onion Routing.
Hence this nice logo that takes the shape of an onion. The complete code is available here.
Let's go!
In order to complete this tutorial from start to finish, be sure to have the following installed on your computer.
You can click on the links below, which will direct you either to an installation tutorial or to the site in question.
To clarify the purpose of each of the above mentioned elements: python3 is the computer language with which we will scrape the pdf, and SublimeText is a text editor. Sublime.
Let's play!
We will proceed as follows:
For the first step, just go here: https://www.torproject.org/download/
Then download the browser that corresponds to your operating system.
Here for me, Mac OS:
And quietly follow the installation instructions:
Finally, we will install the Python Torpy library, and requests that allows to move on the Internet with Python :
f$ pip3 install requests $ pip3 install torpy
And now we're ready to scrape.
NB: with 273 stars, 43 forks, and the most recent commit on 04/15/2021, the Torpy library is the most popular, easiest to use, and best maintained Tor access library via Python 3
đ
Here is the complete code:
f# We do import TorRequests class from torpy library from torpy.http.requests import TorRequests print('start') with TorRequests() as tor_requests: # We do a first request to ipify.org with a Tor proxy print("build circuit #1") with tor_requests.get_session() as sess: print(sess.get("https://api.ipify.org/").text) # We do a second request to ipify.org with a Tor proxy print("build circuit #2") with tor_requests.get_session() as sess: print(sess.get("https://api.ipify.org/").text) print('~~success')
The code is broken down into 3 distinct parts:
And when we run the code from the terminal:
f$ python3 torpy-tor-proxies-python-tutorial.py start build circuit #1 185.220.100.252 build circuit #2 185.220.101.33 ~~success
So we can see that every time a session is opened, a new IP address is assigned to us.
It's a success to make us cry!
âš
This code will allow you, in 50 seconds, to access, from Python 3 and using the Torpy library, the Tor proxy network.
According to Tor Metrics, in 2022 the network will have between 1000 and 2000 exit IPs, which is what you'll find instead of exit. That means you'll be able to rely on a pool of IPs of that size:
In other words, you will be able to
Wonderful!
đ§
Beware, if the network of IPs is free, the size of the network is relatively small. For comparison, the market leading proxies provider Brightdata promises a network of over 1.5 million datacenter IPs. That's 1,000 times larger, no less.
Moreover, in addition to being small and accessible by everyone, the network is used when browsing the Darknet, and take part in more or less legal activities. Also, you take the risk of being quickly blocked by a target site.
You can however normally access Google, when you are lucky:
Finally, since we have to go through a network of servers to guarantee the anonymity of the request, the famous Onion Routing, the request speed is relatively slow. For example, if we calculate the access request to https://api.ipify.org/, between the Tor network and a classic IP from Brightdata, we have a speed difference of 1 to 4.
The result of the script below:
f$ python3 test-speed-tor-vs-brightdata.py tor ip 185.82.127.25 delay 3.131886832998134 brightdata ip 185.255.166.252 delay 0.8867947079997975 ~~success
To advance masked yes, but to advance slowly.
đ
And that's the end of the tutorial!
In this tutorial, we've seen how to use Tor network proxies with Python 3, and Torpy, the latest and easiest to use library on the market.
If you have any questions, or if you need a custom, robust, scalable scraping service that can use a large and powerful IP pool, contact us here.
Happy scraping!
đŠ
Co-founder @ lobstr.io since 2019. Genuine data avid and lowercase aesthetic observer. Ensure you get the hot data you need.