How to scrape data anonymously using Tor proxies and Python?
In this tutorial, we'll look at how to use the Tor proxy network, The Onion Router, with Python 3 and Torpy to browse online. It's a collaborative, decentralized network, where the message you send goes through a series of distinct identities before it reaches its destination, called Onion Routing.
Hence this nice logo that takes the shape of an onion. The complete code is available here.
Let's go!
Prerequisites
In order to complete this tutorial from start to finish, be sure to have the following installed on your computer.
You can click on the links below, which will direct you either to an installation tutorial or to the site in question.
To clarify the purpose of each of the above mentioned elements: python3 is the computer language with which we will scrape the pdf, and SublimeText is a text editor. Sublime.
Let's play!
Setup
We will proceed as follows:
- download tor
- install tor
- install torpy
Then download the browser that corresponds to your operating system.
Here for me, Mac OS:
And quietly follow the installation instructions:
f$ pip3 install requests $ pip3 install torpy
And now we're ready to scrape.
NB: with 273 stars, 43 forks, and the most recent commit on 04/15/2021, the Torpy library is the most popular, easiest to use, and best maintained Tor access library via Python 3
đ
Code
Here is the complete code:
f# We do import TorRequests class from torpy library from torpy.http.requests import TorRequests print('start') with TorRequests() as tor_requests: # We do a first request to ipify.org with a Tor proxy print("build circuit #1") with tor_requests.get_session() as sess: print(sess.get("https://api.ipify.org/").text) # We do a second request to ipify.org with a Tor proxy print("build circuit #2") with tor_requests.get_session() as sess: print(sess.get("https://api.ipify.org/").text) print('~~success')
The code is broken down into 3 distinct parts:
- we import the torpy library
- instantiate a Tor session
- we request https://api.ipify.org/ which returns our IP address
And when we run the code from the terminal:
f$ python3 torpy-tor-proxies-python-tutorial.py start build circuit #1 185.220.100.252 build circuit #2 185.220.101.33 ~~success
So we can see that every time a session is opened, a new IP address is assigned to us.
It's a success to make us cry!
âš
Benefits
This code will allow you, in 50 seconds, to access, from Python 3 and using the Torpy library, the Tor proxy network.
According to Tor Metrics, in 2022 the network will have between 1000 and 2000 exit IPs, which is what you'll find instead of exit. That means you'll be able to rely on a pool of IPs of that size:In other words, you will be able to
- use a pool of 1000-2000 IP addresses
- anonymize your browsing
- for free
Wonderful!
đ§
Limitations
Beware, if the network of IPs is free, the size of the network is relatively small. For comparison, the market leading proxies provider Brightdata promises a network of over 1.5 million datacenter IPs. That's 1,000 times larger, no less.Moreover, in addition to being small and accessible by everyone, the network is used when browsing the Darknet, and take part in more or less legal activities. Also, you take the risk of being quickly blocked by a target site.You can however normally access Google, when you are lucky:
The result of the script below:
f$ python3 test-speed-tor-vs-brightdata.py tor ip 185.82.127.25 delay 3.131886832998134 brightdata ip 185.255.166.252 delay 0.8867947079997975 ~~success
To advance masked yes, but to advance slowly.
đ
Conclusion
And that's the end of the tutorial!
In this tutorial, we've seen how to use Tor network proxies with Python 3, and Torpy, the latest and easiest to use library on the market.
If you have any questions, or if you need a custom, robust, scalable scraping service that can use a large and powerful IP pool, contact us here.Happy scraping!
đŠ
Co-founder @ lobstr.io since 2019. Genuine data avid and lowercase aesthetic observer. Ensure you get the hot data you need.