How to scrape data anonymously using Tor proxies and Python?

Sasha Bouloudnine●

December 16, 2022

●

4 min read

Contents

When scraping online, using one's own server or computer, it can happen that one is blocked by the target website, which then blocks the IP address. This is a hardblock, which makes further collection difficult. For example, on the site https://www.carrefour.fr/, protected by the bot-mitigation tool provided by https://www.cloudflare.com/.

how-to-scrape-data-anonymously-using-tor-proxies-and-python-image-1.png

It is then particularly tempting to use an external proxy provider, such as the solid https://brightdata.com/, or the very user-friendly https://www.scrapingbee.com/. And so change your IP to hide your real identity and bypass the blocking. However, these providers often offer an expensive service - 0.5 EUR per GB at brightdata for example.

In this tutorial, we'll look at how to use the Tor proxy network, The Onion Router, with Python 3 and Torpy to browse online. It's a collaborative, decentralized network, where the message you send goes through a series of distinct identities before it reaches its destination, called Onion Routing.

how-to-scrape-data-anonymously-using-tor-proxies-and-python-image-2.png

Hence this nice logo that takes the shape of an onion. The complete code is available here.

Let's go!

Prerequisites

In order to complete this tutorial from start to finish, be sure to have the following installed on your computer.

You can click on the links below, which will direct you either to an installation tutorial or to the site in question.

To clarify the purpose of each of the above mentioned elements: python3 is the computer language with which we will scrape the pdf, and SublimeText is a text editor. Sublime.

Let's play!

Setup

We will proceed as follows:

download tor
install tor
install torpy

For the first step, just go here: https://www.torproject.org/download/

Then download the browser that corresponds to your operating system.

Here for me, Mac OS:

how-to-scrape-data-anonymously-using-tor-proxies-and-python-image-5.png

And quietly follow the installation instructions:

how-to-scrape-data-anonymously-using-tor-proxies-and-python-image-3.png

Finally, we will install the Python Torpy library, and requests that allows to move on the Internet with Python :

$ pip3 install requests
$ pip3 install torpy
f

And now we're ready to scrape.

NB: with 273 stars, 43 forks, and the most recent commit on 04/15/2021, the Torpy library is the most popular, easiest to use, and best maintained Tor access library via Python 3

🌟

Code

Here is the complete code:

# We do import TorRequests class from torpy library
from torpy.http.requests import TorRequests

print('start')
with TorRequests() as tor_requests:

    # We do a first request to ipify.org with a Tor proxy
    print("build circuit #1")
    with tor_requests.get_session() as sess:
        print(sess.get("https://api.ipify.org/").text)

    # We do a second request to ipify.org with a Tor proxy
    print("build circuit #2")
    with tor_requests.get_session() as sess:
        print(sess.get("https://api.ipify.org/").text)

print('~~success')
f

The code is broken down into 3 distinct parts:

we import the torpy library
instantiate a Tor session
we request https://api.ipify.org/ which returns our IP address

And when we run the code from the terminal:

$ python3 torpy-tor-proxies-python-tutorial.py
start
build circuit #1
185.220.100.252
build circuit #2
185.220.101.33
~~success
f

So we can see that every time a session is opened, a new IP address is assigned to us.

It's a success to make us cry!

✨

Benefits

This code will allow you, in 50 seconds, to access, from Python 3 and using the Torpy library, the Tor proxy network.

According to Tor Metrics, in 2022 the network will have between 1000 and 2000 exit IPs, which is what you'll find instead of exit. That means you'll be able to rely on a pool of IPs of that size:

how-to-scrape-data-anonymously-using-tor-proxies-and-python-image-4.png

In other words, you will be able to

use a pool of 1000-2000 IP addresses
anonymize your browsing
for free

Wonderful!

🧅

Limitations

Beware, if the network of IPs is free, the size of the network is relatively small. For comparison, the market leading proxies provider Brightdata promises a network of over 1.5 million datacenter IPs. That's 1,000 times larger, no less.

Moreover, in addition to being small and accessible by everyone, the network is used when browsing the Darknet, and take part in more or less legal activities. Also, you take the risk of being quickly blocked by a target site.

You can however normally access Google, when you are lucky:

Finally, since we have to go through a network of servers to guarantee the anonymity of the request, the famous Onion Routing, the request speed is relatively slow. For example, if we calculate the access request to https://api.ipify.org/, between the Tor network and a classic IP from Brightdata, we have a speed difference of 1 to 4.

The result of the script below:

$ python3 test-speed-tor-vs-brightdata.py

tor
ip 185.82.127.25
delay 3.131886832998134

brightdata
ip 185.255.166.252
delay 0.8867947079997975

~~success
f

To advance masked yes, but to advance slowly.

🐌

Conclusion

And that's the end of the tutorial!

In this tutorial, we've seen how to use Tor network proxies with Python 3, and Torpy, the latest and easiest to use library on the market.

If you have any questions, or if you need a custom, robust, scalable scraping service that can use a large and powerful IP pool, contact us here.

Happy scraping!

🦀

Related Squids

Try lobstr for free today!

No captcha free data

Start now

How to scrape data anonymously using Tor proxies and Python?

Prerequisites

Setup

Code

Benefits

Limitations

Conclusion

TAGS

Related Articles

Related Squids