How to download ebooks from .onion with Python3 and requests?
Unable to access - with the inability to download items from the page:fimport requests r = requests.get('http://libraryqxxiqakubqv3dc2bend2koqsndbwox2johfywcatxie26bsad.onion/special/index') print(r.text)
What to do?
In this tutorial, we will see how to download, with Python3 and requests, more than 100 free .pdf files available on the darknet.And the code is available here in full: https://gist.github.com/lobstrio/8d1c87203755d569da7ce2433179f099.Off to the dark places of the darknet!
đ„·
Prerequisites
In order to complete this tutorial from start to finish, be sure to have the following installed on your computer.
You can click on the links below, which will take you either to an installation tutorial or to the site in question.
To clarify the purpose of each of the above: python3 is the computer language with which we will be scraping sites and downloading pdf's, and SublimeText is a text editor. Sublime.
Let's get to work.
Installation
We will proceed as follows:
- Install TorBrowser
- Install the tor package
Then download the browser that corresponds to your operating system. Here for me, Mac OS:
And then simply follow the installation instructions:
f$ pip3 install requests $ pip3 install pysocks $ pip3 install lxml
Finally, we'll install Tor from the command line:
Mac OS
First we install brew, the Mac OS package installation tool:Then we install tor:f$ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
f$ brew install tor
We check that tor is well installed:
f$ brew info tor ==> tor: stable 0.4.7.10 (bottled)
Finally, the service is launched:
f$ brew services start tor
Linux
The package is installed:
f$ sudo apt install tor
And we start the machine:
f$ sudo /etc/init.d/tor start
And now we're ready to scrape, with our requests browser directly connected to the Tor proxies.
đ„
Code
Here is the code in full:
fimport requests from lxml import html import time print('~~ start') anarchist_library_onion_link = "http://libraryqxxiqakubqv3dc2bend2koqsndbwox2johfywcatxie26bsad.onion/special/index" latest_books_library_onion_link = "http://libraryqxxiqakubqv3dc2bend2koqsndbwox2johfywcatxie26bsad.onion/latest?bare=1" session = requests.session() session.proxies = {'http': 'socks5h://localhost:9050', 'https': 'socks5h://localhost:9050'} tor_ip = session.get("https://api.ipify.org/").text local_ip = requests.get("https://api.ipify.org/").text assert all([tor_ip, local_ip]) assert tor_ip != local_ip response = session.get("http://libraryqxxiqakubqv3dc2bend2koqsndbwox2johfywcatxie26bsad.onion/latest?bare=1", timeout=100) doc = html.fromstring(response.text) items = doc.xpath('//div[@class="list-group"]//div[@class="amw-listing-item"]') for i, item in enumerate(items[:10]): link = "".join(item.xpath('./a/@href')) print(link) response = session.get(link, timeout=100) assert response.ok item_page_doc = html.fromstring(response.text) pdf_link = "".join(item_page_doc.xpath('//span[@id="download-format-pdf"]/a/@href')) assert pdf_link pdf_name = pdf_link.split('/')[-1] pdf_response = session.get(pdf_link, stream=True) with open(pdf_name, 'wb') as fd: for chunk in pdf_response.iter_content(2000): fd.write(chunk) print('â %s (%s)' % (pdf_name, i+1)) time.sleep(1) print('~~ success') print(""" _ _ _ | | | | | | | | ___ | |__ ___| |_ __ __ | |/ _ \| '_ \/ __| __/| '__| | | (_) | |_) \__ \ |_ | | |_|\___/|_.__/|___/\__||_| """)
To execute the code:
- Download the .py code
- Run the script via the command line
And this is what will appear directly on your terminal:
f$ python3 scraping-anarchists-library-darknet-requests-tor-tutorial.py ~~ start â kevin-carson-may-day.pdf (1) â theodoros-karyotis-ioanna-maria-maravelidi-yavor-tarinski-asking-questions-with-the-zapatistas.pdf (2) â rasmus-hastbacka-six-myths-about-union-action.pdf (3) â asbo-bang-up-and-smash-2nd-edition.pdf (4) â bob-black-fija.pdf (5) ~~ success _ _ _ | | | | | | | | ___ | |__ ___| |_ __ __ | |/ _ \| '_ \/ __| __/| '__| | | (_) | |_) \__ \ |_ | | |_|\___/|_.__/|___/\__||_|
And the precious anarchist ebooks, downloaded for free from the darknet, directly saved on your computer:
đ€
Step-by-Step Guide
The guide will be broken down into 3 parts.
- Connecting to Tor proxies with requests and Python
- Browsing the site
- Downloading pdfs
Tor Proxies
First, let's connect to the Tor proxies with Python3 and requests, as follows:
Port 9050 is where Tor connects to its proxy pool. For more information, see our article on this topic.Now we'll access https://api.ipify.org, once using tor, once using our local IP:fsession = requests.session() session.proxies = {'http': 'socks5h://localhost:9050', 'https': 'socks5h://localhost:9050'}
ftor_ip = session.get("https://api.ipify.org/").text local_ip = requests.get("https://api.ipify.org/").text print(tor_ip) print(local_ip)
And the result is clear:
f$ python3 scraping-anarchists-library-darknet-requests-tor-tutorial.py 5.45.106.207 80.125.29.188
2 different IPs - we are well connected to the Tor proxies!
Now let's try to connect to the site:
fresponse = session.get(latest_books_library_onion_link, timeout=100) print(response.status_code)
And here again, when running the script, the message is clear:
f$ python3 scraping-anarchists-library-darknet-requests-tor-tutorial.py 200
No more inaccessible pages. We positively access this site present on the darknet programmatically, with requests and Python.
đ
All Results Page
Now we will navigate the site.
First, open Tor Browser, and go to the .onion site: http://libraryqxxiqakubqv3dc2bend2koqsndbwox2johfywcatxie26bsad.onion/special/indexOnce you are on site, open your inspection console, right click + Inspect, then select the Network section.Reload the page, and in the search tool, select one of the words on the page, here "A short history of May Day":
A query appears! It is the URL of this request that we will retrieve, and insert directly into our Python code:
fresponse = session.get("http://libraryqxxiqakubqv3dc2bend2koqsndbwox2johfywcatxie26bsad.onion/latest?bare=1", timeout=100)
Now we will retrieve the URL of each of the pages of each book. When we open the 'Inspector' part of the inspection tool, we see that each book page is located in a div which has the class 'amw-listing-item':
Here is the code:
fresponse = session.get("http://libraryqxxiqakubqv3dc2bend2koqsndbwox2johfywcatxie26bsad.onion/latest?bare=1", timeout=100) doc = html.fromstring(response.text) items = doc.xpath('//div[@class="list-group"]//div[@class="amw-listing-item"]') for i, item in enumerate(items[:10]): link = "".join(item.xpath('./a/@href')) print(link)
And when you run it from the command line, the links appear clearly:
f$ python3 scraping-anarchists-library-darknet-requests-tor-tutorial.py http://libraryqxxiqakubqv3dc2bend2koqsndbwox2johfywcatxie26bsad.onion/library/kevin-carson-may-day http://libraryqxxiqakubqv3dc2bend2koqsndbwox2johfywcatxie26bsad.onion/library/theodoros-karyotis-ioanna-maria-maravelidi-yavor-tarinski-asking-questions-with-the-zapatistas http://libraryqxxiqakubqv3dc2bend2koqsndbwox2johfywcatxie26bsad.onion/library/rasmus-hastbacka-six-myths-about-union-action http://libraryqxxiqakubqv3dc2bend2koqsndbwox2johfywcatxie26bsad.onion/library/asbo-bang-up-and-smash-2nd-edition http://libraryqxxiqakubqv3dc2bend2koqsndbwox2johfywcatxie26bsad.onion/library/bob-black-fija
Beautiful!
All we have to do now is:
- Go to the page of each book
- Download the book in .pdf format
One Result Page
First, we go to the book page with requests:
fresponse = session.get(link)
Then from the TorBrowser, once on the book page, inspect the area with the link to the pdf:
Finally, the document is downloaded with requests:f'//span[@id="download-format-pdf"]/a/@href'
fpdf_link = "".join(item_page_doc.xpath('//span[@id="download-format-pdf"]/a/@href')) assert pdf_link pdf_name = pdf_link.split('/')[-1] pdf_response = session.get(pdf_link, stream=True) with open(pdf_name, 'wb') as fd: for chunk in pdf_response.iter_content(2000): fd.write(chunk)
And there you have it!
Benefits
This code will allow you to download .pdf documents with requests and Python3, directly from an .onion site on the dark net.An obvious cultural benefit.Limitations
While this Python script allows you to connect to an .onion site with Python3 and requests, you will only be able to download one page of results. Pagination is not supported.Furthermore, this tutorial is only about the anarchist library on the darknet, accessible via this URL .onion: http://libraryqxxiqakubqv3dc2bend2koqsndbwox2johfywcatxie26bsad.onion/special/index.To access other data sources, you will have to modify the script together.
Beware, please note that this tutorial is for educational purposes only.
We therefore disclaim any responsibility for any immoderate use that may be made of it, particularly on sites linked to other types of services or other types of data. Furthermore, we would like to remind you that, according to the legislation of the country in which you operate, it is strictly forbidden to possess texts protected by copyright without offering the fair remuneration due to the author. We therefore strongly recommend that you consult the legislation in force in your country of practice before embarking on any IT development project as illustrated above.
Conclusion
And that's the end of the tutorial!
In this tutorial, we have seen how to download .pdf files from an anarchist .onion library on the darknet, using requests and Python3.If you have any questions, or if you need a custom, solid and scalable scraping service, we are of course at your disposal here.Happy scraping!
đŠ
Co-founder @ lobstr.io since 2019. Genuine data avid and lowercase aesthetic observer. Ensure you get the hot data you need.