How to scrape Yelp listings using Python and requests in 2023

Shehriar Awan●

July 26, 2023

●

21 min read

Contents

Prerequisites

Full Code

Step by step code explanation

Scraping Yelp restaurant listings from a search URL

Limitations

Conclusion

With its massive user base and extensive business directory, Yelp has become a go-to resource for millions of people searching for restaurants, services, and more. With over 135 million monthly visitors, Yelp offers a wealth of data that entrepreneurs, marketers, and researchers can leverage to gain insights and make informed decisions. In this article, we will learn how to scrape Yelp listings using Python.

Using Python with requests, we'll extract Yelp's restaurant listings from a Search URL and 6 data attributes. We'll scrape names, URLs, ratings, reviews, categories, and neighborhoods. Let's dive in and discover how to scrape restaurant listings from Yelp using Python and requests.

Prerequisites

Before we scrape Yelp restaurant listings, we need to ensure that we have the necessary tools in place. The two essential components we require are Python and Sublime Text.

Python: Make sure you have Python installed on your system. If you don't have Python installed, you can download and install it from the official Python website. Choose the appropriate version based on your operating system.
Sublime Text: Sublime Text is a popular and lightweight text editor that provides a simple and efficient environment for writing Python code. You can download Sublime Text from the official website and install it on your computer.

Once you have Python and Sublime Text set up, you'll be ready to proceed with creating a Yelp scraper using Python and writing the necessary code.

Requirements

To successfully scrape Yelp listings using Python, we will need to install and import several libraries that provide the necessary functionalities. Here are the key libraries we will be using:

requests: a powerful and user-friendly HTTP library for Python. It simplifies the process of making HTTP requests and handling responses, allowing us to fetch the HTML content of web pages.
csv: provides functionality for reading from and writing to CSV (Comma-Separated Values) files. We will utilize this library to store the data into a CSV file for further analysis and processing.
lxml: a robust and efficient library for parsing HTML and XML documents. It offers easy navigation and extraction of data using xPath or CSS selectors, which will be crucial for extracting specific elements from Yelp's HTML pages. lxml is generally faster than bs4, and it can parse more complex HTML. However, it is also more complex to use, and it requires an external C dependency.
argparse: allows us to handle command-line arguments with ease. It simplifies the process of specifying options and parameters when running the Yelp scraper, enabling us to customize the search URL and the maximum number of pages to scrape.
time: provides functions for time-related operations. We will use it to introduce delays between successive requests to avoid overwhelming the server.

Before we proceed, make sure these libraries are installed in your Python environment. We can easily install them using the Python package manager pip.

pip install requests lxml requests
f

The csv and argparse are included in Python's standard library, no separate installation required.

Full Code

The complete code is accessible from Github here.

And just below:

import requests
import csv
from lxml import html
import argparse
import time

class YelpSearchScraper:
    def iter_listings(self, url):
        response = requests.get(url)
        if response.status_code != 200:
            print("Error: Failed to fetch the URL")
            return None

        with open('response.html', 'w') as f: 
            f.write(response.text)


        tree = html.fromstring(response.content)

        scraped_data = []

        businesses = tree.xpath('//div[contains(@class, "container__09f24__mpR8_") and contains(@class, "hoverable__09f24__wQ_on") and contains(@class, "border-color--default__09f24__NPAKY")]')

        for business in businesses:
            data = {}
            name_element = business.xpath('.//h3[contains(@class, "css-1agk4wl")]/span/a')
            if name_element:
                data['Name'] = name_element[0].text.strip()
                data['URL'] = "https://www.yelp.com" + name_element[0].get('href')

            rating_element = business.xpath('.//div[contains(@aria-label, "star rating")]')
            if rating_element:
                rating_value = rating_element[0].get('aria-label').split()[0]
                if rating_value != 'Slideshow':
                    data['Rating'] = float(rating_value)
                else:
                    data['Rating'] = None

            reviews_element = business.xpath('.//span[contains(@class, "css-chan6m")]')
            if reviews_element:
                reviews_text = reviews_element[0].text
                if reviews_text:
                    reviews_text = reviews_text.strip().split()[0]
                    if reviews_text.isnumeric():
                        data['Reviews'] = int(reviews_text)
                    else:
                        data['Reviews'] = None

            price_element = business.xpath('.//span[contains(@class, "priceRange__09f24__mmOuH")]')
            if price_element:
                data['Price Range'] = price_element[0].text.strip()

            # ok getting proper xpath
            categories_element = business.xpath('.//span[contains(@class, "css-11bijt4")]')
            if categories_element: 
                data['Categories'] = ", ".join([c.text for c in categories_element])

            neighborhood_element = business.xpath('.//p[@class="css-dzq7l1"]/span[contains(@class, "css-chan6m")]')
            if neighborhood_element:
                neighborhood_text = neighborhood_element[0].text
                if neighborhood_text: 
                    data['Neighborhood'] = neighborhood_text.strip()

            assert data
            scraped_data.append(data)

        return scraped_data

    def save_to_csv(self, data, filename):
        keys = data[0].keys()

        with open(filename, 'w', newline='', encoding='utf-8-sig') as csvfile:
            writer = csv.DictWriter(csvfile, fieldnames=keys, extrasaction='ignore')
            writer.writeheader()
            writer.writerows(data)

        print("Success! \nData written to CSV file:", filename)

    def scrape_results(self, search_url, max_page):
        all_results = []
        for page in range(1, max_page):
            page_url = search_url + f'&start={(page-1)*10}'
            print(f"Scraping Page {page}")
            results = self.iter_listings(page_url)
            if results:
                all_results.extend(results)
            time.sleep(2)

        return all_results


def main():
    s = time.perf_counter()

    argparser = argparse.ArgumentParser()
    argparser.add_argument('--search-url', '-u', type=str, required=False, help='Yelp search URL', default='https://www.yelp.com/search?find_desc=Burgers&find_loc=London')
    argparser.add_argument('--max-page', '-p', type=int, required=False, help='Max page to visit', default=5)
    args = argparser.parse_args()

    search_url = args.search_url
    max_page = args.max_page

    assert all([search_url, max_page])

    scraper = YelpSearchScraper()
    results = scraper.scrape_results(search_url, max_page)

    if results:
        scraper.save_to_csv(results, 'yelp_search_results.csv')
    else:
        print("No results to save to CSV")

    elapsed = time.perf_counter() - s
    elapsed_formatted = "{:.2f}".format(elapsed)
    print("Elapsed time:", elapsed_formatted, "seconds")
    print('''~~ success
 _       _         _            
| |     | |       | |          
| | ___ | |__  ___| |_ __ __  
| |/ _ \| '_ \/ __| __/| '__|
| | (_) | |_) \__ \ |_ | |  
|_|\___/|_.__/|___/\__||_|  
''')



if __name__ == '__main__':
    main()
f

Step by step code explanation

Importing the Required Libraries

To begin coding our Yelp scraper, we first need to import the libraries. By importing these libraries, we ensure that we have access to the required tools and functionalities for our Yelp scraper.

import requests
import csv
from lxml import html
import argparse
import time
f

Creating the YelpSearchScraper Class

To make our scraping code reusable and organized, we will create a class called YelpSearchScraper. This class will contain all the methods required for scraping and saving the data. Creating the YelpSearchScraper class helps us organize our code and make it reusable and maintainable. We can create instances of this class and call its methods whenever needed.

Gathering Yelp business listings data

Within the YelpSearchScraper class, we have the iter_listings method. This method takes a URL parameter, representing the page we want to scrape.

class YelpSearchScraper:
    def iter_listings(self, url):
        response = requests.get(url)
        if response.status_code != 200:
            print("Error: Failed to fetch the URL")
            return None
f

Here's what happens in this method:

It sends an HTTP GET request to the specified url using the requests.get() function and assigns the response to the response variable.
The code then checks the status_code attribute of the response object. A status code of 200 indicates a successful response. If the status code is not 200, it means there was an error in fetching the URL.
If the status code is not 200, the code prints an error message indicating the failure to fetch the URL and returns None.

After fetching the HTML content of the Yelp page, we want to save it for reference. To do this, we can use the following code snippet:

with open('response.html', 'w') as f:
    f.write(response.text)
f

By saving the HTML content to the file 'response.html', we can easily examine the structure and elements of the fetched page. This is particularly useful for understanding the data and designing appropriate xPath expressions or CSS selectors for extracting specific information during the scraping process.

Once we have fetched the HTML content of the Yelp page, the next step is to parse it and extract the relevant information. Let's examine the following code snippet:

tree = html.fromstring(response.content)
scraped_data = []
f

In this code snippet, we utilize the lxml library to parse the HTML content and create a structured tree representation. Here's how it works:

We use the fromstring() function from the html module in lxml to create a tree object called tree. This function takes the response.content as input, which contains the HTML content retrieved from Yelp.
The response.content represents the raw HTML content in bytes. By passing it to fromstring(), we convert it into a structured tree object that we can traverse and extract data from.
We initialize an empty list called scraped_data to store the extracted data from the Yelp restaurant listings.

By creating the tree object, we can now navigate the HTML structure and extract specific elements using xPath expressions. The scraped_data list will store the extracted information.

Our next step is to extract the business listings from the Yelp page. Let’s open our response.html page that we saved earlier and use inspect element to find the xPath to locate the elements that represent business listings on the yelp page.

The business listings on Yelp search results page are structured as an unordered list (<ul>). Each business listing is an <li>. But there are some blank <li> elements as well. At first, I used the <ul> element as canon but this resulted in extracting unwanted data. So I changed to the <li> xpath. This also resulted in scraping inaccurate data.

To avoid these problems, we’re narrowing down to a <div> element inside <li> with classes "container__09f24__mpR8_", "hoverable__09f24__wQ_on", "border-color--default__09f24__NPAKY". These classes are common in all business listing cards but not in empty list elements and otherunwanted elements.


businesses = tree.xpath('//div[contains(@class, "container__09f24__mpR8_") and contains(@class, "hoverable__09f24__wQ_on") and contains(@class, "border-color--default__09f24__NPAKY")')
f

In this code snippet, we utilize the xpath() method of the tree object to extract the desired elements from the parsed HTML based on an xPath expression. Here's what happens:

The xpath() method is called on the tree object, and it takes an xPath expression as an argument. The xPath expression used here is

'//div[contains(@class, "container__09f24__mpR8_") and contains(@class, "hoverable__09f24__wQ_on") and contains(@class, "border-color--default__09f24__NPAKY")]'.

The xPath expression targets <div> elements that contain certain classes. By using the contains() function, we specify that the element must contain all three classes mentioned in the expression. This helps us locate the specific elements that represent the business listings on the Yelp page.
The xpath() method returns a list of matching elements, which is assigned to the businesses variable.

This code will extract the relevant <div> elements representing the business listings from the parsed HTML. These elements will serve as the foundation for further extracting specific details about each restaurant, such as name, rating, reviews, price range, and more.

Extracting business name and listing URL

Now that we have the extracted business listing elements stored in the businesses list, we can proceed to extract specific details for each listing.

for business in businesses:
    data = {}
    name_element = business.xpath('.//h3[contains(@class, "css-1agk4wl")]/span/a')
    if name_element:
        data['Name'] = name_element[0].text.strip()
        data['URL'] = "https://www.yelp.com" + name_element[0].get('href')
f

In this code snippet, we iterate over each business element within the businesses list and extract relevant information. Here's the breakdown of the snippet:

For each business element, we initialize an empty dictionary called data to store the extracted details for that particular business.
We use the xpath() method on the business element to find the name of the restaurant. The xPath expression:

'.//h3[contains(@class, "css-1agk4wl")]/span/a'

targets the <h3> element with the class css-1agk4wl, which contains a <span> element followed by an <a> element representing the name of the restaurant.

If the name_element is found, we extract the text of the element using .text.strip() and assign it to data['Name']. We also construct the URL of the restaurant by concatenating "https://www.yelp.com" with the href attribute of the <a> element using .get('href'). This URL is assigned to data['URL'].

That's how this snippet will scrape Yelp business names and listing URLs.

Extracting business rating

Continuing the extraction, let’s scrape Yelp business ratings from each listing. For this, we’ll extract the <div> element with attribute aria-label.

rating_element = business.xpath('.//div[contains(@aria-label, "star rating")]')
if rating_element:
    rating_value = rating_element[0].get('aria-label').split()[0]
    if rating_value != 'Slideshow':
        data['Rating'] = float(rating_value)
    else:
        data['Rating'] = None
f

In this snippet, we extract the rating information for each business listing. Here's how it works:

Again we’ve used the xpath() method on the business element to locate the <div> element that contains the rating information. The xPath expression:

'.//div[contains(@aria-label, "star rating")]'

selects the

element that has an aria-label attribute containing the text "star rating".

If the rating_element is found, we extract the rating value from the aria-label attribute of the element. We retrieve the aria-label attribute using .get('aria-label') and split it into a list of words using .split(). The rating value is the first element of this list, representing the numeric rating value.
We’ll check if the extracted rating value is not equal to 'Slideshow' (a special case where Yelp displays a dynamic slideshow instead of a numeric rating). If it's not 'Slideshow', we convert the rating value to a float using float(rating_value) and assign it to data['Rating']. Otherwise, if it is 'Slideshow', we assign None to data['Rating'].

Extracting number of reviews

To extract the number of reviews, we’ll extract <span>having class css-chan6m. Here’s the code snippet that will extract reviews using this xPath.

reviews_element = business.xpath('.//span[contains(@class, "css-chan6m")]')
if reviews_element:
    reviews_text = reviews_element[0].text
    if reviews_text:
        reviews_text = reviews_text.strip().split()[0]
        if reviews_text.isnumeric():
            data['Reviews'] = int(reviews_text)
        else:
            data['Reviews'] = None
f

Here's a breakdown of this code snippet:

We use the xpath() method to locate the element to extract the number of reviews.
If the reviews_element exists, we extract the text content of the element using .text. The extracted text represents the number of reviews.
We ensure that reviews_text is not empty or None. If it contains a value, we’ll extract the numeric part of the text by removing any whitespaces and taking the first word.
After extracting the numeric portion, we’ll check if it is a valid number using .isnumeric(). If it is, convert it to an integer and assign it to data['Reviews']. Otherwise, if it is not a valid number, assign None to data['Reviews'].

Extracting price range

Extracting the price range is the easiest part of our process. All you have to do is locate the <span> with class priceRange__09f24__mmOuH and your code will look like this:

price_element = business.xpath('.//span[contains(@class, "priceRange__09f24__mmOuH")]')
if price_element:
    data['Price Range'] = price_element[0].text.strip()
f

This simple snippet will extract the price range from the listings.

Extracting restaurant category

To extract category names, we'll again use inspect element to find the relevant element. Our category name is located in <span> with class css-11bijt4. Let’s add it to our code and extract category names of all listings.

To retrieve the categories associated with each business listing, we use the following code snippet:

categories_element = business.xpath('.//span[contains(@class, "css-11bijt4")]')
if categories_element:
    data['Categories'] = ", ".join([c.text for c in categories_elemen
f

Here’s how what this snippet does:

We search for <span> elements that contains a class named "css-11bijt4" using the xpath()
If categories_element exists, we extract the text content of each <span> element using a list comprehension: [c.text for c in categories_element]. This creates a list of the category names.
Then join the elements of the list into a single string, separated by commas, using ", ".join(...). This consolidates the category names into a formatted string.

Extracting neighborhood

If you look at the HTML, we can use the class name of <span> to extract neighborhoods. Our code locates most of the elements using this xPath expression. But in the case of neighborhood, the class name is css-chan6m, what’s so special about it? It’s the same class name that we used to extract reviews.

If we use it without referring to a parent element, it’ll mess up the entire data. To solve this, we’ll use the parent <p> element for reference. Here’s how to code will look like:

neighborhood_element = business.xpath('.//p[@class="css-dzq7l1"]/span[contains(@class, "css-chan6m")]')
if neighborhood_element:
    neighborhood_text = neighborhood_element[0].text
    if neighborhood_text:
        data['Neighborhood'] = neighborhood_text.strip()
f

We’ve already learned how this code works so no need to explain it once again. Let’s move to the next part of our code.

After extracting the relevant details from each business listing, we need to store them. Here’s how we do it:

assert data
scraped_data.append(data)
return scraped_data
f

This code snippet makes sure that the data dictionary contains valid information before storing it in the scraped_data list. Let’s breakdown the code:

The assert statement is used to validate a condition. In this case, we assert that data is not empty or None. If the condition evaluates to False, an AssertionError is raised, indicating that something unexpected occurred during the scraping process.
After the assertion, we append the data dictionary to the scraped_data list. This adds the extracted details for a particular business listing to the list of all scraped data.
Finally, we return the scraped_data list, which contains dictionaries representing the extracted information for each business listing.

Now our Python script is functional and to scrape Yelp listings with 6 data attributes. Let’s save the yelp listings we extracted to a CSV file.

Saving data to CSV

After scraping Yelp listings with the desired data attributes, let’s save the data to a CSV file. To do that, we’ll define a save_to_csv method inside the YelpSearchScraper class.

def save_to_csv(self, data, filename):
    keys = data[0].keys()

    with open(filename, 'w', newline='', encoding='utf-8-sig') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=keys, extrasaction='ignore')
        writer.writeheader()
        writer.writerows(data)

    print("Success!\nData written to CSV file:", filename)
f

The save_to_csv method takes two parameters: data, representing the extracted data, and filename, specifying the name of the CSV file to create. Here’s breakdown the code:

def save_to_csv(self, data, filename):
f

This line defines a method named save_to_csv within the YelpSearchScraper class. It takes three parameters: self, which refers to the instance of the class, data, representing the extracted data to be saved, and filename, which specifies the name of the CSV file to create.

    keys = data[0].keys()
f

This line retrieves the keys (fieldnames) of the first dictionary in the data list. It uses the keys() method to obtain a view object containing the keys.

    with open(filename, 'w', newline='', encoding='utf-8-sig') as csvfile:
f

This line opens the specified filename in write mode ('w') using the open() function. The 'newline='' argument ensures that newline characters are handled correctly, and the 'utf-8-sig' encoding is used to support non-ASCII characters.

        writer = csv.DictWriter(csvfile, fieldnames=keys, extrasaction='ignore')
f

This line creates a DictWriter object named writer. It takes three arguments: csvfile, which is the opened CSV file, fieldnames=keys, which specifies the fieldnames for the CSV file based on the keys obtained earlier, and extrasaction='ignore', which tells the writer to ignore any extra keys in the dictionaries.

        writer.writeheader()
        writer.writerows(data)
f

The writeheader() writes header row and writerows() writes rows of data to our CSV file.

Finally, the line prints a success message, confirming that the data has been successfully written to the CSV file.

Scraping multiple pages

While scraping Yelp search results, we don’t want to scrape the first page only. There are only 10 results per page and yelp shows you 24 pages per search URL. Which means you can scrape up to 240 results per search URL. Until now, our code is able to scrape only the first page. Let’s enhance its capabilities and deal with Yelp’s pagination.

To do so, we’ll define a method named scrape_results and add the pagination logic to it. Let’s do it.

def scrape_results(self, search_url, max_page):
        all_results = []
        for page in range(1, max_page):
            page_url = search_url + f'&start={(page-1)*10}'
            print(f"Scraping Page {page}")
            results = self.iter_listings(page_url)
            if results:
                all_results.extend(results)
            time.sleep(2)

        return all_results
f

Let’s breakdown this code snippet for better understanding:

def scrape_results(self, search_url, max_page):
f

This line defines a method named scrape_results within the YelpSearchScraper class. It takes three parameters: self, which refers to the instance of the class, search_url, representing the URL for the Yelp search, and max_page, specifying the maximum number of pages to scrape.

    all_results = []
f

This initializes an empty list named all_results that will be used to store the extracted data from each page.

    for page in range(1, max_page):
f

This sets up a loop that iterates over a range of page numbers, starting from 1 up to max_page. This loop controls the scraping process for each page.

        page_url = search_url + f'&start={(page-1)*10}'
f

This constructs the URL for each page by appending the appropriate start index parameter to the search_url. The start index is calculated as (page-1)*10, where each page displays 10 results.

        results = self.iter_listings(page_url)
f

This calls the iter_listings method to scrape the listings from the current page URL (page_url). The method returns a list of dictionaries representing the extracted data for each business listing on the page.

        if results:
            all_results.extend(results)
f

This line checks if results is not empty. If there are extracted results from the current page, they are added to the all_results list using the extend() method. This ensures that all extracted data is accumulated across all pages.

        time.sleep(2)
f

This introduces a 2-second delay using the time.sleep() function. It helps to prevent overwhelming the server with frequent requests and ensures a more polite scraping process.

    return all_results
f

This line returns the all_results list, which contains the extracted data from all scraped pages. Now our Yelp scraper is capable of going beyond page 1 to extract all results while scraping Yelp listings from a search URL.

Finalizing the Yelp Scraper

In the final part of our code, we have the main() function responsible for executing the Yelp scraper.

def main():
    s = time.perf_counter()

    argparser = argparse.ArgumentParser()
    argparser.add_argument('--search-url', '-u', type=str, required=False, help='Yelp search URL', default='https://www.yelp.com/search?find_desc=Burgers&find_loc=London')
    argparser.add_argument('--max-page', '-p', type=int, required=False, help='Max page to visit', default=5)
    args = argparser.parse_args()

    search_url = args.search_url
    max_page = args.max_page

    assert all([search_url, max_page])

    scraper = YelpSearchScraper()
    results = scraper.scrape_results(search_url, max_page)

    if results:
        scraper.save_to_csv(results, 'yelp_search_results.csv')
    else:
        print("No results to save to CSV")

    elapsed = time.perf_counter() - s
    elapsed_formatted = "{:.2f}".format(elapsed)
    print("Elapsed time:", elapsed_formatted, "seconds")
    print('''~~ success
 _       _         _            
| |     | |       | |          
| | ___ | |__  ___| |_ __ __  
| |/ _ \| '_ \/ __| __/| '__|
| | (_) | |_) \__ \ |_ | |  
|_|\___/|_.__/|___/\__||_|  
''')



if __name__ == '__main__':
    main()
f

Let’s do a quick breakdown of our main function:

def main():
    s = time.perf_counter()
f

Defines the main() function, which serves as the entry point of our program. Within this function, we initialize a variable s to store the starting time using time.perf_counter().

    argparser = argparse.ArgumentParser()
    argparser.add_argument('--search-url', '-u', type=str, required=False, help='Yelp search URL', default='https://www.yelp.com/search?find_desc=Burgers&find_loc=London')
    argparser.add_argument('--max-page', '-p', type=int, required=False, help='Max page to visit', default=5)
    args = argparser.parse_args()
f

These lines create an instance of the argparse.ArgumentParser() class to handle command-line arguments. We define two optional arguments: --search-url (or -u) to specify the Yelp search URL, and --max-page (or -p) to set the maximum number of pages to visit. The default values are provided for convenience. Due to this part of our code, we won’t need to open our script and change the URL and maximum pages to scrape every time. We can specify it in the command line.

    search_url = args.search_url
    max_page = args.max_page
f

These lines extract the values of the command-line arguments search_url and max_page from the args object, assigning them to the respective variables.

    assert all([search_url, max_page])
f

This uses the assert statement to validate that both search_url and max_page have non-empty values. If any of the values are missing, an AssertionError is raised.

    scraper = YelpSearchScraper()
    results = scraper.scrape_results(search_url, max_page)
f

This creates an instance of the YelpSearchScraper class named scraper. We then call the scrape_results() method of the scraper object, passing search_url and max_page as arguments. The method initiates the scraping process and returns the extracted results.

    if results:
        scraper.save_to_csv(results, 'yelp_search_results.csv')
    else:
        print("No results to save to CSV")
f

This if-else block checks if the results list is not empty. If there are extracted results, we call the save_to_csv() method of the scraper object to save the results to a CSV file named 'yelpsearchresults.csv'. If there are no results, we print a message indicating that there are no results to save.

    elapsed = time.perf_counter() - s
    elapsed_formatted = "{:.2f}".format(elapsed)
    print("Elapsed time:", elapsed_formatted, "seconds")
f

These lines calculate the elapsed time by subtracting the starting time (s) from the current time using time.perf_counter(). The elapsed time is stored in the elapsed variable. It is then formatted to two decimal places and stored in the elapsed_formatted variable. Finally, we print the elapsed time in seconds.

Scraping Yelp restaurant listings from a search URL

It’s time to test our Yelp scraper. We’ll scrape Yelp to gather details of 30 Chinese restaurants located in London. The first step is to get the URL. Visit yelp.com, search the keyword, and select the location. Now all we have to do is copy the URL from the address bar.

Let’s launch our python based yelp scraper. Go to the folder where you’ve saved the python script. Press shift and right click on an empty area of your screen. From the menu, select “open Powershell window here”.

In your console, type the following command:

python {your_script_name.py} -u {url} -p {max number of pages to scrape}
f

Replace {your_script_name.py} with the name of your python file, {url} with the Yelp URL you want to scrape, and {max number of pages} with the number of pages you want to scrape. In our case, the command will be:

python yelpscraper.py -u "https://www.yelp.com/search?find_desc=Chinese&find_loc=London" -p 3
f

Hit enter and let the magic begin.

Here we go. We just extracted 30 restaurant listings with 6 data attributes from Yelp. Let’s see how the output file looks like:

And here they are. 30 Chinese restaurants located in London extracted with name, URL, rating, reviews, categories, and neighborhood. All in just 17 seconds.

Limitations

While our Python-based Yelp scraper offers convenience in extracting restaurant listings from Yelp, it does have some limitations that should be considered. These limitations revolve around the scope of data extraction and the absence of anti-bot bypass measures.

Firstly, the scraper is unable to extract data from individual listing pages, which restricts the depth and breadth of information obtained. While it successfully captures essential attributes such as name, rating, neighborhood, and other basic details, it lacks the capability to navigate to individual listing pages and extract more comprehensive information like contact details, opening hours, or menu items.

Furthermore, the absence of anti-bot bypass measures exposes the scraper to detection by Yelp's security mechanisms. This can lead to potential IP banning, hindering the scraping process and preventing access to Yelp's data. Without anti-bot measures in place, the scraper's reliability and scalability may be compromised, posing limitations for large-scale scraping operations or frequent data extraction.

To overcome these limitations, we recommend using Yelp Search Export, a no-code and cloud-based solution. Yelp Search Export offers an expanded set of 13 data attributes, including contact information, opening hours, reviews, photos, and more. It incorporates advanced anti-bot bypass measures, ensuring reliable scraping. The solution is user-friendly, scalable, and provides a free version with premium plans available for additional features.

Learn how to scrape Yelp business listings with all important data attributes without writing a single line of code.

Conclusion

In conclusion, this article has provided a starting point for scraping restaurant listings from Yelp using Python and the requests library. We've walked through the code, explaining its functionality and limitations. If you're curious about web scraping or looking to learn web scraping with Python, we hope this article has been helpful in introducing you to the process.

However, if you're seeking a solution for large-scale scraping, efficiency, and time-saving, we recommend considering our powerful no-code scraper Yelp Search Export. It offers a hassle-free and robust solution for extracting Yelp data.

Happy scraping 🦞

Related Squids

Try lobstr for free today!

No captcha free data

Start now

How to scrape Yelp listings using Python and requests in 2023

Prerequisites

Requirements

Full Code

Step by step code explanation

Importing the Required Libraries

Creating the YelpSearchScraper Class

Gathering Yelp business listings data

Extracting business name and listing URL

Extracting business rating

Extracting number of reviews

Extracting price range

Extracting restaurant category

Extracting neighborhood

Saving data to CSV

Scraping multiple pages

Finalizing the Yelp Scraper

Scraping Yelp restaurant listings from a search URL

Limitations

Conclusion

TAGS

Related Articles

Related Squids