How to scrape Yelp listings using Python and requests in 2023
With its massive user base and extensive business directory, Yelp has become a go-to resource for millions of people searching for restaurants, services, and more. With over 135 million monthly visitors, Yelp offers a wealth of data that entrepreneurs, marketers, and researchers can leverage to gain insights and make informed decisions. In this article, we will learn how to scrape Yelp listings using Python.
Using Python with requests, we'll extract Yelp's restaurant listings from a Search URL and 6 data attributes. We'll scrape names, URLs, ratings, reviews, categories, and neighborhoods. Let's dive in and discover how to scrape restaurant listings from Yelp using Python and requests.
Before we scrape Yelp restaurant listings, we need to ensure that we have the necessary tools in place. The two essential components we require are Python and Sublime Text.
Once you have Python and Sublime Text set up, you'll be ready to proceed with creating a Yelp scraper using Python and writing the necessary code.
To successfully scrape Yelp listings using Python, we will need to install and import several libraries that provide the necessary functionalities. Here are the key libraries we will be using:
Before we proceed, make sure these libraries are installed in your Python environment. We can easily install them using the Python package manager pip.
fpip install requests lxml requests
The csv and argparse are included in Python's standard library, no separate installation required.
The complete code is accessible from Github here.
And just below:
fimport requests import csv from lxml import html import argparse import time class YelpSearchScraper: def iter_listings(self, url): response = requests.get(url) if response.status_code != 200: print("Error: Failed to fetch the URL") return None with open('response.html', 'w') as f: f.write(response.text) tree = html.fromstring(response.content) scraped_data = [] businesses = tree.xpath('//div[contains(@class, "container__09f24__mpR8_") and contains(@class, "hoverable__09f24__wQ_on") and contains(@class, "border-color--default__09f24__NPAKY")]') for business in businesses: data = {} name_element = business.xpath('.//h3[contains(@class, "css-1agk4wl")]/span/a') if name_element: data['Name'] = name_element[0].text.strip() data['URL'] = "https://www.yelp.com" + name_element[0].get('href') rating_element = business.xpath('.//div[contains(@aria-label, "star rating")]') if rating_element: rating_value = rating_element[0].get('aria-label').split()[0] if rating_value != 'Slideshow': data['Rating'] = float(rating_value) else: data['Rating'] = None reviews_element = business.xpath('.//span[contains(@class, "css-chan6m")]') if reviews_element: reviews_text = reviews_element[0].text if reviews_text: reviews_text = reviews_text.strip().split()[0] if reviews_text.isnumeric(): data['Reviews'] = int(reviews_text) else: data['Reviews'] = None price_element = business.xpath('.//span[contains(@class, "priceRange__09f24__mmOuH")]') if price_element: data['Price Range'] = price_element[0].text.strip() # ok getting proper xpath categories_element = business.xpath('.//span[contains(@class, "css-11bijt4")]') if categories_element: data['Categories'] = ", ".join([c.text for c in categories_element]) neighborhood_element = business.xpath('.//p[@class="css-dzq7l1"]/span[contains(@class, "css-chan6m")]') if neighborhood_element: neighborhood_text = neighborhood_element[0].text if neighborhood_text: data['Neighborhood'] = neighborhood_text.strip() assert data scraped_data.append(data) return scraped_data def save_to_csv(self, data, filename): keys = data[0].keys() with open(filename, 'w', newline='', encoding='utf-8-sig') as csvfile: writer = csv.DictWriter(csvfile, fieldnames=keys, extrasaction='ignore') writer.writeheader() writer.writerows(data) print("Success! \nData written to CSV file:", filename) def scrape_results(self, search_url, max_page): all_results = [] for page in range(1, max_page): page_url = search_url + f'&start={(page-1)*10}' print(f"Scraping Page {page}") results = self.iter_listings(page_url) if results: all_results.extend(results) time.sleep(2) return all_results def main(): s = time.perf_counter() argparser = argparse.ArgumentParser() argparser.add_argument('--search-url', '-u', type=str, required=False, help='Yelp search URL', default='https://www.yelp.com/search?find_desc=Burgers&find_loc=London') argparser.add_argument('--max-page', '-p', type=int, required=False, help='Max page to visit', default=5) args = argparser.parse_args() search_url = args.search_url max_page = args.max_page assert all([search_url, max_page]) scraper = YelpSearchScraper() results = scraper.scrape_results(search_url, max_page) if results: scraper.save_to_csv(results, 'yelp_search_results.csv') else: print("No results to save to CSV") elapsed = time.perf_counter() - s elapsed_formatted = "{:.2f}".format(elapsed) print("Elapsed time:", elapsed_formatted, "seconds") print('''~~ success _ _ _ | | | | | | | | ___ | |__ ___| |_ __ __ | |/ _ \| '_ \/ __| __/| '__| | | (_) | |_) \__ \ |_ | | |_|\___/|_.__/|___/\__||_| ''') if __name__ == '__main__': main()
To begin coding our Yelp scraper, we first need to import the libraries. By importing these libraries, we ensure that we have access to the required tools and functionalities for our Yelp scraper.
fimport requests import csv from lxml import html import argparse import time
To make our scraping code reusable and organized, we will create a class called YelpSearchScraper. This class will contain all the methods required for scraping and saving the data. Creating the YelpSearchScraper class helps us organize our code and make it reusable and maintainable. We can create instances of this class and call its methods whenever needed.
Within the YelpSearchScraper class, we have the iter_listings method. This method takes a URL parameter, representing the page we want to scrape.
fclass YelpSearchScraper: def iter_listings(self, url): response = requests.get(url) if response.status_code != 200: print("Error: Failed to fetch the URL") return None
Here's what happens in this method:
After fetching the HTML content of the Yelp page, we want to save it for reference. To do this, we can use the following code snippet:
fwith open('response.html', 'w') as f: f.write(response.text)
By saving the HTML content to the file 'response.html', we can easily examine the structure and elements of the fetched page. This is particularly useful for understanding the data and designing appropriate xPath expressions or CSS selectors for extracting specific information during the scraping process.
Once we have fetched the HTML content of the Yelp page, the next step is to parse it and extract the relevant information. Let's examine the following code snippet:
ftree = html.fromstring(response.content) scraped_data = []
In this code snippet, we utilize the lxml library to parse the HTML content and create a structured tree representation. Here's how it works:
By creating the tree object, we can now navigate the HTML structure and extract specific elements using xPath expressions. The scraped_data list will store the extracted information.
Our next step is to extract the business listings from the Yelp page. Letâs open our response.html page that we saved earlier and use inspect element to find the xPath to locate the elements that represent business listings on the yelp page.
The business listings on Yelp search results page are structured as an unordered list (<ul>). Each business listing is an <li>. But there are some blank <li> elements as well. At first, I used the <ul> element as canon but this resulted in extracting unwanted data. So I changed to the <li> xpath. This also resulted in scraping inaccurate data.
To avoid these problems, weâre narrowing down to a <div> element inside <li> with classes "container__09f24__mpR8_", "hoverable__09f24__wQ_on", "border-color--default__09f24__NPAKY". These classes are common in all business listing cards but not in empty list elements and otherunwanted elements.
fbusinesses = tree.xpath('//div[contains(@class, "container__09f24__mpR8_") and contains(@class, "hoverable__09f24__wQ_on") and contains(@class, "border-color--default__09f24__NPAKY")')
In this code snippet, we utilize the xpath() method of the tree object to extract the desired elements from the parsed HTML based on an xPath expression. Here's what happens:
'//div[contains(@class, "container__09f24__mpR8_") and contains(@class, "hoverable__09f24__wQ_on") and contains(@class, "border-color--default__09f24__NPAKY")]'.
The xPath expression targets <div> elements that contain certain classes. By using the contains() function, we specify that the element must contain all three classes mentioned in the expression. This helps us locate the specific elements that represent the business listings on the Yelp page.
The xpath() method returns a list of matching elements, which is assigned to the businesses variable.
This code will extract the relevant <div> elements representing the business listings from the parsed HTML. These elements will serve as the foundation for further extracting specific details about each restaurant, such as name, rating, reviews, price range, and more.
Now that we have the extracted business listing elements stored in the businesses list, we can proceed to extract specific details for each listing.
ffor business in businesses: data = {} name_element = business.xpath('.//h3[contains(@class, "css-1agk4wl")]/span/a') if name_element: data['Name'] = name_element[0].text.strip() data['URL'] = "https://www.yelp.com" + name_element[0].get('href')
In this code snippet, we iterate over each business element within the businesses list and extract relevant information. Here's the breakdown of the snippet:
'.//h3[contains(@class, "css-1agk4wl")]/span/a'
targets the <h3> element with the class css-1agk4wl, which contains a <span> element followed by an <a> element representing the name of the restaurant.
That's how this snippet will scrape Yelp business names and listing URLs.
Continuing the extraction, letâs scrape Yelp business ratings from each listing. For this, weâll extract the <div> element with attribute aria-label.
frating_element = business.xpath('.//div[contains(@aria-label, "star rating")]') if rating_element: rating_value = rating_element[0].get('aria-label').split()[0] if rating_value != 'Slideshow': data['Rating'] = float(rating_value) else: data['Rating'] = None
In this snippet, we extract the rating information for each business listing. Here's how it works:
'.//div[contains(@aria-label, "star rating")]'
selects the
element that has an aria-label attribute containing the text "star rating".If the rating_element is found, we extract the rating value from the aria-label attribute of the element. We retrieve the aria-label attribute using .get('aria-label') and split it into a list of words using .split(). The rating value is the first element of this list, representing the numeric rating value.
Weâll check if the extracted rating value is not equal to 'Slideshow' (a special case where Yelp displays a dynamic slideshow instead of a numeric rating). If it's not 'Slideshow', we convert the rating value to a float using float(rating_value) and assign it to data['Rating']. Otherwise, if it is 'Slideshow', we assign None to data['Rating'].
To extract the number of reviews, weâll extract <span>having class css-chan6m. Hereâs the code snippet that will extract reviews using this xPath.
freviews_element = business.xpath('.//span[contains(@class, "css-chan6m")]') if reviews_element: reviews_text = reviews_element[0].text if reviews_text: reviews_text = reviews_text.strip().split()[0] if reviews_text.isnumeric(): data['Reviews'] = int(reviews_text) else: data['Reviews'] = None
Here's a breakdown of this code snippet:
We use the xpath() method to locate the element to extract the number of reviews.
If the reviews_element exists, we extract the text content of the element using .text. The extracted text represents the number of reviews.
We ensure that reviews_text is not empty or None. If it contains a value, weâll extract the numeric part of the text by removing any whitespaces and taking the first word.
After extracting the numeric portion, weâll check if it is a valid number using .isnumeric(). If it is, convert it to an integer and assign it to data['Reviews']. Otherwise, if it is not a valid number, assign None to data['Reviews'].
Extracting the price range is the easiest part of our process. All you have to do is locate the <span> with class priceRange__09f24__mmOuH and your code will look like this:
fprice_element = business.xpath('.//span[contains(@class, "priceRange__09f24__mmOuH")]') if price_element: data['Price Range'] = price_element[0].text.strip()
This simple snippet will extract the price range from the listings.
To extract category names, we'll again use inspect element to find the relevant element. Our category name is located in <span> with class css-11bijt4. Letâs add it to our code and extract category names of all listings.
To retrieve the categories associated with each business listing, we use the following code snippet:
fcategories_element = business.xpath('.//span[contains(@class, "css-11bijt4")]') if categories_element: data['Categories'] = ", ".join([c.text for c in categories_elemen
Hereâs how what this snippet does:
We search for <span> elements that contains a class named "css-11bijt4" using the xpath()
If categories_element exists, we extract the text content of each <span> element using a list comprehension: [c.text for c in categories_element]. This creates a list of the category names.
Then join the elements of the list into a single string, separated by commas, using", ".join(...). This consolidates the category names into a formatted string.
If you look at the HTML, we can use the class name of <span> to extract neighborhoods. Our code locates most of the elements using this xPath expression. But in the case of neighborhood, the class name is css-chan6m, whatâs so special about it? Itâs the same class name that we used to extract reviews.
If we use it without referring to a parent element, itâll mess up the entire data. To solve this, weâll use the parent <p> element for reference. Hereâs how to code will look like:
fneighborhood_element = business.xpath('.//p[@class="css-dzq7l1"]/span[contains(@class, "css-chan6m")]') if neighborhood_element: neighborhood_text = neighborhood_element[0].text if neighborhood_text: data['Neighborhood'] = neighborhood_text.strip()
Weâve already learned how this code works so no need to explain it once again. Letâs move to the next part of our code.
After extracting the relevant details from each business listing, we need to store them. Hereâs how we do it:
fassert data scraped_data.append(data) return scraped_data
This code snippet makes sure that the data dictionary contains valid information before storing it in the scraped_data list. Letâs breakdown the code:
The assert statement is used to validate a condition. In this case, we assert that data is not empty or None. If the condition evaluates to False, an AssertionError is raised, indicating that something unexpected occurred during the scraping process.
After the assertion, we append the data dictionary to the scraped_data list. This adds the extracted details for a particular business listing to the list of all scraped data.
Finally, we return the scraped_data list, which contains dictionaries representing the extracted information for each business listing.
Now our Python script is functional and to scrape Yelp listings with 6 data attributes. Letâs save the yelp listings we extracted to a CSV file.
After scraping Yelp listings with the desired data attributes, letâs save the data to a CSV file. To do that, weâll define a save_to_csv method inside the YelpSearchScraper class.
fdef save_to_csv(self, data, filename): keys = data[0].keys() with open(filename, 'w', newline='', encoding='utf-8-sig') as csvfile: writer = csv.DictWriter(csvfile, fieldnames=keys, extrasaction='ignore') writer.writeheader() writer.writerows(data) print("Success!\nData written to CSV file:", filename)
The save_to_csv method takes two parameters: data, representing the extracted data, and filename, specifying the name of the CSV file to create. Hereâs breakdown the code:
fdef save_to_csv(self, data, filename):
This line defines a method named save_to_csv within the YelpSearchScraper class. It takes three parameters: self, which refers to the instance of the class, data, representing the extracted data to be saved, and filename, which specifies the name of the CSV file to create.
fkeys = data[0].keys()
This line retrieves the keys (fieldnames) of the first dictionary in the data list. It uses the keys() method to obtain a view object containing the keys.
fwith open(filename, 'w', newline='', encoding='utf-8-sig') as csvfile:
This line opens the specified filename in write mode ('w') using the open() function. The 'newline='' argument ensures that newline characters are handled correctly, and the 'utf-8-sig' encoding is used to support non-ASCII characters.
fwriter = csv.DictWriter(csvfile, fieldnames=keys, extrasaction='ignore')
This line creates a DictWriter object named writer. It takes three arguments: csvfile, which is the opened CSV file, fieldnames=keys, which specifies the fieldnames for the CSV file based on the keys obtained earlier, and extrasaction='ignore', which tells the writer to ignore any extra keys in the dictionaries.
fwriter.writeheader() writer.writerows(data)
The writeheader() writes header row and writerows() writes rows of data to our CSV file.
Finally, the line prints a success message, confirming that the data has been successfully written to the CSV file.
While scraping Yelp search results, we donât want to scrape the first page only. There are only 10 results per page and yelp shows you 24 pages per search URL. Which means you can scrape up to 240 results per search URL. Until now, our code is able to scrape only the first page. Letâs enhance its capabilities and deal with Yelpâs pagination.
To do so, weâll define a method named scrape_results and add the pagination logic to it. Letâs do it.
fdef scrape_results(self, search_url, max_page): all_results = [] for page in range(1, max_page): page_url = search_url + f'&start={(page-1)*10}' print(f"Scraping Page {page}") results = self.iter_listings(page_url) if results: all_results.extend(results) time.sleep(2) return all_results
Letâs breakdown this code snippet for better understanding:
fdef scrape_results(self, search_url, max_page):
This line defines a method named scrape_results within the YelpSearchScraper class. It takes three parameters: self, which refers to the instance of the class, search_url, representing the URL for the Yelp search, and max_page, specifying the maximum number of pages to scrape.
fall_results = []
This initializes an empty list named all_results that will be used to store the extracted data from each page.
ffor page in range(1, max_page):
This sets up a loop that iterates over a range of page numbers, starting from 1 up to max_page. This loop controls the scraping process for each page.
fpage_url = search_url + f'&start={(page-1)*10}'
This constructs the URL for each page by appending the appropriate start index parameter to the search_url. The start index is calculated as (page-1)*10, where each page displays 10 results.
fresults = self.iter_listings(page_url)
This calls the iter_listings method to scrape the listings from the current page URL (page_url). The method returns a list of dictionaries representing the extracted data for each business listing on the page.
fif results: all_results.extend(results)
This line checks if results is not empty. If there are extracted results from the current page, they are added to the all_results list using the extend() method. This ensures that all extracted data is accumulated across all pages.
ftime.sleep(2)
This introduces a 2-second delay using the time.sleep() function. It helps to prevent overwhelming the server with frequent requests and ensures a more polite scraping process.
freturn all_results
This line returns the all_results list, which contains the extracted data from all scraped pages. Now our Yelp scraper is capable of going beyond page 1 to extract all results while scraping Yelp listings from a search URL.
In the final part of our code, we have the main() function responsible for executing the Yelp scraper.
fdef main(): s = time.perf_counter() argparser = argparse.ArgumentParser() argparser.add_argument('--search-url', '-u', type=str, required=False, help='Yelp search URL', default='https://www.yelp.com/search?find_desc=Burgers&find_loc=London') argparser.add_argument('--max-page', '-p', type=int, required=False, help='Max page to visit', default=5) args = argparser.parse_args() search_url = args.search_url max_page = args.max_page assert all([search_url, max_page]) scraper = YelpSearchScraper() results = scraper.scrape_results(search_url, max_page) if results: scraper.save_to_csv(results, 'yelp_search_results.csv') else: print("No results to save to CSV") elapsed = time.perf_counter() - s elapsed_formatted = "{:.2f}".format(elapsed) print("Elapsed time:", elapsed_formatted, "seconds") print('''~~ success _ _ _ | | | | | | | | ___ | |__ ___| |_ __ __ | |/ _ \| '_ \/ __| __/| '__| | | (_) | |_) \__ \ |_ | | |_|\___/|_.__/|___/\__||_| ''') if __name__ == '__main__': main()
Letâs do a quick breakdown of our main function:
fdef main(): s = time.perf_counter()
Defines the main() function, which serves as the entry point of our program. Within this function, we initialize a variable s to store the starting time using time.perf_counter().
fargparser = argparse.ArgumentParser() argparser.add_argument('--search-url', '-u', type=str, required=False, help='Yelp search URL', default='https://www.yelp.com/search?find_desc=Burgers&find_loc=London') argparser.add_argument('--max-page', '-p', type=int, required=False, help='Max page to visit', default=5) args = argparser.parse_args()
These lines create an instance of the argparse.ArgumentParser() class to handle command-line arguments. We define two optional arguments: --search-url (or -u) to specify the Yelp search URL, and --max-page (or -p) to set the maximum number of pages to visit. The default values are provided for convenience. Due to this part of our code, we wonât need to open our script and change the URL and maximum pages to scrape every time. We can specify it in the command line.
fsearch_url = args.search_url max_page = args.max_page
These lines extract the values of the command-line arguments search_url and max_page from the args object, assigning them to the respective variables.
fassert all([search_url, max_page])
This uses the assert statement to validate that both search_url and max_page have non-empty values. If any of the values are missing, an AssertionError is raised.
fscraper = YelpSearchScraper() results = scraper.scrape_results(search_url, max_page)
This creates an instance of the YelpSearchScraper class named scraper. We then call the scrape_results() method of the scraper object, passing search_url and max_page as arguments. The method initiates the scraping process and returns the extracted results.
fif results: scraper.save_to_csv(results, 'yelp_search_results.csv') else: print("No results to save to CSV")
This if-else block checks if the results list is not empty. If there are extracted results, we call the save_to_csv() method of the scraper object to save the results to a CSV file named 'yelp_search_results.csv'. If there are no results, we print a message indicating that there are no results to save.
felapsed = time.perf_counter() - s elapsed_formatted = "{:.2f}".format(elapsed) print("Elapsed time:", elapsed_formatted, "seconds")
These lines calculate the elapsed time by subtracting the starting time (s) from the current time using time.perf_counter(). The elapsed time is stored in the elapsed variable. It is then formatted to two decimal places and stored in the elapsed_formatted variable. Finally, we print the elapsed time in seconds.
Itâs time to test our Yelp scraper. Weâll scrape Yelp to gather details of 30 Chinese restaurants located in London. The first step is to get the URL. Visit yelp.com, search the keyword, and select the location. Now all we have to do is copy the URL from the address bar.
Letâs launch our python based yelp scraper. Go to the folder where youâve saved the python script. Press shift and right click on an empty area of your screen. From the menu, select âopen Powershell window hereâ.
In your console, type the following command:
fpython {your_script_name.py} -u {url} -p {max number of pages to scrape}
Replace {your_script_name.py} with the name of your python file, {url} with the Yelp URL you want to scrape, and {max number of pages} with the number of pages you want to scrape. In our case, the command will be:
fpython yelpscraper.py -u "https://www.yelp.com/search?find_desc=Chinese&find_loc=London" -p 3
Hit enter and let the magic begin.
Here we go. We just extracted 30 restaurant listings with 6 data attributes from Yelp. Letâs see how the output file looks like:
And here they are. 30 Chinese restaurants located in London extracted with name, URL, rating, reviews, categories, and neighborhood. All in just 17 seconds.
While our Python-based Yelp scraper offers convenience in extracting restaurant listings from Yelp, it does have some limitations that should be considered. These limitations revolve around the scope of data extraction and the absence of anti-bot bypass measures.
Firstly, the scraper is unable to extract data from individual listing pages, which restricts the depth and breadth of information obtained. While it successfully captures essential attributes such as name, rating, neighborhood, and other basic details, it lacks the capability to navigate to individual listing pages and extract more comprehensive information like contact details, opening hours, or menu items.
Furthermore, the absence of anti-bot bypass measures exposes the scraper to detection by Yelp's security mechanisms. This can lead to potential IP banning, hindering the scraping process and preventing access to Yelp's data. Without anti-bot measures in place, the scraper's reliability and scalability may be compromised, posing limitations for large-scale scraping operations or frequent data extraction.
To overcome these limitations, we recommend using Yelp Search Export, a no-code and cloud-based solution. Yelp Search Export offers an expanded set of 13 data attributes, including contact information, opening hours, reviews, photos, and more. It incorporates advanced anti-bot bypass measures, ensuring reliable scraping. The solution is user-friendly, scalable, and provides a free version with premium plans available for additional features.
Learn how to scrape Yelp business listings with all important data attributes without writing a single line of code.
In conclusion, this article has provided a starting point for scraping restaurant listings from Yelp using Python and the requests library. We've walked through the code, explaining its functionality and limitations. If you're curious about web scraping or looking to learn web scraping with Python, we hope this article has been helpful in introducing you to the process.
However, if you're seeking a solution for large-scale scraping, efficiency, and time-saving, we recommend considering our powerful no-code scraper Yelp Search Export. It offers a hassle-free and robust solution for extracting Yelp data.
Happy scraping đŠ
Self-proclaimed Head of Content @ lobstr.io. I write all those awesome how-tos, listicles, and (they deserve) troll our competitors.