I Used ChatGPT to web-scrape data of google map (french restaurants in NYC) and finally used my ‘bootlegged’ solution… (check the conclusion)
I went looking for French restaurants in the city of New York, of which I wanted the coordinates. I used multiple queries for my search. Among those:
My search were not specific enough. I had to formulate exactly what I was looking for and how. So I came up with prompts of the like:
I played with my prompts making variations of the keywords [‘Python’, ‘googlemap’, ‘webscrape’, ‘french’, ‘restaurant’, ‘nyc’, ‘list’, ‘coordinates’] and obtained codes similar to this one below, obtained with chatGPT as well.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.google.com/maps/search/french+restaurant+nyc"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'
}
response = requests.get(url)#, headers=headers)
content = response.content
soup = BeautifulSoup(content, "html.parser")
restaurants = soup.find_all("div", class_="section-result-content")
restaurant_list = []
for restaurant in restaurants:
name = restaurant.find("h3", class_="section-result-title").text
address = restaurant.find("span", class_="section-result-location").text
coordinates = restaurant.find("div", class_="section-result-location").get("data-location")
restaurant_list.append({
"Name": name,
"Address": address,
"Coordinates": coordinates
})
The resulting restaurants was an empty array. After backtracking the results one can realize this is because the query below
soup.find_all("div", class_="section-result-content")
returns an empty set (bs4.element.ResultSet).
In a nutshell, very often chatGPT tries to find the ‘div’ in the soup.
Soup.findAll(‘div) would return this as output:
[<div id=”XvQR9b”> <div class=”wSgKnf”> <div> When you have eliminated the <strong>JavaScript</strong>, whatever remains must be an empty page. </div> <a class=”hl4GXb” href=”https://support.google.com/maps/?hl=en&authuser=0&p=no_javascript" target=”_blank”> Enable JavaScript to see Google Maps. </a> </div> </div>,
<div class=”wSgKnf”> <div> When you have eliminated the <strong>JavaScript</strong>, whatever remains must be an empty page. </div> <a class=”hl4GXb” href=”https://support.google.com/maps/?hl=en&authuser=0&p=no_javascript" target=”_blank”> Enable JavaScript to see Google Maps. </a> </div>,
<div> When you have eliminated the <strong>JavaScript</strong>, whatever remains must be an empty page. </div>]
which does not really have to do with enabling Javascript. So we are left out with a soup. Here, after examining the soup I can come up with a solution in the following code:
soup.findAll('meta')[9]
From which 10 coordinates can be extracted. That is not a lot. ChatGPT in this case was only able to render the soup. Which to me is a fancy way of copying and pasting the page source using a python library. However this is mainly due to fact we cannot access the data on the map or the map itself. A solution would maybe be to direct ChatGPT to find what is in the google map API. I doubt that would work but if someone has done that, please let me know.
I proceeded to make my own solution. It is not that nice, it saves time for now, although I believe maybe these solutions will be integrated to chatGPT (or not) and it will operate the tasks required with one prompt, faster.
I made the query of french restaurants in nyc. I could inspect beyond the 10 first results but did not obtain the page source beyond them (Note: I obtained the most results just by scrolling down the page). I copied and pasted everything highlighted in the image below on a docx (select all, copy and paste) and quickly deleted unnecessary stuffs which took grand max 5 mins.
On the docx file the data looked like this:
Not very clean despite the impression. I proceed to read the data, clean and organize based on patterns we see in the raw file (Note that there is not much rearranging of the docx file otherwise we might as well manually obtain one by one each address needed).
In the code below, the getText function will return something like a long string that somehow resembles the soup.
%pip install python-docx
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
#the getText function will return something a long string that somehow
#resembles the soup
Our docx file is called ‘French2.docx’.
After manipulation, we see that our string has a lot unnecessary things like ‘\n’ , ’·’, ‘\n4’ etc that we replace with ‘’ or ‘ ‘ to clarify the text. Also ‘\xao’ is often found in front of an address, it is replaced with ‘ ‘ and we split the text
newtxt = getText('French2.docx').replace('·', '').replace('\n4.', ' ')
newtxt = newtxt.replace('\n', ' ').replace('AM', '').replace('PM', '')
newtxt = newtxt.replace('\xa0', ' ').split(' ')#.replace('\u202f', '')
Knowing the pattern we decide to create a list arrays with the street address that starts with a number and the name of the establishment 3 to 4 positions behind.
arr = []
j = 0
for i in newtxt:
try:
int(i[0])
arr.append([i, newtxt[j -4]])
except:
None
j = j + 1
arr
we have 102 entries with 2 or 3 errors (entries that do not correspond to address).
We can now make a full circle by retrieving our geocoordinates :). With the aim of being cheap and hassle-free, we use the geopy library. Its performance is lesser than that of google maps but it does not require API-key and has no billing fee. For the sake of the article I am using it.
Quick install:
%pip install geopy
I run the following command. Note that to enhance the performance of geopy, I add ‘, Manhattan, New York, New York’ to my addresses. I create dictionary with full address and geocode to check my results.
dictLoc3 = {}
import geopy
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent='Your_email_can_be_a_user_agent_id')
for item in arr:
try:
location = geolocator.geocode(item[0]+', Manhattan, New York, New York')
dictLoc3[item[0]] = [location.address, (location.latitude, location.longitude)]
except:
None
The output:
However not all entries are correct, we get not more than 10 wrong entries which brings down the count of right results to 82 on the lower end (Hey not using google maps…)
That is an about 80% success rate.
Conclusion:
I was impressed with the fact ChatGPT could give me prompts to import certain libraries like bs4, however these prompts are standard. To web scrape, a data miner (or whatever you call it, would automatically try what was proposed). Writing the chat command presupposes being versed in scraping. As more data and solutions are integrated in the system, I believe ChatGPT will come up with various ‘automated’ way to tackle problems. Now, ChatGPT could not access the google map because google has ways to restrict its access. Yet, smaller companies or websites can play the same game so that ChatGPt is not able to find the info it wants (here in the case of web scraping). It becomes a battle of technology. In that sense, I do not think ChatGPT could soon replace software engineer, data scientists etc. It will certainly be an enhancer for them.
If you are interested in python or IOS swift bootcamps, kindly let me know.