Web-Scrapping with python quick tips an example (on NY-Bits website)
we will scrape the internet for relevant information. We choose NY Bits. It is a website that gives info about rental properties. We will use python and its libraries to do so.
We will obtain the names of the property managers in NYC and a link to a descriptive page.
On the NY Bits website, the list of property managers with name starting with ‘A’ looks like this:
The url: https://www.nybits.com/managers/a_letter_managers.html
Highlighted in blue are the names with embedded links that interest us. We right click on them and click inspect.
We see the tag <li> and <a href> where the links and names ‘reside’. We will use python to retrieve our info.
import pandas as pd #as a habit pandas is always imported
import requests # requests will allow us to make url requests
from bs4 import BeautifulSoup
The last library is Beautiful soup:
“Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.” — The internet — Someone :)
Our url for the listing starting with “B” is: https://www.nybits.com/managers/b_letter_managers.html => notice the difference?
We will ask python to generate the links. we generate the url request by changing the letter to obtain a complete listing of each management property starting with the corresponding letter:
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
for letter in letters:
website_url = requests.get('https://www.nybits.com/managers/'+letter+'_letter_managers.html'
We pull the html data. A good hint is to look for all <li> tags (we can print the results. try for example .find(‘a’)):
soup = BeautifulSoup(website_url.content, 'html')
My_el_links = soup.findAll('li')
- For all elements retrieved we find the text of the attribute ‘a’ ,
el.find(attrs={'a': ''}).get_text()
- This is the key of dictioMan_comp
dictioMan_comp[el.find(attrs={'a': ''}).get_text()] = "something"
- We find the the attribute of ‘a’ which is a dictionary and the element corresponding to the key ‘href’
el.find(attrs={'a': ''}).attrs['href']
we do:
dictioMan_comp[el.find(attrs={'a': ''}).get_text()] = el.find(attrs={'a': ''}).attrs['href']
The entire code look like this:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
dictioMan_comp ={}
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
for letter in letters:
website_url = requests.get('https://www.nybits.com/managers/'+letter+'_letter_managers.html')
soup = BeautifulSoup(website_url.content, 'html')
My_el_links = soup.findAll('li')
for el in My_el_links:
dictioMan_comp[el.find(attrs={'a': ''}).get_text()] = el.find(attrs={'a': ''}).attrs['href']
dictioMan_comp
The output is (on jupyter notebook):
github: chrisb5a
This is it!