Im currently working on a website that helps residents in my borough avoid parking tickets. I wanted to scrape roads information data and save it into a Django model which would later be used in an automated task.
For web scraping the primary weapon in Python is BeautifulSoup. Another excellent library is pandas.
My goal is to scrape data from the following page:
London borough of Camden parking suspensions
This page has a bunch of parking zones and if you click on their links they have a number of roads associated. I would like to store the zone and roads into the following Django model that I will use to check for parking bay suspensions using an automated task later.
Django Model:
class Road(models.Model):
road = models.CharField(blank=False, max_length=120)
zone_code = models.CharField(blank=True, max_length=5)
zone = models.CharField(blank=True, max_length=30)
borough = models.CharField(blank=True, max_length=30)
suspension_href = models.TextField(blank=True, null=True)
created = models.DateTimeField(auto_now_add=True)
updated = models.DateTimeField(auto_now=True)
class Meta:
ordering = ['-updated', '-created']
unique_together = ['road', 'borough', 'zone']
def get_absolute_url(self): # get_absolute_url
return reverse('roads:detail', kwargs={'pk': self.pk})
Pandas has a function called read_html which pulls in all HTML tables into a list of DataFrame objects which is pretty slick but it doesn't pull in data within the anchor tags. In my case within the anchor tags there is an href that provides suspended bays for the zone that I need to perform my automated task looking for new suspensions. For this we will use BeautifulSoup. To be fair this could all be achieved just using BeautifulSoup but I like Pandas efficiency for this to get the data into a DataFrame making it easier to work with.
My preference is to run these scripts in a Jupyter Notebook. With Django-Extensions installed in my virtualenv, I run ./manage.py shell_plus --notebook to launch the notebook.
#imports
from pcndodger.suspensions.models import Road
from bs4 import BeautifulSoup
import pandas as pd
import requests
url='http://camden.gov.uk/ccm/navigation/transport-and-streets/parking/parking-bay-suspensions/search-for-parking-bay-suspensions/'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc,"html.parser")
table = soup.find('table')
rows = table.find_all('tr')
pd.set_option('display.max_colwidth', -1)
df = pd.read_html(str(table))[0]
df['href'] = [tag['href'] for tag in table.find_all('a')]
#Delete objects in Django
Road.objects.all().delete()
for zone_index, zone_row in df.iterrows():
print(zone_row[0],zone_row[1],zone_row[2])
url=zone_row[2]
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc,"html.parser")
table = soup.find('table')
df2 = pd.read_html(str(table),header=0)[0]
df2['href'] = [tag['href'] for tag in table.find_all('a')]
road_list = [[zone_row[0],zone_row[1],road_row[0],'Camden','http://registers.camden.gov.uk/SuspendedBays/'+road_row['href'].replace(' ','%20')]
for road_index,road_row in df2.iterrows() if pd.notnull(road_row[1]) == False ]
road_df = pd.DataFrame(road_list,columns=['zone_code','zone','road','borough','suspension_href'])
Roads = road_df.to_dict('roads')
model_instances = [Road(
road=record['road'],
zone_code=record['zone_code'],
borough=record['borough'],
zone=record['zone'],
suspension_href=record['suspension_href'],
) for record in Roads]
Road.objects.bulk_create(model_instances)