Structure¶
Each scraper in can-scrapers
is a Python class within the can_tools/scrapers
Python package
Organization¶
The scrapers are organized in a few sub-directories of can_tools/scrapers
:
official/
: these contain data from official federal, state, or county government websites (including health departments, CDC, HHS, etc.).Scrapers targeting a state level dashboard are put in
official/XX
whereXX
is the two letter state abbreviation (for exampleofficial/NM/nm_vaccine.py
for a scraper collecting vaccine data for counties in the state of New Mexico)Scrapers for a specific county are organized into
official/XX/counties
directory. For exampleofficial/XX/counties/la_county_vaccine.py
might have a scraper that scrapes vaccine data from the Los Angeles county dashboard
usafacts/
: scrapers for the county-level data provided by usafactsuscensus/
: scrapers that obtain demographic data from the US Census
Class Hierarchy¶
Let’s consider an example scraper and its lineage: the NewJerseyVaccineCounty
class found in can_tools/scrapers/official/NJ/nj_vaccine.py
Let A <: B
represent the phrase “A is a subclass of B”
Then the following is true about NewJerseyVaccineCounty
NewJerseyVaccineCounty <: ArcGIS <: StateDashboard <: DatasetBase
Each of the parent classes has a specific purpose and adds in functionality
We’ll start at the top of the hierarchy and work our way down
DatasetBase¶
Each scraper must be a subclass of the core DatasetBase
class.
The DatasetBase
class is defined in can_tools/scrapers/base.py
and does a number of things:
Automatically generates a prefect flow for execution of the scraper in the production pipeline
Abstracts away all non-scraper specific IO. This includes writing out temporary results, storing in cloud buckets, inserting into database, etc.
Doing some common data quality checks (called validation)
Defines helper methods for wranging data, these include methods
extract_CMU
- Defines a common interface that must be satisfied by all scrapers. These are abstract methods must be implemented by a subclass and include:
fetch
: responsible for doing network operations to collect datanormalize
: consumes the output offetch
and returns normalized data (see below)put
: consumes output ofnormalize
and stores into database
Most of our scrapers are from official government or health department websites. There are common tasks and configuration for all scrapers of this type
For this reason, there are other abstract classes that inherit from DatasetBase
These include: StateDashbord
, CountyDashboard
, FederalDashboard
We’ll talk about these next
StateDashboard¶
The majority of our scrapers collect data from a state maintained dashboard
The StateDashboard
class (defined in can_tools/scrapers/official/base.py
) adds some tools to make getting data from these sources easier:
Defines
table
,provider
, anddata_type
class attributesMethods
put
and_put_exec
: the code needed to push data to the database. Note, this means that none of our scraper classes (at the bottom of the hierarchy likeNewJerseyVaccineCounty
) need to worry about database interactionsMethods
_rename_or_add_date_and_location
and_reshape_variables
: tools for cleaning data (see below)
- can_tools.scrapers.official.base.StateDashboard._rename_or_add_date_and_location(self, data: pandas.core.frame.DataFrame, location_name_column: Optional[str] = None, location_column: Optional[str] = None, location_names_to_drop: Optional[List[str]] = None, location_names_to_replace: Optional[Dict[str, str]] = None, locations_to_drop: Optional[List[str]] = None, date_column: Optional[str] = None, date: Optional[pandas._libs.tslibs.timestamps.Timestamp] = None, timezone: Optional[str] = None, apply_title_case: bool = True)¶
Renames or adds date and location columns.
- Parameters
data – Input data
location_name_column – Name of column with location name
location_column – Name of column with location (fips)
location_names_to_drop – List of values in location_name_column that should be dropped
location_names_to_replace – Dict mapping from old location_name spelling/capitalization to new location_name
locations_to_drop – List of values in location_column that should be dropped
date_column – Name of Column containing date.
date – Date for data
timezone – Timezone of data if date or date_column not supplied.
apply_title_case – If True will make location name title case.
- Returns
Data with date and location columns normalized.
- Return type
data
- can_tools.scrapers.official.base.StateDashboard._reshape_variables(self, data: pandas.core.frame.DataFrame, variable_map: Dict[str, can_tools.scrapers.base.CMU], id_vars: Optional[List[str]] = None, **kwargs) → pandas.core.frame.DataFrame¶
Reshape columns in data to be long form definitions defined in variable_map.
- Parameters
data – Input data
variable_map (Union[str,int]) – Map from column name to output variables
id_vars (Optional[List[str]], (default=None)) – Variables that should be included as “id_vars” when melting from wide to long
kwargs – Other kwargs to pass to self.extract_CMU
- Returns
Reshaped DataFrame.
- Return type
data
Note
CountyDashboard
and FederalDashboard
inherit from StateDashboard
and update the provider
attribute. These are also defined in can_tools/scrapers/official/base.py
Dashboard Type subclasses¶
The next level in the hierarchy is a subclass for a specific type of dashboard technology
In the NewJerseyVaccineCounty
example, this was the ArcGIS
class
This subclass inherits from StateDashboard
(so a scraper for an ArcGIS dashbaord only need to subclass ArcGIS
and will get all goodies from StateDashboard
and DatasetBase
) and adds in tools specific for interacting with ArcGIS dashboards
ArcGIS
has some siblings:
SODA
: interacting with resources that adhere to the SODA standardTableauDashboard
: tools for extracting data from Tableau based dashboardsMicrosoftBIDashboard
: tools for extracting data from Microsoft BI dashboardsGoogleDataStudioDashboard
: tools for extracting data from Google Data Studio dashboards
In general, when you begin a new scraper, the initial steps are
Determine the technology used to create the dashboard
See if we have a subclass specific to that dashboard type
See examples of existing scrapers that build on that subclass to get a jump start on how to structure your new scraper
Note
The technology-specific parent classes are defined in can_tools/scrapers/official/base.py
Scraper Lifecycle¶
With all that in mind, we now lay out the lifecycle of a scraper when it runs in production
We will do this by writing code needed for running the scraper
1scraper = NewJerseyVaccineCounty()
2raw = scraper.fetch()
3clean = scraper.normalize(raw)
5scraper.put(engine, clean)
The line by line description of this code is
Create an instance of the scraper class. We can optionally pass
execution_dt
as an argumentCall the
.fetch
method to do network requests and getraw
data. This method is typically defined directly in the child classCall the
.normalize(raw)
method to get a cleaned DataFrame. This method is also typically defined directly in the child class. Implementing the.fetch
and.normalize
methods is the core of what we mean when we say “write a scraper”Call
.put(engine, clean)
to store the data in the database backing the sqlalchemy Engineengine
. This is written inStateDashboard
and should not need to be overwridden in child classes