Structure¶

Each scraper in can-scrapers is a Python class within the can_tools/scrapers Python package

Organization¶

The scrapers are organized in a few sub-directories of can_tools/scrapers:

official/: these contain data from official federal, state, or county government websites (including health departments, CDC, HHS, etc.).
- Scrapers targeting a state level dashboard are put in official/XX where XX is the two letter state abbreviation (for example official/NM/nm_vaccine.py for a scraper collecting vaccine data for counties in the state of New Mexico)
- Scrapers for a specific county are organized into official/XX/counties directory. For example official/XX/counties/la_county_vaccine.py might have a scraper that scrapes vaccine data from the Los Angeles county dashboard
usafacts/: scrapers for the county-level data provided by usafacts
uscensus/: scrapers that obtain demographic data from the US Census

Class Hierarchy¶

Let’s consider an example scraper and its lineage: the NewJerseyVaccineCounty class found in can_tools/scrapers/official/NJ/nj_vaccine.py

Let A <: B represent the phrase “A is a subclass of B”

Then the following is true about NewJerseyVaccineCounty

NewJerseyVaccineCounty <: ArcGIS <: StateDashboard <: DatasetBase

Each of the parent classes has a specific purpose and adds in functionality

We’ll start at the top of the hierarchy and work our way down

DatasetBase¶

Each scraper must be a subclass of the core DatasetBase class.

The DatasetBase class is defined in can_tools/scrapers/base.py and does a number of things:

Automatically generates a prefect flow for execution of the scraper in the production pipeline
Abstracts away all non-scraper specific IO. This includes writing out temporary results, storing in cloud buckets, inserting into database, etc.
Doing some common data quality checks (called validation)
Defines helper methods for wranging data, these include methods extract_CMU
Defines a common interface that must be satisfied by all scrapers. These are abstract methods must be implemented by a subclass and include:
- fetch: responsible for doing network operations to collect data
- normalize: consumes the output of fetch and returns normalized data (see below)
- put: consumes output of normalize and stores into database

Most of our scrapers are from official government or health department websites. There are common tasks and configuration for all scrapers of this type

For this reason, there are other abstract classes that inherit from DatasetBase

These include: StateDashbord, CountyDashboard, FederalDashboard

We’ll talk about these next

StateDashboard¶

The majority of our scrapers collect data from a state maintained dashboard

The StateDashboard class (defined in can_tools/scrapers/official/base.py) adds some tools to make getting data from these sources easier:

Defines table, provider, and data_type class attributes
Methods put and _put_exec: the code needed to push data to the database. Note, this means that none of our scraper classes (at the bottom of the hierarchy like NewJerseyVaccineCounty) need to worry about database interactions
Methods _rename_or_add_date_and_location and _reshape_variables: tools for cleaning data (see below)

can_tools.scrapers.official.base.StateDashboard._rename_or_add_date_and_location(self, data: pandas.core.frame.DataFrame, location_name_column: Optional[str] = None, location_column: Optional[str] = None, location_names_to_drop: Optional[List[str]] = None, location_names_to_replace: Optional[Dict[str, str]] = None, locations_to_drop: Optional[List[str]] = None, date_column: Optional[str] = None, date: Optional[pandas._libs.tslibs.timestamps.Timestamp] = None, timezone: Optional[str] = None, apply_title_case: bool = True)¶

Renames or adds date and location columns.

Parameters

data – Input data
location_name_column – Name of column with location name
location_column – Name of column with location (fips)
location_names_to_drop – List of values in location_name_column that should be dropped
location_names_to_replace – Dict mapping from old location_name spelling/capitalization to new location_name
locations_to_drop – List of values in location_column that should be dropped
date_column – Name of Column containing date.
date – Date for data
timezone – Timezone of data if date or date_column not supplied.
apply_title_case – If True will make location name title case.

Returns

Data with date and location columns normalized.

Return type

data

can_tools.scrapers.official.base.StateDashboard._reshape_variables(self, data: pandas.core.frame.DataFrame, variable_map: Dict[str, can_tools.scrapers.base.CMU], id_vars: Optional[List[str]] = None, **kwargs) → pandas.core.frame.DataFrame¶

Reshape columns in data to be long form definitions defined in variable_map.

Parameters

data – Input data
variable_map (Union[str,int]) – Map from column name to output variables
id_vars (Optional[List[str]], (default=None)) – Variables that should be included as “id_vars” when melting from wide to long
kwargs – Other kwargs to pass to self.extract_CMU

Returns

Reshaped DataFrame.

Return type

data

Note

CountyDashboard and FederalDashboard inherit from StateDashboard and update the provider attribute. These are also defined in can_tools/scrapers/official/base.py

Dashboard Type subclasses¶

The next level in the hierarchy is a subclass for a specific type of dashboard technology

In the NewJerseyVaccineCounty example, this was the ArcGIS class

This subclass inherits from StateDashboard (so a scraper for an ArcGIS dashbaord only need to subclass ArcGIS and will get all goodies from StateDashboard and DatasetBase) and adds in tools specific for interacting with ArcGIS dashboards

ArcGIS has some siblings:

SODA: interacting with resources that adhere to the SODA standard
TableauDashboard: tools for extracting data from Tableau based dashboards
MicrosoftBIDashboard: tools for extracting data from Microsoft BI dashboards
GoogleDataStudioDashboard: tools for extracting data from Google Data Studio dashboards

In general, when you begin a new scraper, the initial steps are

Determine the technology used to create the dashboard
See if we have a subclass specific to that dashboard type
See examples of existing scrapers that build on that subclass to get a jump start on how to structure your new scraper

Note

The technology-specific parent classes are defined in can_tools/scrapers/official/base.py

Scraper Lifecycle¶

With all that in mind, we now lay out the lifecycle of a scraper when it runs in production

We will do this by writing code needed for running the scraper

scraper = NewJerseyVaccineCounty()
raw = scraper.fetch()
clean = scraper.normalize(raw)
scraper.put(engine, clean)

The line by line description of this code is

Create an instance of the scraper class. We can optionally pass execution_dt as an argument
Call the .fetch method to do network requests and get raw data. This method is typically defined directly in the child class
Call the .normalize(raw) method to get a cleaned DataFrame. This method is also typically defined directly in the child class. Implementing the .fetch and .normalize methods is the core of what we mean when we say “write a scraper”
Call .put(engine, clean) to store the data in the database backing the sqlalchemy Engine engine. This is written in StateDashboard and should not need to be overwridden in child classes

Structure¶

Organization¶

Class Hierarchy¶

DatasetBase¶

StateDashboard¶

Dashboard Type subclasses¶

Scraper Lifecycle¶

CAN Scrapers

Navigation

Related Topics