Writing Scrapers ¶

Table of Contents

Writing Scrapers

The goal of the can_tools package is to make it easy to build and maintain COVID data scrapers

As noted in Structure, we have built a number of tools in an effort to achieve this goal

This document describes how to build a scraper

We will analyze the code for the NewJerseyVaccineCounty scraper, repeated below for convenience:

from typing import Any

import pandas as pd
import us

from can_tools.scrapers import variables
from can_tools.scrapers.official.base import ArcGIS


class NewJerseyVaccineCounty(ArcGIS):
    ARCGIS_ID = "Z0rixLlManVefxqY"
    has_location = False
    location_type = "county"
    state_fips = int(us.states.lookup("New Jersey").fips)
    source = "https://covid19.nj.gov/#live-updates"
    source_name = "New Jersey Department of Health"
    service: str = "VaxCov2"

    # NOTE: do not delete the `(start|end) variables` comments
    #       they are needed to generate documentation
    # start variables
    variables = {
        "Dose_1": variables.INITIATING_VACCINATIONS_ALL,
        "CompletedVax": variables.FULLY_VACCINATED_ALL,
        "Grand_Total": variables.TOTAL_DOSES_ADMINISTERED_ALL,
    }
    # end variables

    def fetch(self) -> Any:
        return self.get_all_jsons(self.service, 0, 7)

    def normalize(self, data: Any) -> pd.DataFrame:
        non_counties = ["OUT OF STATE", "UNKNOWN", "MISSING", "TOTALS"]
        df = self.arcgis_jsons_to_df(data)
        df = self._rename_or_add_date_and_location(
            df,
            location_name_column="County",
            timezone="US/Eastern",
            location_names_to_drop=non_counties,
        )
        return self._reshape_variables(df, self.variables)

Finding the right subclass ¶

As we’ve scraped data for the past year, we’ve noticed that a few technologies are used to create the majority of COVID data dashboards

For the technologies we have come across often, we have created classes that capture key paterns of interaction that can be reused by specific scrapers

The first step when starting a new scraper is to determine the technology used to create the dashboard, and then find the corresponding subclass

See Dashboard Type subclasses for a discussion of these classes

In our example with the New Jersy Vaccine Scraper, we observed that the dashboard was produced using ArcGIS

For that reason NewJerseyVaccineCounty subclasses from ArcGIS

Filling in class level attributes ¶

After determining the type of dashboard you are planning to scrape and subclassing the appropriate parent, the next is to define some key class-level attributes

The ones that must be defined include:

has_location: bool: whether there is a column called location containing fips code (when has_location = True) or a column called location_name containing county names (when has_location = False)
location_type: str: The type of geography represented in the data. Typically this will be county because we aim to scrape state level dashboards that report county level data
state_fips: int: The (at most) two digit fips code for the state as an integer. This should be found using the method us.states.lookup as shown in the example above
source: str: A URL pointing to the dashboard
source_name: str: The name of the entity that maintains or publishes the dashboard

We also reccomend that you define a mapping from column names to instances of CMU as a class-level attribute called variables (see Writing the normalize method below for more details)

There are other class-level attributes that may be required by the dashboard type specific parent class

For example, for the NewJerseyVaccineCounty class’ parent – ArcGIS – the following are required:

ARCGIS_ID: str: A string identifying the resource id within the ArcGIS system
service: str A string representing the name of the service containing the data

Writing the fetch method ¶

The first method you will write is fetch

This is responsible for making a network request to fetch a remote resource

It should not handle parsing or cleaning of the response (other than a simple thing like parsing the JSON body of a requests.Response)

The reason for this is that when scrapers are run, we like to keep track of failures for network requests vs failures in parsing or validation

For many dashboard types, this fetch method will be very simple and be a call to one or more of the helper methods defined in the dashboard type specific parent class

This is the case in our NewJerseyVaccineCounty scraper where we call out to the get_all_jsons method defined in ArcGIS

    def fetch(self) -> Any:
        return self.get_all_jsons(self.service, 0, 7)

Writing the normalize method ¶

The normalize method is responsible for converting the raw data obtained by fetch into a clean, structured pandas DataFrame

This step is often the most work as it requires interpreting/understanding the data presented on the dashboard, understanding the desired DataFrame structure/schema, and writing the necessary data transformation code (usually a sequence of calls to pandas.DataFrame methods) to map from the raw form into the can-scrapers schema

The output DataFrames must contain the following columns:

Column	Type	Description
vintage	`pd.TimeStamp`	UTC timestamp for when scraper runs
dt	`pd.TimeStamp`	Timestamp capturing date for observed data
location	`int`	County FIPS code
location_name	`str`	County name (if not location column)
category	`str`	Category for variable (see below)
measurement	`str`	Measurement for variable (see below)
unit	`str`	Unit for variable (see below)
age	`str`	age group for demographic group (see below)
race	`str`	race group for demographic group (see below)
ethnicity	`str`	ethnicity group for demographic group (see below)
sex	`str`	sex group for demographic group (see below)
value	`int` or `float`	The observed value

The vintage, dt, and location (or location_name depending on whether the dashboard reports fips codes or county names) columns are typically added using the _rename_or_add_date_and_location (from StateDashboard)

CovidVariables ¶

The (category, measurement, unit) triplet define what type of data is being observed.

category describes what the variable is. Some examples are total_deaths or total_vaccine_completed, etc.
measurement describes how the data is being reported. Some examples are new, cumulative, rolling_7day_average, etc.
unit describes the units used for the observation. Some examples are people, percentage, specimens, doses, etc.

The values used for these three columns must be known in our system

Known values are recorded in the file can_tools/bootstrap/covid_variables.csv

Most often, you will be trying to scrape variables that we already know about. In this case, you need to go to the csv file mentioned above and find a row that matches what you are looking for

CovidDemographics ¶

The (age, race, ethnicity, sex) 4-tuple define what type of data is being observed.

age Categorizes an age group. Examples are all, 0-16, 20-30, 81_plus, etc.
race describes the race of the subpopulation represented. Some examples are all, ai_an, asian, black, etc.
ethnicity describes the ethnicity of the subpopulation represented. Some examples are all, hispanic, unknown and non-hispanic
sex describes the sex of the subpopulation represented. Possible values are all, male, female, unknown

This 4-tuple describes the demographic group represented in the data

The reported demographic group must be known in our system

Known values are recorded in the file can_tools/bootstrap/covid_demographics.csv

Note

When scraping data that does not have a demographic dimension, these columns will all be filled entirely with the string all

`CMU`¶

To help fill in the values for the variable dimensions (category, measurement, unit) and the demographic dimensions (age, race, ethnicity, sex); there is a helper class called CMU.

Note

Before we added demographics to our system, we only had the variable dimensions. The name CMU was chosen as an acronym for (catgegory, measurement, unit)

The CMU class is documented below:

class can_tools.scrapers.base.CMU(category='cases', measurement='cumulative', unit='people', age='all', race='all', ethnicity='all', sex='all')¶

Define variable and demographic dimensions for an observation

Variable dimensions include:

category: The ‘type’ of variable. Examples are cases, total_vaccine_completed
measurement: The form of measurement, e.g. cumulative, new
unit: The unit of measurement, e.g. people, doses

Demographic dimensions include:

age: the age group, e.g. 1-10, 40-49, 65_plus
race: the race, e.g. white, black
ethnicity: the ethnicity, e.g. hispanic, non-hispanic
sex: the sex, male, female, uknown

Note

All demographic dimensions allow a value of all, which is interpreted as the observation corresponding to all groups of that dimension (i.e. if age is all, then the data represent all ages)

For a complete list of admissible variable 3-tuples see the file can_tools/bootstrap_data/covid_variables.csv

For a complete list of admissible demographic 4-tuples see the file can_tools/bootstrap_data/covid_demographics.csv

Typically a scraper will define a class attribute called variables that is a dictionary mapping from a column name in a wide form dataset we receive from the source into an instance of CMU describing variable and demographic dimensions for the data.

A few CMU come up in many scrapers. These include CMU for total people with at least one dose, total people fully vaccinated, etc. Instead of repeating the instantiation of these CMU instances in every scraper, we instead have a helper module can_tools.scrapers.variables that contains common definitions as module level constants

These were used in the NewJerseyVaccineCounty scraper we’ve been working with:

    variables = {
        "Dose_1": variables.INITIATING_VACCINATIONS_ALL,
        "CompletedVax": variables.FULLY_VACCINATED_ALL,
        "Grand_Total": variables.TOTAL_DOSES_ADMINISTERED_ALL,
    }

Helper Methods ¶

Now we have all the pieces we need in order to fill in the necessary rows of a normalized DataFrame

There are a few helper methods on the StateDashboard class (and therefore its subclasses) that we often use: _rename_or_add_date_and_location and _reshape_variables

These are documented in StateDashboard and shown in the NewJerseyVaccineCounty.normalize method below:

    def normalize(self, data: Any) -> pd.DataFrame:
        non_counties = ["OUT OF STATE", "UNKNOWN", "MISSING", "TOTALS"]
        df = self.arcgis_jsons_to_df(data)
        df = self._rename_or_add_date_and_location(
            df,
            location_name_column="County",
            timezone="US/Eastern",
            location_names_to_drop=non_counties,
        )
        return self._reshape_variables(df, self.variables)

Running the scraper locally ¶

After writing a fetch and normalize method, you can run your scraper

We could do this as follows:

from can_tools.scrapers import NewJerseyVaccineCounty

# create scraper
d = NewJerseyVaccineCounty()

# fetch raw resource

raw = d.fetch()

# normalize the raw resource into a conformable DataFrame df = d.normalize(raw)

At this point you should have a normalized DataFrame, ready to be injected into the CAN database.

While running locally, we suggest you create an in-memory sqlite database and attempt to put your data

We have helper methods set up for you to do this:

from can_tools.models import create_dev_engine

# create a sqlalchemy engine and session connected to
# in memory sqlite db
engine, Session = create_dev_engine()

# put the DataFrame into the db
d.put(engine, df)

If at this stage you have any problems inserting the data, see the FAQ page

Running the tests for the scraper ¶

There are a few tests that are automatically defined for you

We use the pytest framework

If you wanted to run the full test suite, you could run the pytest command from the can_tools directory

To select a subset of tests to run, use the -k flag for pytest

For example, to run only the tests for our NewJerseyVaccineCounty scraper, I would run

pytest -k NewJerseyVaccineCounty

This would produce output similar to

❯ pytest -k NewJerseyVaccineCounty
============================================ test session starts =============================================
platform linux -- Python 3.9.1, pytest-6.1.2, py-1.10.0, pluggy-0.13.1
rootdir: /home/sglyon/valorum/covid/can-scrapers
plugins: xdist-2.2.1, forked-1.3.0, parallel-0.1.0
collected 370 items / 365 deselected / 5 selected

tests/test_datasets.py .....                                                                           [100%]

============================================== warnings summary ==============================================
../../../anaconda3/envs/can-scrapers/lib/python3.9/site-packages/us/states.py:86: 46 warnings
/home/sglyon/anaconda3/envs/can-scrapers/lib/python3.9/site-packages/us/states.py:86: DeprecationWarning: PY_SSIZE_T_CLEAN will be required for '#' formats
   val = jellyfish.metaphone(val)

-- Docs: https://docs.pytest.org/en/stable/warnings.html
=============================== 5 passed, 365 deselected, 46 warnings in 2.16s ===============================

Writing Scrapers ¶

Finding the right subclass ¶

Filling in class level attributes ¶

Writing the fetch method ¶

Writing the normalize method ¶

CovidVariables ¶

CovidDemographics ¶

`CMU`¶

Helper Methods ¶

Running the scraper locally ¶

Running the tests for the scraper ¶

CAN Scrapers

Navigation

Related Topics