Writing Scrapers

The goal of the can_tools package is to make it easy to build and maintain COVID data scrapers

As noted in Structure, we have built a number of tools in an effort to achieve this goal

This document describes how to build a scraper

We will analyze the code for the NewJerseyVaccineCounty scraper, repeated below for convenience:

 1from typing import Any
 2
 3import pandas as pd
 4import us
 5
 6from can_tools.scrapers import variables
 7from can_tools.scrapers.official.base import ArcGIS
 8
 9
10class NewJerseyVaccineCounty(ArcGIS):
11    ARCGIS_ID = "Z0rixLlManVefxqY"
12    has_location = False
13    location_type = "county"
14    state_fips = int(us.states.lookup("New Jersey").fips)
15    source = "https://covid19.nj.gov/#live-updates"
16    source_name = "New Jersey Department of Health"
17    service: str = "VaxCov2"
18
19    # NOTE: do not delete the `(start|end) variables` comments
20    #       they are needed to generate documentation
21    # start variables
22    variables = {
23        "Dose_1": variables.INITIATING_VACCINATIONS_ALL,
24        "CompletedVax": variables.FULLY_VACCINATED_ALL,
25        "Grand_Total": variables.TOTAL_DOSES_ADMINISTERED_ALL,
26    }
27    # end variables
28
29    def fetch(self) -> Any:
30        return self.get_all_jsons(self.service, 0, 7)
31
32    def normalize(self, data: Any) -> pd.DataFrame:
33        non_counties = ["OUT OF STATE", "UNKNOWN", "MISSING", "TOTALS"]
34        df = self.arcgis_jsons_to_df(data)
35        df = self._rename_or_add_date_and_location(
36            df,
37            location_name_column="County",
38            timezone="US/Eastern",
39            location_names_to_drop=non_counties,
40        )
41        return self._reshape_variables(df, self.variables)

Finding the right subclass

As we’ve scraped data for the past year, we’ve noticed that a few technologies are used to create the majority of COVID data dashboards

For the technologies we have come across often, we have created classes that capture key paterns of interaction that can be reused by specific scrapers

The first step when starting a new scraper is to determine the technology used to create the dashboard, and then find the corresponding subclass

See Dashboard Type subclasses for a discussion of these classes

In our example with the New Jersy Vaccine Scraper, we observed that the dashboard was produced using ArcGIS

For that reason NewJerseyVaccineCounty subclasses from ArcGIS

Filling in class level attributes

After determining the type of dashboard you are planning to scrape and subclassing the appropriate parent, the next is to define some key class-level attributes

The ones that must be defined include:

  • has_location: bool: whether there is a column called location containing fips code (when has_location = True) or a column called location_name containing county names (when has_location = False)

  • location_type: str: The type of geography represented in the data. Typically this will be county because we aim to scrape state level dashboards that report county level data

  • state_fips: int: The (at most) two digit fips code for the state as an integer. This should be found using the method us.states.lookup as shown in the example above

  • source: str: A URL pointing to the dashboard

  • source_name: str: The name of the entity that maintains or publishes the dashboard

We also reccomend that you define a mapping from column names to instances of CMU as a class-level attribute called variables (see Writing the normalize method below for more details)

There are other class-level attributes that may be required by the dashboard type specific parent class

For example, for the NewJerseyVaccineCounty class’ parent – ArcGIS – the following are required:

  • ARCGIS_ID: str: A string identifying the resource id within the ArcGIS system

  • service: str A string representing the name of the service containing the data

Writing the fetch method

The first method you will write is fetch

This is responsible for making a network request to fetch a remote resource

It should not handle parsing or cleaning of the response (other than a simple thing like parsing the JSON body of a requests.Response)

The reason for this is that when scrapers are run, we like to keep track of failures for network requests vs failures in parsing or validation

For many dashboard types, this fetch method will be very simple and be a call to one or more of the helper methods defined in the dashboard type specific parent class

This is the case in our NewJerseyVaccineCounty scraper where we call out to the get_all_jsons method defined in ArcGIS

    def fetch(self) -> Any:
        return self.get_all_jsons(self.service, 0, 7)

Writing the normalize method

The normalize method is responsible for converting the raw data obtained by fetch into a clean, structured pandas DataFrame

This step is often the most work as it requires interpreting/understanding the data presented on the dashboard, understanding the desired DataFrame structure/schema, and writing the necessary data transformation code (usually a sequence of calls to pandas.DataFrame methods) to map from the raw form into the can-scrapers schema

The output DataFrames must contain the following columns:

Column

Type

Description

vintage

pd.TimeStamp

UTC timestamp for when scraper runs

dt

pd.TimeStamp

Timestamp capturing date for observed data

location

int

County FIPS code

location_name

str

County name (if not location column)

category

str

Category for variable (see below)

measurement

str

Measurement for variable (see below)

unit

str

Unit for variable (see below)

age

str

age group for demographic group (see below)

race

str

race group for demographic group (see below)

ethnicity

str

ethnicity group for demographic group (see below)

sex

str

sex group for demographic group (see below)

value

int or float

The observed value

The vintage, dt, and location (or location_name depending on whether the dashboard reports fips codes or county names) columns are typically added using the _rename_or_add_date_and_location (from StateDashboard)

CovidVariables

The (category, measurement, unit) triplet define what type of data is being observed.

  • category describes what the variable is. Some examples are total_deaths or total_vaccine_completed, etc.

  • measurement describes how the data is being reported. Some examples are new, cumulative, rolling_7day_average, etc.

  • unit describes the units used for the observation. Some examples are people, percentage, specimens, doses, etc.

The values used for these three columns must be known in our system

Known values are recorded in the file can_tools/bootstrap/covid_variables.csv

Most often, you will be trying to scrape variables that we already know about. In this case, you need to go to the csv file mentioned above and find a row that matches what you are looking for

CovidDemographics

The (age, race, ethnicity, sex) 4-tuple define what type of data is being observed.

  • age Categorizes an age group. Examples are all, 0-16, 20-30, 81_plus, etc.

  • race describes the race of the subpopulation represented. Some examples are all, ai_an, asian, black, etc.

  • ethnicity describes the ethnicity of the subpopulation represented. Some examples are all, hispanic, unknown and non-hispanic

  • sex describes the sex of the subpopulation represented. Possible values are all, male, female, unknown

This 4-tuple describes the demographic group represented in the data

The reported demographic group must be known in our system

Known values are recorded in the file can_tools/bootstrap/covid_demographics.csv

Note

When scraping data that does not have a demographic dimension, these columns will all be filled entirely with the string all

CMU

To help fill in the values for the variable dimensions (category, measurement, unit) and the demographic dimensions (age, race, ethnicity, sex); there is a helper class called CMU.

Note

Before we added demographics to our system, we only had the variable dimensions. The name CMU was chosen as an acronym for (catgegory, measurement, unit)

The CMU class is documented below:

class can_tools.scrapers.base.CMU(category='cases', measurement='cumulative', unit='people', age='all', race='all', ethnicity='all', sex='all')

Define variable and demographic dimensions for an observation

Variable dimensions include:

  • category: The ‘type’ of variable. Examples are cases, total_vaccine_completed

  • measurement: The form of measurement, e.g. cumulative, new

  • unit: The unit of measurement, e.g. people, doses

Demographic dimensions include:

  • age: the age group, e.g. 1-10, 40-49, 65_plus

  • race: the race, e.g. white, black

  • ethnicity: the ethnicity, e.g. hispanic, non-hispanic

  • sex: the sex, male, female, uknown

Note

All demographic dimensions allow a value of all, which is interpreted as the observation corresponding to all groups of that dimension (i.e. if age is all, then the data represent all ages)

For a complete list of admissible variable 3-tuples see the file can_tools/bootstrap_data/covid_variables.csv

For a complete list of admissible demographic 4-tuples see the file can_tools/bootstrap_data/covid_demographics.csv

Typically a scraper will define a class attribute called variables that is a dictionary mapping from a column name in a wide form dataset we receive from the source into an instance of CMU describing variable and demographic dimensions for the data.

A few CMU come up in many scrapers. These include CMU for total people with at least one dose, total people fully vaccinated, etc. Instead of repeating the instantiation of these CMU instances in every scraper, we instead have a helper module can_tools.scrapers.variables that contains common definitions as module level constants

These were used in the NewJerseyVaccineCounty scraper we’ve been working with:

    variables = {
        "Dose_1": variables.INITIATING_VACCINATIONS_ALL,
        "CompletedVax": variables.FULLY_VACCINATED_ALL,
        "Grand_Total": variables.TOTAL_DOSES_ADMINISTERED_ALL,
    }

Helper Methods

Now we have all the pieces we need in order to fill in the necessary rows of a normalized DataFrame

There are a few helper methods on the StateDashboard class (and therefore its subclasses) that we often use: _rename_or_add_date_and_location and _reshape_variables

These are documented in StateDashboard and shown in the NewJerseyVaccineCounty.normalize method below:

    def normalize(self, data: Any) -> pd.DataFrame:
        non_counties = ["OUT OF STATE", "UNKNOWN", "MISSING", "TOTALS"]
        df = self.arcgis_jsons_to_df(data)
        df = self._rename_or_add_date_and_location(
            df,
            location_name_column="County",
            timezone="US/Eastern",
            location_names_to_drop=non_counties,
        )
        return self._reshape_variables(df, self.variables)

Running the scraper locally

After writing a fetch and normalize method, you can run your scraper

We could do this as follows:

from can_tools.scrapers import NewJerseyVaccineCounty

# create scraper
d = NewJerseyVaccineCounty()
# fetch raw resource

raw = d.fetch()

# normalize the raw resource into a conformable DataFrame df = d.normalize(raw)

At this point you should have a normalized DataFrame, ready to be injected into the CAN database.

While running locally, we suggest you create an in-memory sqlite database and attempt to put your data

We have helper methods set up for you to do this:

from can_tools.models import create_dev_engine

# create a sqlalchemy engine and session connected to
# in memory sqlite db
engine, Session = create_dev_engine()

# put the DataFrame into the db
d.put(engine, df)

If at this stage you have any problems inserting the data, see the FAQ page

Running the tests for the scraper

There are a few tests that are automatically defined for you

We use the pytest framework

If you wanted to run the full test suite, you could run the pytest command from the can_tools directory

To select a subset of tests to run, use the -k flag for pytest

For example, to run only the tests for our NewJerseyVaccineCounty scraper, I would run

pytest -k NewJerseyVaccineCounty

This would produce output similar to

❯ pytest -k NewJerseyVaccineCounty
============================================ test session starts =============================================
platform linux -- Python 3.9.1, pytest-6.1.2, py-1.10.0, pluggy-0.13.1
rootdir: /home/sglyon/valorum/covid/can-scrapers
plugins: xdist-2.2.1, forked-1.3.0, parallel-0.1.0
collected 370 items / 365 deselected / 5 selected

tests/test_datasets.py .....                                                                           [100%]

============================================== warnings summary ==============================================
../../../anaconda3/envs/can-scrapers/lib/python3.9/site-packages/us/states.py:86: 46 warnings
/home/sglyon/anaconda3/envs/can-scrapers/lib/python3.9/site-packages/us/states.py:86: DeprecationWarning: PY_SSIZE_T_CLEAN will be required for '#' formats
   val = jellyfish.metaphone(val)

-- Docs: https://docs.pytest.org/en/stable/warnings.html
=============================== 5 passed, 365 deselected, 46 warnings in 2.16s ===============================