FAQ

This document contains a list of questions we’ve heard developers ask about how the system works, how to write a scraper, or anything else related to our data engineering efforts. Our intention is for this document to be updated frequently and be a living resource of common questions and their answers.

In code snippets below you will see references to a few variables (d, engine, df), these are

  • d: an instance of a scraper

  • engine: a sqlalchemy engine, most often the sqlite based dev engine

  • df: a clean/normalized DataFrame that is the output of the normalize method

Location_id sql error

How to diagnose this problem: when calling d.put(engine, df) you will see an error that looks like this:

IntegrityError: (sqlite3.IntegrityError) NOT NULL constraint failed: covid_observations.location_id

There are two possible cases for handling locations: using a location_name column with state or county name or using a location column with fips codes

location_name column

If you have a location_name column, chances are you have a misspelled county name, a row that isn’t a county (All or Total are common issues)

How to fix this problem: Try the following method: d.find_unknown_location_id(engine, df)

It will return rows of your DataFrame for which we do not recognize the county name

You can compare this list against the list of counties for that state, which you can obtain via:

locs = pd.read_sql("select * from locations", engine)
state_locs = locs.loc[locs["state_fips"] == d.state_fips, :]

Most often, the fix in this situation is to fix spelling/capitalization for a county name (to match what is in state_locs) or delete the offending rows if they are obviously not counties

location column

If instead you have a location column, check to make sure that each row of the location column maps into a known location for that state

You can use the state_locs DataFrame from the code snippet above to see all known locations for the state

variable_id sql error

How to diagnose this problem: when calling d.put(engine, df) you will see an error that looks like this:

IntegrityError: (sqlite3.IntegrityError) NOT NULL constraint failed: covid_observations.variable_id

How to fix this problem: Try the following method: d.find_unknown_location_id(engine, df)

It will return rows of your DataFrame for which we do not recognize the variable (recall that a variable_id is defined by a triplet ("category", "measurement", "unit") – the CMU columns)

The most common fixes for this problem are:

  • Fix spelling on one of the CMU columns

  • Change recorded value of CMU columns to match a value in the file can_tools/bootstrap_data/covid_variables.csv

  • If it is an entirely new type of variable, you may need to add a row to the can_tools/bootstrap_data/covid_variables.csv file and try to .put again

    • If you are adding a brand new value for any of category, measurement, unit you also need to add the correspoinding value to one of can_tools/bootstrap_data/covid_{categories,measurements,units}.csv

demographic_id sql error

TODO