FAQ¶
This document contains a list of questions we’ve heard developers ask about how the system works, how to write a scraper, or anything else related to our data engineering efforts. Our intention is for this document to be updated frequently and be a living resource of common questions and their answers.
In code snippets below you will see references to a few variables (d
, engine
, df
), these are
d
: an instance of a scraperengine
: a sqlalchemy engine, most often the sqlite based dev enginedf
: a clean/normalized DataFrame that is the output of thenormalize
method
Location_id sql error¶
How to diagnose this problem: when calling d.put(engine, df)
you will see an error that looks like this:
IntegrityError: (sqlite3.IntegrityError) NOT NULL constraint failed: covid_observations.location_id
There are two possible cases for handling locations: using a location_name
column with state or county name or using a location
column with fips codes
location_name
column¶
If you have a location_name
column, chances are you have a misspelled county name, a row that isn’t a county (All
or Total
are common issues)
How to fix this problem: Try the following method: d.find_unknown_location_id(engine, df)
It will return rows of your DataFrame for which we do not recognize the county name
You can compare this list against the list of counties for that state, which you can obtain via:
locs = pd.read_sql("select * from locations", engine)
state_locs = locs.loc[locs["state_fips"] == d.state_fips, :]
Most often, the fix in this situation is to fix spelling/capitalization for a county name (to match what is in state_locs
) or delete the offending rows if they are obviously not counties
location
column¶
If instead you have a location
column, check to make sure that each row of the location
column maps into a known location for that state
You can use the state_locs
DataFrame from the code snippet above to see all known locations for the state
variable_id sql error¶
How to diagnose this problem: when calling d.put(engine, df)
you will see an error that looks like this:
IntegrityError: (sqlite3.IntegrityError) NOT NULL constraint failed: covid_observations.variable_id
How to fix this problem: Try the following method: d.find_unknown_location_id(engine, df)
It will return rows of your DataFrame for which we do not recognize the variable (recall that a variable_id is defined by a triplet ("category", "measurement", "unit")
– the CMU columns)
The most common fixes for this problem are:
Fix spelling on one of the CMU columns
Change recorded value of CMU columns to match a value in the file
can_tools/bootstrap_data/covid_variables.csv
If it is an entirely new type of variable, you may need to add a row to the
can_tools/bootstrap_data/covid_variables.csv
file and try to.put
againIf you are adding a brand new value for any of category, measurement, unit you also need to add the correspoinding value to one of
can_tools/bootstrap_data/covid_{categories,measurements,units}.csv
demographic_id sql error¶
TODO