Deleting Data¶
From time to time some incorrect data makes it into the database
This eventually ends up in the parquet file and then on the CAN website and into the CAN API
We have a system for deleting data
The system is documented on this page
SQL in Production¶
Right now our strategy for deleting bad data is to write SQL against our production database
To make this process easier and less scary, we have written three SQL functions that should be used to handle deleting records
select_scraped_data¶
The first SQL function is called select_scraped_data and can be called as follows:
select * from select_scraped_data(...args)
The ...args is a placeholder for optional arguments that can be used to filter/narrow down the returned results.
Each optional argument should be given in format name => value.
The optional argument are:
_limit(int default = 5000): The maximum number of rows to be returned by the query_provider(textdefault = ‘state’): The type of provider. Should one of the items incan_tools/bootstrap_data/covid_providers.csv_variable_regex(TEXT default = ‘.+’): A regular expression for filtering on the name of the variable (applied to thecategorycolumn fromcan_tools/bootstrap_data/covid_variables.csv)_state_fips(int default = NULL): An integer representing the state fips code_location(int default = NULL): A single fips code – if this is given_state_fipsis ignored_start_date(date default = NULL): A starting date for data_end_date(date default = NULL): An ending date for data_location_type_regex(text default = ‘.+’): A location type. One ofnation,state, orcounty_unit_regex(text default = ‘.+’): A regular expression used to filter on theunitcolumn (applied to theunitcolumn fromcan_tools/bootstrap_data/covid_variables.csv)_measurement_regex(text default = ‘.+’): A regular expression used to filter on themeasurementcolumn (applied to themeasurementcolumn fromcan_tools/bootstrap_data/covid_variables.csv)
An example of how this can be used is given below:
SELECT
*
FROM
select_scraped_data (_provider => 'state',
_variable_regex => '.*vaccine.*',
_start_date => '2021-05-20',
_end_date => '2021-05-24',
_unit_regex => 'doses',
_location_type_regex => 'county',
_measurement_regex => 'cumulative',
_state_fips => 12);
This would return data between May 20, 2021 and May 24, 2021 (inclusive) for counties in Florida where the “vaccine” is part of the variable name and unit and measurement are cumulative doses. This data would also have been scraped from the Florida state dashboard
delete_scraped_data¶
The second helper function is called delete_scraped_data
It takes all the same optional arguments as select_scraped_data
when this routine is called two things happen:
All rows that would be returned by the equivalent call to
select_scraped_dataare updated so that thedeletedcolumn is set toTRUEAND thedelete_batch_idis set to a constant integer representing this call to thedelete_scraped_datafunctionTwo integers are returned representing first the number of rows changed and second the
delete_batch_id
Only rows where deleted = FALSE makes it into the api.covid_us table, and therefore into the parquet file
The purpose of the delete_batch_id column is to make it easier to undo an incorrect change.
After using select_scraped_data to identify rows that should be deleted, simply change the function name to delete_scraped_data to do the deletion
Continuing the example from above, to delete that Florida data we would execute
SELECT
*
FROM
delete_scraped_data(_provider => 'state',
_variable_regex => '.*vaccine.*',
_start_date => '2021-05-20',
_end_date => '2021-05-24',
_unit_regex => 'doses',
_location_type_regex => 'county',
_measurement_regex => 'cumulative',
_state_fips => 12);
undelete_by_delete_batch_id¶
The final helper function we have written makes it easy to undo a delete operation
undelete_by_delete_batch_id accepts a single argument, an integer for the delete_batch_id
When this function is called, all rows of data will have deleted set to FALSE and the delete_batch_id will be set back to NULL
Suppose we were given a delete_batch_id of 42 when calling the delete_scraped_data in our Florida example above…
To undo this deletion we need to call
SELECT
*
FROM
undelete_by_delete_batch_id(42);
This will return a single row and column letting us know how many rows were updated
Timeseries scrapers¶
Note that when a scraper returns a timeseries, some deleted data may have the value and last_updated columns update
When this happens we will also mark deleted as FALSE and set delete_batch_id to NULL
In effect, we make the assumption that if a scraper is trying to insert a data point, it should not begin as deleted