Deleting Data¶
From time to time some incorrect data makes it into the database
This eventually ends up in the parquet file and then on the CAN website and into the CAN API
We have a system for deleting data
The system is documented on this page
SQL in Production¶
Right now our strategy for deleting bad data is to write SQL against our production database
To make this process easier and less scary, we have written three SQL functions that should be used to handle deleting records
select_scraped_data¶
The first SQL function is called select_scraped_data and can be called as follows:
select * from select_scraped_data(...args)
The ...args is a placeholder for optional arguments that can be used to filter/narrow down the returned results.
Each optional argument should be given in format name => value.
The optional argument are:
- _limit(int default = 5000): The maximum number of rows to be returned by the query
- _provider(textdefault = ‘state’): The type of provider. Should one of the items in- can_tools/bootstrap_data/covid_providers.csv
- _variable_regex(TEXT default = ‘.+’): A regular expression for filtering on the name of the variable (applied to the- categorycolumn from- can_tools/bootstrap_data/covid_variables.csv)
- _state_fips(int default = NULL): An integer representing the state fips code
- _location(int default = NULL): A single fips code – if this is given- _state_fipsis ignored
- _start_date(date default = NULL): A starting date for data
- _end_date(date default = NULL): An ending date for data
- _location_type_regex(text default = ‘.+’): A location type. One of- nation,- state, or- county
- _unit_regex(text default = ‘.+’): A regular expression used to filter on the- unitcolumn (applied to the- unitcolumn from- can_tools/bootstrap_data/covid_variables.csv)
- _measurement_regex(text default = ‘.+’): A regular expression used to filter on the- measurementcolumn (applied to the- measurementcolumn from- can_tools/bootstrap_data/covid_variables.csv)
An example of how this can be used is given below:
SELECT
    *
FROM
    select_scraped_data (_provider => 'state',
        _variable_regex => '.*vaccine.*',
        _start_date => '2021-05-20',
        _end_date => '2021-05-24',
        _unit_regex => 'doses',
        _location_type_regex => 'county',
        _measurement_regex => 'cumulative',
        _state_fips_ => 12);
This would return data between May 20, 2021 and May 24, 2021 (inclusive) for counties in Florida where the “vaccine” is part of the variable name and unit and measurement are cumulative doses. This data would also have been scraped from the Florida state dashboard
delete_scraped_data¶
The second helper function is called delete_scraped_data
It takes all the same optional arguments as select_scraped_data
when this routine is called two things happen:
- All rows that would be returned by the equivalent call to - select_scraped_dataare updated so that the- deletedcolumn is set to- TRUEAND the- delete_batch_idis set to a constant integer representing this call to the- delete_scraped_datafunction
- Two integers are returned representing first the number of rows changed and second the - delete_batch_id
Only rows where deleted = FALSE makes it into the api.covid_us table, and therefore into the parquet file
The purpose of the delete_batch_id column is to make it easier to undo an incorrect change.
After using select_scraped_data to identify rows that should be deleted, simply change the function name to delete_scraped_data to do the deletion
Continuing the example from above, to delete that Florida data we would execute
SELECT
    *
FROM
    delete_scraped_data(_provider => 'state',
        _variable_regex => '.*vaccine.*',
        _start_date => '2021-05-20',
        _end_date => '2021-05-24',
        _unit_regex => 'doses',
        _location_type_regex => 'county',
        _measurement_regex => 'cumulative',
        _state_fips_ => 12);
undelete_by_delete_batch_id¶
The final helper function we have written makes it easy to undo a delete operation
undelete_by_delete_batch_id accepts a single argument, an integer for the delete_batch_id
When this function is called, all rows of data will have deleted set to FALSE and the delete_batch_id will be set back to NULL
Suppose we were given a delete_batch_id of 42 when calling the delete_scraped_data in our Florida example above…
To undo this deletion we need to call
SELECT
    *
FROM
    undelete_by_delete_batch_id(42);
This will return a single row and column letting us know how many rows were updated
Timeseries scrapers¶
Note that when a scraper returns a timeseries, some deleted data may have the value and last_updated columns update
When this happens we will also mark deleted as FALSE and set delete_batch_id to NULL
In effect, we make the assumption that if a scraper is trying to insert a data point, it should not begin as deleted