Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP: Decide what to do with tables when they are removed from this repository #386

Open
fbertsch opened this issue Aug 15, 2019 · 3 comments

Comments

@fbertsch
Copy link
Contributor

BQ tables are auto-generated and updated as the schema changes. Once the schema is removed from this repository, the table is dropped. We shouldn't be dropping data when a schema is removed, instead we should retain the historical data for however long the retention period is (cc @mreid-moz).

Option 1: We keep the table in the same location, allowing for the small possibility that a new schema will be written to that location (we could add automatic checking for these, it would be especially bad if the schemas weren't compatible).

Option 2: We move that data to a historical location, such that we know it's not being updated and new data is not flowing in, and a new ping can replace it; however it will remain queryable (for the duration of the retention period).

I'm leaning towards (2.), but the downside is we either need to manually change queries to point to the new location, or move views to point there (and version views for the new data).

@whd
Copy link
Member

whd commented Aug 15, 2019

I'm also vaguely pro (2), as it would be nice to more generally have a concept of "deprecated" or "historical" data. There are many ping types we have collected over the years that we probably don't care about processing (e.g. appusage). In #334 we decided which pings we care about and don't care about for the purposes of schemas. I do like the idea of being able to make an active decision to no longer process a ping type by removing its schemas.

I don't like (1) because it means the state of production and the state of generated-schemas would not be precisely the same. In this case I would prefer that we never drop schemas (hitherto the standard practice). This goes back to developing a notion of "deprecated" data, which doesn't exist for ingestion currently.

@fbertsch
Copy link
Contributor Author

You make a good point about (1.) not matching the generated-schemas branch. A tentative plan for (2.) could be:

  1. Schema is removed from this repository
  2. Deploy notices the schema is now missing
    a. Copy the data to a historical location (TBD)
    b. Update the view that points to the historic data (related, auto-deployed views. If the view definition is not auto-deployed manual intervention could be required)
    c. Drop the prod table
    d. Deploy gcp-ingest with no json schema

I do believe that makes mozilla/bigquery-etl#291 a dependency here.

@jklukas
Copy link
Contributor

jklukas commented Aug 20, 2019

I'm not convinced that we gain much by actually moving the table to a historical location. I'd like to see a way of marking a docType as deprecated, perhaps via metadata in the JSON schema file itself.

Once a schema is marked as deprecated, perhaps we'd want the generated-schemas branch to include the BQ schema, but not include the JSON schema, so that the docType is no longer valid in the pipeline, but we don't remove the BQ table.

Perhaps we should have a deprecatedOn date or toBeRemovedOn date such that the schema generation machinery could automatically drop the BQ schema, and thus cause the table to be deleted, after all data in the table has expired.

This vaguely seems like the kind of metadata we would want to maintain in GCP's Data Catalog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants