Improvements to Dataform Query Testing #1777
bmagyarkuti
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I propose the following extensions to Dataform’s query testing framework. We already have implementations and use them on several projects at my company. I’d like to hear the package developers’ opinion about whether they think any of these features might be a viable candidate for upstreaming.
Convenience Features
These help developers write better tests or be quicker as they work on their local machine. The same tests would be possible without them.
.only
,.skip
When there are test cases marked with
.only
, no other test cases are executed. When any test cases are marked with.skip
, those test cases are not executed. Current workarounds involve commenting out code, as well as deleting / renaming files.For JS, I propose an interface that is like that used by BDD frameworks such as Mocha:
The interface for
.skip
is similar..columns
Helps write expectations that focus on just those columns that are relevant for the test case.
.columns
.columns
worstGrade
andbestGrade
are easy to understand but are just unnecessary noise. They are not what I want to test here.enrolledAt
, which comes from the defaults, is also irrelevant, and is difficult to understand without examining the defaults. Usually, there would be many more tests to write about GPA calculation, and each of those tests would need to repeat the noise as well..where
Similar to
.columns
, but instead of filtering out irrelevant columns, it filters out irrelevant rows. Imagine the same example, just with astudents
table that uses the long representation, where columns effectively become rows.The interface:
.where("name = 'John Doe' AND field IN ('gpa', 'date_of_birth')")
..orderBy
Takes a list of columns. Can be used to ensure results arrive in a particular order, which helps avoid flaky tests. The current workaround is to include ordering in the implementation, or to always test just a single row.
The inclusion of this feature might resolve Issue 1265. An alternative to this feature could be an
.expect
that ignores the order of rows.Feature Additions
These enable tests that are not currently possible with Dataform query testing.
.vars
to Test Dataform VariablesDataform can generate different queries depending on variables that are passed in at compile time. For example, the following query would parse a birthday using a format string that is specified at compile time.
I propose the introduction of a
.vars
method, which could be used to write test cases that specify different values for a variable. For example, the above query might be tested like so:.calculateInput
for Integration TestsThere might be cases when instead of testing the behavior of individual tables, it is better to test the behavior of a succession of tables. For example, given an ETL pipeline that consists of a series of transformations, tests that describe just the initial input and the final output might be more expressive.
In general, such tests leave more space for refactoring. For example, if developers choose to split up
transformation_2
into two steps, they can perform this refactor without having to adjust any tests. Similarly, it might be beneficial even in a unit test oftransformation_2
to specify just the input and the expected state aftertransformation_2
. That way, the test suite might be used as a guide during a refactor when the output oftransformation_1
changes, but the expected outputs oftransformation_2
remain the same.An integration test of
transformation_2
, which specifiesinput
rather than the output oftransformation_1
might have this interface. In this example,transformation_1
andtransformation_2
each increasevalue
by 1.Note that, in our experience, these tests execute slower on BigQuery than simple query tests. Nevertheless, they remain significantly faster than tests that actually execute the pipeline.
This would resolve Issue 1387. The solution I propose is in line with @BenBirt's suggestion and requires that the user explicitly specify any intermediate steps that need executing.
Incremental Table Testing and
.incremental
dataform test
currently doesn’t support testing incremental tables at all. I propose testing incremental tables the same way that regular tables are tested: by executing the tested query against inline tables..incremental(true)
call would be used to test when the target table already exists before the execution. In this case, the expectation would list only those rows that will be newly added (or merged) into the table. The insert / merge operation itself would not be tested. However, any conditional logic in the implementation that only activates on an incremental execution would be tested..incremental(false)
call would simulate the case when the target table does not exist. In such a case, an incremental table behaves the same way as a regular table, and tests would also behave in the same way.This would resolve Issue 1132.
.preOps
This would allow the user to specify some SQL to run ahead of the test query. For example, a query that defines a User-Defined Function as part of its
preOps
block can include the same UDF declaration as part of the tests. This would resolve Issue 1472.Note: Any upstreamed implementation would likely need an sqlx interface as well, which I’m open to create if requested.
Beta Was this translation helpful? Give feedback.
All reactions