Compare header and schema #127

peterdesmet · 2023-03-21T14:23:33Z

Update: this can now be defined in fieldMatch #216

It is possible for an (invalid) Data Package to have discrepancies between the schema and the actual data. E.g. defining more/less columns or in a different order. read_resource() will silently let those through when the data types of the switched columns are compatible, which can lead to issues for the user (e.g. lat/lon are silently switched). Only when the data types are incompatible, will readr return a parsing issue.

To avoid passing these issues silently, read_resource() should compare the headers of the file with the schema and raise an error if those are not exactly the same. This implements the following spec:

The field descriptor MUST contain a name property. This property SHOULD correspond to the name of field/column in the data file (if it has a name). As such it SHOULD be unique (though it is possible, but very bad practice, for the data file to have multiple columns with the same name). name SHOULD NOT be considered case sensitive in determining uniqueness. However, since it should correspond to the name of the field in the data file it may be important to preserve case.

Implementation considerations:

Only compare when replace_null(dialect$header, TRUE) (i.e. it is not false). It might be useful to define dialect_header and reuse it here:

frictionless-r/R/read_resource.R

Line 356 in 421c22f

skip = ifelse(replace_null(dialect$header, TRUE), 1, 0),
The specs say that case should NOT be considered, so both the field names and col_names should be lowercased before comparing
To allow comparison, the header line of the file should be read separately from the main read_delim(). read_lines() could be used, but delim and encoding/locale might have to be passed too.
A resource can contain multiple files (e.g. observations_1, observations_2). Either all files are read and compared or only the last once, cf. add_resource():

The last file will be read with readr::read_delim() to create or compare with schema and to set format, mediatype and encoding. The other files are ignored, but are expected to have the same structure and properties.

On a mismatch (fieldnames, different order, more or less), an error should be returned, similar to check_schema():

frictionless-r/R/check_schema.R

Lines 65 to 69 in 421c22f

    
           msg = glue::glue( 
        
             "Field names in `schema` must match column names in data:", 
        
             "\u2139 Field names: `{field_names_collapse}`", 
        
             "\u2139 Column names: `{col_names_collapse}`", 
        
             .sep = "\n"

Add a section validation to explain what we validate:

#' @section Validation:
#' Full validation is not supported.
#' Something about validation issues
#' Something about header compare

The text was updated successfully, but these errors were encountered:

PietrH · 2023-03-22T11:00:05Z

Some questions:

Multipart resources

What about multipart resources, should all parts of the resource be checked? Or just the first/last one?
For multipart resources, will they always either all have a header, or none of them? Or is it possible for example only the first resource has a header?

Naming

What would be a good argument name to toggle this comparison/check?

check_header = TRUE
compare_header = TRUE
check_fields = TRUE

Default behavior

I assume that read_resource() should be default not compare the header and the schema?

peterdesmet · 2023-03-22T11:28:01Z

Multipart resources: to increase performance (especially when reading over URL) I'd be fine with the last file being read.
A header or not is defined at resource level, meaning all files should comply.
I would not add a parameter in read_resource, but always include this check. It is a recommended part of the specs: This property SHOULD correspond ...

peterdesmet added the enhancement New feature or request label Mar 21, 2023

peterdesmet assigned PietrH Mar 21, 2023

peterdesmet added this to the 1.1.0 milestone Mar 21, 2023

peterdesmet mentioned this issue Mar 21, 2023

Make validation more explicit #125

Closed

PietrH added a commit that referenced this issue Aug 3, 2023

add test skeletons for #127

a0894ba

PietrH linked a pull request Aug 3, 2023 that will close this issue

127 compare header and schema #146

Draft

peterdesmet added the function:read_resource Function read_resource() label Aug 12, 2023

peterdesmet mentioned this issue Sep 13, 2023

Frictionless silently adds columns not defined in the schema (add schema_sync) #150

Open

peterdesmet modified the milestones: 1.1.0, 1.2.0 Mar 22, 2024

peterdesmet added the complexity:high Likely complex to implement label Jul 3, 2024

peterdesmet removed this from the 1.2.0 milestone Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compare header and schema #127

Compare header and schema #127

peterdesmet commented Mar 21, 2023 •

edited

Loading

PietrH commented Mar 22, 2023

peterdesmet commented Mar 22, 2023

Compare header and schema #127

Compare header and schema #127

Comments

peterdesmet commented Mar 21, 2023 • edited Loading

PietrH commented Mar 22, 2023

Multipart resources

Naming

Default behavior

peterdesmet commented Mar 22, 2023

peterdesmet commented Mar 21, 2023 •

edited

Loading