add interlacer support via `read_interlaced_resource()` #213

khusmann · 2024-06-27T02:06:37Z

Just put together a quick sketch of what interlacer support might look like for frictionless-r

Key features that interlacer brings:

interlaced columns (so missing reasons are easily accessible)
field-level missingValues Support field.missingValues #174
coded factor values and missing values (value labels for factors) Support field.categorical and field.categoriesOrdered #148
an x_col system that can easily accommodate other new frictionless types like list Support new field type list #179

Example csv in tests/testthat/data/type_interlaced.csv:

na_fct,na_int,na_cfct,na_none,na_default,cfct_chr,cfct_int
1.1,1.1,1.1,1,1,a,10
2.2,2.2,2.2,2,2,b,20
OMITTED,-99,-99,3,OMITTED,a,10
REFUSED,-98,-98,3,3,b,20
5.5,5.5,5.5,4,4,a,10

Example run:

pkg <- read_package("tests/testthat/data/types.json")
read_interlaced_resource(pkg, "interlaced")
# # A tibble: 5 × 7
#      na_fct    na_int    na_cfct na_none na_default cfct_chr cfct_int
#   <dbl,fct> <dbl,int> <dbl,cfct>   <dbl>  <dbl,fct> <cfct>   <cfct>  
# 1       1.1       1.1        1.1       1          1 APPLE    APPLE   
# 2       2.2       2.2        2.2       2          2 BANANA   BANANA  
# 3 <OMITTED>     <-99>  <OMITTED>       3  <OMITTED> APPLE    APPLE   
# 4 <REFUSED>     <-98>  <REFUSED>       3          3 BANANA   BANANA  
# 5       5.5       5.5        5.5       4          4 APPLE    APPLE

Notes / thoughts

The complexity of read_resource is getting pretty out of hand (as you note in Make read_resource() more modular #210). I would argue that it's not just that we need to make it more modular; what really hurts us is that we're writing a complex parser here in a language that doesn't support static typing. So we can't just validate the JSON via an object schema (as you'd do with pydantic or serde), then rely on the type signatures thereafter to guarantee valid types; instead type validation is mixed into all of our transformation logic (look at all the null checks that are necessary, for example). Without static typing, it becomes very difficult to manage the input possibilities and decouple input validation from transformation. As I mentioned in the last frictionless call, in the long term I want to write a library in rust to help address / standardize / streamline some of this, but in the short term I agree that your ideas in Make read_resource() more modular #210 will go a long way.
While working on this I noticed some more (easily fixable) performance issues with read_interlaced_*():

Because interlacer is still in its infancy, I don't think it should be the default reader. So presently I have an interlaced argument in read_resource, as well as an alias function read_interlaced_resource() that runs with interlaced = TRUE.

When interlaced = FALSE, it uses read_delim() instead of read_interlaced_delim(). This has the following effects:

cfactors just become whatever their underlying code type was (a cfactor with character codes just turns into a character, for example)
if there is any custom field-level missingness information, it is lost (and the user is warned).

The reason for this, is that even after I fix the performance issues above, we're still looking at a speed compromise if we want to load non-vroom types (interlaced columns , cfactors, list, etc.) with frictionless-r. This is the case because if we touch a column in R after loading it with vroom, the column gets loaded into memory, thereby defeating the lazy ALTREP loading of the column and slowing things down. In the long-term I want to work with ALTREP so it lazy loads everything just like vroom; but in the short term I'm planning to focus on the larger low hanging fruit (like the issues i link above)

Minor cosmetic thing -- I've named all my classes "interlacer_*" to sort of namespace them (similar to "vectrs_vctr"). After working with the lib from the perspective of another package though, I'm not sure I like it. Would you prefer to see "interlaced" and "cfactor" instead of "interlacer_interlaced" and "interlacer_cfactor"?

Key features: - field-level missingValues - coded factor columns (value labels for factors) - tagged missingness (interlaced columns)

khusmann added 3 commits June 26, 2024 18:09

add interlacer support via read_interlaced_resource()

1945ee0

Key features: - field-level missingValues - coded factor columns (value labels for factors) - tagged missingness (interlaced columns)

scope calls to cli::cli_abort

6d2111b

fix r cmd check documentation warnings

839acc9

This was referenced Jul 3, 2024

Support field.missingValues #218

Closed

Support field.missingValues #174

Open

Support labeled schema.missingValues #222

Open

Support field.categorical and field.categoriesOrdered #148

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add interlacer support via `read_interlaced_resource()` #213

add interlacer support via `read_interlaced_resource()` #213

khusmann commented Jun 27, 2024 •

edited

Loading

add interlacer support via read_interlaced_resource() #213

Are you sure you want to change the base?

add interlacer support via read_interlaced_resource() #213

Conversation

khusmann commented Jun 27, 2024 • edited Loading

Notes / thoughts

add interlacer support via `read_interlaced_resource()` #213

add interlacer support via `read_interlaced_resource()` #213

khusmann commented Jun 27, 2024 •

edited

Loading