Multiple train/test splits result in discontinuous batches #127

SimonTopp · 2021-08-17T12:05:25Z

Lines 112 to 117 in a7629eb

    
           for i in range(int(1 / offset)): 
        
               start = int(i * offset * seq_len) 
        
               idx = np.arange(start=start, stop=data_array.shape[1] + 1, step=seq_len) 
        
               split = np.split(data_array, indices_or_sections=idx, axis=1) 
        
               # add all but the first and last batch since they will be smaller 
        
               combined.extend([s for s in split if s.shape[1] == seq_len])

Here, if we have discontinuous training and testing groups (i.e. two sets of date ranges for both), and batches are set to anything other than 365, then I think this results in one batch that starts in the first date range and ends in the second. I think we should first group by water year, then split into batches and just pad and/or drop the last one. What do you all think?

jsadler2 · 2021-10-25T20:58:42Z

Interesting. It's been a while since I wrote this (or thought about this ... or used this 😄). Have you confirmed that this is what happens?

jdiaz4302 · 2021-11-04T15:33:25Z

This may be of interest as confirmation that multiple train/test splits result in discontinuous ~~batches~~ sequences.

janetrbarclay · 2021-11-09T18:05:42Z

Further confirmation if you look at the temps in a single sample (these are observed temps, # in the title is the seg_id)

jdiaz4302 · 2021-11-12T22:22:49Z

Using the existing reduce_training_data_continuous function from the river_dl/preproc_utils.py file can help get continuous batches with nan values. For example, here is the 365-day sequence for pretraining and finetuning Ys when I applied it to only the finetuning Y (the gap in the finetuning Y is where the nans have been placed - summer):

If you apply reduce_training_data_continuous to the x variables, you end up with nan in the predictions and subsequently the loss function. Taking this approach in #142 by applying reduce_training_data_continuous to only the Y array (and not the pretraining Y or X arrays) led to much worse RMSE (factor of 2). I assume this is because the model is exposed to out of bound x values that have no corresponding out of bound Y values (set to nan) but still associated with the 365-day sequence of other values, so it may lead to some misleading learning.

jds485 · 2023-05-23T13:48:47Z

I think this issue has been addressed from #218

SimonTopp mentioned this issue Nov 3, 2021

No pretraining exp #142

Closed

SimonTopp mentioned this issue Nov 17, 2021

Simplify training routine #146

Closed

jds485 mentioned this issue Apr 24, 2023

Pad train/val/test data #218

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple train/test splits result in discontinuous batches #127

Multiple train/test splits result in discontinuous batches #127

SimonTopp commented Aug 17, 2021 •

edited

Loading

jsadler2 commented Oct 25, 2021

jdiaz4302 commented Nov 4, 2021 •

edited

Loading

janetrbarclay commented Nov 9, 2021

jdiaz4302 commented Nov 12, 2021

jds485 commented May 23, 2023

Multiple train/test splits result in discontinuous batches #127

Multiple train/test splits result in discontinuous batches #127

Comments

SimonTopp commented Aug 17, 2021 • edited Loading

jsadler2 commented Oct 25, 2021

jdiaz4302 commented Nov 4, 2021 • edited Loading

janetrbarclay commented Nov 9, 2021

jdiaz4302 commented Nov 12, 2021

jds485 commented May 23, 2023

SimonTopp commented Aug 17, 2021 •

edited

Loading

jdiaz4302 commented Nov 4, 2021 •

edited

Loading