-
Notifications
You must be signed in to change notification settings - Fork 14
Conversation
Also, the |
@jsadler2 any chance you have time to review, or willingness to punt quickly to @SimonTopp if not? Jeremy pointed out the big concern - it's weird that the model predictions are all right on top of one another from 2009-09 to 2010-06 and then suddenly all over the map in the summer (and even a bit in the winter) once the validation period starts. Could be a bunch of things, including but not limited to:
|
I can look at this this afternoon |
How has it been done previously to train on the winter data and test on the summer? I wonder if shortening the training data like this is causing weird jumps in the data (such that the model thinks that Sept 23 comes immediately after June 19). I have a 2 PM (eastern) meeting but can look a little more afterwards. |
I think Janet might be onto something. Here we're cutting up all our observations by start and end date. river-dl/river_dl/preproc_utils.py Lines 46 to 75 in b716770
Then here we're taking the resulting sequences and slicing them by 365 which is assuming that continuous years are being passed to it. I think I walked through all this when I made issue #127. river-dl/river_dl/preproc_utils.py Lines 120 to 139 in b716770
I think what we want to be doing here is masking out the summer months rather than excluding them in the start/end dates. Maybe using the exclude file (might need some work after the big update a couple months ago) or by using |
np.savez_compressed(updated_io_data, x_trn = io_data['x_trn'], x_val = io_data['x_val'], x_tst = io_data['x_tst'], | ||
x_std = io_data['x_std'], x_mean = io_data['x_mean'], x_vars = io_data['x_vars'], | ||
ids_trn = io_data['ids_trn'], times_trn = io_data['times_trn'], | ||
ids_val = io_data['ids_val'], times_val = io_data['times_val'], | ||
ids_tst = io_data['ids_tst'], times_tst = io_data['times_tst'], dist_matrix = io_data['dist_matrix'], | ||
y_obs_trn = io_data['y_obs_trn'], y_obs_wgts = io_data['y_obs_wgts'], | ||
y_obs_val = io_data['y_obs_val'], y_obs_tst = io_data['y_obs_tst'], | ||
y_std = io_data['y_std'], y_mean = io_data['y_mean'], y_obs_vars = io_data['y_obs_vars'], | ||
y_pre_trn = io_data['y_pre_trn'], y_pre_wgts = io_data['y_pre_wgts'], | ||
y_pre_val = io_data['y_pre_val'], y_pre_tst = io_data['y_pre_tst'], y_pre_vars = io_data['y_pre_vars']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did these all get updated somewhere that I'm not seeing, or are you basically just copying io_data here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just copying the io_data because that's what satisfied snakemake
- correct my next interpretation if it's wrong, I'm still very new to snakemake
.
The output file, prepped2.npz
, wasn't actually updating when I was using PB outputs as inputs until I specified it as an output in the Snakefile
and that required it to always be made. So, I rewrote prepped.npz
to prepped2.npz
if pretraining did occur - when I didn't really need to - to make the pipeline work under how I set everything else up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly is prepped2.npz
and how does it differ from prepped.npz
I think that would help clarify how to best do this in Snakemake. It seems pretty unusual to me to have to make a straight copy of a file just to make the pipeline work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, gotcha, ya Snakemake can be tricky when your experiments have different outputs. I think you could use touch
here in the snakemake, which basically creates a temporary phantom file to fool snakemake into thinking the output exists. You'd probably have to play around with it a bit though.
Good catch! Interestingly, this training set processing applies to both groups in the experiment, so I wonder what that implies. That is, both groups are given a discontinuous sequence of inputs values, but only the |
I was just thinking about that @jdiaz4302 . I would expect the discontinuous sequences to decrease accuracy across the board, but we still see pretty decent results from the pre-training which is surprising. Am I right that the pre-training here has the some breaks as the training dataset? If so, it's bonkers that it can still learn annual signals. |
Yep! I'm assuming with a discontinuous 365-day sequence, you could often still reliably use (e.g.) the last 2 weeks of data and learn certain variable relationships with less focus on long-term temporal dynamics. |
I've found similar things with the GraphWaveNet model I've been developing, but that would imply that there's relatively little worthwhile information beyond ~1-2 months in a sequence. Might be interesting to run some tests with different sequence lengths and see at what point (how short) the model sees a drop in performance from loss of temporal info. Also, should have said this off the bat, very cool work and great visualizations man! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really interesting stuff, @jdiaz4302!
I'm having a hard time understanding exactly how you set up the two treatments. I understand what you wrote in the description, but I don't see it in the code anywhere. For example,
- I don't see PB-input and PB-pretraining treatments in the Snakefile.
- I also don't see where you are combining the PB outputs. Is that just in the x_vars in the config file?
Discontinuous sequences
I think that the non-continuous thing is really interesting. So does the model have chunks of time where it's doing well and then, when it gets to a break in the time series it does poorly for a few days and then recovers? That's how I'm picturing it working, but it'd be nice to
- get a confirmation that the 365-day training sequences are indeed discontinuous in time,
- see what impact that is having on predictions.
As an aside, I think the prep_data function has sequence length as an argument that gets propagated to the other relevant functions so if we want, we could shorten the sequence length to reduce the discontinuities.
Let's fix this if we can! We've hypothesized that a lot of the pretraining benefit is in getting to see predictions for conditions under which the model doesn't get see any observations (in this case, for summertimes). So maybe we can get the pretraining results even better, justifiably, by adding those back in.
Any ideas on what the mechanism would be for this? I would think the PB inputs approach would have a better shot at learning this since it could learn to rely on the PB input more heavily (which does integrate memory across that missing period) whereas the pretraining approach has no such pseudo-memory to rely on. I've seen a handful of (informal) HPO exercises looking at sequence length for such problems, and people generally settle on ~176 or 365 days. But I bet it varies by region, and I wonder if memory just isn't that important over the summer in these reaches b/c snow is long gone by June and drought is rarely severe. I wouldn't mind seeing this experiment done again for the DRB but also don't see it as a very high priority.
This explanation seems more plausible to me. |
PB-input or PB-pretraining is triggered by the number of |
This is updated in my most recent PR (granted in kind of a bulky way), but you could pull it from there if you wanted. It basically just creates an |
Since Jeff and Simon are deep into this review already, I'll follow the conversation and comment if I think of something, but mostly let them dig into the code. |
Mmk. I'm pretty sure I know what is going on here as far as why the predictions for PB-input is so wonky in the validation phase and not in the training. The Training Y values are normalized and the Validation ones are not. river-dl/river_dl/preproc_utils.py Line 584 in b716770
|
Yikes, good catch! Is that only the case in this PR, or has it been that way in recent code as well? |
It's always been like that. There's been no need to normalize Y_tst and Y_val before. Do you think that we should normalize all Y? I think often no Y partition is normalized. It's just when doing the multi-variable predictions it is needed. |
@jdiaz4302 - I think if you scaled and centered |
BTW - the only reason I thought of this as quickly as I did is because I did basically the same thing for an experiment for my multi-task paper 😄 I kept thinking "how are the training predictions okay and the val/test predictions terrible?!" and then I realized that the model was being trained on variables that were on a total different scale than what what I was giving them in the val/test conditions. |
😮 Haha, thanks @jsadler2. I agree there's generally no reason to scale the validation and testing set observations since you're usually going to just use those for evaluation at the scale-of-interest. |
That's a great find, Jeff. Experience and team communication paying off big time here. Combining the scaling fix (💯 Jeff!), discontinuity fix (💯 Janet and Simon!), and pretraining fix (💯 Simon), I feel a lot more optimistic that a next run of these models could give us a correct result. Sweet! |
If you do pull from the |
Sounds like we need somebody to review & merge Simon's PR soon! |
I can take a look at Simon's PR tomorrow. |
Results with scaling fix: Regarding the discontinuity fix, I tried using a modified version |
much different than before. To clarify, the heatmap is from all segments modeled, correct? not just the two segments with time series plots |
Yes, the heatmap is from all segments. Definitely different, but that should be expected given the previous results were generated with validation set variables that had the wrong scale |
Super interesting Jeremy. So basically, even though the PB input runs saw nothing that resembles summer, they're able to generalize to summer conditions better than models that were at least pre-trained with summer months included? Also, did we confirm our discontinuous training sequences? |
There's no pretraining of summer months included here yet - didn't want to duplicate efforts. I posted a graph at #127 showing that we do have discontinuous batches |
I messed up some of the version control associated with this, so to clarify, since the last update, I didn’t make any changes to:
What I did make meaningful changes to were:
FiguresFigure showing the continuous batch of Figure showing latest performance heatmap. I found it strange that performance took a strong hit from using the continuous batches with Time series for reservoir impacted and not impacted stream: I'll include these plots of input versus output as well (since I made them), but I didn't find them incredibly insightful (colors are the same; kinda interesting that it seems to taper the effect of higher PB values): I'm likely going to be helping more on the reservoir task starting next week, and like I said, this was not designed to merge with the existing codebase - more an exploratory tangent. Feel free to close out or maybe I will sometime next week when engagement is practically dead. Also, thanks @jsadler2 for the better approach! I just didn't have time to learn it and get the results, but I will definitely be reviewing it before trying to take on a deeper snakemake-affliated task |
Interesting. I find it a bit surprising that both methods are overpredicting temperature by quite a bit during the summer periods even though they didn't see any forcing data in that range. Do you know if it's overpredicting at all segments? |
@jzwart here's a plot of all observations (x-axis) versus predictions (y-axis); these look approximately the same across models and runs. It does seem like that's the general trend. Made some low-effort quadrants via dashed lines to try to discern summer (upper left quadrant - above 25 Celsius). Solid line is 1:1 Seeing data adjacent to summer (when temperatures are changing faster) may suggest that summer will peak higher than it does (i.e., a sharper rather than rounder parabola)? |
This seems like a relevant conversation to be had and maybe a good task to assign to someone for a new PR. We should probably make sure our pipeline is creating continuous sequences and has the flexibility mask out certain observations within those sequences for experiments like this. I know @jsadler2 mentioned he had some ideas for an upcoming PR, maybe we should put this on the to-do list? Also, at least in these reaches it looks like our high temp bias is in the training predictions as well. I feel like that might be an indication that it could be something wrong with our data prep rather than an issue generalizing to the unseen summer temps. What do you think @jdiaz4302? |
The red box annotations are a good point. It's possible, but I don't necessarily suspect that something is wrong with the data prep. In my experience, it's not uncommon for there to be under/overestimating at the low ends and over/underestimating at the high ends (note opposite order with respect to "/") because then performance at the central/median/mean values are still optimized. These plots do seem overly skewed to not performing at the high ends, but a density view of the plot seems to show that the low end is far more weighted (same plot as above, but It's possible that additional variables/missing context could help reel in those low and high ends to the 1:1 line though. RMSE definitely optimizes with respect to the central values, but I've never had luck fixing this problem by using a different generic loss function. |
Something that I find really interesting, and @jdiaz4302 brought this up when he first posted this, is the shapes of the inputs vs the outputs for the temp-related inputs... especially the seg_tave_gw. They all have this unusual pattern where it's a little like the "quiet coyote" shape :) - It goes up kind of linearly at the bottom of the range of inputs, but then at the top it kind of splits where some of the points keep going up and some level off and sometimes go down. I'm scratching my head. Why would the model learn that? Shouldn't increasing air temps (for example) always lead to higher water temps? There is the factor of the reservoirs, but, if I understand these sites correctly, only some of them are influenced by the reservoir. And why would sometimes they go up and sometimes they go level and sometimes they go down? The gw one is especially interesting because there is also this vertical line when the input is zero. And to me that just seems really weird and like there is some kind of mistake in the model. But again, I'm scratching my head .... no ideas so far as what it might be. |
Yeah, I think the implicit assumption for a standard LSTM/RNN architecture is that values are evenly spaced/sampled in time. There are variants (e.g., Time-LSTM, Time Aware LSTM) that explicitly require the time between values as an input and easily allow a discontinuous segment, but those are probably better suited for truly uneven time series rather than an actually even time series with big chunks missing - also, its effort into new models, so a masking approach seems like the most applicable in these cases. @jsadler2 I think those plots are really cool for the same reasons 😄 . It could only be possible because of some interacting effects, Less confidence in that |
While finding a storage place for this work and testing the storage place, I found that the output versus input plots were specific to segment 1566 (reservoir-impacted); this is the same segment as the reservoir impacted time series throughout this PR (not labelled, but obvious by the spiky summer behavior in those time series plots). Here are the corresponding output versus input plots for 1573 (the not-reservoir-impacted time series segment; I used 1566 and 1573 because they had tons of data). I think it's really interesting that these plots are a lot more straightforward - less of those "quiet coyote" shapes, as Jeff pointed out. Also, the relationship between prediction and PB output (last row) is a lot more monotonic but still noisy/spread (I believe we expect the PB model to be more reliable away from reservoirs), could be motivation for further refining the PB model for reservoirs. Here is the same plot for all segments. Overplotting doesn't really resolve even with decimal-point (e.g., 0.01) alpha and marker size. Generally the overall out vs in plots seem to more closely resemble the corresponding 1573 plot, probably because most segments aren't so directly impacted by reservoirs as 1566. There's definitely a lot more spread added when considering the whole data set though. My plan is to close (and not merge) this PR by the end of the work day just to clean up and it will still be present in the "closed" tab for reference. I've stored all the output directories generated by this experiment in the newly created pump project space that Alison announced under |
Regarding #38
What happens here:
config.yml
config.yml
between runs if needed)x
variable arrays prior to training and rewrite that file (prepped.npz
->prepped2.npz
)Once a run is completed, I copy the output directory to a separate location and rerun (possibly after changing the pretraining epochs); e.g.,
cp -r output_DRB_offsetTest/ no_pretrain_1_300/
.After all the runs were done, I made some plots of the learning/training curves, validation set RMSE by month (adjusted to bin "months" by the 21st date of each month which better aligns with defining summer by equinox dates), time series, and I'm starting to look at the outputs vs input plots (will upload soon). I will add the notebooks to generate those plots shortly (need to clean them up).
Right now this looks very favorable for using PB pretraining (specifically validation set RMSE by month), but maybe too favorable? It would be nice to get some more eyes to spot any mistakes or oversights. One thing that is definitely strange is the all-over-the-place behavior of PB input models during validation set summers (see last two plots).
Plots: