Yes, I totally agree with this but I also don’t trust the vast majority of evaluation that is done at the moment that looks at lots of different strata in the data in an ad-hoc way so my threshold for thinking we should use a model is much lower.
I need to engage with the MCS literature more but on a first pass I assume it can also be represented in a model framework which would be handy as again it means you have just a single set of tools to learn/develop best practices for.
Yup this is the big problem right - depending on how you set them up and as you say one of the arguments @nikosbosse gave for why a transformed score is nice.
Again something, I wonder is how often we have similar problems when we are reasoning about a model’s performance by i.e. location and horizon using graphs etc. It would be interesting to try and unpick if the model setup just makes a more common problem obvious.