I strongly agree with this concern. It really is the wild west at the moment. Not only drawing too much from artifacts but also the lack of common practice meaning people accidentally/intentionally go down different reporting paths to find what they want to find.
So my view of this is that in some sense the scores can be thought about as data and from there you can then naturally do lots of things to them with different statistical tools. Often these have been quite classic frequentists tests but I don’t think there is a reason you can’t also do more generative bayesian models etc. The tricky bit is of course making sure what you do doesn’t get rid of the proper part of the proper score which is tricky.
Personally, though obviously biased, I do think modelling the scores is the way to go most of the time if getting serious about this. In the first instance making sure to report the distribution of scores etc seems sensible.
I really like the idea of having some living guidance “best practice”, especially for high-dimensional settings. I have had a few chats with @johannes about this and would love to get it off the ground. Something I am unclear about is how much this needs to be in our domain and how much we can farm out to the stats folks.
Yes this exists but I think I would say for at least myself I am uncomfortable about depending on that as an approach for reasons I find hard to quantify.
In terms of good practice, I tend to look back at what @nikosbosse was doing and what @johannes has done recently.