New package: `coerceDT`

@samabbott and I have previously discussed the need for slightly more generic ingest-and-check capability for receiving slimy user input and turning it into pristine data.tables for use in epinowcast.

I have enough demand in some of my other work to justify doing so. Currently, epinowcast has an internal function coerce_dt.

I propose a new one-thing-only package to provide a more generic coerceDT functionality. My draft contract for that function is roughly:

#' @param data Any of the types supported by []
#' @param copy A logical; if `TRUE` (default), a new `data.table` is returned;
#' if `FALSE`, `data` *may* be modified in place, but is not *guaranteed* to be
#' so. For example, selecting a subset of columns creates a new `data.table`
#' reference.
#' @param select Optional; if `NULL` (the default), all columns of `data` are
#' returned. If a `character` vector, the corresponding columns are returned. If
#' a `j`-like expression (see [data.table::data.table()]), the result of that
#' expression will be returned. _N.B._ for the `j`-like version: `required`
#' will be considered *before* selection occurs, and thus any coercion to
#' relevant classes will also happen pre-selection.
#' @param drop Optional; if `NULL`, ignored. If a character or integer vector,
#' the corresponding columns *if present* will be dropped. If *not* present,
#' `coerceDT` will warn about a request to drop columns that don't exist.
#' @param required Optional; if `NULL` (the default), there are no required
#' columns. If a `character` vector, `coerceDT` will produce an error indicating
#' which columns are not present. If a named `list`, the names will be required
#' columns and the `list` entries will be used to coerce the corresponding
#' columns. If those entries are themselves characters, those will be assumed to
#' be the class to coerce to via an `as.Class` method; otherwise, they should be
#' single argument functions that will be used to transform the column.
#' @param NAerror Logical; if column coercion results in any `NA`s, is that an
#' error (default: yes)?
#' @param forbidden Optional; if `NULL`, ignored. If a character vector,
#' `coerceDT` will error if any of those columns are present.
#' @return A `data.table`; the returned object will be a copy (default), unless
#' `copy = FALSE`, in which case modifications *may* be made in-place, though
#' are not *guaranteed* to have been so
#' @details This function provides a general-purpose tool for common, basic
#' checking and conversion tasks with `data.table`s. It's intended use is as
#' a simplifying, standardizing interface for raw input checking, not to perform
#' more complex requirement checks. It is not, e.g., able to answer if
#' one-and-only-one of some set of columns are present, or to coerce column
#' values to a new values based on anything other than their initial value.

Looking for feedback on 1) the value of such a function and 2) any tweaks to that contract to increase its value

1 Like

This functionality has been really helpful in epinowcast and I can think of a bunch of places I might use it going forward.

I also think we might want it as we spin up other tools (such as our work led by @sangwoopark on epidist).

I have made a summary poll here but do feel to comment any specific things as well:

1.) Value:

  • Valuable
  • Shrug
  • Not valuable

0 voters

2.) Should this be part of epinowcast (i.e. hosted on the GitHub org)

  • Yes
  • No

0 voters

Something that might be nice is making it optionally use the new rlang message/warning/error interface. Ideally, this would be non-default and as a suggest. Potentially, this could be made more seem less than needing to specify an arg (maybe using options?).

This would be useful immediately for epinowcast as we might move package messaging to be rlang based in the near future.

The other nice to have is the ability to assert required types for columns. This could be a bit fiddly and maybe not worth the effort of wrapping but is the part of input checking in epinowcast that is currently a little weak.

I’ve managed in the past to hook into package loading to sniff for the availability of a suggests, then modify function behavior accordingly.

Presumably, thats possible still, and possibly could figure it out anew. Lets call that 0.2.0 behavior.

Yes that could work. Potentially could just pass message, warning, stop as args so if you want to use different functions for these you just make a light wrapper. Agree it’s 2.0 functionality l.

Working on some of the gory details here:

I’m a bit unclear on the use case for actually forbidding columns. By “forbidding” I mean “throwing an error if present” - what is the advantage of that behavior over “dropping with a warning”?

Second, order of operations:

Do we want drop => forbid => require => select or some other order?

Third (and related to previous), what do we want to support in terms of filtering / transforming?

It’s related because it seems we might want to express e.g. forbid in terms of values (not just presence of a column). I’m not sure I like that idea, because it makes the interface forbid-these-columns OR actually-require-these-columns-to-test-for-forbidden-rows.

That said, filtering / checking for bad rows does strike me as a basic-and-performance-entangled operation. As in, if getting a filtered slice of data, it isn’t necessary to make a copy - though gets somewhat complicated, e.g.

dt <- data.table(x=1:10)
address(dt) != address(dt[x>0]) # TRUE, even though the "view" is everything
dt2 <- dt[x>0][, y := "B" ] # makes a copy
dt[x>0, y := "A"] # *doesn't* make a copy, however!

so if just filtering, can avoid the auto-copy. if also modifying, need to check for copy-vs-not behavior?

Relative to forbidding certain columns vs values in the columns, I think I’m on board with the abstraction forbidden = list("colA", "colB" = \(b) ...), and the idea that the default test function just yields TRUE – i.e. all values of the column A are forbidden, therefore the only way to pass is for A to not be present. Then by implication, column B can either not be present, or if it is, all it’s values must yield FALSE from the test function.

This doesn’t quite work for the drop context however - because the idea would be to drop either a whole column OR rows where that column evaluates to TRUE. But the default-to-TRUE conception would then correspond to dropping all the rows.

I think it is unclear which you might want here though you could make an argument a warning with dropping is preferred to an error in most settings.

Would we not want select → drop → forbid? In what setting would it make sense to have dropped or forbidden cols do anything when you are selecting for a different subset in the first place?

Agree I “think” this is nice.

This doesn’t quite work for the drop context however - because the idea would be to drop either a whole column OR rows where that column evaluates to TRUE . But the default-to-TRUE conception would then correspond to dropping all the rows.

I feel like this is happening because it is trying to stretch the naming approaches a bit far. Perhaps instead of overloading drop and forbidden should instead think about an additional filter option?

As an aside, in our prototype (i.e. in epinowcast) this is a single function. Given we are spinning this out it might be nice to modularise each component (as internal functions) with a single coerceDT() exposed to the users?

Definitely modalurized - the way I’m implementing it now, there are argument checks at the boundary, then deferring to specific internal functions depending on what’s coming in.

Some implementation notes from chat between @pearsonca and @samabbott

  • going with the S3 approach - simplifies dispatch to data.table vs character vs other inputs
  • for the more generic use case, i think it makes sense to include a “from file” option (that’s the character argument) - basically uses fread (though will also handle an rds) - this gets managed via .character then passed on to other methods.
  • my problem remains getting a very clear, consistent abstraction for the “verbs”. some rough ideas there, just need to have a clear think about it.

Verb proposals (and order in which they act):

  • required: Vectorised transformation (i.e. checking column class or try and convert). The typical use case is casting (i.e as.numeric(), as.integer(), etc.). Failure state is to depend on the casting. In the top level will support column names to check for existence.
  • forbidden: Nearly the same as required. Need both as can use together. In addition to casting can also check.
  • select: Select or drop variables (negative select). Also supports renaming. Syntax: c("Sam" = "the best", -"Carl")
  • rename: Rename variables. Need to check that any variable requested to be renamed is an error. To be dropped once select is fully featured.

The plan is to modularise the work plan into issues so that multiple people can work on at once. @pearsonca is the current code owner.

@samabbott to write a user integration test of the new syntax based on the current implementation in epinowcast.

This has got a bit stuck due to lack of developer time but is really very close if anyone has any spare cycles and thinks this is interesting/useful