Designing Machine Learning Toolboxes: Concepts, Principles and Patterns

At the Epiforecasts meeting today @sbfnk pointed out this talk (https://www.youtube.com/watch?v=BQg-aG8J2DU) that discusses criteria for good modelling. Its very interesting and worth checking out.

I did a little digging and found the following paper from the presenter (as the title of this post) which I think has some interesting thoughts in it for thinking about tools design.

Paper: https://arxiv.org/pdf/2101.04938.pdf

LLM summary:

Introduction

  • ML toolboxes like scikit-learn are central to data science but their key design principles are not analyzed in literature. This paper attempts to explain and guide ML toolbox design through a conceptual model and design patterns.

Conceptual Model

  • Key abstraction points identified: data, learning algorithms, tasks, workflows. A type system called ‚Äúscientific typing‚ÄĚ proposed to capture properties of ML objects based on operations and statistics.

  • Mathematical objects are value objects, learning algorithms are entity objects with state changes. Interface cases identified for mathematical objects.

  • ‚ÄúScientific types‚ÄĚ (scitypes) introduced which combine structured types with compatibility constraints between parameters and methods. Scitypes define interfaces and properties for ML objects.

  • Higher-order scitypes proposed to describe composite algorithms like pipelines, defined through component scitypes and resultant scitype.

Design Principles

  • Separate key conceptual layers: data, algorithms, tasks, workflows. Avoid conflating concerns.

  • Encapsulate objects by scitype. Interface should reflect conceptual model. Treat higher-order scitypes similarly.

  • Declarative syntax following conceptual model. Specification and usage should mirror scitype formalism.

Design Patterns

  • Universal interfaces for all ML objects: construction, inspection, persistence.

  • Scitype templates define interfaces and inherit base functionality. Strategy pattern ensures interchangeable implementations.

  • Composition patterns: modification, homogeneous and inhomogeneous composition. Parameters set at construction.

  • Contraction factories: bulk conversion of composites into simpler objects.

  • Co-strategies encapsulate recurring motifs like ML task specification.

Examples

  • Patterns explain and can re-derive scikit-learn‚Äôs fit/predict interface, pipelines, forecasting APIs.

  • Guide new designs like time series forecasting, encapsulating reductions.

Conclusion

  • Provides well-grounded reference for future ML software design, connecting formal mathematics, statistics and software engineering. Potentially enables higher-level declarative ML languages.

The talk

I found this slide very interesting as a way of thinking about what not to do (and slightly sad as so many fo these things are common in IDE modelling).

Screenshot 2023-10-05 at 11.59.29

Also we are better because the models are fancy

This is paraphrased but really resonates.

Finally some levels of evidence are pulled out with examples of how ML is not hitting those

and question asked what do we say if asked if a method is good for a real world application or not.

Thinking about this from an IDE context (and especially in real-time) I think all these points are very very similar.

1 Like