Designing Machine Learning Toolboxes: Concepts, Principles and Patterns

samabbott · 5 October 2023 10:56

At the Epiforecasts meeting today @sbfnk pointed out this talk (https://www.youtube.com/watch?v=BQg-aG8J2DU) that discusses criteria for good modelling. Its very interesting and worth checking out.

I did a little digging and found the following paper from the presenter (as the title of this post) which I think has some interesting thoughts in it for thinking about tools design.

Paper: https://arxiv.org/pdf/2101.04938.pdf

LLM summary:

Introduction

ML toolboxes like scikit-learn are central to data science but their key design principles are not analyzed in literature. This paper attempts to explain and guide ML toolbox design through a conceptual model and design patterns.

Conceptual Model

Key abstraction points identified: data, learning algorithms, tasks, workflows. A type system called “scientific typing” proposed to capture properties of ML objects based on operations and statistics.
Mathematical objects are value objects, learning algorithms are entity objects with state changes. Interface cases identified for mathematical objects.
“Scientific types” (scitypes) introduced which combine structured types with compatibility constraints between parameters and methods. Scitypes define interfaces and properties for ML objects.
Higher-order scitypes proposed to describe composite algorithms like pipelines, defined through component scitypes and resultant scitype.

Design Principles

Separate key conceptual layers: data, algorithms, tasks, workflows. Avoid conflating concerns.
Encapsulate objects by scitype. Interface should reflect conceptual model. Treat higher-order scitypes similarly.
Declarative syntax following conceptual model. Specification and usage should mirror scitype formalism.

Design Patterns

Universal interfaces for all ML objects: construction, inspection, persistence.
Scitype templates define interfaces and inherit base functionality. Strategy pattern ensures interchangeable implementations.
Composition patterns: modification, homogeneous and inhomogeneous composition. Parameters set at construction.
Contraction factories: bulk conversion of composites into simpler objects.
Co-strategies encapsulate recurring motifs like ML task specification.

Examples

Patterns explain and can re-derive scikit-learn’s fit/predict interface, pipelines, forecasting APIs.
Guide new designs like time series forecasting, encapsulating reductions.

Conclusion

Provides well-grounded reference for future ML software design, connecting formal mathematics, statistics and software engineering. Potentially enables higher-level declarative ML languages.

The talk

I found this slide very interesting as a way of thinking about what not to do (and slightly sad as so many fo these things are common in IDE modelling).

Screenshot 2023-10-05 at 11.59.29

Also we are better because the models are fancy

This is paraphrased but really resonates.

Finally some levels of evidence are pulled out with examples of how ML is not hitting those

and question asked what do we say if asked if a method is good for a real world application or not.

Thinking about this from an IDE context (and especially in real-time) I think all these points are very very similar.

Topic		Replies	Views
How can we build a product in the open - talk from Chris Rackauckas talks	2	17	12 February 2025
Interesting SM post on the dream of automatic inference	0	24	28 May 2025
Streamlining of epi modeling tools	12	76	14 August 2024
Sam Brand - Intro Introductions	1	86	24 May 2024
Community Seminar 2023-03-01 - Adrian Lison - Generative modeling approaches to nowcasting with incomplete line list data Meetings	0	590	23 February 2023