Preface

Published

2024-10-02

Science is built up with facts, as a house is with stones. But a collection of facts is no more a science than a heap of stones is a house. (H. Poincaré)

Mechanics and engineers

What is the difference between a car mechanic and an automotive engineer?

Both have knowledge about cars, but their knowledge domains are different and focus on different goals.

A car mechanic can keep your car in top-notch condition; can do different kinds of easy and difficult repairs if problems arise with it; knows whether a particular brand of valve can be used as a replacement for another brand; can recommend the optimal kind of tyres to use in a given season for different brands of cars. But a car mechanic would face difficulties in calculating the theoretical maximal efficiency of an engine; or predicting the temperature increase caused by a new kind of fuel; or exploiting the phase transition of a new kind of foam to design a safer airbag system; or calculating the optimal surface curvature for a spoiler. A car mechanic typically possesses a large amount of case-specific knowledge, and doesn’t need to know in depth the principles of electromechanics and thermochemistry, or the laws of balance of momentum, energy, entropy.

Vice versa, an automotive engineer can assess how to use the electromechanical properties of a new material in order to design a more efficient and environmentally friendly engine; can calculate how a new material-surface handling would affect air drag and speed; and ultimately can research how to exploit new physical phenomena to build completely new means of transportation. Yet, an automotive engineer could be completely incapable of changing a pipe in your car, or tell you whether it can use a particular brand of lubricant oil. An automotive engineer typically possesses knowledge about the principles of electromagnetism, mechanics, or thermochemistry; is acquainted with relevant physical laws; and doesn’t need to have in-depth case-specific kinds of knowledge.

Note that the differences just sketched do not imply a judgement of value. Both professions, kinds of knowledge, and goals are necessary, interesting, and couldn’t exist without each other. Choice between them is a subjective matter of personal passions and aspirations.

In fact there isn’t a clear divide between these two kinds of knowledge, but rather a continuum between two vague extremities. A car mechanic can have knowledge and insight about new technologies, and an automotive engineer can know how to fix a carburettor. The two sketches above are meant to expose and emphasize the existence of such a continuum of knowledge and of goals.

Data mechanics and data engineers

A continuum with two similar extremities can also be drawn in data science.

Some data scientists have in-depth knowledge on, for instance, how to optimally store and read large amounts data; what kind of machine-learning algorithm to use in a given task; how to fine-tune an algorithm’s parameters, and the currently best software for this purpose. Their particular knowledge is fundamental for the working of today’s technological infrastructure.

At the same time, these data scientists typically face difficulties, for instance, in:

calculating the theoretical maximal accuracy or performance achievable – by any possible algorithm – in a given inference problem
explaining how the fundamental rules of inference and decision-making are implemented in a particular machine-learning algorithm
identifying which sub-optimal approximations to the fundamental rules are made by popular machine-learning algorithms
exploiting new technologies to build new algorithms that do calculations closer to the exact theoretical ones, thereby achieving a performance closer to the theoretical optimum

And it is also possible that they are not aware of, and maybe would be surprised by, some basic facts of data science. For instance:

there is an optimal, universal inference & decision algorithm, of which all machine-learning algorithms (from support vector machines and deep networks to random forests and large language models), are an approximation
there are only five or six fundamental laws upon which any inference, prediction, classification, regression, decision task is (or ought to be) based upon
splittings of data into “training set”, “validation set”, and similar sets, are not part of the exact application of the laws of inference and decision-making; such splittings arise as coarse approximations of the exact method.
cross-validation and related techniques are not part of the exact method either; they also arise as approximations
overfitting, underfitting and related notions are not problems that appear in the exact method (which takes care of them automatically); they also arise from approximations
it is possible to calculate, within probable bounds, the maximal accuracy (or other performance metric) achievable by any classification or regression algorithm for a given application
some evaluation metrics, such as precision or the area under the curve of the receiver operating characteristic (AUC), have intrinsic flaws and may attribute higher values to worse-performing algorithms

…because this is a kind of general and principled knowledge that these data scientists don’t need in their jobs. Their knowledge is more case-specific.

Drawing a parallel with the car example, a data scientist with this kind of case-specific knowledge is like a “data mechanic”.

A “data engineer”, on the other hand, is the kind of data scientist who has no difficulties with the knowledge and skills implicit in the bullet points above; but at the same time might not know what software to use for tuning parameters of a particular class of deep networks, or the best format to store particular kinds of data.

Just like in the case of the automotive industry, the difference just sketched does not imply any judgement of value. Both kinds of knowledge and goals are important and can’t exist without each other.

Goals of this course

There is a plethora of academic courses, in all kinds of format, that target knowledge and goals for the “data mechanic”. Those courses are usually inadequate to cover the knowledge and goals for the “data engineer”. Some courses, misleadingly, even present approximations and recipes that are only valid for particular situations as if they were universal rules or methods instead.

Courses that target the “data engineer” seem to be more rare. One possible reason is that this kind of knowledge is actually hidden in courses on probability, statistics, and risk analysis, presented with a language which makes only opaque and confusing connections with fields in data science and their goals; or, worse, with a language which emphasizes connections that are actually superficial and misleading.

We believe that it is important to teach and keep alive the less “mechanic” and more “engineer” side of data science:

Continuous advances in computational technology – think of quantum computers – will offer completely novel and superior ways to approximate the exact method of inference and decision. Only the data scientist who knows the exact method and theory, and understands how present-day algorithms approximate it, will be able to exploit new technologies.
Even without looking at the future, several present-day machine-learning algorithms could already be greatly optimized by any data engineer who is acquainted with the basic principles underlying data science.
The foundations of data science are the bridge to the sibling discipline of Artificial Intelligence.

The present course aspires to give an introduction to the “data engineer” side, rather than “data mechanic” one, of data science, but using a point of view more familiar to data scientists than to, say, statisticians.

More details about its aims, structure, and features are already given in the Dear student introduction.