One of the greatest hits by the Boss (Bruce Springsteen) is “Blinded by the light”. The lyrics are rather poetic, so I asked ChatGPT to analyze them for me. According to the LLM (Large Language Model), the song's narrative isn't linear or straightforward, which is characteristic of many of Springsteen's early compositions.
The song's chorus, "Blinded by the light," is a metaphor that can be interpreted in several ways. It might signify being overwhelmed or rendered directionless by a sudden clarity or realization, or it could represent the disorientation one experiences through sudden changes or overwhelming circumstances. For many, it's a symbol of the confusion and intensity of growing up, with all its exciting and confusing experiences.
So how does this relate to the exciting world of data science? Early this year, we were blown away by ChatGPT—built upon the GPT3 model by OpenAI, which provides an unprecedented conversational experience with a machine. Under the hood, LLMs (such as GPT3) are gigantic deep neural networks (DNNs) with trillions of parameters.
DDNs have become increasingly popular parallel to decreasing costs of cloud computing and today they are by many seen as the hammer for most data-associated nails (aka business problems). While the superb performance and flexibility of DNNs might sound like they are a go-to solution for most use cases data scientist encounter, I argue that we really should avoid being blinded by the hype.
In the following, I’ll provide a few arguments to back this up, both broad scale (environmental) and narrow scale (everyday DS work).
The costs, they are substantial
Let’s start with the broad scale implications of increasing DDN application. The wind is blowing from the general direction of massive data centers, which account for about 3% of the global greenhouse gas emissions. While this figure seems low, it is more than produced, e.g., by aviation or by rice cultivation and food processing together.
Moreover, the energy consumption of data centers is rising. According to a study report by the European Commission, the energy consumption of data centers is expected to grow 28% by 2030, from what it was in 2018. There is no reason to expect that the same pattern would not apply to the rest of the world. Some of the increasing consumption is attributed to data storage, but the vast majority is due used to operate processors.
LLMs owe their capabilities to the enormous amount of data used to train the models. To process and analyze the vast amounts of data, tens of thousands of high-performance GPUs are needed. This goes especially for model training, but also for making predictions. As an example, the training of GPT-4 took over 90 days, using more than 20 000 GPUs. To use this model (a single copy) for prediction, takes a cluster with over 120 GPUs.
The energy consumption of a model—both at training and inference—is proportional to the number of operations performed, which is in turn proportional to the amount of parameters in the model. Thus, simply put, energy can be saved by not doing (excessive) unnecessary calls to large models, or by using the least complicated model appropriate for the task. An analogy for the latter would be to use a calculator to perform a calculation instead of asking ChatGPT.
As a data scientist and IT consultant I might not be the right person to preach about the carbon footprint of AI and digital automation. I admit that I could be much more conscious about the environmental implications of my actions at work. Also, I strongly believe that we all need to do the same; be more conscious and responsible. Of course, it matters a ton how the energy is produced to power data centers, as it matters how the food ending up on our plates is produced. As a consumer it is not always easy or even possible to know what the environmental impact is.
High cost of thinkin’, data’s got us on the brink. Machines dreamin’, but it’s costin’ more than we think. Oh, Mother Earth’s heart is heavy with sorrow. Are we tradin’ her today for a smoky tomorrow? ChatGPT.
Ground control to over-the-moon ideas
Clearly, when one needs to analyze unstructured data, namely an images or text, there is really no question about it whether DNNs are the best option for the job. But why is this? Machine learning is not magic; it is all about the information in the data.
By information I mean how strongly the features of the data are associated with the phenomenon/variable of interest. For example, the amount of light outside provides reliable indication of the time of day, whereas measuring the temperature of the content of a cup does not tell you whether there is coffee, tea, or hot chocolate in there.
If there is little information in the data, or if there are no clear associations between features and the variable of interest, it matters very little what algorithm is chosen. Linear models can be as good as neural networks. One good example here is prediction of customer behavior, such as churn, based on customer characteristics. This is because the reasons to quit a service are most likely not that well reflected by the data the service provider has on the customers. Collecting better data is either difficult, not possible, or even prohibited by legislation.
The reason why NNs are so good with unstructured data is due to their ability to automatically create and select useful features from the data. Convolutional networks and DNNs are also able to select features which are invariant to some types of transformations like rotations, transpositions, and scaling, while attention mechanisms in sequential models are good at learning the contextual association between pieces of information. In addition, neural networks can learn complex patterns in the data, which can be especially useful when detecting patterns in images or the semantic meaning of text and relationships between different words.
However, Kaggle competitions are not typically won using NNs, but with some variant of XGBoost; a tree-based ensemble method. This is because NNs are not optimal for tabular data, which most business problems are essentially about. A key question is what makes tree-based models work better on typical tabular data than deep learning?
This question was addressed in a research paper that was published last year. According to the authors (Grinsztajn, Oyallon, and Varoquaux), one reason is that neural networks struggle to fit non-smooth, irregular functions as compared to tree-based models (which learn piece-wise constant functions). Another reason is that tree-based methods are less sensitive to uninformative features in the data. The third explanation provided in the paper is that NNs are invariant to rotation (which means that rotating the data does not change the model performance), while the information in tabular data is not.
Keep it simple, keep it swift
Most business use cases fall within the scope of customer analytics, demand forecasting, recommendations, or predictive maintenance. For these problems, NN-based solutions can be used but they are typically not either cost effective (they require a lot of tuning and computational capacity) nor significantly better than simpler alternatives.
I’ll give a concrete example from a few years back. In 2018, the Finnish Retirement Safety Center (Eläketurvakeskus) predicted the probability of starting on disability pension within two years’ time, using a multi-layer perceptron (MLP) model, i.e., a neural network. The reported AUC (likelihood that a randomly chosen positive case has a higher predicted probability than a negative case) was 78%, while the AUC for a simple logistic regression model was 77%. In 2018 the AI-hype was booming, so it is kind of understandable that the NN-results were reported in the press. Still, it is arguable that the transparent linear model was probably more useful for understanding the underlying resons and possibly mitigating the risk of disability pension.
Deep learning/NN -models have their strengths. In addition to the above-mentioned, they also tend to extrapolate well beyond the training data, which is not the case with tree-based models. To overcome this drawback, linear trees have been developed (applying linear models in the leaves instead of simple constant approximations), which admittingly comes with increased memory requirements.
In general, it is of good practice to apply the least complicated solution for a given problem. This saves time from development, reduces direct and indirect computational costs, and facilitates better explainability of the solution. It takes experience and self-confidence to make smart choices rather than to go with the hype.