In this blog post, I will go through my first impressions and findings of Snowflake’s recent step into ML and Data Science space with Python-supported Snowpark.
For the less patient ones, here is the shortcut to the conclusion: Snowpark shows great promise for Data Science and ML. However, there are still features missing especially on the MLOps side of things. The idea of bringing the tools where the data lies is exciting and I’m sure that Snowflake will continue to develop the platform rapidly to make it viable for more complex use cases.
Something old, something new
Experienced practitioners might recall that it is (still) possible to execute R or Python code in SQL Server, provided by the SQL Server machine learning services. This was a powerful approach for executing computationally demanding tasks when cloud computing was less of an option as it is today.
One can still use machine learning services and it is also available for Azure SQL. The beauty of this approach is that instead of extracting data from its source, calculations on the data are taken where the data resides. From a developer perspective, however, this is suboptimal as version control of code on SQL Server is not easy to achieve (but still doable), and as debugging of non-SQL code needs to be done elsewhere.
Today, many organizations, while still using relational databases, store the bulk of their data in cloud data lakes. Databricks has become a strong player in the field of data engineering and advanced analytics, by providing an easy access to scalable spark clusters for massive data processing and multi-language notebook support for flexible development.
In addition, making the use of delta lake easily accessible, Databricks also handles data versioning, which is an import aspect of modern MLOps philosophy. Still, the data needs to be transported between the source and the compute, which can create considerable latencies as well costs. In the lack of a better analogy, this is like taking the workshop to the hammer and not vice versa.
A couple of years ago, Snowflake introduced Snowpark, a solution allowing the execution of non-SQL code. Recently, Snowflake announced that Python support has been added to Snowpark, making it a potentially attractive tool for data science purposes. Conveniently, working with Snowpark dataframes resembles that with pyspark data frames. The syntax is pretty much the same and both are evaluated lazily. However, not all functionality in pyspark is (yet) available in Snowpark, which might create some problems.
For a data scientist like myself, who is used to working in Python, accessing data in Snowflake via Snowpark is remarkably simple. One needs only to establish a connection to a database and make a reference to a given table in that database (connection parameters here stored in a .toml file):
<code>
from snowflake.snowpark.session import Session
import toml
config = toml.load("<path to config.toml>")
session = Session.builder.configs(config).create()
df = session.table("NAME_OF_OBJECT")
<\code>
It is also possible to run queries in the database and conduct basically any operation that could be done directly in Snowflake. For a new user, there is already a lot of material available on medium to help one get started with different topics.
One cannot avoid the thought of Snowflake aiming at challenging Databricks, with the great advantage of the ability to use Snowflake warehouses to do the heavy lifting at the database where the data is located. A chat with Snowflake representatives revealed that the roadmap concentrates heavily on data applications enabled by Streamlit (acquired by Snowflake). Python development is also going to be easier within Snowflake UI with added Python worksheet support.
Not quite there yet
Great, one can make any required data processing in a familiar way and no computation is done unless data frames are viewed or converted to other formats (such as pandas). So far so good, but what about model training? To be able to execute this in Snowflake, the training logic needs to be registered as a stored procedure (SPROC, or as a user-defined table function). At this point, the immaturity of Snowpark API starts to reveal itself.
Snowpark is integrated with Anaconda and there are quite a few libraries readily available. However, many of these libraries have not been updated in a long time, which creates a dependency discrepancy between developers and the platform. While scikit-learn is up to date, for example, the version of gluonts for time series forecasting is from August 2021, and the gradient boosting library catboost is not available at all.
Online Snowpark ML tutorials work fine, as they tend to utilize only those libraries that are provided by the Anaconda integration. However, there are countless unsupported libraries that one might want to use in a project. Also, it is very typical that organisations have developed custom libraries/modules that are used internally for specific purposes. Luckily, local code can be included in the SPROC registration (by using an imports-argument). However, it gets trickier with unsupported libraries.
If an external library (which can be included in the registration either via importing a zip-file or giving the local path to the library) only depends on built-in libraries or the dependencies that are provided by the Anaconda integration, you are likely to be fine. Otherwise, all the dependencies need to be imported as well. These imports are stored as zip-files into a stage on Snowflake, whose location needs to be provided in the registration call.
Let’s say that one manages to include all the imports needed for using a particular library for model training and successfully registers a SPROC and uses that to train a model, which gets stored into the model stage. The logical next step would be to use the stored model to score new data. This is done either by a UDF (User-Defined-Function), a UDTF, or a SPROC, depending on the requirements of the predict-method of the model.
Whatever the approach, the library used for training should be included here as well (which is typically not the case in native python). One would expect that a library imported previously at model training would be readily available later, but this is unfortunately not the case; the importing will need to be repeated, which feels a bit clumsy.
Finally, let’s assume that model training and scoring can be implemented in Snowpark. Then (if not even earlier) it would be logical to start thinking about model life cycle. In a couple of years, mlflow has become a standard component for model management in MLOps pipelines. While there is an mlflow-plugin available for snowflake, it currently only supports model deployment. Model experimentation needs to be done somewhere else (or outside the mlflow context), missing the opportunity to export training to a Snowflake warehouse.
Great potential awaits to be realised
While Snowpark holds great promises, it is not yet a relevant competitor of Databricks, which has become a serious MLOps platform over the years. Still, Snowpark is already a great tool for straightforward data operation tasks and light-weight deployments, especially if scikit-learn is the go-to ML library and one does not mind doing some manual work with model versioning and performance logging.
The above-mentioned issues with python dependencies and lack of native model life-cycle management, which undoubtedly are solvable, are hindering broader adoption of Snowpark for ML modelling. Something Snowflake could do here is to adopt the possibility to register code environments, as implemented in Azure Machine Learning.
I’m certainly looking forward seeing how Snowpark will be evolving in the future. As soon as the obstacles for flexible machine learning experimentation are cleared, there is likely to be increasing interest in taking advanced analytics from other platforms to Snowpark, especially in organizations already hosting their data in Snowflake.
Photo by David Clode on Unsplash