Scale your pandas workflow with Modin – no rewrite required

This post was authored by Vasilij Litvinov, AI Frameworks Engineer, Intel.


Introduction

AI and data science are rapidly advancing, which brings an ever-increasing amount of data to the table and enables us to derive ideas and solutions that continue to grow more complex every day.

But on the other hand, we see that these advances are shifting focus from value extraction to systems engineering. Also, hardware capabilities might be growing at a faster rate than people can learn how to properly utilize them.

This trend either requires adding a new position, a so called “data engineer”, or requires a data scientist to deal with infrastructure-related issues instead of focusing on the core part of data science — generating insights. One of the primary reasons for this is the absence of optimized data science and machine learning infrastructure for data scientists who are not necessarily software engineers by nature – these can be considered two separate, sometimes overlapping skill sets.

We know that data scientists are creatures of habit. They like the tools that they’re familiar with in the Python data stack, such as pandas, Scikit-learn, NumPy, PyTorch, etc. However, these tools are often unsuited for parallel processing or terabytes of data.

Anaconda and Intel are collaborating to solve the data scientists’ most critical and central problem: how to make their familiar software stack and APIs scalable and faster?

This post aims to introduce Modin (aka Intel® Distribution of Modin), part of the Intel® oneAPI AI Analytics Toolkit (AI Kit), which is now available from Anaconda’s defaults channel (and from conda-forge, too ).

Why pandas is not enough

While being an “industry standard”, pandas is inherently single-threaded for a lot of cases, which makes it slow to operate on huge data (or even stop working for datasets not fitting the memory).

While there are many other alternatives aimed to solve the issues (like Dask, pySpark, vaex.io, etc.), none of those libraries provide a fully pandas-compatible interface – the user would have to “fix” their workload accordingly.

What does Modin have to offer you as the end user? It tries to adhere to the idea of “tools should work for the data scientist, not vice versa”. Hence it offers a very simple, drop-in replacement for pandas – you just switch your “import pandas as pd” statement with “import modin.pandas as pd” and gain better scalability for a lot of use cases.

What Modin offers

By removing the step of “rewrite pandas workflow to X framework”, it’s possible to speed up the development cycle for insights.

Fig. 1. Utilize Modin in a continuous-development cycle.

The way Modin better utilizes the hardware is by grid-splitting the dataframe, which enables running certain operations in a parallel distributed way, be it cell-wise, column-wise, or row-wise.

For certain operations, it’s possible to utilize experimental integration of OmniSci engine to leverage the power of multiple cores even better.

Fig. 2. Comparing pandas DataFrame and Modin DataFrame.

By installing Modin through AI Kit or from the Anaconda defaults (or conda-forge) channel , an experimental, even more speedy OmniSci backend for Modin is also available with a few simple code changes to activate:


import modin.config as cfg
cfg.Engine.put('native')
cfg.Backend.put('omnisci')
import modin.experimental.pandas as pd

Show me the numbers!

Enough with the words, let’s have a look at the benchmarks.

For detailed comparison of different Modin engines please refer to community-measured microbenchmarks: https://modin.org/modin-bench/, which track performance of different data science operations over commits to the Modin repo.

For this post let’s use more end-to-end-relevant, bigger benchmarks, running on an Intel® Xeon® 8368 Platinum-based server (see full hardware info below) using OmniSci through Modin.

Fig. 3. Running NYC Taxi (200M records, 79.2GB input dataset)

Fig. 4. Running Census (21M records, 2.1GB input dataset)

Fig. 5. Running PlastiCC (460M records, 20GB input dataset)

Hardware information: 1x node, 2x 3rd Gen Intel Xeon 8368 Platinum on C620 board with 512GB (16 slots/ 32GB/ 3200) total DDR4 memory, microcode 0xd0002a0, HT on, Turbo on, Centos 7.9.2009, 3.10.0-1160.36.2.el7.x86_64, 1x Intel 960GB SSD OS Drive, 3x Intel 1.9TB SSD data drives. Software information: Python 3.8.10, Pandas 1.3.2, Modin 0.10.2, OmnisciDB 5.7.0, Docker 20.10.8, tested by Intel on 10/05/2021.

Wait, there’s more

If running on one node is not enough for your data, Modin supports running on a cluster with rather simple setup (for Ray-driven cluster see https://docs.ray.io/en/latest/cluster/cloud.html#manual-ray-cluster-setup for setting up; refer to https://docs.dask.org/en/latest/how-to/deploy-dask-clusters.html for Dask).

You can utilize experimental XGBoost integration, too, which will automatically utilize the Ray-based cluster for you without any special set up!

With all of this new information, please try out installing Modin and the Intel® oneAPI AI Analytics Toolkit through Anaconda today!

References


Notices and Disclaimers

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See configuration disclosure for details.

No product or component can be absolutely secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software or service activation.

©Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

Talk to an Expert

Talk to one of our experts to find solutions for your AI journey.

Talk to an Expert