Sparse Inputs¶

SciKeras supports sparse inputs (X/features). You don’t have to do anything special for this to work, you can just pass a sparse matrix to fit().

In this notebook, we’ll demonstrate how this works and compare memory consumption of sparse inputs to dense inputs.

Setup¶

[1]:

!pip install memory_profiler
%load_ext memory_profiler

Collecting memory_profiler

  Downloading memory_profiler-0.61.0-py3-none-any.whl.metadata (20 kB)
Requirement already satisfied: psutil in /home/runner/work/scikeras/scikeras/.venv/lib/python3.12/site-packages (from memory_profiler) (5.9.8)

Downloading memory_profiler-0.61.0-py3-none-any.whl (31 kB)

Installing collected packages: memory_profiler

Successfully installed memory_profiler-0.61.0

[2]:

import warnings
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
from tensorflow import get_logger
get_logger().setLevel('ERROR')
warnings.filterwarnings("ignore", message="Setting the random state for TF")

[3]:

try:
    import scikeras
except ImportError:
    !python -m pip install scikeras

[4]:

import scipy
import numpy as np
from scikeras.wrappers import KerasRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
import keras

Data¶

The dataset we’ll be using is designed to demostrate a worst-case/best-case scenario for dense and sparse input features respectively. It consists of a single categorical feature with equal number of categories as rows. This means the one-hot encoded representation will require as many columns as it does rows, making it very ineffienct to store as a dense matrix but very efficient to store as a sparse matrix.

[5]:

N_SAMPLES = 20_000  # hand tuned to be ~4GB peak

X = np.arange(0, N_SAMPLES).reshape(-1, 1)
y = np.random.uniform(0, 1, size=(X.shape[0],))

Model¶

The model here is nothing special, just a basic multilayer perceptron with one hidden layer.

[6]:

def get_clf(meta) -> keras.Model:
    n_features_in_ = meta["n_features_in_"]
    model = keras.models.Sequential()
    model.add(keras.layers.Input(shape=(n_features_in_,)))
    # a single hidden layer
    model.add(keras.layers.Dense(100, activation="relu"))
    model.add(keras.layers.Dense(1))
    return model

Pipelines¶

Here is where it gets interesting. We make two Scikit-Learn pipelines that use OneHotEncoder: one that uses sparse_output=False to force a dense matrix as the output and another that uses sparse_output=True (the default).

[7]:

dense_pipeline = Pipeline(
    [
        ("encoder", OneHotEncoder(sparse_output=False)),
        ("model", KerasRegressor(get_clf, loss="mse", epochs=5, verbose=False))
    ]
)

sparse_pipeline = Pipeline(
    [
        ("encoder", OneHotEncoder(sparse_output=True)),
        ("model", KerasRegressor(get_clf, loss="mse", epochs=5, verbose=False))
    ]
)

Benchmark¶

Our benchmark will be to just train each one of these pipelines and measure peak memory consumption.

[8]:

%memit dense_pipeline.fit(X, y)

peak memory: 5175.21 MiB, increment: 4650.21 MiB

[9]:

%memit sparse_pipeline.fit(X, y)

peak memory: 1001.99 MiB, increment: 40.09 MiB

You should see at least 100x more memory consumption increment in the dense pipeline.

Runtime¶

Using sparse inputs can have a drastic impact on memory usage, but it often (not always) hurts overall runtime.

[10]:

%timeit dense_pipeline.fit(X, y)

32.8 s ± 9.49 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

[11]:

%timeit sparse_pipeline.fit(X, y)

12.1 s ± 717 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Tensorflow Datasets¶

Tensorflow provides a whole suite of functionality around the Dataset. Datasets are lazily evaluated, can be sparse and minimize the transformations required to feed data into the model. They are a lot more performant and efficient at scale than using numpy datastructures, even sparse ones.

SciKeras does not (and cannot) support Datasets directly because Scikit-Learn itself does not support them and SciKeras’ outwards API is Scikit-Learn’s API. You may want to explore breaking out of SciKeras and just using TensorFlow/Keras directly to see if Datasets can have a large impact for your use case.

Bonus: dtypes¶

You might be able to save even more memory by changing the output dtype of OneHotEncoder.

[12]:

sparse_pipline_uint8 = Pipeline(
    [
        ("encoder", OneHotEncoder(sparse_output=True, dtype=np.uint8)),
        ("model", KerasRegressor(get_clf, loss="mse", epochs=5, verbose=False))
    ]
)

[13]:

%memit sparse_pipline_uint8.fit(X, y)

peak memory: 1084.54 MiB, increment: 16.99 MiB