Run in Google Colab

Sparse Inputs

SciKeras supports sparse inputs (X/features). You don’t have to do anything special for this to work, you can just pass a sparse matrix to fit().

In this notebook, we’ll demonstrate how this works and compare memory consumption of sparse inputs to dense inputs.

Setup

[1]:
!pip install memory_profiler
%load_ext memory_profiler
Collecting memory_profiler
  Downloading memory_profiler-0.60.0.tar.gz (38 kB)
  Preparing metadata (setup.py) ... - \ done
Requirement already satisfied: psutil in /home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages (from memory_profiler) (5.9.2)
Building wheels for collected packages: memory_profiler
  Building wheel for memory_profiler (setup.py) ... - \ | done
  Created wheel for memory_profiler: filename=memory_profiler-0.60.0-py3-none-any.whl size=31267 sha256=53a9e045284d81a31069de19d473a8891b26ccc5fcccdfb88d4e60062c87fe3d
  Stored in directory: /home/runner/.cache/pip/wheels/01/ca/8b/b518dd2aef69635ad6fcab87069c9c52f355a2e9c5d4c02da9
Successfully built memory_profiler
Installing collected packages: memory_profiler
Successfully installed memory_profiler-0.60.0
[2]:
import warnings
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
from tensorflow import get_logger
get_logger().setLevel('ERROR')
warnings.filterwarnings("ignore", message="Setting the random state for TF")
[3]:
try:
    import scikeras
except ImportError:
    !python -m pip install scikeras
[4]:
import scipy
import numpy as np
from scikeras.wrappers import KerasRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from tensorflow import keras

Data

The dataset we’ll be using is designed to demostrate a worst-case/best-case scenario for dense and sparse input features respectively. It consists of a single categorical feature with equal number of categories as rows. This means the one-hot encoded representation will require as many columns as it does rows, making it very ineffienct to store as a dense matrix but very efficient to store as a sparse matrix.

[5]:
N_SAMPLES = 20_000  # hand tuned to be ~4GB peak

X = np.arange(0, N_SAMPLES).reshape(-1, 1)
y = np.random.uniform(0, 1, size=(X.shape[0],))

Model

The model here is nothing special, just a basic multilayer perceptron with one hidden layer.

[6]:
def get_clf(meta) -> keras.Model:
    n_features_in_ = meta["n_features_in_"]
    model = keras.models.Sequential()
    model.add(keras.layers.Input(shape=(n_features_in_,)))
    # a single hidden layer
    model.add(keras.layers.Dense(100, activation="relu"))
    model.add(keras.layers.Dense(1))
    return model

Pipelines

Here is where it gets interesting. We make two Scikit-Learn pipelines that use OneHotEncoder: one that uses sparse=False to force a dense matrix as the output and another that uses sparse=True (the default).

[7]:
dense_pipeline = Pipeline(
    [
        ("encoder", OneHotEncoder(sparse=False)),
        ("model", KerasRegressor(get_clf, loss="mse", epochs=5, verbose=False))
    ]
)

sparse_pipeline = Pipeline(
    [
        ("encoder", OneHotEncoder(sparse=True)),
        ("model", KerasRegressor(get_clf, loss="mse", epochs=5, verbose=False))
    ]
)

Benchmark

Our benchmark will be to just train each one of these pipelines and measure peak memory consumption.

[8]:
%memit dense_pipeline.fit(X, y)
peak memory: 3540.61 MiB, increment: 3151.28 MiB
[9]:
%memit sparse_pipeline.fit(X, y)
/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/tensorflow/python/framework/indexed_slices.py:444: UserWarning: Converting sparse IndexedSlices(IndexedSlices(indices=Tensor("gradient_tape/sequential_1/dense_2/embedding_lookup_sparse/Reshape_1:0", shape=(None,), dtype=int32), values=Tensor("gradient_tape/sequential_1/dense_2/embedding_lookup_sparse/Reshape:0", shape=(None, 100), dtype=float32), dense_shape=Tensor("gradient_tape/sequential_1/dense_2/embedding_lookup_sparse/Cast:0", shape=(2,), dtype=int32))) to a dense Tensor of unknown shape. This may consume a large amount of memory.
  warnings.warn(
peak memory: 678.02 MiB, increment: 62.62 MiB

You should see at least 100x more memory consumption increment in the dense pipeline.

Runtime

Using sparse inputs can have a drastic impact on memory usage, but it often (not always) hurts overall runtime.

[10]:
%timeit dense_pipeline.fit(X, y)
27.8 s ± 1.27 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
[11]:
%timeit sparse_pipeline.fit(X, y)
/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/tensorflow/python/framework/indexed_slices.py:444: UserWarning: Converting sparse IndexedSlices(IndexedSlices(indices=Tensor("gradient_tape/sequential_10/dense_20/embedding_lookup_sparse/Reshape_1:0", shape=(None,), dtype=int32), values=Tensor("gradient_tape/sequential_10/dense_20/embedding_lookup_sparse/Reshape:0", shape=(None, 100), dtype=float32), dense_shape=Tensor("gradient_tape/sequential_10/dense_20/embedding_lookup_sparse/Cast:0", shape=(2,), dtype=int32))) to a dense Tensor of unknown shape. This may consume a large amount of memory.
  warnings.warn(
/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/tensorflow/python/framework/indexed_slices.py:444: UserWarning: Converting sparse IndexedSlices(IndexedSlices(indices=Tensor("gradient_tape/sequential_11/dense_22/embedding_lookup_sparse/Reshape_1:0", shape=(None,), dtype=int32), values=Tensor("gradient_tape/sequential_11/dense_22/embedding_lookup_sparse/Reshape:0", shape=(None, 100), dtype=float32), dense_shape=Tensor("gradient_tape/sequential_11/dense_22/embedding_lookup_sparse/Cast:0", shape=(2,), dtype=int32))) to a dense Tensor of unknown shape. This may consume a large amount of memory.
  warnings.warn(
/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/tensorflow/python/framework/indexed_slices.py:444: UserWarning: Converting sparse IndexedSlices(IndexedSlices(indices=Tensor("gradient_tape/sequential_12/dense_24/embedding_lookup_sparse/Reshape_1:0", shape=(None,), dtype=int32), values=Tensor("gradient_tape/sequential_12/dense_24/embedding_lookup_sparse/Reshape:0", shape=(None, 100), dtype=float32), dense_shape=Tensor("gradient_tape/sequential_12/dense_24/embedding_lookup_sparse/Cast:0", shape=(2,), dtype=int32))) to a dense Tensor of unknown shape. This may consume a large amount of memory.
  warnings.warn(
/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/tensorflow/python/framework/indexed_slices.py:444: UserWarning: Converting sparse IndexedSlices(IndexedSlices(indices=Tensor("gradient_tape/sequential_13/dense_26/embedding_lookup_sparse/Reshape_1:0", shape=(None,), dtype=int32), values=Tensor("gradient_tape/sequential_13/dense_26/embedding_lookup_sparse/Reshape:0", shape=(None, 100), dtype=float32), dense_shape=Tensor("gradient_tape/sequential_13/dense_26/embedding_lookup_sparse/Cast:0", shape=(2,), dtype=int32))) to a dense Tensor of unknown shape. This may consume a large amount of memory.
  warnings.warn(
/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/tensorflow/python/framework/indexed_slices.py:444: UserWarning: Converting sparse IndexedSlices(IndexedSlices(indices=Tensor("gradient_tape/sequential_14/dense_28/embedding_lookup_sparse/Reshape_1:0", shape=(None,), dtype=int32), values=Tensor("gradient_tape/sequential_14/dense_28/embedding_lookup_sparse/Reshape:0", shape=(None, 100), dtype=float32), dense_shape=Tensor("gradient_tape/sequential_14/dense_28/embedding_lookup_sparse/Cast:0", shape=(2,), dtype=int32))) to a dense Tensor of unknown shape. This may consume a large amount of memory.
  warnings.warn(
/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/tensorflow/python/framework/indexed_slices.py:444: UserWarning: Converting sparse IndexedSlices(IndexedSlices(indices=Tensor("gradient_tape/sequential_15/dense_30/embedding_lookup_sparse/Reshape_1:0", shape=(None,), dtype=int32), values=Tensor("gradient_tape/sequential_15/dense_30/embedding_lookup_sparse/Reshape:0", shape=(None, 100), dtype=float32), dense_shape=Tensor("gradient_tape/sequential_15/dense_30/embedding_lookup_sparse/Cast:0", shape=(2,), dtype=int32))) to a dense Tensor of unknown shape. This may consume a large amount of memory.
  warnings.warn(
/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/tensorflow/python/framework/indexed_slices.py:444: UserWarning: Converting sparse IndexedSlices(IndexedSlices(indices=Tensor("gradient_tape/sequential_16/dense_32/embedding_lookup_sparse/Reshape_1:0", shape=(None,), dtype=int32), values=Tensor("gradient_tape/sequential_16/dense_32/embedding_lookup_sparse/Reshape:0", shape=(None, 100), dtype=float32), dense_shape=Tensor("gradient_tape/sequential_16/dense_32/embedding_lookup_sparse/Cast:0", shape=(2,), dtype=int32))) to a dense Tensor of unknown shape. This may consume a large amount of memory.
  warnings.warn(
/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/tensorflow/python/framework/indexed_slices.py:444: UserWarning: Converting sparse IndexedSlices(IndexedSlices(indices=Tensor("gradient_tape/sequential_17/dense_34/embedding_lookup_sparse/Reshape_1:0", shape=(None,), dtype=int32), values=Tensor("gradient_tape/sequential_17/dense_34/embedding_lookup_sparse/Reshape:0", shape=(None, 100), dtype=float32), dense_shape=Tensor("gradient_tape/sequential_17/dense_34/embedding_lookup_sparse/Cast:0", shape=(2,), dtype=int32))) to a dense Tensor of unknown shape. This may consume a large amount of memory.
  warnings.warn(
10.5 s ± 360 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Tensorflow Datasets

Tensorflow provides a whole suite of functionality around the Dataset. Datasets are lazily evaluated, can be sparse and minimize the transformations required to feed data into the model. They are a lot more performant and efficient at scale than using numpy datastructures, even sparse ones.

SciKeras does not (and cannot) support Datasets directly because Scikit-Learn itself does not support them and SciKeras’ outwards API is Scikit-Learn’s API. You may want to explore breaking out of SciKeras and just using TensorFlow/Keras directly to see if Datasets can have a large impact for your use case.

Bonus: dtypes

You might be able to save even more memory by changing the output dtype of OneHotEncoder.

[12]:
sparse_pipline_uint8 = Pipeline(
    [
        ("encoder", OneHotEncoder(sparse=True, dtype=np.uint8)),
        ("model", KerasRegressor(get_clf, loss="mse", epochs=5, verbose=False))
    ]
)
[13]:
%memit sparse_pipline_uint8.fit(X, y)
/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/tensorflow/python/framework/indexed_slices.py:444: UserWarning: Converting sparse IndexedSlices(IndexedSlices(indices=Tensor("gradient_tape/sequential_18/dense_36/embedding_lookup_sparse/Reshape_1:0", shape=(None,), dtype=int32), values=Tensor("gradient_tape/sequential_18/dense_36/embedding_lookup_sparse/Reshape:0", shape=(None, 100), dtype=float32), dense_shape=Tensor("gradient_tape/sequential_18/dense_36/embedding_lookup_sparse/Cast:0", shape=(2,), dtype=int32))) to a dense Tensor of unknown shape. This may consume a large amount of memory.
  warnings.warn(
peak memory: 1069.02 MiB, increment: 0.44 MiB