Sparse Inputs¶
SciKeras supports sparse inputs (X
/features). You don’t have to do anything special for this to work, you can just pass a sparse matrix to fit()
.
In this notebook, we’ll demonstrate how this works and compare memory consumption of sparse inputs to dense inputs.
Setup¶
[1]:
!pip install memory_profiler
%load_ext memory_profiler
Collecting memory_profiler
Downloading memory_profiler-0.61.0-py3-none-any.whl.metadata (20 kB)
Requirement already satisfied: psutil in /home/runner/work/scikeras/scikeras/.venv/lib/python3.12/site-packages (from memory_profiler) (5.9.8)
Downloading memory_profiler-0.61.0-py3-none-any.whl (31 kB)
Installing collected packages: memory_profiler
Successfully installed memory_profiler-0.61.0
[2]:
import warnings
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
from tensorflow import get_logger
get_logger().setLevel('ERROR')
warnings.filterwarnings("ignore", message="Setting the random state for TF")
[3]:
try:
import scikeras
except ImportError:
!python -m pip install scikeras
[4]:
import scipy
import numpy as np
from scikeras.wrappers import KerasRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
import keras
Data¶
The dataset we’ll be using is designed to demostrate a worst-case/best-case scenario for dense and sparse input features respectively. It consists of a single categorical feature with equal number of categories as rows. This means the one-hot encoded representation will require as many columns as it does rows, making it very ineffienct to store as a dense matrix but very efficient to store as a sparse matrix.
[5]:
N_SAMPLES = 20_000 # hand tuned to be ~4GB peak
X = np.arange(0, N_SAMPLES).reshape(-1, 1)
y = np.random.uniform(0, 1, size=(X.shape[0],))
Model¶
The model here is nothing special, just a basic multilayer perceptron with one hidden layer.
[6]:
def get_clf(meta) -> keras.Model:
n_features_in_ = meta["n_features_in_"]
model = keras.models.Sequential()
model.add(keras.layers.Input(shape=(n_features_in_,)))
# a single hidden layer
model.add(keras.layers.Dense(100, activation="relu"))
model.add(keras.layers.Dense(1))
return model
Pipelines¶
Here is where it gets interesting. We make two Scikit-Learn pipelines that use OneHotEncoder
: one that uses sparse_output=False
to force a dense matrix as the output and another that uses sparse_output=True
(the default).
[7]:
dense_pipeline = Pipeline(
[
("encoder", OneHotEncoder(sparse_output=False)),
("model", KerasRegressor(get_clf, loss="mse", epochs=5, verbose=False))
]
)
sparse_pipeline = Pipeline(
[
("encoder", OneHotEncoder(sparse_output=True)),
("model", KerasRegressor(get_clf, loss="mse", epochs=5, verbose=False))
]
)
Benchmark¶
Our benchmark will be to just train each one of these pipelines and measure peak memory consumption.
[8]:
%memit dense_pipeline.fit(X, y)
peak memory: 5175.21 MiB, increment: 4650.21 MiB
[9]:
%memit sparse_pipeline.fit(X, y)
peak memory: 1001.99 MiB, increment: 40.09 MiB
You should see at least 100x more memory consumption increment in the dense pipeline.
Runtime¶
Using sparse inputs can have a drastic impact on memory usage, but it often (not always) hurts overall runtime.
[10]:
%timeit dense_pipeline.fit(X, y)
32.8 s ± 9.49 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
[11]:
%timeit sparse_pipeline.fit(X, y)
12.1 s ± 717 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Tensorflow Datasets¶
Tensorflow provides a whole suite of functionality around the Dataset. Datasets are lazily evaluated, can be sparse and minimize the transformations required to feed data into the model. They are a lot more performant and efficient at scale than using numpy datastructures, even sparse ones.
SciKeras does not (and cannot) support Datasets directly because Scikit-Learn itself does not support them and SciKeras’ outwards API is Scikit-Learn’s API. You may want to explore breaking out of SciKeras and just using TensorFlow/Keras directly to see if Datasets can have a large impact for your use case.
Bonus: dtypes¶
You might be able to save even more memory by changing the output dtype of OneHotEncoder
.
[12]:
sparse_pipline_uint8 = Pipeline(
[
("encoder", OneHotEncoder(sparse_output=True, dtype=np.uint8)),
("model", KerasRegressor(get_clf, loss="mse", epochs=5, verbose=False))
]
)
[13]:
%memit sparse_pipline_uint8.fit(X, y)
peak memory: 1084.54 MiB, increment: 16.99 MiB