{ "cells": [ { "cell_type": "raw", "id": "4a36501e", "metadata": {}, "source": [ "Run in Google Colab" ] }, { "cell_type": "markdown", "id": "0cfe80ba", "metadata": {}, "source": [ "# Sparse Inputs" ] }, { "cell_type": "markdown", "id": "d69f3e71", "metadata": {}, "source": [ "SciKeras supports sparse inputs (`X`/features).\n", "You don't have to do anything special for this to work, you can just pass a sparse matrix to `fit()`.\n", "\n", "In this notebook, we'll demonstrate how this works and compare memory consumption of sparse inputs to dense inputs." ] }, { "cell_type": "markdown", "id": "fcb78c17", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": 1, "id": "ef5dbe87", "metadata": { "execution": { "iopub.execute_input": "2023-06-13T22:02:09.472451Z", "iopub.status.busy": "2023-06-13T22:02:09.468737Z", "iopub.status.idle": "2023-06-13T22:02:12.845097Z", "shell.execute_reply": "2023-06-13T22:02:12.843925Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting memory_profiler\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " Downloading memory_profiler-0.61.0-py3-none-any.whl (31 kB)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: psutil in /home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages (from memory_profiler) (5.9.5)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Installing collected packages: memory_profiler\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Successfully installed memory_profiler-0.61.0\r\n" ] } ], "source": [ "!pip install memory_profiler\n", "%load_ext memory_profiler" ] }, { "cell_type": "code", "execution_count": 2, "id": "c04458d1", "metadata": { "execution": { "iopub.execute_input": "2023-06-13T22:02:12.851228Z", "iopub.status.busy": "2023-06-13T22:02:12.849547Z", "iopub.status.idle": "2023-06-13T22:02:15.435482Z", "shell.execute_reply": "2023-06-13T22:02:15.434469Z" } }, "outputs": [], "source": [ "import warnings\n", "import os\n", "os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'\n", "from tensorflow import get_logger\n", "get_logger().setLevel('ERROR')\n", "warnings.filterwarnings(\"ignore\", message=\"Setting the random state for TF\")" ] }, { "cell_type": "code", "execution_count": 3, "id": "3835af93", "metadata": { "execution": { "iopub.execute_input": "2023-06-13T22:02:15.441145Z", "iopub.status.busy": "2023-06-13T22:02:15.440433Z", "iopub.status.idle": "2023-06-13T22:02:15.449238Z", "shell.execute_reply": "2023-06-13T22:02:15.448070Z" } }, "outputs": [], "source": [ "try:\n", " import scikeras\n", "except ImportError:\n", " !python -m pip install scikeras" ] }, { "cell_type": "code", "execution_count": 4, "id": "13e4a059", "metadata": { "execution": { "iopub.execute_input": "2023-06-13T22:02:15.454764Z", "iopub.status.busy": "2023-06-13T22:02:15.452820Z", "iopub.status.idle": "2023-06-13T22:02:15.786043Z", "shell.execute_reply": "2023-06-13T22:02:15.785131Z" } }, "outputs": [], "source": [ "import scipy\n", "import numpy as np\n", "from scikeras.wrappers import KerasRegressor\n", "from sklearn.preprocessing import OneHotEncoder\n", "from sklearn.pipeline import Pipeline\n", "from tensorflow import keras" ] }, { "cell_type": "markdown", "id": "29c0d314", "metadata": {}, "source": [ "## Data\n", "\n", "The dataset we'll be using is designed to demostrate a worst-case/best-case scenario for dense and sparse input features respectively.\n", "It consists of a single categorical feature with equal number of categories as rows.\n", "This means the one-hot encoded representation will require as many columns as it does rows, making it very ineffienct to store as a dense matrix but very efficient to store as a sparse matrix." ] }, { "cell_type": "code", "execution_count": 5, "id": "3e5ef117", "metadata": { "execution": { "iopub.execute_input": "2023-06-13T22:02:15.791342Z", "iopub.status.busy": "2023-06-13T22:02:15.790801Z", "iopub.status.idle": "2023-06-13T22:02:15.798426Z", "shell.execute_reply": "2023-06-13T22:02:15.797605Z" } }, "outputs": [], "source": [ "N_SAMPLES = 20_000 # hand tuned to be ~4GB peak\n", "\n", "X = np.arange(0, N_SAMPLES).reshape(-1, 1)\n", "y = np.random.uniform(0, 1, size=(X.shape[0],))" ] }, { "cell_type": "markdown", "id": "d6db3339", "metadata": {}, "source": [ "## Model\n", "\n", "The model here is nothing special, just a basic multilayer perceptron with one hidden layer." ] }, { "cell_type": "code", "execution_count": 6, "id": "e4d60ef8", "metadata": { "execution": { "iopub.execute_input": "2023-06-13T22:02:15.803216Z", "iopub.status.busy": "2023-06-13T22:02:15.802474Z", "iopub.status.idle": "2023-06-13T22:02:15.809936Z", "shell.execute_reply": "2023-06-13T22:02:15.808963Z" } }, "outputs": [], "source": [ "def get_clf(meta) -> keras.Model:\n", " n_features_in_ = meta[\"n_features_in_\"]\n", " model = keras.models.Sequential()\n", " model.add(keras.layers.Input(shape=(n_features_in_,)))\n", " # a single hidden layer\n", " model.add(keras.layers.Dense(100, activation=\"relu\"))\n", " model.add(keras.layers.Dense(1))\n", " return model" ] }, { "cell_type": "markdown", "id": "27c66471", "metadata": {}, "source": [ "## Pipelines\n", "\n", "Here is where it gets interesting.\n", "We make two Scikit-Learn pipelines that use `OneHotEncoder`: one that uses `sparse=False` to force a dense matrix as the output and another that uses `sparse=True` (the default)." ] }, { "cell_type": "code", "execution_count": 7, "id": "33ae5e54", "metadata": { "execution": { "iopub.execute_input": "2023-06-13T22:02:15.815498Z", "iopub.status.busy": "2023-06-13T22:02:15.813950Z", "iopub.status.idle": "2023-06-13T22:02:15.821448Z", "shell.execute_reply": "2023-06-13T22:02:15.820612Z" } }, "outputs": [], "source": [ "dense_pipeline = Pipeline(\n", " [\n", " (\"encoder\", OneHotEncoder(sparse=False)),\n", " (\"model\", KerasRegressor(get_clf, loss=\"mse\", epochs=5, verbose=False))\n", " ]\n", ")\n", "\n", "sparse_pipeline = Pipeline(\n", " [\n", " (\"encoder\", OneHotEncoder(sparse=True)),\n", " (\"model\", KerasRegressor(get_clf, loss=\"mse\", epochs=5, verbose=False))\n", " ]\n", ")" ] }, { "cell_type": "markdown", "id": "96314520", "metadata": {}, "source": [ "## Benchmark\n", "\n", "Our benchmark will be to just train each one of these pipelines and measure peak memory consumption." ] }, { "cell_type": "code", "execution_count": 8, "id": "000f39b1", "metadata": { "execution": { "iopub.execute_input": "2023-06-13T22:02:15.826092Z", "iopub.status.busy": "2023-06-13T22:02:15.825512Z", "iopub.status.idle": "2023-06-13T22:03:41.465775Z", "shell.execute_reply": "2023-06-13T22:03:41.463143Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "peak memory: 3568.23 MiB, increment: 3147.57 MiB\n" ] } ], "source": [ "%memit dense_pipeline.fit(X, y)" ] }, { "cell_type": "code", "execution_count": 9, "id": "469d47e5", "metadata": { "execution": { "iopub.execute_input": "2023-06-13T22:03:41.471877Z", "iopub.status.busy": "2023-06-13T22:03:41.470612Z", "iopub.status.idle": "2023-06-13T22:04:19.158321Z", "shell.execute_reply": "2023-06-13T22:04:19.157281Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "peak memory: 693.33 MiB, increment: 106.52 MiB\n" ] } ], "source": [ "%memit sparse_pipeline.fit(X, y)" ] }, { "cell_type": "markdown", "id": "cca1b6e8", "metadata": {}, "source": [ "You should see at least 100x more memory consumption **increment** in the dense pipeline." ] }, { "cell_type": "markdown", "id": "bca1b449", "metadata": {}, "source": [ "### Runtime\n", "\n", "Using sparse inputs can have a drastic impact on memory usage, but it often (not always) hurts overall runtime." ] }, { "cell_type": "code", "execution_count": 10, "id": "3631c7e6", "metadata": { "execution": { "iopub.execute_input": "2023-06-13T22:04:19.164821Z", "iopub.status.busy": "2023-06-13T22:04:19.162972Z", "iopub.status.idle": "2023-06-13T22:15:41.909036Z", "shell.execute_reply": "2023-06-13T22:15:41.907949Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n", " warnings.warn(\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n", " warnings.warn(\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n", " warnings.warn(\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n", " warnings.warn(\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n", " warnings.warn(\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n", " warnings.warn(\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n", " warnings.warn(\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "1min 25s ± 25.7 s per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "source": [ "%timeit dense_pipeline.fit(X, y)" ] }, { "cell_type": "code", "execution_count": 11, "id": "7ae40148", "metadata": { "execution": { "iopub.execute_input": "2023-06-13T22:15:41.913559Z", "iopub.status.busy": "2023-06-13T22:15:41.913027Z", "iopub.status.idle": "2023-06-13T22:18:40.608777Z", "shell.execute_reply": "2023-06-13T22:18:40.607706Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n", " warnings.warn(\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n", " warnings.warn(\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n", " warnings.warn(\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n", " warnings.warn(\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n", " warnings.warn(\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n", " warnings.warn(\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n", " warnings.warn(\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "22.2 s ± 1.8 s per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "source": [ "%timeit sparse_pipeline.fit(X, y)" ] }, { "cell_type": "markdown", "id": "fb5603b3", "metadata": {}, "source": [ "## Tensorflow Datasets\n", "\n", "Tensorflow provides a whole suite of functionality around the [Dataset].\n", "Datasets are lazily evaluated, can be sparse and minimize the transformations required to feed data into the model.\n", "They are _a lot_ more performant and efficient at scale than using numpy datastructures, even sparse ones.\n", "\n", "SciKeras does not (and cannot) support Datasets directly because Scikit-Learn itself does not support them and SciKeras' outwards API is Scikit-Learn's API.\n", "You may want to explore breaking out of SciKeras and just using TensorFlow/Keras directly to see if Datasets can have a large impact for your use case.\n", "\n", "[Dataset]: https://www.tensorflow.org/api_docs/python/tf/data/Dataset" ] }, { "cell_type": "markdown", "id": "6d854b23", "metadata": {}, "source": [ "## Bonus: dtypes\n", "\n", "You might be able to save even more memory by changing the output dtype of `OneHotEncoder`." ] }, { "cell_type": "code", "execution_count": 12, "id": "1492c50f", "metadata": { "execution": { "iopub.execute_input": "2023-06-13T22:18:40.614039Z", "iopub.status.busy": "2023-06-13T22:18:40.613597Z", "iopub.status.idle": "2023-06-13T22:18:40.619881Z", "shell.execute_reply": "2023-06-13T22:18:40.619027Z" } }, "outputs": [], "source": [ "sparse_pipline_uint8 = Pipeline(\n", " [\n", " (\"encoder\", OneHotEncoder(sparse=True, dtype=np.uint8)),\n", " (\"model\", KerasRegressor(get_clf, loss=\"mse\", epochs=5, verbose=False))\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": 13, "id": "f78ac962", "metadata": { "execution": { "iopub.execute_input": "2023-06-13T22:18:40.622996Z", "iopub.status.busy": "2023-06-13T22:18:40.622750Z", "iopub.status.idle": "2023-06-13T22:19:05.201191Z", "shell.execute_reply": "2023-06-13T22:19:05.200038Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "peak memory: 1010.32 MiB, increment: 0.50 MiB\n" ] } ], "source": [ "%memit sparse_pipline_uint8.fit(X, y)" ] } ], "metadata": { "jupytext": { "formats": "ipynb,md" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.16" } }, "nbformat": 4, "nbformat_minor": 5 }