{
 "cells": [
  {
   "cell_type": "raw",
   "id": "4a36501e",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/adriangb/scikeras/blob/docs-deploy/refs/heads/master/notebooks/AutoEncoders.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\">Run in Google Colab</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0cfe80ba",
   "metadata": {},
   "source": [
    "# Sparse Inputs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d69f3e71",
   "metadata": {},
   "source": [
    "SciKeras supports sparse inputs (`X`/features).\n",
    "You don't have to do anything special for this to work, you can just pass a sparse matrix to `fit()`.\n",
    "\n",
    "In this notebook, we'll demonstrate how this works and compare memory consumption of sparse inputs to dense inputs."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fcb78c17",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "ef5dbe87",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-06-13T22:02:09.472451Z",
     "iopub.status.busy": "2023-06-13T22:02:09.468737Z",
     "iopub.status.idle": "2023-06-13T22:02:12.845097Z",
     "shell.execute_reply": "2023-06-13T22:02:12.843925Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting memory_profiler\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  Downloading memory_profiler-0.61.0-py3-none-any.whl (31 kB)\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: psutil in /home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages (from memory_profiler) (5.9.5)\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Installing collected packages: memory_profiler\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Successfully installed memory_profiler-0.61.0\r\n"
     ]
    }
   ],
   "source": [
    "!pip install memory_profiler\n",
    "%load_ext memory_profiler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "c04458d1",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-06-13T22:02:12.851228Z",
     "iopub.status.busy": "2023-06-13T22:02:12.849547Z",
     "iopub.status.idle": "2023-06-13T22:02:15.435482Z",
     "shell.execute_reply": "2023-06-13T22:02:15.434469Z"
    }
   },
   "outputs": [],
   "source": [
    "import warnings\n",
    "import os\n",
    "os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'\n",
    "from tensorflow import get_logger\n",
    "get_logger().setLevel('ERROR')\n",
    "warnings.filterwarnings(\"ignore\", message=\"Setting the random state for TF\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "3835af93",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-06-13T22:02:15.441145Z",
     "iopub.status.busy": "2023-06-13T22:02:15.440433Z",
     "iopub.status.idle": "2023-06-13T22:02:15.449238Z",
     "shell.execute_reply": "2023-06-13T22:02:15.448070Z"
    }
   },
   "outputs": [],
   "source": [
    "try:\n",
    "    import scikeras\n",
    "except ImportError:\n",
    "    !python -m pip install scikeras"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "13e4a059",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-06-13T22:02:15.454764Z",
     "iopub.status.busy": "2023-06-13T22:02:15.452820Z",
     "iopub.status.idle": "2023-06-13T22:02:15.786043Z",
     "shell.execute_reply": "2023-06-13T22:02:15.785131Z"
    }
   },
   "outputs": [],
   "source": [
    "import scipy\n",
    "import numpy as np\n",
    "from scikeras.wrappers import KerasRegressor\n",
    "from sklearn.preprocessing import OneHotEncoder\n",
    "from sklearn.pipeline import Pipeline\n",
    "from tensorflow import keras"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29c0d314",
   "metadata": {},
   "source": [
    "## Data\n",
    "\n",
    "The dataset we'll be using is designed to demostrate a worst-case/best-case scenario for dense and sparse input features respectively.\n",
    "It consists of a single categorical feature with equal number of categories as rows.\n",
    "This means the one-hot encoded representation will require as many columns as it does rows, making it very ineffienct to store as a dense matrix but very efficient to store as a sparse matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "3e5ef117",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-06-13T22:02:15.791342Z",
     "iopub.status.busy": "2023-06-13T22:02:15.790801Z",
     "iopub.status.idle": "2023-06-13T22:02:15.798426Z",
     "shell.execute_reply": "2023-06-13T22:02:15.797605Z"
    }
   },
   "outputs": [],
   "source": [
    "N_SAMPLES = 20_000  # hand tuned to be ~4GB peak\n",
    "\n",
    "X = np.arange(0, N_SAMPLES).reshape(-1, 1)\n",
    "y = np.random.uniform(0, 1, size=(X.shape[0],))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d6db3339",
   "metadata": {},
   "source": [
    "## Model\n",
    "\n",
    "The model here is nothing special, just a basic multilayer perceptron with one hidden layer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "e4d60ef8",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-06-13T22:02:15.803216Z",
     "iopub.status.busy": "2023-06-13T22:02:15.802474Z",
     "iopub.status.idle": "2023-06-13T22:02:15.809936Z",
     "shell.execute_reply": "2023-06-13T22:02:15.808963Z"
    }
   },
   "outputs": [],
   "source": [
    "def get_clf(meta) -> keras.Model:\n",
    "    n_features_in_ = meta[\"n_features_in_\"]\n",
    "    model = keras.models.Sequential()\n",
    "    model.add(keras.layers.Input(shape=(n_features_in_,)))\n",
    "    # a single hidden layer\n",
    "    model.add(keras.layers.Dense(100, activation=\"relu\"))\n",
    "    model.add(keras.layers.Dense(1))\n",
    "    return model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "27c66471",
   "metadata": {},
   "source": [
    "## Pipelines\n",
    "\n",
    "Here is where it gets interesting.\n",
    "We make two Scikit-Learn pipelines that use `OneHotEncoder`: one that uses `sparse=False` to force a dense matrix as the output and another that uses `sparse=True` (the default)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "33ae5e54",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-06-13T22:02:15.815498Z",
     "iopub.status.busy": "2023-06-13T22:02:15.813950Z",
     "iopub.status.idle": "2023-06-13T22:02:15.821448Z",
     "shell.execute_reply": "2023-06-13T22:02:15.820612Z"
    }
   },
   "outputs": [],
   "source": [
    "dense_pipeline = Pipeline(\n",
    "    [\n",
    "        (\"encoder\", OneHotEncoder(sparse=False)),\n",
    "        (\"model\", KerasRegressor(get_clf, loss=\"mse\", epochs=5, verbose=False))\n",
    "    ]\n",
    ")\n",
    "\n",
    "sparse_pipeline = Pipeline(\n",
    "    [\n",
    "        (\"encoder\", OneHotEncoder(sparse=True)),\n",
    "        (\"model\", KerasRegressor(get_clf, loss=\"mse\", epochs=5, verbose=False))\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96314520",
   "metadata": {},
   "source": [
    "## Benchmark\n",
    "\n",
    "Our benchmark will be to just train each one of these pipelines and measure peak memory consumption."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "000f39b1",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-06-13T22:02:15.826092Z",
     "iopub.status.busy": "2023-06-13T22:02:15.825512Z",
     "iopub.status.idle": "2023-06-13T22:03:41.465775Z",
     "shell.execute_reply": "2023-06-13T22:03:41.463143Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "peak memory: 3568.23 MiB, increment: 3147.57 MiB\n"
     ]
    }
   ],
   "source": [
    "%memit dense_pipeline.fit(X, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "469d47e5",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-06-13T22:03:41.471877Z",
     "iopub.status.busy": "2023-06-13T22:03:41.470612Z",
     "iopub.status.idle": "2023-06-13T22:04:19.158321Z",
     "shell.execute_reply": "2023-06-13T22:04:19.157281Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "peak memory: 693.33 MiB, increment: 106.52 MiB\n"
     ]
    }
   ],
   "source": [
    "%memit sparse_pipeline.fit(X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cca1b6e8",
   "metadata": {},
   "source": [
    "You should see at least 100x more memory consumption **increment** in the dense pipeline."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bca1b449",
   "metadata": {},
   "source": [
    "### Runtime\n",
    "\n",
    "Using sparse inputs can have a drastic impact on memory usage, but it often (not always) hurts overall runtime."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "3631c7e6",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-06-13T22:04:19.164821Z",
     "iopub.status.busy": "2023-06-13T22:04:19.162972Z",
     "iopub.status.idle": "2023-06-13T22:15:41.909036Z",
     "shell.execute_reply": "2023-06-13T22:15:41.907949Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1min 25s ± 25.7 s per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%timeit dense_pipeline.fit(X, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "7ae40148",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-06-13T22:15:41.913559Z",
     "iopub.status.busy": "2023-06-13T22:15:41.913027Z",
     "iopub.status.idle": "2023-06-13T22:18:40.608777Z",
     "shell.execute_reply": "2023-06-13T22:18:40.607706Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "22.2 s ± 1.8 s per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%timeit sparse_pipeline.fit(X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fb5603b3",
   "metadata": {},
   "source": [
    "## Tensorflow Datasets\n",
    "\n",
    "Tensorflow provides a whole suite of functionality around the [Dataset].\n",
    "Datasets are lazily evaluated, can be sparse and minimize the transformations required to feed data into the model.\n",
    "They are _a lot_ more performant and efficient at scale than using numpy datastructures, even sparse ones.\n",
    "\n",
    "SciKeras does not (and cannot) support Datasets directly because Scikit-Learn itself does not support them and SciKeras' outwards API is Scikit-Learn's API.\n",
    "You may want to explore breaking out of SciKeras and just using TensorFlow/Keras directly to see if Datasets can have a large impact for your use case.\n",
    "\n",
    "[Dataset]: https://www.tensorflow.org/api_docs/python/tf/data/Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6d854b23",
   "metadata": {},
   "source": [
    "## Bonus: dtypes\n",
    "\n",
    "You might be able to save even more memory by changing the output dtype of `OneHotEncoder`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "1492c50f",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-06-13T22:18:40.614039Z",
     "iopub.status.busy": "2023-06-13T22:18:40.613597Z",
     "iopub.status.idle": "2023-06-13T22:18:40.619881Z",
     "shell.execute_reply": "2023-06-13T22:18:40.619027Z"
    }
   },
   "outputs": [],
   "source": [
    "sparse_pipline_uint8 = Pipeline(\n",
    "    [\n",
    "        (\"encoder\", OneHotEncoder(sparse=True, dtype=np.uint8)),\n",
    "        (\"model\", KerasRegressor(get_clf, loss=\"mse\", epochs=5, verbose=False))\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "f78ac962",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-06-13T22:18:40.622996Z",
     "iopub.status.busy": "2023-06-13T22:18:40.622750Z",
     "iopub.status.idle": "2023-06-13T22:19:05.201191Z",
     "shell.execute_reply": "2023-06-13T22:19:05.200038Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/runner/work/scikeras/scikeras/.venv/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "peak memory: 1010.32 MiB, increment: 0.50 MiB\n"
     ]
    }
   ],
   "source": [
    "%memit sparse_pipline_uint8.fit(X, y)"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "formats": "ipynb,md"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}