HTTP/2 200
content-type: application/octet-stream
x-guploader-uploadid: ABgVH8-PdMfmMo5Z4Er1NUUex4gPcUQSE5aQUUW0cfPFiSSzhYCcyRIy6B5gcc_xuun5MKfQ
expires: Tue, 15 Jul 2025 21:04:08 GMT
date: Tue, 15 Jul 2025 20:04:08 GMT
cache-control: public, max-age=3600
last-modified: Thu, 15 Aug 2024 03:16:04 GMT
etag: "8b77283c2eeb5d4d48f8d11903c80306"
x-goog-generation: 1723691764800540
x-goog-metageneration: 1
x-goog-stored-content-encoding: identity
x-goog-stored-content-length: 38031
x-goog-hash: crc32c=dX2eVA==
x-goog-hash: md5=i3coPC7rXU1I+NEZA8gDBg==
x-goog-storage-class: MULTI_REGIONAL
accept-ranges: bytes
content-length: 38031
server: UploadServer
alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "FEL3NlTTDlSX"
},
"source": [
"##### Copyright 2021 The TensorFlow Authors."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"cellView": "form",
"execution": {
"iopub.execute_input": "2024-08-15T02:14:12.043017Z",
"iopub.status.busy": "2024-08-15T02:14:12.042782Z",
"iopub.status.idle": "2024-08-15T02:14:12.046473Z",
"shell.execute_reply": "2024-08-15T02:14:12.045862Z"
},
"id": "FlUw7tSKbtg4"
},
"outputs": [],
"source": [
"#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "77z2OchJTk0l"
},
"source": [
"# Debug a TensorFlow 2 migrated training pipeline\n",
"\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zTwPu-w6M5sz"
},
"source": [
"This notebook demonstrates how to debug a training pipeline when migrating to TensorFlow 2 (TF2). It consists of following components:\n",
"1. Suggested steps and code samples for debugging training pipeline\n",
"2. Tools for debugging\n",
"3. Other related resources\n",
"\n",
"One assumption is you have the TensorFlow 1 (TF1.x) code and trained models for comparison, and you want to build a TF2 model that achieves similar validation accuracy.\n",
"\n",
"This notebook does **NOT** cover debugging performance issues for training/inference speed or memory usage."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fKm9R4CtOAP3"
},
"source": [
"## Debugging workflow\n",
"\n",
"Below is a general workflow for debugging your TF2 training pipelines. Note that you do not need to follow these steps in order. You can also use a binary search approach where you test the model in an intermediate step and narrow down the debugging scope. \n",
"\n",
"1. Fix compile and runtime errors\n",
"\n",
"2. Single forward pass validation (in a separate\n",
" [guide](./validate_correctness.ipynb))\n",
"\n",
" a. On single CPU device\n",
"\n",
" * Verify variables are created only once\n",
" * Check variable counts, names, and shapes match\n",
" * Reset all variables, check numerical equivalence with all randomness\n",
" disabled\n",
" * Align random number generation, check numerical equivalence in inference\n",
" * (Optional) Check checkpoints are loaded properly and TF1.x/TF2 models\n",
" generate identical output\n",
"\n",
" b. On single GPU/TPU device\n",
"\n",
" c. With multi-device strategies\n",
"\n",
"3. Model training numerical equivalence validation for a few steps (code\n",
" samples available below)\n",
"\n",
" a. Single training step validation using small and fixed data on single CPU\n",
" device. Specifically, check numerical equivalence for the following\n",
" components\n",
"\n",
" * losses computation\n",
" * metrics\n",
" * learning rate\n",
" * gradient computation and update\n",
"\n",
" b. Check statistics after training 3 or more steps to verify optimizer behaviors like the momentum, still with fixed data on single CPU device\n",
"\n",
" c. On single GPU/TPU device\n",
"\n",
" d. With multi-device strategies (check the intro for [MultiProcessRunner](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/distribute/multi_process_runner.py#L108) at the bottom)\n",
"\n",
"4. End-to-end convergence testing on real dataset\n",
"\n",
" a. Check training behaviors with TensorBoard\n",
"\n",
" * use simple optimizers e.g., SGD and simple distribution strategies e.g.\n",
" `tf.distribute.OneDeviceStrategy` first\n",
" * training metrics\n",
" * evaluation metrics\n",
" * figure out what the reasonable tolerance for inherent randomness is\n",
"\n",
" b. Check equivalence with advanced optimizer/learning rate\n",
" scheduler/distribution strategies\n",
"\n",
" c. Check equivalence when using mixed precision\n",
"\n",
"5. Additional product benchmarks"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XKakQBI9-FLb"
},
"source": [
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"execution": {
"iopub.execute_input": "2024-08-15T02:14:12.049922Z",
"iopub.status.busy": "2024-08-15T02:14:12.049670Z",
"iopub.status.idle": "2024-08-15T02:15:01.407706Z",
"shell.execute_reply": "2024-08-15T02:15:01.406615Z"
},
"id": "i1ghHyXl-Oqd"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\r\n",
"tensorflow-datasets 4.9.3 requires protobuf>=3.20, but you have protobuf 3.19.6 which is incompatible.\r\n",
"tensorflow-metadata 1.15.0 requires protobuf<4.21,>=3.20.3; python_version < \"3.11\", but you have protobuf 3.19.6 which is incompatible.\r\n",
"tf-keras 2.17.0 requires tensorflow<2.18,>=2.17, but you have tensorflow 2.9.3 which is incompatible.\u001b[0m\u001b[31m\r\n",
"\u001b[0m"
]
}
],
"source": [
"# The `DeterministicRandomTestTool` is only available from Tensorflow 2.8:\n",
"!pip install -q \"tensorflow==2.9.*\""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "usyRSlIRl3r2"
},
"source": [
"### Single forward pass validation \n",
"\n",
"Single forward pass validation, including checkpoint loading, is covered in a different [colab](./validate_correctness.ipynb)."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"execution": {
"iopub.execute_input": "2024-08-15T02:15:01.412107Z",
"iopub.status.busy": "2024-08-15T02:15:01.411795Z",
"iopub.status.idle": "2024-08-15T02:15:03.408398Z",
"shell.execute_reply": "2024-08-15T02:15:03.407690Z"
},
"id": "HVBQbsZeVL_V"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2024-08-15 02:15:01.731254: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory\n"
]
}
],
"source": [
"import sys\n",
"import unittest\n",
"import numpy as np\n",
"\n",
"import tensorflow as tf\n",
"import tensorflow.compat.v1 as v1"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4M104dt7m5cC"
},
"source": [
"### Model training numerical equivalence validation for a few steps"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "v2Nz2Ni1EkMz"
},
"source": [
"Set up model configuration and prepare a fake dataset."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"execution": {
"iopub.execute_input": "2024-08-15T02:15:03.412472Z",
"iopub.status.busy": "2024-08-15T02:15:03.412017Z",
"iopub.status.idle": "2024-08-15T02:15:03.417011Z",
"shell.execute_reply": "2024-08-15T02:15:03.416438Z"
},
"id": "hUxXadzKU9rT"
},
"outputs": [],
"source": [
"params = {\n",
" 'input_size': 3,\n",
" 'num_classes': 3,\n",
" 'layer_1_size': 2,\n",
" 'layer_2_size': 2,\n",
" 'num_train_steps': 100,\n",
" 'init_lr': 1e-3,\n",
" 'end_lr': 0.0,\n",
" 'decay_steps': 1000,\n",
" 'lr_power': 1.0,\n",
"}\n",
"\n",
"# make a small fixed dataset\n",
"fake_x = np.ones((2, params['input_size']), dtype=np.float32)\n",
"fake_y = np.zeros((2, params['num_classes']), dtype=np.int32)\n",
"fake_y[0][0] = 1\n",
"fake_y[1][1] = 1\n",
"\n",
"step_num = 3"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lV_n3Ukmz4Un"
},
"source": [
"Define the TF1.x model."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"execution": {
"iopub.execute_input": "2024-08-15T02:15:03.420071Z",
"iopub.status.busy": "2024-08-15T02:15:03.419673Z",
"iopub.status.idle": "2024-08-15T02:15:03.428819Z",
"shell.execute_reply": "2024-08-15T02:15:03.428165Z"
},
"id": "ATa5fzL8mAwl"
},
"outputs": [],
"source": [
"# Assume there is an existing TF1.x model using estimator API\n",
"# Wrap the model_fn to log necessary tensors for result comparison\n",
"class SimpleModelWrapper():\n",
" def __init__(self):\n",
" self.logged_ops = {}\n",
" self.logs = {\n",
" 'step': [],\n",
" 'lr': [],\n",
" 'loss': [],\n",
" 'grads_and_vars': [],\n",
" 'layer_out': []}\n",
" \n",
" def model_fn(self, features, labels, mode, params):\n",
" out_1 = tf.compat.v1.layers.dense(features, units=params['layer_1_size'])\n",
" out_2 = tf.compat.v1.layers.dense(out_1, units=params['layer_2_size'])\n",
" logits = tf.compat.v1.layers.dense(out_2, units=params['num_classes'])\n",
" loss = tf.compat.v1.losses.softmax_cross_entropy(labels, logits)\n",
"\n",
" # skip EstimatorSpec details for prediction and evaluation \n",
" if mode == tf.estimator.ModeKeys.PREDICT:\n",
" pass\n",
" if mode == tf.estimator.ModeKeys.EVAL:\n",
" pass\n",
" assert mode == tf.estimator.ModeKeys.TRAIN\n",
"\n",
" global_step = tf.compat.v1.train.get_or_create_global_step()\n",
" lr = tf.compat.v1.train.polynomial_decay(\n",
" learning_rate=params['init_lr'],\n",
" global_step=global_step,\n",
" decay_steps=params['decay_steps'],\n",
" end_learning_rate=params['end_lr'],\n",
" power=params['lr_power'])\n",
" \n",
" optmizer = tf.compat.v1.train.GradientDescentOptimizer(lr)\n",
" grads_and_vars = optmizer.compute_gradients(\n",
" loss=loss,\n",
" var_list=graph.get_collection(\n",
" tf.compat.v1.GraphKeys.TRAINABLE_VARIABLES))\n",
" train_op = optmizer.apply_gradients(\n",
" grads_and_vars,\n",
" global_step=global_step)\n",
" \n",
" # log tensors\n",
" self.logged_ops['step'] = global_step\n",
" self.logged_ops['lr'] = lr\n",
" self.logged_ops['loss'] = loss\n",
" self.logged_ops['grads_and_vars'] = grads_and_vars\n",
" self.logged_ops['layer_out'] = {\n",
" 'layer_1': out_1,\n",
" 'layer_2': out_2,\n",
" 'logits': logits}\n",
"\n",
" return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)\n",
"\n",
" def update_logs(self, logs):\n",
" for key in logs.keys():\n",
" model_tf1.logs[key].append(logs[key])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kki9yILSKS7f"
},
"source": [
"The following [`v1.keras.utils.DeterministicRandomTestTool`](https://www.tensorflow.org/api_docs/python/tf/compat/v1/keras/utils/DeterministicRandomTestTool) class provides a context manager `scope()` that can make stateful random operations use the same seed across both TF1 graphs/sessions and eager execution,\n",
"\n",
"The tool provides two testing modes: \n",
"1. `constant` which uses the same seed for every single operation no matter how many times it has been called and,\n",
"2. `num_random_ops` which uses the number of previously-observed stateful random operations as the operation seed.\n",
"\n",
"This applies both to the stateful random operations used for creating and initializing variables, and to the stateful random operations used in computation (such as for dropout layers)."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"execution": {
"iopub.execute_input": "2024-08-15T02:15:03.431927Z",
"iopub.status.busy": "2024-08-15T02:15:03.431431Z",
"iopub.status.idle": "2024-08-15T02:15:03.706758Z",
"shell.execute_reply": "2024-08-15T02:15:03.706137Z"
},
"id": "X6Y3RWMoKOl8"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"WARNING:tensorflow:From /tmpfs/tmp/ipykernel_93773/2689227634.py:1: The name tf.keras.utils.DeterministicRandomTestTool is deprecated. Please use tf.compat.v1.keras.utils.DeterministicRandomTestTool instead.\n",
"\n"
]
}
],
"source": [
"random_tool = v1.keras.utils.DeterministicRandomTestTool(mode='num_random_ops')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "mk5-ZzxcErX5"
},
"source": [
"Run the TF1.x model in graph mode. Collect statistics for first 3 training steps for numerical equivalence comparison."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"execution": {
"iopub.execute_input": "2024-08-15T02:15:03.710230Z",
"iopub.status.busy": "2024-08-15T02:15:03.709981Z",
"iopub.status.idle": "2024-08-15T02:15:04.586394Z",
"shell.execute_reply": "2024-08-15T02:15:04.585675Z"
},
"id": "r5zhJHvsWA24"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2024-08-15 02:15:04.252686: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory\n",
"2024-08-15 02:15:04.252893: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory\n",
"2024-08-15 02:15:04.252990: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory\n",
"2024-08-15 02:15:04.253068: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory\n",
"2024-08-15 02:15:04.324588: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory\n",
"2024-08-15 02:15:04.324786: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.\n",
"Skipping registering GPU devices...\n",
"/tmpfs/tmp/ipykernel_93773/1984550333.py:14: UserWarning: `tf.layers.dense` is deprecated and will be removed in a future version. Please use `tf.keras.layers.Dense` instead.\n",
" out_1 = tf.compat.v1.layers.dense(features, units=params['layer_1_size'])\n",
"/tmpfs/tmp/ipykernel_93773/1984550333.py:15: UserWarning: `tf.layers.dense` is deprecated and will be removed in a future version. Please use `tf.keras.layers.Dense` instead.\n",
" out_2 = tf.compat.v1.layers.dense(out_1, units=params['layer_2_size'])\n",
"/tmpfs/tmp/ipykernel_93773/1984550333.py:16: UserWarning: `tf.layers.dense` is deprecated and will be removed in a future version. Please use `tf.keras.layers.Dense` instead.\n",
" logits = tf.compat.v1.layers.dense(out_2, units=params['num_classes'])\n"
]
}
],
"source": [
"with random_tool.scope():\n",
" graph = tf.Graph()\n",
" with graph.as_default(), tf.compat.v1.Session(graph=graph) as sess:\n",
" model_tf1 = SimpleModelWrapper()\n",
" # build the model\n",
" inputs = tf.compat.v1.placeholder(tf.float32, shape=(None, params['input_size']))\n",
" labels = tf.compat.v1.placeholder(tf.float32, shape=(None, params['num_classes']))\n",
" spec = model_tf1.model_fn(inputs, labels, tf.estimator.ModeKeys.TRAIN, params)\n",
" train_op = spec.train_op\n",
"\n",
" sess.run(tf.compat.v1.global_variables_initializer())\n",
" for step in range(step_num):\n",
" # log everything and update the model for one step\n",
" logs, _ = sess.run(\n",
" [model_tf1.logged_ops, train_op],\n",
" feed_dict={inputs: fake_x, labels: fake_y})\n",
" model_tf1.update_logs(logs)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "eZxjI8Nxz9Ea"
},
"source": [
"Define the TF2 model."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"execution": {
"iopub.execute_input": "2024-08-15T02:15:04.590358Z",
"iopub.status.busy": "2024-08-15T02:15:04.589710Z",
"iopub.status.idle": "2024-08-15T02:15:04.598939Z",
"shell.execute_reply": "2024-08-15T02:15:04.598273Z"
},
"id": "AA67rh2TkS1M"
},
"outputs": [],
"source": [
"class SimpleModel(tf.keras.Model):\n",
" def __init__(self, params, *args, **kwargs):\n",
" super(SimpleModel, self).__init__(*args, **kwargs)\n",
" # define the model\n",
" self.dense_1 = tf.keras.layers.Dense(params['layer_1_size'])\n",
" self.dense_2 = tf.keras.layers.Dense(params['layer_2_size'])\n",
" self.out = tf.keras.layers.Dense(params['num_classes'])\n",
" learning_rate_fn = tf.keras.optimizers.schedules.PolynomialDecay(\n",
" initial_learning_rate=params['init_lr'],\n",
" decay_steps=params['decay_steps'],\n",
" end_learning_rate=params['end_lr'],\n",
" power=params['lr_power']) \n",
" self.optimizer = tf.keras.optimizers.legacy.SGD(learning_rate_fn)\n",
" self.compiled_loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)\n",
" self.logs = {\n",
" 'lr': [],\n",
" 'loss': [],\n",
" 'grads': [],\n",
" 'weights': [],\n",
" 'layer_out': []}\n",
"\n",
" def call(self, inputs):\n",
" out_1 = self.dense_1(inputs)\n",
" out_2 = self.dense_2(out_1)\n",
" logits = self.out(out_2)\n",
" # log output features for every layer for comparison\n",
" layer_wise_out = {\n",
" 'layer_1': out_1,\n",
" 'layer_2': out_2,\n",
" 'logits': logits}\n",
" self.logs['layer_out'].append(layer_wise_out)\n",
" return logits\n",
"\n",
" def train_step(self, data):\n",
" x, y = data\n",
" with tf.GradientTape() as tape:\n",
" logits = self(x)\n",
" loss = self.compiled_loss(y, logits)\n",
" grads = tape.gradient(loss, self.trainable_weights)\n",
" # log training statistics\n",
" step = self.optimizer.iterations.numpy()\n",
" self.logs['lr'].append(self.optimizer.learning_rate(step).numpy())\n",
" self.logs['loss'].append(loss.numpy())\n",
" self.logs['grads'].append(grads)\n",
" self.logs['weights'].append(self.trainable_weights)\n",
" # update model\n",
" self.optimizer.apply_gradients(zip(grads, self.trainable_weights))\n",
" return"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "I5smAcaEE8nX"
},
"source": [
"Run the TF2 model in eager mode. Collect statistics for first 3 training steps for numerical equivalence comparison."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"execution": {
"iopub.execute_input": "2024-08-15T02:15:04.602294Z",
"iopub.status.busy": "2024-08-15T02:15:04.601726Z",
"iopub.status.idle": "2024-08-15T02:15:04.651956Z",
"shell.execute_reply": "2024-08-15T02:15:04.651281Z"
},
"id": "Q0AbXF_eE8cS"
},
"outputs": [],
"source": [
"random_tool = v1.keras.utils.DeterministicRandomTestTool(mode='num_random_ops')\n",
"with random_tool.scope():\n",
" model_tf2 = SimpleModel(params)\n",
" for step in range(step_num):\n",
" model_tf2.train_step([fake_x, fake_y])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cjJDjLcAz_gU"
},
"source": [
"Compare numerical equivalence for first few training steps.\n",
"\n",
"You can also check the [Validating correctness & numerical equivalence notebook](./validate_correctness.ipynb) for additional advice for numerical equivalence."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"execution": {
"iopub.execute_input": "2024-08-15T02:15:04.655169Z",
"iopub.status.busy": "2024-08-15T02:15:04.654942Z",
"iopub.status.idle": "2024-08-15T02:15:04.696773Z",
"shell.execute_reply": "2024-08-15T02:15:04.695664Z"
},
"id": "6CbCUbsCiabC"
},
"outputs": [],
"source": [
"np.testing.assert_allclose(model_tf1.logs['lr'], model_tf2.logs['lr'])\n",
"np.testing.assert_allclose(model_tf1.logs['loss'], model_tf2.logs['loss'])\n",
"for step in range(step_num):\n",
" for name in model_tf1.logs['layer_out'][step]:\n",
" np.testing.assert_allclose(\n",
" model_tf1.logs['layer_out'][step][name],\n",
" model_tf2.logs['layer_out'][step][name])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dhVuuciimLIY"
},
"source": [
"#### Unit tests"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sXZYFC6Hhqeb"
},
"source": [
"There are a few types of unit testing that can help debug your migration code.\n",
"1. Single forward pass validation\n",
"2. Model training numerical equivalence validation for a few steps\n",
"3. Benchmark inference performance\n",
"4. The trained model makes correct predictions on fixed and simple data points\n",
"\n",
"You can use `@parameterized.parameters` to test models with different configurations. [Details with code sample](https://github.com/abseil/abseil-py/blob/master/absl/testing/parameterized.py).\n",
"\n",
"Note that it's possible to run session APIs and eager execution in the same test case. The code snippets below show how."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"execution": {
"iopub.execute_input": "2024-08-15T02:15:04.700763Z",
"iopub.status.busy": "2024-08-15T02:15:04.700246Z",
"iopub.status.idle": "2024-08-15T02:15:04.709897Z",
"shell.execute_reply": "2024-08-15T02:15:04.709284Z"
},
"id": "CdHqkgPPM2Bj"
},
"outputs": [],
"source": [
"import unittest\n",
"\n",
"class TestNumericalEquivalence(unittest.TestCase):\n",
"\n",
" # copied from code samples above\n",
" def setup(self):\n",
" # record statistics for 100 training steps\n",
" step_num = 100\n",
"\n",
" # setup TF 1 model\n",
" random_tool = v1.keras.utils.DeterministicRandomTestTool(mode='num_random_ops')\n",
" with random_tool.scope():\n",
" # run TF1.x code in graph mode with context management\n",
" graph = tf.Graph()\n",
" with graph.as_default(), tf.compat.v1.Session(graph=graph) as sess:\n",
" self.model_tf1 = SimpleModelWrapper()\n",
" # build the model\n",
" inputs = tf.compat.v1.placeholder(tf.float32, shape=(None, params['input_size']))\n",
" labels = tf.compat.v1.placeholder(tf.float32, shape=(None, params['num_classes']))\n",
" spec = self.model_tf1.model_fn(inputs, labels, tf.estimator.ModeKeys.TRAIN, params)\n",
" train_op = spec.train_op\n",
"\n",
" sess.run(tf.compat.v1.global_variables_initializer())\n",
" for step in range(step_num):\n",
" # log everything and update the model for one step\n",
" logs, _ = sess.run(\n",
" [self.model_tf1.logged_ops, train_op],\n",
" feed_dict={inputs: fake_x, labels: fake_y})\n",
" self.model_tf1.update_logs(logs)\n",
"\n",
" # setup TF2 model\n",
" random_tool = v1.keras.utils.DeterministicRandomTestTool(mode='num_random_ops')\n",
" with random_tool.scope():\n",
" self.model_tf2 = SimpleModel(params)\n",
" for step in range(step_num):\n",
" self.model_tf2.train_step([fake_x, fake_y])\n",
" \n",
" def test_learning_rate(self):\n",
" np.testing.assert_allclose(\n",
" self.model_tf1.logs['lr'],\n",
" self.model_tf2.logs['lr'])\n",
"\n",
" def test_training_loss(self):\n",
" # adopt different tolerance strategies before and after 10 steps\n",
" first_n_step = 10\n",
"\n",
" # absolute difference is limited below 1e-5\n",
" # set `equal_nan` to be False to detect potential NaN loss issues\n",
" abosolute_tolerance = 1e-5\n",
" np.testing.assert_allclose(\n",
" actual=self.model_tf1.logs['loss'][:first_n_step],\n",
" desired=self.model_tf2.logs['loss'][:first_n_step],\n",
" atol=abosolute_tolerance,\n",
" equal_nan=False)\n",
" \n",
" # relative difference is limited below 5%\n",
" relative_tolerance = 0.05\n",
" np.testing.assert_allclose(self.model_tf1.logs['loss'][first_n_step:],\n",
" self.model_tf2.logs['loss'][first_n_step:],\n",
" rtol=relative_tolerance,\n",
" equal_nan=False)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gshSQdKIddpZ"
},
"source": [
"## Debugging tools"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "CkMfCaJRclKv"
},
"source": [
"### tf.print\n",
"\n",
"tf.print vs print/logging.info\n",
"\n",
"- With configurable arguments, `tf.print` can recursively display the first and last few elements of each dimension for printed tensors. Check the [API docs](https://www.tensorflow.org/api_docs/python/tf/print) for details.\n",
"- For eager execution, both `print` and `tf.print` print the value of the tensor. But `print` may involve device-to-host copy, which can potentially slow down your code. \n",
"- For graph mode including usage inside `tf.function`, you need to use `tf.print` to print the actual tensor value. `tf.print` is compiled into an op in the graph, whereas `print` and `logging.info` only log at tracing time, which is often not what you want. \n",
"- `tf.print` also supports printing composite tensors like `tf.RaggedTensor` and `tf.sparse.SparseTensor`.\n",
"- You can also use a callback to monitor metrics and variables. Please check how to use custom callbacks with [logs dict](https://www.tensorflow.org/guide/keras/custom_callback#usage_of_logs_dict) and [self.model attribute](https://www.tensorflow.org/guide/keras/custom_callback#usage_of_selfmodel_attribute)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "S-5h3cX8Dc50"
},
"source": [
"tf.print vs print inside tf.function"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"execution": {
"iopub.execute_input": "2024-08-15T02:15:04.713301Z",
"iopub.status.busy": "2024-08-15T02:15:04.713069Z",
"iopub.status.idle": "2024-08-15T02:15:04.776042Z",
"shell.execute_reply": "2024-08-15T02:15:04.775421Z"
},
"id": "gRED9FMyDKih"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tensor(\"add:0\", shape=(1,), dtype=float32)\n",
"[2]\n"
]
}
],
"source": [
"# `print` prints info of tensor object\n",
"# `tf.print` prints the tensor value\n",
"@tf.function\n",
"def dummy_func(num):\n",
" num += 1\n",
" print(num)\n",
" tf.print(num)\n",
" return num\n",
"\n",
"_ = dummy_func(tf.constant([1.0]))\n",
"\n",
"# Output:\n",
"# Tensor(\"add:0\", shape=(1,), dtype=float32)\n",
"# [2]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3QroLA_zDK2w"
},
"source": [
"tf.distribute.Strategy\n",
"\n",
"- If the `tf.function` containing `tf.print` is executed on the workers, for example when using `TPUStrategy` or `ParameterServerStrategy`, you need to check worker/parameter server logs to find the printed values.\n",
"- For `print` or `logging.info`, logs will be printed on the coordinator when using `ParameterServerStrategy`, and logs will be printed on the STDOUT on worker0 when using TPUs.\n",
"\n",
"tf.keras.Model\n",
"- When using Sequential and Functional API models, if you want to print values, e.g., model inputs or intermediate features after some layers, you have following options.\n",
" 1. [Write a custom layer](https://www.tensorflow.org/guide/keras/custom_layers_and_models) that `tf.print` the inputs. \n",
" 2. Include the intermediate outputs you want to inspect in the model outputs.\n",
"- `tf.keras.layers.Lambda` layers have (de)serialization limitations. To avoid checkpoint loading issues, write a custom subclassed layer instead. Check the [API docs](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Lambda) for more details. \n",
"- You can't `tf.print` intermediate outputs in a `tf.keras.callbacks.LambdaCallback` if you don't have access to the actual values, but instead only to the symbolic Keras tensor objects.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aKazGTr1ZUMG"
},
"source": [
"Option 1: write a custom layer"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"execution": {
"iopub.execute_input": "2024-08-15T02:15:04.779421Z",
"iopub.status.busy": "2024-08-15T02:15:04.779187Z",
"iopub.status.idle": "2024-08-15T02:15:05.295301Z",
"shell.execute_reply": "2024-08-15T02:15:05.294507Z"
},
"id": "8w4aY7wO0B4W"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[-0.327884018]\n",
" [-0.109294683]\n",
" [-0.218589365]]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\r",
"1/1 [==============================] - ETA: 0s - loss: 0.6077"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
"1/1 [==============================] - 0s 402ms/step - loss: 0.6077\n"
]
},
{
"data": {
"text/plain": [
"
"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"class PrintLayer(tf.keras.layers.Layer):\n",
" def call(self, inputs):\n",
" tf.print(inputs)\n",
" return inputs\n",
"\n",
"def get_model():\n",
" inputs = tf.keras.layers.Input(shape=(1,))\n",
" out_1 = tf.keras.layers.Dense(4)(inputs)\n",
" out_2 = tf.keras.layers.Dense(1)(out_1)\n",
" # use custom layer to tf.print intermediate features\n",
" out_3 = PrintLayer()(out_2)\n",
" model = tf.keras.Model(inputs=inputs, outputs=out_3)\n",
" return model\n",
"\n",
"model = get_model()\n",
"model.compile(optimizer=\"adam\", loss=\"mse\")\n",
"model.fit([1, 2, 3], [0.0, 0.0, 1.0])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KNESOatq7iM9"
},
"source": [
"Option 2: include the intermediate outputs you want to inspect in the model outputs.\n",
"\n",
"Note that in such case, you may need some [customizations](https://www.tensorflow.org/guide/keras/customizing_what_happens_in_fit) to use `Model.fit`."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"execution": {
"iopub.execute_input": "2024-08-15T02:15:05.298547Z",
"iopub.status.busy": "2024-08-15T02:15:05.298294Z",
"iopub.status.idle": "2024-08-15T02:15:05.302797Z",
"shell.execute_reply": "2024-08-15T02:15:05.302098Z"
},
"id": "MiifvdLk7g9J"
},
"outputs": [],
"source": [
"def get_model():\n",
" inputs = tf.keras.layers.Input(shape=(1,))\n",
" out_1 = tf.keras.layers.Dense(4)(inputs)\n",
" out_2 = tf.keras.layers.Dense(1)(out_1)\n",
" # include intermediate values in model outputs\n",
" model = tf.keras.Model(\n",
" inputs=inputs,\n",
" outputs={\n",
" 'inputs': inputs,\n",
" 'out_1': out_1,\n",
" 'out_2': out_2})\n",
" return model"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MvIKDZpHSLmQ"
},
"source": [
"### pdb\n",
"You can use [pdb](https://docs.python.org/3/library/pdb.html) both in terminal and Colab to inspect intermediate values for debugging.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Qu0n4O2umyT7"
},
"source": [
"### Visualize graph with TensorBoard\n",
"\n",
"You can [examine the TensorFlow graph with TensorBoard](https://www.tensorflow.org/tensorboard/graphs). TensorBoard is also [supported on colab](https://www.tensorflow.org/tensorboard/tensorboard_in_notebooks). TensorBoard is a great tool to visualize summaries. You can use it to compare learning rate, model weights, gradient scale, training/validation metrics, or even model intermediate outputs between TF1.x model and migrated TF2 model through the training process and seeing if the values look as expected."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "vBnxB6_xzlnT"
},
"source": [
"### TensorFlow Profiler\n",
"\n",
"[TensorFlow Profiler](https://www.tensorflow.org/guide/profiler) can help you visualize the execution timeline on GPUs/TPUs. You can check out this [Colab Demo](https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras) for its basic usage."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9wNmCSHBpiGM"
},
"source": [
"### MultiProcessRunner\n",
"[MultiProcessRunner](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/distribute/multi_process_runner.py#L108) is a useful tool when debugging with MultiWorkerMirroredStrategy and ParameterServerStrategy. You can take a look at [this concrete example](https://github.com/keras-team/keras/blob/master/keras/integration_test/mwms_multi_process_runner_test.py) for its usage.\n",
"\n",
"Specifically for the cases of these two strategies, you are recommended to 1) not only have unit tests to cover their flow, 2) but also to attempt to reproduce failures using it in unit test to avoid launch real distributed job every time when they attempt a fix."
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"name": "migration_debugging.ipynb",
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.19"
}
},
"nbformat": 4,
"nbformat_minor": 0
}