CARVIEW |
Select Language
HTTP/2 200
content-type: application/octet-stream
x-guploader-uploadid: ABgVH8_4ASanNjpJWkU0jXEqGeI3-w4b4kWq2WvfvVzFaVzLlLORR918hdmSISHRghTSr71Htp9DMIo
expires: Wed, 16 Jul 2025 15:37:50 GMT
date: Wed, 16 Jul 2025 14:37:50 GMT
cache-control: public, max-age=3600
last-modified: Sat, 03 Jun 2023 11:15:19 GMT
etag: "bca36f6ee96ae82201bcf8abd4a383ee"
x-goog-generation: 1685790919493060
x-goog-metageneration: 1
x-goog-stored-content-encoding: identity
x-goog-stored-content-length: 158418
x-goog-hash: crc32c=Hl7Efw==
x-goog-hash: md5=vKNvbulq6CIBvPir1KOD7g==
x-goog-storage-class: MULTI_REGIONAL
accept-ranges: bytes
content-length: 158418
server: UploadServer
alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "wlFbFLUghfjo"
},
"source": [
"##### Copyright 2021 The TensorFlow Authors."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "4FyfuZX-gTKS"
},
"outputs": [],
"source": [
"#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "I_16rv9EPhB_"
},
"source": [
"# TensorFlow Ranking Keras pipeline for distributed training\n",
"\n",
"\u003ctable class=\"tfo-notebook-buttons\" align=\"left\"\u003e\n",
" \u003ctd\u003e\n",
" \u003ca target=\"_blank\" href=\"https://www.tensorflow.org/ranking/tutorials/ranking_dnn_distributed\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" /\u003eView on TensorFlow.org\u003c/a\u003e\n",
" \u003c/td\u003e\n",
" \u003ctd\u003e\n",
" \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/ranking/blob/master/docs/tutorials/ranking_dnn_distributed.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" /\u003eRun in Google Colab\u003c/a\u003e\n",
" \u003c/td\u003e\n",
" \u003ctd\u003e\n",
" \u003ca target=\"_blank\" href=\"https://github.com/tensorflow/ranking/blob/master/docs/tutorials/ranking_dnn_distributed.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" /\u003eView source on GitHub\u003c/a\u003e\n",
" \u003c/td\u003e\n",
" \u003ctd\u003e\n",
" \u003ca href=\"https://storage.googleapis.com/tensorflow_docs/ranking/docs/tutorials/ranking_dnn_distributed.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/download_logo_32px.png\" /\u003eDownload notebook\u003c/a\u003e\n",
" \u003c/td\u003e\n",
"\u003c/table\u003e"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "V8tMYn22vtDV"
},
"source": [
"TensorFlow Ranking can handle heterogeneous dense and sparse features, and scales up to millions of data points. However, building and deploying a learning to rank model to operate at scale creates additional challenges beyond simply designing a model. The Ranking library provides workflow utility classes for building [distributed training](https://www.tensorflow.org/guide/distributed_training) for large-scale ranking applications. For more information about these features, see the TensorFlow Ranking [Overview](../overview).\n",
"\n",
"This tutorial shows you how to build a ranking model that enables a distributed processing strategy by using the Ranking library's support for a pipeline processing architecture.\n",
"\n",
"Note: An advanced version of this code is also available as a [Python script](https://github.com/tensorflow/ranking/blob/master/tensorflow_ranking/examples/keras/antique_ragged.py). The script version supports flags for hyperparameters, and advanced use-cases like [Document Interaction Network](https://research.google/pubs/pub49364)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "UxG7i8xbDIDF"
},
"source": [
"## ANTIQUE dataset\n",
"\n",
"In this tutorial, you will build a ranking model for ANTIQUE, a question-answering dataset. Given a query, and a list of answers, the objective is to rank the answers with optimal rank related metrics, such as NDCG. For more details about ranking metrics, review evaluation measures [offline metrics](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Offline_metrics).\n",
"\n",
"[ANTIQUE](https://ciir.cs.umass.edu/downloads/Antique/) is a publicly available dataset for open-domain non-factoid question answering, collected from Yahoo! answers.\n",
"Each question has a list of answers, whose relevance are graded on a scale of 0-4, 0 for irrelevant and 4 for fully relevant.\n",
"The list size can vary depending on the query, so we use a fixed \"list size\" of 50, where the list is either truncated or padded with default values.\n",
"The dataset is split into 2206 queries for training and 200 queries for testing. For more details, please read the technical paper on [arXiv](https://arxiv.org/abs/1905.08957)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ucWaXnFazZXD"
},
"source": [
"## Setup\n",
"\n",
"Download and install the TensorFlow Ranking and TensorFlow Serving packages."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "aPmhLkMWgPLO"
},
"outputs": [],
"source": [
"!pip install -q tensorflow-ranking tensorflow-serving-api\n",
"!pip install -U \"tensorflow-text==2.11.*\""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9OKDJUjq0rnm"
},
"source": [
"Import TensorFlow Ranking library and useful libraries through the notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "fmlaz2D5Ux3J"
},
"outputs": [],
"source": [
"import pathlib\n",
"\n",
"import tensorflow as tf\n",
"import tensorflow_ranking as tfr\n",
"import tensorflow_text as tf_text\n",
"from tensorflow_serving.apis import input_pb2\n",
"from google.protobuf import text_format"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JNilCoqq1jJn"
},
"source": [
"## Data preparation\n",
"\n",
"Download training, test data, and vocabulary file."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Mwxtsi4wqoOJ"
},
"outputs": [],
"source": [
"!wget -O \"/tmp/train.tfrecords\" \"https://ciir.cs.umass.edu/downloads/Antique/tf-ranking/ELWC/train.tfrecords\"\n",
"!wget -O \"/tmp/test.tfrecords\" \"https://ciir.cs.umass.edu/downloads/Antique/tf-ranking//ELWC/test.tfrecords\"\n",
"!wget -O \"/tmp/vocab.txt\" \"https://ciir.cs.umass.edu/downloads/Antique/tf-ranking/vocab.txt\""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "T7L-IOmOWm3s"
},
"source": [
"Here, the dataset is saved in a ranking-specific ExampleListWithContext (ELWC) format. Detailed in the next section, shows how to generate and store data in the ELWC format."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6tuna6Td3_UO"
},
"source": [
"### ELWC Data Formats for Ranking\n",
"\n",
"The data for a single question consists of a list of `query_tokens` representing the question (the \"context\"), and a list of answers (the \"examples\"). Each answer is represented as a list of `document_tokens` and a `relevance` score. The following code shows a _simplified_ representation of a question's data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "9uqSmfoaQc_-"
},
"outputs": [],
"source": [
"example_list_with_context = {\n",
" \"context\": {\n",
" \"query_tokens\": [\"this\", \"is\", \"a\", \"question\"]\n",
" },\n",
" \"examples\": [\n",
" {\n",
" \"document_tokens\": [\"this\", \"is\", \"a\", \"relevant\", \"answer\"],\n",
" \"relevance\": [4]\n",
" },\n",
" {\n",
" \"document_tokens\": [\"irrelevant\", \"data\"],\n",
" \"relevance\": [0]\n",
" }\n",
" ]\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "95QnMu1cyPYA"
},
"source": [
"The data files, downloaded in the previous section, contain a serialized [protobuffer](https://developers.google.com/protocol-buffers/) representation of this sort of data.\n",
"These protobuffers are quite long when viewed as text, but encode the same data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ooOmCPHbyd02"
},
"outputs": [],
"source": [
"CONTEXT = text_format.Parse(\n",
" \"\"\"\n",
" features {\n",
" feature {\n",
" key: \"query_tokens\"\n",
" value { bytes_list { value: [\"this\", \"is\", \"a\", \"question\"] } }\n",
" }\n",
" }\"\"\", tf.train.Example())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "eE7hpEBBykVS"
},
"outputs": [],
"source": [
"EXAMPLES = [\n",
" text_format.Parse(\n",
" \"\"\"\n",
" features {\n",
" feature {\n",
" key: \"document_tokens\"\n",
" value { bytes_list { value: [\"this\", \"is\", \"a\", \"relevant\", \"answer\"] } }\n",
" }\n",
" feature {\n",
" key: \"relevance\"\n",
" value { int64_list { value: 4 } }\n",
" }\n",
" }\"\"\", tf.train.Example()),\n",
" text_format.Parse(\n",
" \"\"\"\n",
" features {\n",
" feature {\n",
" key: \"document_tokens\"\n",
" value { bytes_list { value: [\"irrelevant\", \"data\"] } }\n",
" }\n",
" feature {\n",
" key: \"relevance\"\n",
" value { int64_list { value: 0 } }\n",
" }\n",
" }\"\"\", tf.train.Example()),\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "iImhGlJyyo--",
"outputId": "5c3c018d-c7e2-491c-8426-cdee999b9251"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"examples {\n",
" features {\n",
" feature {\n",
" key: \"document_tokens\"\n",
" value {\n",
" bytes_list {\n",
" value: \"this\"\n",
" value: \"is\"\n",
" value: \"a\"\n",
" value: \"relevant\"\n",
" value: \"answer\"\n",
" }\n",
" }\n",
" }\n",
" feature {\n",
" key: \"relevance\"\n",
" value {\n",
" int64_list {\n",
" value: 4\n",
" }\n",
" }\n",
" }\n",
" }\n",
"}\n",
"examples {\n",
" features {\n",
" feature {\n",
" key: \"document_tokens\"\n",
" value {\n",
" bytes_list {\n",
" value: \"irrelevant\"\n",
" value: \"data\"\n",
" }\n",
" }\n",
" }\n",
" feature {\n",
" key: \"relevance\"\n",
" value {\n",
" int64_list {\n",
" value: 0\n",
" }\n",
" }\n",
" }\n",
" }\n",
"}\n",
"context {\n",
" features {\n",
" feature {\n",
" key: \"query_tokens\"\n",
" value {\n",
" bytes_list {\n",
" value: \"this\"\n",
" value: \"is\"\n",
" value: \"a\"\n",
" value: \"question\"\n",
" }\n",
" }\n",
" }\n",
" }\n",
"}\n",
"\n"
]
}
],
"source": [
"ELWC = input_pb2.ExampleListWithContext()\n",
"ELWC.context.CopyFrom(CONTEXT)\n",
"for example in EXAMPLES:\n",
" example_features = ELWC.examples.add()\n",
" example_features.CopyFrom(example)\n",
"\n",
"print(ELWC)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fCjUs4tk0oej"
},
"source": [
"While the text format is verbose, protos can be efficiently serialized to a byte string (and parsed back into a proto)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Bt3c6fUOKkAG",
"outputId": "949bd7be-8785-48d0-ef1d-c1bbefdc44f9"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"b\"\\nL\\nJ\\n4\\n\\x0fdocument_tokens\\x12!\\n\\x1f\\n\\x04this\\n\\x02is\\n\\x01a\\n\\x08relevant\\n\\x06answer\\n\\x12\\n\\trelevance\\x12\\x05\\x1a\\x03\\n\\x01\\x04\\n?\\n=\\n\\x12\\n\\trelevance\\x12\\x05\\x1a\\x03\\n\\x01\\x00\\n'\\n\\x0fdocument_tokens\\x12\\x14\\n\\x12\\n\\nirrelevant\\n\\x04data\\x12-\\n+\\n)\\n\\x0cquery_tokens\\x12\\x19\\n\\x17\\n\\x04this\\n\\x02is\\n\\x01a\\n\\x08question\"\n"
]
}
],
"source": [
"serialized_elwc = ELWC.SerializeToString()\n",
"print(serialized_elwc)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DI8maZPqKxQ0"
},
"source": [
"The following parser configuration parses the binary representation into a dictionary of tensors:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "EdhG7t3MGr1G",
"outputId": "c634ede5-00e4-4358-f035-00d18513a480"
},
"outputs": [
{
"data": {
"text/plain": [
"{'_list_size_': \u003ctf.Tensor: shape=(1,), dtype=int32, numpy=array([2], dtype=int32)\u003e,\n",
" '_mask_': \u003ctf.Tensor: shape=(1, 2), dtype=bool, numpy=array([[ True, True]])\u003e,\n",
" 'document_tokens': \u003ctf.RaggedTensor [[[b'this', b'is', b'a', b'relevant', b'answer'], [b'irrelevant', b'data']]]\u003e,\n",
" 'query_tokens': \u003ctf.RaggedTensor [[b'this', b'is', b'a', b'question']]\u003e,\n",
" 'relevance': \u003ctf.Tensor: shape=(1, 2), dtype=int64, numpy=array([[4, 0]])\u003e}"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def parse_elwc(elwc):\n",
" return tfr.data.parse_from_example_list(\n",
" [elwc],\n",
" list_size=2,\n",
" context_feature_spec={\"query_tokens\": tf.io.RaggedFeature(dtype=tf.string)},\n",
" example_feature_spec={\n",
" \"document_tokens\":\n",
" tf.io.RaggedFeature(dtype=tf.string),\n",
" \"relevance\":\n",
" tf.io.FixedLenFeature(shape=[], dtype=tf.int64, default_value=0)\n",
" },\n",
" size_feature_name=\"_list_size_\",\n",
" mask_feature_name=\"_mask_\")\n",
"\n",
"parse_elwc(serialized_elwc)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nCluYFt8LSP-"
},
"source": [
"Note with ELWC, you could also generate `size` and/or `mask` features to indicate the valid size and/or to mask out the valid entries in the list as long as `size_feature_name` and/or `mask_feature_name` are defined."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "scY9IVwMThDi"
},
"source": [
"The above parser is defined in `tfr.data` and wrapped in our predefined dataset builder `tfr.keras.pipeline.BaseDatasetBuilder`."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tFbFBTUh9WXf"
},
"source": [
"## Overview of the ranking pipeline\n",
"\n",
"Follow the steps depicted in the figure below to train a ranking model with ranking pipeline. In particular, this example uses the `tfr.keras.model.FeatureSpecInputCreator` and `tfr.keras.pipeline.BaseDatasetBuilder` defined specific for the datasets with `feature_spec`.\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aQ-VTA56sOTA"
},
"source": [
"## Create a model builder\n",
"\n",
"Instead of directly building a `tf.keras.Model` object, create a model_builder, which is called in the ranking pipeline to build the `tf.keras.Model`, as all training parameters must be defined under the `strategy.scope` (called in `train_and_validate` function in ranking pipeline) in order to train with distributed strategies.\n",
"\n",
"This framework uses the [keras functional api](https://www.tensorflow.org/guide/keras/functional) to build models, where inputs (`tf.keras.Input`), preprocessors (`tf.keras.layers.experimental.preprocessing`), and scorer (`tf.keras.Sequential`) are required to define the model."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "at0nVKnts8Pn"
},
"source": [
"### Specify Features\n",
"\n",
"[Feature Specification](https://www.tensorflow.org/api_docs/python/tf/io) are TensorFlow abstractions that are used to capture rich information about each feature.\n",
"\n",
"Create feature specifications for context features, example features, and labels, consistent with the input formats for ranking, such as ELWC format.\n",
"\n",
"The `default_value` of `label_spec` feature is set to -1 to take care of the padding items to be masked out."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "nSXd4pEPqaQW"
},
"outputs": [],
"source": [
"context_feature_spec = {\n",
" \"query_tokens\": tf.io.RaggedFeature(dtype=tf.string),\n",
"}\n",
"example_feature_spec = {\n",
" \"document_tokens\":\n",
" tf.io.RaggedFeature(dtype=tf.string),\n",
"}\n",
"label_spec = (\n",
" \"relevance\",\n",
" tf.io.FixedLenFeature(shape=(1,), dtype=tf.int64, default_value=-1)\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dw78n3UVZLnR"
},
"source": [
"### Define `input_creator`\n",
"\n",
"`input_creator` create dictionaries of context and example `tf.keras.Input`s for input features defined in `context_feature_spec` and `example_feature_spec`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "4AAidWeFY8_h"
},
"outputs": [],
"source": [
"input_creator = tfr.keras.model.FeatureSpecInputCreator(\n",
" context_feature_spec, example_feature_spec)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "pmjkqpY0_zoH"
},
"source": [
"Callling the `input_creator` returns the dictionaries of Keras-Tensors, that are used as the inputs when building the model:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "F4w_7U1m_3Ev",
"outputId": "ce2cfdca-9025-41d6-fb99-7180595ae0bf"
},
"outputs": [
{
"data": {
"text/plain": [
"({'query_tokens': \u003cKerasTensor: type_spec=RaggedTensorSpec(TensorShape([None, None]), tf.string, 1, tf.int64) (created by layer 'query_tokens')\u003e},\n",
" {'document_tokens': \u003cKerasTensor: type_spec=RaggedTensorSpec(TensorShape([None, None, None]), tf.string, 2, tf.int64) (created by layer 'document_tokens')\u003e})"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"input_creator()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dg0N0FscZVin"
},
"source": [
"### Define `preprocessor`\n",
"\n",
"In the `preprocessor`, the input tokens are converted to a one-hot vector through the String Lookup preprocessing layer and then embeded as an embedding vector through the Embedding preprocessing layer. Finally, compute an embedding vector for the full sentence by the average of token embeddings."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "2eSAoISdtQ9r"
},
"outputs": [],
"source": [
"class LookUpTablePreprocessor(tfr.keras.model.Preprocessor):\n",
"\n",
" def __init__(self, vocab_file, vocab_size, embedding_dim):\n",
" self._vocab_file = vocab_file\n",
" self._vocab_size = vocab_size\n",
" self._embedding_dim = embedding_dim\n",
"\n",
" def __call__(self, context_inputs, example_inputs, mask):\n",
" list_size = tf.shape(mask)[1]\n",
" lookup = tf.keras.layers.StringLookup(\n",
" max_tokens=self._vocab_size,\n",
" vocabulary=self._vocab_file,\n",
" mask_token=None)\n",
" embedding = tf.keras.layers.Embedding(\n",
" input_dim=self._vocab_size,\n",
" output_dim=self._embedding_dim,\n",
" embeddings_initializer=None,\n",
" embeddings_constraint=None)\n",
" # StringLookup and Embedding are shared over context and example features.\n",
" context_features = {\n",
" key: tf.reduce_mean(embedding(lookup(value)), axis=-2)\n",
" for key, value in context_inputs.items()\n",
" }\n",
" example_features = {\n",
" key: tf.reduce_mean(embedding(lookup(value)), axis=-2)\n",
" for key, value in example_inputs.items()\n",
" }\n",
" return context_features, example_features"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "VhIxo4hOZY0L"
},
"outputs": [],
"source": [
"_VOCAB_FILE = '/tmp/vocab.txt'\n",
"_VOCAB_SIZE = len(pathlib.Path(_VOCAB_FILE).read_text().split())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Vd2kk-kGb4sv"
},
"outputs": [],
"source": [
"preprocessor = LookUpTablePreprocessor(_VOCAB_FILE, _VOCAB_SIZE, 20)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lMo8KPBwvmIw"
},
"source": [
"Note that the vocabulary uses the same tokenizer that BERT does. You could also use [BertTokenizer](https://www.tensorflow.org/text/guide/subwords_tokenizer) to tokenize the raw sentences."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "VvqaOQjSv1l2",
"outputId": "737015a0-7889-4645-9956-b34172ddd561"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u003ctf.RaggedTensor [[[7592], [23435, 12314], [999]]]\u003e\n",
"\u003ctf.RaggedTensor [[[b'hello'], [b'tensorflow'], [b'!']]]\u003e\n"
]
}
],
"source": [
"tokenizer = tf_text.BertTokenizer(_VOCAB_FILE)\n",
"example_tokens = tokenizer.tokenize(\"Hello TensorFlow!\".lower())\n",
"\n",
"print(example_tokens)\n",
"print(tokenizer.detokenize(example_tokens))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ggv8lIpkekng"
},
"source": [
"### Define `scorer`\n",
"\n",
"This example uses a Deep Neural Network (DNN) univariate scorer, predefined in TensorFlow Ranking."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "k31ZYnnAfIpe"
},
"outputs": [],
"source": [
"scorer = tfr.keras.model.DNNScorer(\n",
" hidden_layer_dims=[64, 32, 16],\n",
" output_units=1,\n",
" activation=tf.nn.relu,\n",
" use_batch_norm=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3PwsTZ5PiZBN"
},
"source": [
"### Make `model_builder`\n",
"\n",
"In addition to `input_creator`, `preprocessor`, and `scorer`, specify the mask feature name to take the mask feature generated in datasets."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "5uMDkiSOifL2"
},
"outputs": [],
"source": [
"model_builder = tfr.keras.model.ModelBuilder(\n",
" input_creator=input_creator,\n",
" preprocessor=preprocessor,\n",
" scorer=scorer,\n",
" mask_feature_name=\"example_list_mask\",\n",
" name=\"antique_model\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6NY35Ct_-TTr"
},
"source": [
"Check the model architecture,"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "-hvxdkNF-P76",
"outputId": "1ed4c533-92fa-41b1-eb45-456fd1e04138"
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"\u003cIPython.core.display.Image object\u003e"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model = model_builder.build()\n",
"tf.keras.utils.plot_model(model, expand_nested=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZGJ6rRJyZmiB"
},
"source": [
"## Create a dataset builder\n",
"\n",
"A `dataset_builder` is designed to create datasets for training and validation and to define [signatures](https://www.tensorflow.org/guide/saved_model#specifying_signatures_during_export) for exporting trained model as `tf.function`. "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-AycDBrIoct_"
},
"source": [
"### Specify data hyperparameters\n",
"\n",
"Define the hyperparameters to be used to build datasets in `dataset_builder` by creating a `dataset_hparams` object.\n",
"\n",
"Load training dataset at `/tmp/train.tfrecords` with `tf.data.TFRecordDataset` reader. In each batch, each feature tensor has a shape `(batch_size, list_size, feature_sizes)` with `batch_size` equal to 32 and `list_size` equal to 50. Validate with the test data at `/tmp/test.tfrecords` at the same `batch_size` and `list_size`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "PvAPaDXPjJ3L"
},
"outputs": [],
"source": [
"dataset_hparams = tfr.keras.pipeline.DatasetHparams(\n",
" train_input_pattern=\"/tmp/train.tfrecords\",\n",
" valid_input_pattern=\"/tmp/test.tfrecords\",\n",
" train_batch_size=32,\n",
" valid_batch_size=32,\n",
" list_size=50,\n",
" dataset_reader=tf.data.TFRecordDataset)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "oPg0FRazjo8P"
},
"source": [
"### Make `dataset_builder`\n",
"\n",
"TensorFlow Ranking provides a pre-defined `SimpleDatasetBuilder` to generate datasets from ELWC using `feature_spec`s. As a mask feature is used to determine valid examples in each padded list, must specify the `mask_feature_name` consistent with the `mask_feature_name` used in `model_builder`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "JtdPgebnjC18"
},
"outputs": [],
"source": [
"dataset_builder = tfr.keras.pipeline.SimpleDatasetBuilder(\n",
" context_feature_spec,\n",
" example_feature_spec,\n",
" mask_feature_name=\"example_list_mask\",\n",
" label_spec=label_spec,\n",
" hparams=dataset_hparams)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "n3rK7Yb0_RL8",
"outputId": "2ac33742-8bb2-42df-8827-b10589155a28"
},
"outputs": [
{
"data": {
"text/plain": [
"({'document_tokens': RaggedTensorSpec(TensorShape([None, 50, None]), tf.string, 2, tf.int32),\n",
" 'example_list_mask': TensorSpec(shape=(32, 50), dtype=tf.bool, name=None),\n",
" 'query_tokens': RaggedTensorSpec(TensorShape([32, None]), tf.string, 1, tf.int32)},\n",
" TensorSpec(shape=(32, 50), dtype=tf.float32, name=None))"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ds_train = dataset_builder.build_train_dataset()\n",
"ds_train.element_spec"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "EIhdYyRok9Li"
},
"source": [
"## Create a ranking pipeline\n",
"\n",
"A `ranking_pipeline` is an optimized ranking model training package that implement distributed training, export model as `tf.function`, and integrate useful callbacks including tensorboard and restoring upon failures."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RqM6WW-Kp95E"
},
"source": [
"### Specify pipeline hyperparameters\n",
"\n",
"Specify the hyperparameters to be used to run the pipeline in `ranking_pipeline` by creating a `pipeline_hparams` object. \n",
"\n",
"Train the model with `approx_ndcg_loss` at learning rate equal to 0.05 for 5 epoch with 1000 steps in each epoch using `MirroredStrategy`. Evaluate the model on the validation dataset for 100 steps after each epoch. Save the trained model under `/tmp/ranking_model_dir`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "tx1r0ef7lLEG"
},
"outputs": [],
"source": [
"pipeline_hparams = tfr.keras.pipeline.PipelineHparams(\n",
" model_dir=\"/tmp/ranking_model_dir\",\n",
" num_epochs=5,\n",
" steps_per_epoch=1000,\n",
" validation_steps=100,\n",
" learning_rate=0.05,\n",
" loss=\"approx_ndcg_loss\",\n",
" strategy=\"MirroredStrategy\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "UMOA9Cojln6K"
},
"source": [
"### Define `ranking_pipeline`\n",
"\n",
"TensorFlow Ranking provides a pre-defined `SimplePipeline` to support model training with distributed strategies."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "gLC-8SwamAvh",
"outputId": "ba429aa6-b657-4506-f4c6-4b1c7deaad0d"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"WARNING:tensorflow:There are non-GPU devices in `tf.distribute.Strategy`, not using nccl allreduce.\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"WARNING:tensorflow:There are non-GPU devices in `tf.distribute.Strategy`, not using nccl allreduce.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)\n"
]
}
],
"source": [
"ranking_pipeline = tfr.keras.pipeline.SimplePipeline(\n",
" model_builder,\n",
" dataset_builder=dataset_builder,\n",
" hparams=pipeline_hparams)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_EPvXEbomK29"
},
"source": [
"## Train and evaluate the model\n",
"\n",
"The `train_and_validate` function evaluates the trained model on the validation dataset after every epoch."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "BZr8MX6VmQSj",
"outputId": "2a923f85-7cbd-45f4-fb92-c249aeae0880"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 1/5\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/indexed_slices.py:450: UserWarning: Converting sparse IndexedSlices(IndexedSlices(indices=Tensor(\"gradient_tape/antique_model/flatten_list_2/RaggedGatherNd/RaggedGatherNd/RaggedGather/Reshape_1:0\", shape=(1600,), dtype=int32, device=/job:localhost/replica:0/task:0/device:CPU:0), values=Tensor(\"gradient_tape/antique_model/flatten_list_2/RaggedGatherNd/RaggedGatherNd/RaggedGather/Reshape:0\", shape=(1600, 20), dtype=float32, device=/job:localhost/replica:0/task:0/device:CPU:0), dense_shape=Tensor(\"gradient_tape/antique_model/flatten_list_2/RaggedGatherNd/RaggedGatherNd/RaggedGather/Cast:0\", shape=(2,), dtype=int32, device=/job:localhost/replica:0/task:0/device:CPU:0))) to a dense Tensor of unknown shape. This may consume a large amount of memory.\n",
" \"shape. This may consume a large amount of memory.\" % value)\n",
"/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/indexed_slices.py:450: UserWarning: Converting sparse IndexedSlices(IndexedSlices(indices=Tensor(\"gradient_tape/while/antique_model/flatten_list_2/RaggedGatherNd/RaggedGatherNd/RaggedGather/Reshape_1:0\", shape=(1600,), dtype=int32, device=/job:localhost/replica:0/task:0/device:CPU:0), values=Tensor(\"gradient_tape/while/antique_model/flatten_list_2/RaggedGatherNd/RaggedGatherNd/RaggedGather/Reshape:0\", shape=(1600, 20), dtype=float32, device=/job:localhost/replica:0/task:0/device:CPU:0), dense_shape=Tensor(\"gradient_tape/while/antique_model/flatten_list_2/RaggedGatherNd/RaggedGatherNd/RaggedGather/Cast:0\", shape=(2,), dtype=int32, device=/job:localhost/replica:0/task:0/device:CPU:0))) to a dense Tensor of unknown shape. This may consume a large amount of memory.\n",
" \"shape. This may consume a large amount of memory.\" % value)\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"1000/1000 [==============================] - 121s 121ms/step - loss: -0.8845 - metric/ndcg_1: 0.7122 - metric/ndcg_5: 0.7813 - metric/ndcg_10: 0.8413 - metric/ndcg: 0.8856 - val_loss: -0.8672 - val_metric/ndcg_1: 0.6557 - val_metric/ndcg_5: 0.7689 - val_metric/ndcg_10: 0.8243 - val_metric/ndcg: 0.8678\n",
"Epoch 2/5\n",
"1000/1000 [==============================] - 88s 88ms/step - loss: -0.8957 - metric/ndcg_1: 0.7428 - metric/ndcg_5: 0.8005 - metric/ndcg_10: 0.8551 - metric/ndcg: 0.8959 - val_loss: -0.8731 - val_metric/ndcg_1: 0.6614 - val_metric/ndcg_5: 0.7812 - val_metric/ndcg_10: 0.8348 - val_metric/ndcg: 0.8733\n",
"Epoch 3/5\n",
"1000/1000 [==============================] - 50s 50ms/step - loss: -0.8955 - metric/ndcg_1: 0.7422 - metric/ndcg_5: 0.7991 - metric/ndcg_10: 0.8545 - metric/ndcg: 0.8957 - val_loss: -0.8695 - val_metric/ndcg_1: 0.6414 - val_metric/ndcg_5: 0.7759 - val_metric/ndcg_10: 0.8315 - val_metric/ndcg: 0.8699\n",
"Epoch 4/5\n",
"1000/1000 [==============================] - 53s 53ms/step - loss: -0.9009 - metric/ndcg_1: 0.7563 - metric/ndcg_5: 0.8094 - metric/ndcg_10: 0.8620 - metric/ndcg: 0.9011 - val_loss: -0.8624 - val_metric/ndcg_1: 0.6179 - val_metric/ndcg_5: 0.7627 - val_metric/ndcg_10: 0.8253 - val_metric/ndcg: 0.8626\n",
"Epoch 5/5\n",
"1000/1000 [==============================] - 52s 52ms/step - loss: -0.9042 - metric/ndcg_1: 0.7646 - metric/ndcg_5: 0.8152 - metric/ndcg_10: 0.8662 - metric/ndcg: 0.9044 - val_loss: -0.8733 - val_metric/ndcg_1: 0.6579 - val_metric/ndcg_5: 0.7741 - val_metric/ndcg_10: 0.8362 - val_metric/ndcg: 0.8741\n",
"INFO:tensorflow:Assets written to: /tmp/ranking_model_dir/export/latest_model/assets\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO:tensorflow:Assets written to: /tmp/ranking_model_dir/export/latest_model/assets\n"
]
}
],
"source": [
"ranking_pipeline.train_and_validate(verbose=1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "45WYaJNaGfLM"
},
"source": [
"### Launch TensorBoard"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "sHfuUVQ5D1jq"
},
"outputs": [],
"source": [
"%load_ext tensorboard\n",
"%tensorboard --logdir=\"/tmp/ranking_model_dir\" --port 12345"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "S1s1BKWSP8p_"
},
"source": [
"\u003c!-- \u003cimg class=\"tfo-display-only-on-site\" src=\"https://user-images.githubusercontent.com/18746174/136845677-8cd41b8f-0a1a-4b38-b905-779966839e5f.png\" /\u003e --\u003e"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "juSvnOWiSbVw"
},
"source": [
"### Generate predictions and evaluate\n",
"\n",
"Get the test data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "a23f4q4wSbIQ"
},
"outputs": [],
"source": [
"ds_test = dataset_builder.build_valid_dataset()\n",
"\n",
"# Get input features from the first batch of the test data\n",
"for x, y in ds_test.take(1):\n",
" break"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3werWYrkfYPV"
},
"source": [
"Load the saved model and run a prediction."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "FgBWzIzXfanI"
},
"outputs": [],
"source": [
"loaded_model = tf.keras.models.load_model(\"/tmp/ranking_model_dir/export/latest_model\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "-E4XmBSbSyFY"
},
"outputs": [],
"source": [
"# Predict ranking scores\n",
"scores = loaded_model.predict(x)\n",
"min_score = tf.reduce_min(scores)\n",
"scores = tf.where(tf.greater_equal(y, 0.), scores, min_score - 1e-5)\n",
"\n",
"# Sort the answers by scores\n",
"sorted_answers = tfr.utils.sort_by_scores(\n",
" scores,\n",
" [tf.strings.reduce_join(x['document_tokens'], -1, separator=' ')])[0]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "C2pUkpuTTVFh"
},
"source": [
"Check the top 5 answers for question number 4."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "8lHaFjyiTfo-",
"outputId": "1f1fbb08-572d-48c7-9acd-1c0a8df7e962"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Q: why do people ask questions they know ?\n",
"A1: because it re ##as ##ures them that they were right in the first place .\n",
"A2: people like to that be ##cao ##use they want to be recognise that they are the one knows the answer and the questions int ##he first place .\n",
"A3: to rev ##ali ##date their knowledge and perhaps they choose answers that are mostly with their side simply because they are being subjective . . . .\n",
"A4: so they can weasel out the judge mental and super ##ci ##lio ##us know all cr ##aa ##p like yourself . . . don ##t judge others , what gives you the right ? . . how do you know what others know . ? . . by asking this question you are putting yourself in the same league as the others you want ot condemn . . face it you already know what your shallow , self absorbed answer is . . . get a reality check pill ##ock , . . . and if you want to go gr ##iz ##z ##ling to the yahoo policeman bring it on . . it will only reinforce my answer and the pathetic ##iness of your q ##est ##ion . . . the only thing you could do that would be even more pathetic is give me the top answer award . . . then you would suck beyond all measure\n",
"A5: human nature i guess . i have noticed that too . maybe it is just for re ##ass ##urance or approval .\n"
]
}
],
"source": [
"question = tf.strings.reduce_join(\n",
" x['query_tokens'][4, :], -1, separator=' ').numpy()\n",
"top_answers = sorted_answers[4, :5].numpy()\n",
"\n",
"print(\n",
" f'Q: {question.decode()}\\n' +\n",
" '\\n'.join([f'A{i+1}: {ans.decode()}' for i, ans in enumerate(top_answers)]))"
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"name": "ranking_dnn_distributed.ipynb",
"provenance": [],
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}