Carview!

CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 content-type: application/octet-stream x-guploader-uploadid: ABgVH8_cRNxTW4uBjZtmDkY_fchzpT_d2n5fG9UddW21VCCISu3V1-D4qffMt9lWidtUnNEs9Qsf3Qk expires: Tue, 15 Jul 2025 22:58:17 GMT date: Tue, 15 Jul 2025 21:58:17 GMT cache-control: public, max-age=3600 last-modified: Fri, 19 Jul 2024 13:04:18 GMT etag: "1eb5476025b498830f89861e5849cdae" x-goog-generation: 1721394258854272 x-goog-metageneration: 1 x-goog-stored-content-encoding: identity x-goog-stored-content-length: 50872 x-goog-hash: crc32c=Hdmscg== x-goog-hash: md5=HrVHYCW0mIMPiYYeWEnNrg== x-goog-storage-class: MULTI_REGIONAL accept-ranges: bytes content-length: 50872 server: UploadServer alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000 { "cells": [ { "cell_type": "markdown", "metadata": { "id": "oL9KopJirB2g" }, "source": [ "##### Copyright 2018 The TensorFlow Authors." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "cellView": "form", "execution": { "iopub.execute_input": "2024-07-19T13:01:47.063694Z", "iopub.status.busy": "2024-07-19T13:01:47.063172Z", "iopub.status.idle": "2024-07-19T13:01:47.067048Z", "shell.execute_reply": "2024-07-19T13:01:47.066505Z" }, "id": "SKaX3Hd3ra6C" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "AXH1bmUctMld" }, "source": [ "# Unicode strings\n", "\n", "\n", " \n", " \n", " \n", " \n", "

\n", "

View on TensorFlow.org\n", "

\n", "

Run in Google Colab\n", "

\n", "

View source on GitHub\n", "

\n", "

Download notebook\n", "

" ] }, { "cell_type": "markdown", "metadata": { "id": "LrHJrKYis06U" }, "source": [ "## Introduction\n", "\n", "NLP models often handle different languages with different character sets. *Unicode* is a standard encoding system that is used to represent characters from almost all languages. Every Unicode character is encoded using a unique integer [code point](https://en.wikipedia.org/wiki/Code_point) between `0` and `0x10FFFF`. A *Unicode string* is a sequence of zero or more code points.\n", "\n", "This tutorial shows how to represent Unicode strings in TensorFlow and manipulate them using Unicode equivalents of standard string ops. It separates Unicode strings into tokens based on script detection." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:47.070779Z", "iopub.status.busy": "2024-07-19T13:01:47.070214Z", "iopub.status.idle": "2024-07-19T13:01:49.434945Z", "shell.execute_reply": "2024-07-19T13:01:49.434321Z" }, "id": "OIKHl5Lvn4gh" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2024-07-19 13:01:47.328260: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n", "2024-07-19 13:01:47.349715: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n", "2024-07-19 13:01:47.356051: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n" ] } ], "source": [ "import tensorflow as tf\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": { "id": "n-LkcI-vtWNj" }, "source": [ "## The `tf.string` data type\n", "\n", "The basic TensorFlow `tf.string` `dtype` allows you to build tensors of byte strings.\n", "Unicode strings are [utf-8](https://en.wikipedia.org/wiki/UTF-8) encoded by default." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:49.439029Z", "iopub.status.busy": "2024-07-19T13:01:49.438603Z", "iopub.status.idle": "2024-07-19T13:01:51.663650Z", "shell.execute_reply": "2024-07-19T13:01:51.662699Z" }, "id": "3yo-Qv6ntaFr" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n", "I0000 00:00:1721394109.974229 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394109.978217 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394109.981871 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394109.985379 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394109.997087 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394110.000883 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394110.004294 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394110.007621 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394110.011020 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394110.014636 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394110.018010 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394110.021386 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394111.237174 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394111.239312 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394111.241382 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394111.243348 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394111.245395 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394111.247369 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394111.249317 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394111.251190 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394111.253079 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394111.255025 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394111.256961 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394111.258853 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394111.298850 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394111.300873 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394111.302869 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394111.304789 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See mo" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" }, { "name": "stderr", "output_type": "stream", "text": [ "re at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394111.306767 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394111.308726 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394111.310660 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394111.313038 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394111.314913 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394111.317376 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394111.319690 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1721394111.321971 53267 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n" ] } ], "source": [ "tf.constant(u\"Thanks 😊\")" ] }, { "cell_type": "markdown", "metadata": { "id": "2kA1ziG2tyCT" }, "source": [ "A `tf.string` tensor treats byte strings as atomic units. This enables it to store byte strings of varying lengths. The string length is not included in the tensor dimensions.\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:51.667631Z", "iopub.status.busy": "2024-07-19T13:01:51.667317Z", "iopub.status.idle": "2024-07-19T13:01:51.675303Z", "shell.execute_reply": "2024-07-19T13:01:51.674303Z" }, "id": "eyINCmTztyyS" }, "outputs": [ { "data": { "text/plain": [ "TensorShape([2])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf.constant([u\"You're\", u\"welcome!\"]).shape" ] }, { "cell_type": "markdown", "metadata": { "id": "jsMPnjb6UDJ1" }, "source": [ "If you use Python to construct strings, note that [string literals](https://docs.python.org/3/reference/lexical_analysis.html) are Unicode-encoded by default." ] }, { "cell_type": "markdown", "metadata": { "id": "hUFZ7B1Lk-uj" }, "source": [ "## Representing Unicode\n", "\n", "There are two standard ways to represent a Unicode string in TensorFlow:\n", "\n", "* `string` scalar — where the sequence of code points is encoded using a known [character encoding](https://en.wikipedia.org/wiki/Character_encoding).\n", "* `int32` vector — where each position contains a single code point.\n", "\n", "For example, the following three values all represent the Unicode string `\"语言处理\"` (which means \"language processing\" in Chinese):" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:51.678880Z", "iopub.status.busy": "2024-07-19T13:01:51.678592Z", "iopub.status.idle": "2024-07-19T13:01:51.684642Z", "shell.execute_reply": "2024-07-19T13:01:51.683746Z" }, "id": "cjQIkfJWvC_u" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Unicode string, represented as a UTF-8 encoded string scalar.\n", "text_utf8 = tf.constant(u\"语言处理\")\n", "text_utf8" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:51.688250Z", "iopub.status.busy": "2024-07-19T13:01:51.687574Z", "iopub.status.idle": "2024-07-19T13:01:51.694208Z", "shell.execute_reply": "2024-07-19T13:01:51.693354Z" }, "id": "yQqcUECcvF2r" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Unicode string, represented as a UTF-16-BE encoded string scalar.\n", "text_utf16be = tf.constant(u\"语言处理\".encode(\"UTF-16-BE\"))\n", "text_utf16be" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:51.697682Z", "iopub.status.busy": "2024-07-19T13:01:51.697419Z", "iopub.status.idle": "2024-07-19T13:01:51.704801Z", "shell.execute_reply": "2024-07-19T13:01:51.703854Z" }, "id": "ExdBr1t7vMuS" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Unicode string, represented as a vector of Unicode code points.\n", "text_chars = tf.constant([ord(char) for char in u\"语言处理\"])\n", "text_chars" ] }, { "cell_type": "markdown", "metadata": { "id": "B8czv4JNpBnZ" }, "source": [ "### Converting between representations\n", "\n", "TensorFlow provides operations to convert between these different representations:\n", "\n", "* `tf.strings.unicode_decode`: Converts an encoded string scalar to a vector of code points.\n", "* `tf.strings.unicode_encode`: Converts a vector of code points to an encoded string scalar.\n", "* `tf.strings.unicode_transcode`: Converts an encoded string scalar to a different encoding." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:51.708619Z", "iopub.status.busy": "2024-07-19T13:01:51.707844Z", "iopub.status.idle": "2024-07-19T13:01:51.717033Z", "shell.execute_reply": "2024-07-19T13:01:51.716184Z" }, "id": "qb-UQ_oLpAJg" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf.strings.unicode_decode(text_utf8,\n", " input_encoding='UTF-8')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:51.720761Z", "iopub.status.busy": "2024-07-19T13:01:51.720088Z", "iopub.status.idle": "2024-07-19T13:01:51.731865Z", "shell.execute_reply": "2024-07-19T13:01:51.730932Z" }, "id": "kEBUcunnp-9n" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf.strings.unicode_encode(text_chars,\n", " output_encoding='UTF-8')" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:51.735509Z", "iopub.status.busy": "2024-07-19T13:01:51.735233Z", "iopub.status.idle": "2024-07-19T13:01:51.741761Z", "shell.execute_reply": "2024-07-19T13:01:51.740818Z" }, "id": "0MLhWcLZrph-" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf.strings.unicode_transcode(text_utf8,\n", " input_encoding='UTF8',\n", " output_encoding='UTF-16-BE')" ] }, { "cell_type": "markdown", "metadata": { "id": "QVeLeVohqN7I" }, "source": [ "### Batch dimensions\n", "\n", "When decoding multiple strings, the number of characters in each string may not be equal. The return result is a [`tf.RaggedTensor`](../../guide/ragged_tensor.ipynb), where the innermost dimension length varies depending on the number of characters in each string." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:51.745595Z", "iopub.status.busy": "2024-07-19T13:01:51.745212Z", "iopub.status.idle": "2024-07-19T13:01:51.751948Z", "shell.execute_reply": "2024-07-19T13:01:51.751067Z" }, "id": "N2jVzPymr_Mm" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[104, 195, 108, 108, 111]\n", "[87, 104, 97, 116, 32, 105, 115, 32, 116, 104, 101, 32, 119, 101, 97, 116, 104, 101, 114, 32, 116, 111, 109, 111, 114, 114, 111, 119]\n", "[71, 246, 246, 100, 110, 105, 103, 104, 116]\n", "[128522]\n" ] } ], "source": [ "# A batch of Unicode strings, each represented as a UTF8-encoded string.\n", "batch_utf8 = [s.encode('UTF-8') for s in\n", " [u'hÃllo', u'What is the weather tomorrow', u'Göödnight', u'😊']]\n", "batch_chars_ragged = tf.strings.unicode_decode(batch_utf8,\n", " input_encoding='UTF-8')\n", "for sentence_chars in batch_chars_ragged.to_list():\n", " print(sentence_chars)" ] }, { "cell_type": "markdown", "metadata": { "id": "iRh3n1hPsJ9v" }, "source": [ "You can use this `tf.RaggedTensor` directly, or convert it to a dense `tf.Tensor` with padding or a `tf.sparse.SparseTensor` using the methods `tf.RaggedTensor.to_tensor` and `tf.RaggedTensor.to_sparse`." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:51.755817Z", "iopub.status.busy": "2024-07-19T13:01:51.755139Z", "iopub.status.idle": "2024-07-19T13:01:51.766199Z", "shell.execute_reply": "2024-07-19T13:01:51.765209Z" }, "id": "yz17yeSMsUid" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 104 195 108 108 111 -1 -1 -1 -1 -1\n", " -1 -1 -1 -1 -1 -1 -1 -1 -1 -1\n", " -1 -1 -1 -1 -1 -1 -1 -1]\n", " [ 87 104 97 116 32 105 115 32 116 104\n", " 101 32 119 101 97 116 104 101 114 32\n", " 116 111 109 111 114 114 111 119]\n", " [ 71 246 246 100 110 105 103 104 116 -1\n", " -1 -1 -1 -1 -1 -1 -1 -1 -1 -1\n", " -1 -1 -1 -1 -1 -1 -1 -1]\n", " [128522 -1 -1 -1 -1 -1 -1 -1 -1 -1\n", " -1 -1 -1 -1 -1 -1 -1 -1 -1 -1\n", " -1 -1 -1 -1 -1 -1 -1 -1]]\n" ] } ], "source": [ "batch_chars_padded = batch_chars_ragged.to_tensor(default_value=-1)\n", "print(batch_chars_padded.numpy())" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:51.769567Z", "iopub.status.busy": "2024-07-19T13:01:51.768980Z", "iopub.status.idle": "2024-07-19T13:01:51.776426Z", "shell.execute_reply": "2024-07-19T13:01:51.775607Z" }, "id": "kBjsPQp3rhfm" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 104, 195, 108, 108, 111, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]\n", " [ 87, 104, 97, 116, 32, 105, 115, 32, 116, 104, 101, 32, 119, 101, 97, 116, 104, 101, 114, 32, 116, 111, 109, 111, 114, 114, 111, 119]\n", " [ 71, 246, 246, 100, 110, 105, 103, 104, 116, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]\n", " [128522, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]]\n" ] } ], "source": [ "batch_chars_sparse = batch_chars_ragged.to_sparse()\n", "\n", "nrows, ncols = batch_chars_sparse.dense_shape.numpy()\n", "elements = [['_' for i in range(ncols)] for j in range(nrows)]\n", "for (row, col), value in zip(batch_chars_sparse.indices.numpy(), batch_chars_sparse.values.numpy()):\n", " elements[row][col] = str(value)\n", "# max_width = max(len(value) for row in elements for value in row)\n", "value_lengths = []\n", "for row in elements:\n", " for value in row:\n", " value_lengths.append(len(value))\n", "max_width = max(value_lengths)\n", "print('[%s]' % '\\n '.join(\n", " '[%s]' % ', '.join(value.rjust(max_width) for value in row)\n", " for row in elements))" ] }, { "cell_type": "markdown", "metadata": { "id": "GCCkZh-nwlbL" }, "source": [ "When encoding multiple strings with the same lengths, use a `tf.Tensor` as the input." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:51.779773Z", "iopub.status.busy": "2024-07-19T13:01:51.779490Z", "iopub.status.idle": "2024-07-19T13:01:51.803825Z", "shell.execute_reply": "2024-07-19T13:01:51.802989Z" }, "id": "_lP62YUAwjK9" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf.strings.unicode_encode([[99, 97, 116], [100, 111, 103], [99, 111, 119]],\n", " output_encoding='UTF-8')" ] }, { "cell_type": "markdown", "metadata": { "id": "w58CMRg9tamW" }, "source": [ "When encoding multiple strings with varying length, use a `tf.RaggedTensor` as the input." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:51.806971Z", "iopub.status.busy": "2024-07-19T13:01:51.806724Z", "iopub.status.idle": "2024-07-19T13:01:51.812535Z", "shell.execute_reply": "2024-07-19T13:01:51.811726Z" }, "id": "d7GtOtrltaMl" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf.strings.unicode_encode(batch_chars_ragged, output_encoding='UTF-8')" ] }, { "cell_type": "markdown", "metadata": { "id": "T2Nh5Aj9xob3" }, "source": [ "If you have a tensor with multiple strings in padded or sparse format, convert it first into a `tf.RaggedTensor` before calling `tf.strings.unicode_encode`." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:51.816268Z", "iopub.status.busy": "2024-07-19T13:01:51.815604Z", "iopub.status.idle": "2024-07-19T13:01:52.188506Z", "shell.execute_reply": "2024-07-19T13:01:52.187584Z" }, "id": "R2bYCYl0u-Ue" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf.strings.unicode_encode(\n", " tf.RaggedTensor.from_sparse(batch_chars_sparse),\n", " output_encoding='UTF-8')" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:52.192220Z", "iopub.status.busy": "2024-07-19T13:01:52.191674Z", "iopub.status.idle": "2024-07-19T13:01:52.941975Z", "shell.execute_reply": "2024-07-19T13:01:52.941004Z" }, "id": "UlV2znh_u_zm" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf.strings.unicode_encode(\n", " tf.RaggedTensor.from_tensor(batch_chars_padded, padding=-1),\n", " output_encoding='UTF-8')" ] }, { "cell_type": "markdown", "metadata": { "id": "hQOOGkscvDpc" }, "source": [ "## Unicode operations" ] }, { "cell_type": "markdown", "metadata": { "id": "NkmtsA_yvMB0" }, "source": [ "### Character length\n", "\n", "Use the `unit` parameter of the `tf.strings.length` op to indicate how character lengths should be computed. `unit` defaults to `\"BYTE\"`, but it can be set to other values, such as `\"UTF8_CHAR\"` or `\"UTF16_CHAR\"`, to determine the number of Unicode codepoints in each encoded string." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:52.945950Z", "iopub.status.busy": "2024-07-19T13:01:52.945669Z", "iopub.status.idle": "2024-07-19T13:01:52.952511Z", "shell.execute_reply": "2024-07-19T13:01:52.951678Z" }, "id": "1ZzMe59mvLHr" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "11 bytes; 8 UTF-8 characters\n" ] } ], "source": [ "# Note that the final character takes up 4 bytes in UTF8.\n", "thanks = u'Thanks 😊'.encode('UTF-8')\n", "num_bytes = tf.strings.length(thanks).numpy()\n", "num_chars = tf.strings.length(thanks, unit='UTF8_CHAR').numpy()\n", "print('{} bytes; {} UTF-8 characters'.format(num_bytes, num_chars))" ] }, { "cell_type": "markdown", "metadata": { "id": "fHG85gxlvVU0" }, "source": [ "### Character substrings\n", "\n", "The `tf.strings.substr` op accepts the `unit` parameter, and uses it to determine what kind of offsets the `pos` and `len` paremeters contain." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:52.955921Z", "iopub.status.busy": "2024-07-19T13:01:52.955677Z", "iopub.status.idle": "2024-07-19T13:01:52.961881Z", "shell.execute_reply": "2024-07-19T13:01:52.961045Z" }, "id": "WlWRLV-4xWYq" }, "outputs": [ { "data": { "text/plain": [ "b'\\xf0'" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Here, unit='BYTE' (default). Returns a single byte with len=1\n", "tf.strings.substr(thanks, pos=7, len=1).numpy()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:52.965328Z", "iopub.status.busy": "2024-07-19T13:01:52.964654Z", "iopub.status.idle": "2024-07-19T13:01:52.970136Z", "shell.execute_reply": "2024-07-19T13:01:52.969201Z" }, "id": "JfNUVDPwxkCS" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "b'\\xf0\\x9f\\x98\\x8a'\n" ] } ], "source": [ "# Specifying unit='UTF8_CHAR', returns a single 4 byte character in this case\n", "print(tf.strings.substr(thanks, pos=7, len=1, unit='UTF8_CHAR').numpy())" ] }, { "cell_type": "markdown", "metadata": { "id": "zJUEsVSyeIa3" }, "source": [ "### Split Unicode strings\n", "\n", "The `tf.strings.unicode_split` operation splits unicode strings into substrings of individual characters." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:52.974092Z", "iopub.status.busy": "2024-07-19T13:01:52.973386Z", "iopub.status.idle": "2024-07-19T13:01:52.984108Z", "shell.execute_reply": "2024-07-19T13:01:52.983274Z" }, "id": "dDjkh5G1ejMt" }, "outputs": [ { "data": { "text/plain": [ "array([b'T', b'h', b'a', b'n', b'k', b's', b' ', b'\\xf0\\x9f\\x98\\x8a'],\n", " dtype=object)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf.strings.unicode_split(thanks, 'UTF-8').numpy()" ] }, { "cell_type": "markdown", "metadata": { "id": "HQqEEZEbdG9O" }, "source": [ "### Byte offsets for characters\n", "\n", "To align the character tensor generated by `tf.strings.unicode_decode` with the original string, it's useful to know the offset for where each character begins. The method `tf.strings.unicode_decode_with_offsets` is similar to `unicode_decode`, except that it returns a second tensor containing the start offset of each character." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:52.987539Z", "iopub.status.busy": "2024-07-19T13:01:52.986912Z", "iopub.status.idle": "2024-07-19T13:01:52.993251Z", "shell.execute_reply": "2024-07-19T13:01:52.992400Z" }, "id": "Cug7cmwYdowd" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "At byte offset 0: codepoint 127880\n", "At byte offset 4: codepoint 127881\n", "At byte offset 8: codepoint 127882\n" ] } ], "source": [ "codepoints, offsets = tf.strings.unicode_decode_with_offsets(u'🎈🎉🎊', 'UTF-8')\n", "\n", "for (codepoint, offset) in zip(codepoints.numpy(), offsets.numpy()):\n", " print('At byte offset {}: codepoint {}'.format(offset, codepoint))" ] }, { "cell_type": "markdown", "metadata": { "id": "2ZnCNxOvx66T" }, "source": [ "## Unicode scripts" ] }, { "cell_type": "markdown", "metadata": { "id": "nRRHqkqNyGZ6" }, "source": [ "Each Unicode code point belongs to a single collection of codepoints known as a [script](https://en.wikipedia.org/wiki/Script_%28Unicode%29) . A character's script is helpful in determining which language the character might be in. For example, knowing that 'Б' is in Cyrillic script indicates that modern text containing that character is likely from a Slavic language such as Russian or Ukrainian.\n", "\n", "TensorFlow provides the `tf.strings.unicode_script` operation to determine which script a given codepoint uses. The script codes are `int32` values corresponding to [International Components for\n", "Unicode](https://site.icu-project.org/home) (ICU) [`UScriptCode`](https://icu-project.org/apiref/icu4c/uscript_8h.html) values.\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:52.996842Z", "iopub.status.busy": "2024-07-19T13:01:52.996383Z", "iopub.status.idle": "2024-07-19T13:01:53.001907Z", "shell.execute_reply": "2024-07-19T13:01:53.001065Z" }, "id": "K7DeYHrRyFPy" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[17 8]\n" ] } ], "source": [ "uscript = tf.strings.unicode_script([33464, 1041]) # ['芸', 'Б']\n", "\n", "print(uscript.numpy()) # [17, 8] == [USCRIPT_HAN, USCRIPT_CYRILLIC]" ] }, { "cell_type": "markdown", "metadata": { "id": "2fW992a1lIY6" }, "source": [ "The `tf.strings.unicode_script` operation can also be applied to multidimensional `tf.Tensor`s or `tf.RaggedTensor`s of codepoints:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:53.005317Z", "iopub.status.busy": "2024-07-19T13:01:53.004844Z", "iopub.status.idle": "2024-07-19T13:01:53.010194Z", "shell.execute_reply": "2024-07-19T13:01:53.009184Z" }, "id": "uR7b8meLlFnp" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "print(tf.strings.unicode_script(batch_chars_ragged))" ] }, { "cell_type": "markdown", "metadata": { "id": "mx7HEFpBzEsB" }, "source": [ "## Example: Simple segmentation\n", "\n", "Segmentation is the task of splitting text into word-like units. This is often easy when space characters are used to separate words, but some languages (like Chinese and Japanese) do not use spaces, and some languages (like German) contain long compounds that must be split in order to analyze their meaning. In web text, different languages and scripts are frequently mixed together, as in \"NY株価\" (New York Stock Exchange).\n", "\n", "We can perform very rough segmentation (without implementing any ML models) by using changes in script to approximate word boundaries. This will work for strings like the \"NY株価\" example above. It will also work for most languages that use spaces, as the space characters of various scripts are all classified as USCRIPT_COMMON, a special script code that differs from that of any actual text." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:53.013863Z", "iopub.status.busy": "2024-07-19T13:01:53.013226Z", "iopub.status.idle": "2024-07-19T13:01:53.017024Z", "shell.execute_reply": "2024-07-19T13:01:53.016150Z" }, "id": "grsvFiC4BoPb" }, "outputs": [], "source": [ "# dtype: string; shape: [num_sentences]\n", "#\n", "# The sentences to process. Edit this line to try out different inputs!\n", "sentence_texts = [u'Hello, world.', u'世界こんにちは']" ] }, { "cell_type": "markdown", "metadata": { "id": "CapnbShuGU8i" }, "source": [ "First, decode the sentences into character codepoints, and find the script identifeir for each character." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:53.020526Z", "iopub.status.busy": "2024-07-19T13:01:53.020020Z", "iopub.status.idle": "2024-07-19T13:01:53.026279Z", "shell.execute_reply": "2024-07-19T13:01:53.025449Z" }, "id": "ReQVcDQh1MB8" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n" ] } ], "source": [ "# dtype: int32; shape: [num_sentences, (num_chars_per_sentence)]\n", "#\n", "# sentence_char_codepoint[i, j] is the codepoint for the j'th character in\n", "# the i'th sentence.\n", "sentence_char_codepoint = tf.strings.unicode_decode(sentence_texts, 'UTF-8')\n", "print(sentence_char_codepoint)\n", "\n", "# dtype: int32; shape: [num_sentences, (num_chars_per_sentence)]\n", "#\n", "# sentence_char_scripts[i, j] is the Unicode script of the j'th character in\n", "# the i'th sentence.\n", "sentence_char_script = tf.strings.unicode_script(sentence_char_codepoint)\n", "print(sentence_char_script)" ] }, { "cell_type": "markdown", "metadata": { "id": "O2fapF5UGcUc" }, "source": [ "Use the script identifiers to determine where word boundaries should be added. Add a word boundary at the beginning of each sentence, and for each character whose script differs from the previous character." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:53.029853Z", "iopub.status.busy": "2024-07-19T13:01:53.029198Z", "iopub.status.idle": "2024-07-19T13:01:53.108698Z", "shell.execute_reply": "2024-07-19T13:01:53.107862Z" }, "id": "7v5W6MOr1Rlc" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tf.Tensor([ 0 5 7 12 13 15], shape=(6,), dtype=int64)\n" ] } ], "source": [ "# dtype: bool; shape: [num_sentences, (num_chars_per_sentence)]\n", "#\n", "# sentence_char_starts_word[i, j] is True if the j'th character in the i'th\n", "# sentence is the start of a word.\n", "sentence_char_starts_word = tf.concat(\n", " [tf.fill([sentence_char_script.nrows(), 1], True),\n", " tf.not_equal(sentence_char_script[:, 1:], sentence_char_script[:, :-1])],\n", " axis=1)\n", "\n", "# dtype: int64; shape: [num_words]\n", "#\n", "# word_starts[i] is the index of the character that starts the i'th word (in\n", "# the flattened list of characters from all sentences).\n", "word_starts = tf.squeeze(tf.where(sentence_char_starts_word.values), axis=1)\n", "print(word_starts)" ] }, { "cell_type": "markdown", "metadata": { "id": "LAwh-1QkGuC9" }, "source": [ "You can then use those start offsets to build a `RaggedTensor` containing the list of words from all batches." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:53.112426Z", "iopub.status.busy": "2024-07-19T13:01:53.111737Z", "iopub.status.idle": "2024-07-19T13:01:53.122776Z", "shell.execute_reply": "2024-07-19T13:01:53.121906Z" }, "id": "bNiA1O_eBBCL" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "# dtype: int32; shape: [num_words, (num_chars_per_word)]\n", "#\n", "# word_char_codepoint[i, j] is the codepoint for the j'th character in the\n", "# i'th word.\n", "word_char_codepoint = tf.RaggedTensor.from_row_starts(\n", " values=sentence_char_codepoint.values,\n", " row_starts=word_starts)\n", "print(word_char_codepoint)" ] }, { "cell_type": "markdown", "metadata": { "id": "66a2ZnYmG2ao" }, "source": [ "To finish, segment the word codepoints `RaggedTensor` back into sentences and encode into UTF-8 strings for readability." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T13:01:53.126295Z", "iopub.status.busy": "2024-07-19T13:01:53.125705Z", "iopub.status.idle": "2024-07-19T13:01:53.145730Z", "shell.execute_reply": "2024-07-19T13:01:53.144879Z" }, "id": "NCfwcqLSEjZb" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/plain": [ "[[b'Hello', b', ', b'world', b'.'],\n", " [b'\\xe4\\xb8\\x96\\xe7\\x95\\x8c',\n", " b'\\xe3\\x81\\x93\\xe3\\x82\\x93\\xe3\\x81\\xab\\xe3\\x81\\xa1\\xe3\\x81\\xaf']]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# dtype: int64; shape: [num_sentences]\n", "#\n", "# sentence_num_words[i] is the number of words in the i'th sentence.\n", "sentence_num_words = tf.reduce_sum(\n", " tf.cast(sentence_char_starts_word, tf.int64),\n", " axis=1)\n", "\n", "# dtype: int32; shape: [num_sentences, (num_words_per_sentence), (num_chars_per_word)]\n", "#\n", "# sentence_word_char_codepoint[i, j, k] is the codepoint for the k'th character\n", "# in the j'th word in the i'th sentence.\n", "sentence_word_char_codepoint = tf.RaggedTensor.from_row_lengths(\n", " values=word_char_codepoint,\n", " row_lengths=sentence_num_words)\n", "print(sentence_word_char_codepoint)\n", "\n", "tf.strings.unicode_encode(sentence_word_char_codepoint, 'UTF-8').to_list()" ] } ], "metadata": { "colab": { "collapsed_sections": [ "oL9KopJirB2g" ], "name": "unicode.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.19" } }, "nbformat": 4, "nbformat_minor": 0 }

Original Source | Taken Source