fix kafka online-learning section in tutorial notebook (tensorflow#1274)

kvignesh1420 · i-ony · commit c8b5031fb2f9 · 2021-03-08T12:47:36.000-08:00
* kafka notebook fix for colab env

* change timeout from 30 to 20 seconds

* reduce stream_timeout
diff --git a/docs/tutorials/kafka.ipynb b/docs/tutorials/kafka.ipynb
@@ -74,7 +74,7 @@
         "\n",
         "Kafka is primarily a distributed event-streaming platform which provides scalable and fault-tolerant streaming data across data pipelines. It is an essential technical component of a plethora of major enterprises where mission-critical data delivery is a primary requirement.\n",
         "\n",
-        "**NOTE:** A basic understanding of the [kafka components](https://kafka.apache.org/documentation/#intro_concepts_and_terms) will help you in following the tutorial with ease.",
+        "**NOTE:** A basic understanding of the [kafka components](https://kafka.apache.org/documentation/#intro_concepts_and_terms) will help you in following the tutorial with ease.\n",
         "\n",
         "**NOTE:** A Java runtime environment is required to run this tutorial."
       ]
@@ -755,7 +755,7 @@
       "source": [
         "### The tfio training dataset for online learning\n",
         "\n",
-        "The `streaming.KafkaBatchIODataset` is similar to the `streaming.KafkaGroupIODataset` in it's API. Additionally, it is recommended to utilize the `stream_timeout` parameter to configure the duration for which the dataset will block for new messages before timing out. In the instance below, the dataset is configured with a `stream_timeout` of `30000` milliseconds. This implies that, after all the messages from the topic have been consumed, the dataset will wait for an additional 30 seconds before timing out and disconnecting from the kafka cluster. If new messages are streamed into the topic before timing out, the data consumption and model training resumes for those newly consumed data points. To block indefinitely, set it to `-1`."
+        "The `streaming.KafkaBatchIODataset` is similar to the `streaming.KafkaGroupIODataset` in it's API. Additionally, it is recommended to utilize the `stream_timeout` parameter to configure the duration for which the dataset will block for new messages before timing out. In the instance below, the dataset is configured with a `stream_timeout` of `10000` milliseconds. This implies that, after all the messages from the topic have been consumed, the dataset will wait for an additional 10 seconds before timing out and disconnecting from the kafka cluster. If new messages are streamed into the topic before timing out, the data consumption and model training resumes for those newly consumed data points. To block indefinitely, set it to `-1`."
       ]
     },
     {
@@ -770,7 +770,7 @@
         "    topics=[\"susy-train\"],\n",
         "    group_id=\"cgonline\",\n",
         "    servers=\"127.0.0.1:9092\",\n",
-        "    stream_timeout=30000, # in milliseconds, to block indefinitely, set it to -1.\n",
+        "    stream_timeout=10000, # in milliseconds, to block indefinitely, set it to -1.\n",
         "    configuration=[\n",
         "        \"session.timeout.ms=7000\",\n",
         "        \"max.poll.interval.ms=8000\",\n",
@@ -779,50 +779,6 @@
         ")"
       ]
     },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "sJronJPnZhyR"
-      },
-      "source": [
-        "In addition to training the model on existing data, a background thread will be started, which will start streaming additional data into the `susy-train` topic after a sleep duration of 30 seconds. This demonstrates the functionality of resuming the training as soons as new data is fed into the topic without the need for building the dataset over and over again."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "iaBjhFkmZd1C"
-      },
-      "outputs": [],
-      "source": [
-        "def error_callback(exc):\n",
-        "    raise Exception('Error while sendig data to kafka: {0}'.format(str(exc)))\n",
-        "\n",
-        "def write_to_kafka_after_sleep(topic_name, items):\n",
-        "  time.sleep(30)\n",
-        "  print(\"#\"*100)\n",
-        "  print(\"Writing messages into topic: {0} after a nice sleep !\".format(topic_name))\n",
-        "  print(\"#\"*100)\n",
-        "  count=0\n",
-        "  producer = KafkaProducer(bootstrap_servers=['127.0.0.1:9092'])\n",
-        "  for message, key in items:\n",
-        "    producer.send(topic_name,\n",
-        "                  key=key.encode('utf-8'),\n",
-        "                  value=message.encode('utf-8')\n",
-        "                  ).add_errback(error_callback)\n",
-        "    count+=1\n",
-        "  producer.flush()\n",
-        "  print(\"#\"*100)\n",
-        "  print(\"Wrote {0} messages into topic: {1}\".format(count, topic_name))\n",
-        "  print(\"#\"*100)\n",
-        "\n",
-        "def decode_kafka_online_item(raw_message, raw_key):\n",
-        "  message = tf.io.decode_csv(raw_message, [[0.0] for i in range(NUM_COLUMNS)])\n",
-        "  key = tf.strings.to_number(raw_key)\n",
-        "  return (message, key)\n"
-      ]
-    },
     {
       "cell_type": "markdown",
       "metadata": {
@@ -840,11 +796,11 @@
       },
       "outputs": [],
       "source": [
-        "thread = threading.Thread(target=write_to_kafka_after_sleep,\n",
-        "                          args=(\"susy-train\", zip(x_train, y_train)))\n",
-        "thread.daemon = True\n",
-        "thread.start()\n",
-        "\n",
+        "def decode_kafka_online_item(raw_message, raw_key):\n",
+        "  message = tf.io.decode_csv(raw_message, [[0.0] for i in range(NUM_COLUMNS)])\n",
+        "  key = tf.strings.to_number(raw_key)\n",
+        "  return (message, key)\n",
+        "  \n",
         "for mini_ds in online_train_ds:\n",
         "  mini_ds = mini_ds.shuffle(buffer_size=32)\n",
         "  mini_ds = mini_ds.map(decode_kafka_online_item)\n",