Consuming from Google Cloud PubSub using only Google Cloud Dataflow SQL
Introduction
Would you like to create your own dataflow jobs directly? Would you like to create these jobs from Google Big Query UI, without writing any java or python and not worrying about deployment? Step right up a try Google’s new feature for Dataflow, Dataflow SQL.
Setup
Here I will take you through some steps to create a very simple Dataflow SQL pipeline. To get started lets create a few things. We will need a user account with appropriate IAM permissions, a BigQuery dataset to create tables in, a BigQuery table to store data in, and a Google Cloud PubSub topic for data ingestion. Lets create those now:
Create a dataset:
Create a table in that dataset:
Create a PubSub Topic:
Now that we have everything created, go ahead and switch over to the new Dataflow SQL engine.
To switch over, go to query settings:
and select Cloud Dataflow Engine:
Observe the new resources available:
Now in order to query the Google Cloud PubSub topic we created, we need to assign it a schema. Create one, which is simple enough. Three string fields plus the incoming timestamp which Google Cloud PubSub adds by default.
Create schema for topic:
Now its time to move on to creating a Dataflow SQL job.
Job Creation
Create a simple sql statement, that will run as a Dataflow Job.
Click Create Cloud Dataflow job. Now we can specify some output locations. This is a super helpful feature allows us to save results when done, giving us the full power of an ETL process.
Lets have a look at the Dataflow DAG, we created from sql. Go to job history in Big Query:
Click on Job ID and then the DAG can be seen:
It is a very simple one, just applying the single transform and then dumping into a Big Query table via streaming inserts. Simple yet powerful, an ideal Dataflow Job.
Play with our streaming pipeline
From here we can now test our pipeline creatation. Lets put some dummy json data into our topic and see what our pipeline does.
The pipeline we created should take this json data, and place only the json field “a” into our table test.
Success, we have filtered incoming data from Google Cloud PubSub into a table in Big Query without ever having to leave Big Query and writing it all in sql.
This illustrates how easy it is to setup and deploy Dataflow pipelines via Big Query SQL. We have only done a very simple query but more complex ones are possible.