Easy Data Federation With JuJu And Apache Drill

The Juju Charmers Summit is just a few days away and before I fly out there I wanted to clean up some of the work I’ve been doing on the Apache Drill charm for one of the presentations.

For those of you who don’t know, Drill allows you to run SQL over NOSQL and SQL datasources and present this service as a JDBC interface allowing Analytics tools access to the underlying data even if they don’t speak the query language required.

So my goal was as close to a zero config data federation platform as possible and we’re certainly heading in the right direction.

To get started you need the following charms:

  • Drillbit (cs:~spicule/drillbit)
  • OpenJDK
  • Apache Zookeeper
  • Your choice of data sources, in this demo it is MySQL and Postgres, but the charm also accepts connections from Mongo and HDFS. You can of course upload your own CSV’s and JSON files for additional source data as well.

Here are my commands:

juju bootstrap jujudev aws/us-west-1 --bootstrap-series=trusty
juju deploy cs:~spicule/drillbit
juju deploy openjdk
juju deploy apache-zookeeper zk
juju deploy postgresql
juju deploy mysql

In the above we have bootstrapped a new environment, deployed Apache Drill, OpenJDK, Apache Zookeeper, Postgres and MySQL.

Great, but currently they have no knowledge of each other, so we need to add relations between the charms.

juju add-relation openjdk drillbit

This tells Juju to install OpenJDK on the Drillbit node, which is required for Apache Drill to run.

Next we need to make Drill aware of Zookeeper so it knows to run in Distributed mode.

juju add-relation zk drillbit

This will have provided Drill enough information to install itself, so you should then see Drill doing that via juju status.

Once Drill is installed we can verify this with: juju expose. Then navigate to the IP address listed for that charm.

You should see something like this:

We can also see what we have deployed in Juju GUI by running:

juju gui
if juju is running on your local machine or:

juju gui –no-browser –show-credentials
if it isn’t.

In the GUI you should see a layout similar to this:

Next up we will add relations to our data sources:

juju add-relation drillbit mysql
juju add-relation drillbit postgresql
When these have run if you then click on the Storage menu item in Drill you should see a Postgres and MySQL entry added automatically to the data sources. If you open one of these you’ll see that it has been prepopulated with the JDBC url, a username and password. Clever huh?

Next we can add data to our databases. In this demo I just added a table called test to both and added 2 columns 1 ID and one label.

create table test(id int, label varchar(20);
Then I added one record to each with the same ID.

Finally back in Drill I can now do this:

select m.label, p.label from `juju_mysql_172.31.25.57`.drillbit.test as m, `juju_psql_172.31.0.200`.public.test as p where m.id = p.id

All this query says is select my table from the MySQL connection and the table test from the PostgreSQL connection where the id’s match in both, simple SQL just with the additional connection information.

Then I hit the Submit button and I’m displayed a table like this:

Of course this is a very simple solution, but its pretty cool how with no ETL code or glue, I’ve extracted data from 2 tables is separate data sources and joined them before rendering the result. Of course the Drill GUI isn’t designed for Analytics. In a future blog post we’ll look at hooking this up to Saiku Analytics via Juju to provide end to end Analytics over a multitude of data sources.

You can also scale this solution very easily. Curently we have 1 Zookeeper node, we should have 3+, we can also add more Drill nodes very easily:

juju add-unit -n 2 zk juju add-unit -n 2 drillbit

As soon as that finishes we then have a 3 node Drill cluster talking to your data sources without editing a single configuration file.



Leave a Reply