Apache Drill On The JuJu Platform

The last couple of days have been busy. We’ve been working hard on a new charm for the Juju charm store, it’s not yet complete but it’s worth blogging about and having people test.

The new charm is, Apache Drill. For those of you who use Juju you’ll know the platform has excellent big data and NOSQL support to enable people to quickly build scalable data applications. I personally got involved in Juju when Meteorite was asked to bring Saiku Analytics to the platform, but the problem has always been the inability to leverage the big data capabilities within the platform, until now.

Apache Drill allows you to run SQL queries over a host of different datasources including, CSV, JSON, TSV text files, the Parquet column store format, HBase, JDBC, Mongo and others. The files can also be stored on a filesystem, in HDFS, S3 etc. So in short, its a great way to allow analytical tools access to the data that would otherwise be locked away behind a Spark or MapR job.

Drill also scales well, and can distribute queries over a large cluster to help improve performance on systems that traditionally don’t perform particularly well when running SQL queries over the top of them (think, Hive). Of course on Juju though, I don’t want to have to manually install Drill on each node, so this is where the new charm comes into play nicely.

DEPLOYMENT

You can deploy Apache Drill in embedded (standalone) mode, but that doesn’t scale and also doesn’t fit our requirements, so Drill also supports Distributed mode. To deploy Drill in distributed mode is has 2 dependencies, Zookeeper and Java. Because of the way Juju works, we don’t bundle these dependencies, or enforce specifics upon you, instead you deploy compatible charms and ‘relate’ them to our Drill charm. Here’s how it works:

juju deploy apache-zookeeper zookeeper
juju add-unit -n 2 zookeeper
(required to setup the zookeeper quorum, if you’re just testing you can run 1 node, you could even run ZK on the same node as the Drill instance)

juju deploy openjdk java
juju deploy cs:~spicule/drillbit drillbit
So, what have we got here in our 4 lines of code? First up we deployed a Zookeeper charm, Zookeeper is a configuration service and will make sure all our Drill nodes stay in sync with one another, it also handles failure well, so if nodes fall over (assuming you have enough), Zookeeper will keep everything up and running. After that, we then added 2 more zookeeper nodes to create our quorum. In reality, you’d probably specify a machine size because Zookeeper is pretty lightweight so the default size is probably a little OTT for our needs, of course that depends how hard you’re hammering ZK though.

Next we install OpenJDK, this charm doesn’t actually do anything for now, because installing Java on a server by itself doesn’t make much sense, so instead it sits there waiting for us to relate it to something.

Finally we deploy our first Drillbit node. Drill does like a lot of RAM, by default they ship with 8GB as the minimum. In this charm we allow a dynamic value to be used, so by default we’ll use 75% of the system RAM as the upper limit, but this value is entirely configurable.

If you run watch juju status you’ll see your nodes spin up and install their required software. But currently Drill isn’t doing anything, it’s not got the various requirements resolved it needs to run, so it will tell you something like:

Waiting for Zookeeper to become joined
So, we need to do what it says:

juju add-relation zookeeper drillbit
juju add-relation java drillbit
This installs the 2 dependencies Drill needs to run, it will configure the Drill config files to point it to the available Zookeeper nodes and it will install OpenJDK JRE on the Drillbit server so that Drill can use Java. If you run watch juju status again, you’ll see the system process the requests until it is ready and says: Drill up and running.

We can now “expose” Drill to the outside world with:

juju expose drillbit
and in your browser you can now navigate to the IP address of the server and port 8047 to see the Drill web console.

SCALING

Now of course this is all well and good, but how does this system scale?

Well that bit is easy:

juju add-unit -n x drillbit
x is the number of extra units you want in your drillbit cluster. This time, because the relations are already established, it will set up everything for you. Because the system is managed by Zookeeper, when the node becomes available you will instantly see it reflected in the Drill webconsole. Similarly if you want to switch nodes off:

juju remove-unit drillbit
CONNECTING TO DATA!

So where does that leave us? We have a Drill cluster that lets us drill data but what do we connect it to. This part of the charm development is still in flux but there are a couple of sample relations to get us started.

Firstly I can deploy MongoDB and run SQL over the top of mongo:

juju deploy cs:trusty/mongodb
Then I can do:

juju add-relation mongodb drillbit
If I then look in the Apache Drill webconsole at the storage plugins I can see:

and if I click the update button it tells me:

{
“type”: “mongo”,
“connection”: “mongodb://172.31.38.188:27017/”,
“enabled”: true
}
If I then put data into my MongoDB I can then instantly run SQL over the top of it like:

juju ssh mongodb/0
sudo -i
wget https://github.com/tmcnab/northwind-mongo/archive/master.zip
apt-get install unzip
unzip master.zip
cd northwind-mongo-master/
./mongo-import.sh
logout then:

juju ssh drillbit/0
sudo -i
cd /opt/drill/bin
./drill-conf
show databases;
use juju_mongo_mongodb.Northwind;
Notice the .Northwind extension, that the name of the MongoDB database I created.

show tables;
select * from products;
and suddenly I get:

How cool is that? Not bad for running analytics over a database that doesn’t understand SQL.

TAKING IT FURTHER

Okay so this isn’t a big Drill howto, but lets look at HDFS. If I run:

juju deploy cs:hadoop-processing
It will give me a fully operational multinode Hadoop stack, with a bunch of slaves, namenodes etc, underpinned by HDFS. I can then connect Drill to my HDFS storage pool by adding a relation to the namenode:

juju add-relation namenode drillbit
Again when you add this relation it will add a connection to your drill Storage profile. In it you will see it has configured your HDFS connection IP and port, added a default location and a bunch of file types.

Now I can ingest some data into HDFS for example here is some Drill sample data:

{
“type”: “ticket”,
“venue”: 123455,
“sales”: {
“12-10”: 532806,
“12-11”: 112889,
“12-19”: 898999,
“12-21”: 10875
}
}
{
“type”: “ticket”,
“venue”: 123456,
“sales”: {
“12-10”: 87350,
“12-15”: 972880,
“12-19”: 49999,
“12-21”: 857475
}
}
If I paste that into a file on my namenode:

juju ssh namenode/0 hadoop fs -put sales.json /user/ubuntu/
hadoop fs -ls /user/ubuntu/
Next I can use the drill SQL console, SQL editor of choice or Drill webconsole to execute a query. For example:

cd /opt/drill/bin
./drill-conf
show databases
this will return my list of available databases/datasources in it you can see:

juju_hdfs_namenode.root
This is the name of my connection plus .root which is the default workspace name.

So I’ll go ahead and use it:

use juju_hdfs_namenode.root;
Next I’ll use a command that regular SQL people wont be familiar with:

show files
Which should return the file I ingested into hadoop:

So, lets do some analytics:

select * from `sales.json`
Sorry, what? Thats right, SQL over JSON, you can do the same over CSV and other files, or groups of files. If you want to utilise the cluster properly, for file based analytics try using the Parquet file format for column store operations over files inside HDFS or another filestore.

Anyway, as you can see here is the output:

ROUNDUP
So this charm isn’t perfect, there is still plenty to be done on the config and automation side, but if you are happy to do a bit of manual work, this will give you a great platform to get started on SQL over NOSQL.

Apache Drill is a fabulous bit of kit and in some ways the missing glue between Juju’s Big Data Stack and the more traditional analytics kit. Of course that isn’t to say that Drill is the only answer and in many cases the best answer, but it provides an easily accessible, SQL compliant, distributed processing engine to help analysts get to grips with “big data”.



Leave a Reply