The third day with HBase is a bit short, but opens to a world of possibilities: the Cloud.
This is where HBase belongs. No personal (or even that many corporate) networks are large enough to let it perform correctly.
HBase depends on a large number of servers running in parallel for its performance, and there are few other places to find that many machines.
The first topic for today is Thrift, a generic remote interface to program servers (and a gift from the new Evil Empire, Facebook).
It is a tool to document a binary API, and generate client stubs to use this API. HBase supports such an API, making it possible to write clients in a variety of languages.
Using Thrift on your own project (on the server side, if you have any) would make it possible to use different languages on the client side, depending on whichever better fits the needs (scripting languages for glue scripts, …)
When I tried the example from the book, I had to change the connection
address of the
thrift_example.rb code from
127.0.0.1, otherwise Thrift would refuse the connection.
The first, and perhaps the most complex step is to open an account on AWS. It will require a phone, a credit card, a computer, and some time. And perhaps a couple of emails if the account opening remains stuck in “Pending verification” status.
Once this is done, Whirr can be used to create instances (be careful with that: Amazon will charge at least one hour for each server even if you take it down after a couple of minutes), download and install specific servers (mostly from the Hadoop family), configure them, all of this from the comfort of the command line (which is my case is cosily close to a cup of warm coco, so it is very comfortable indeed).
All you have to do is retrieve you security token from your AWS account page, create a public/private key pair, then write a recipe file (which describes what kind of machines and how many you need, what to install on each, …), and Whirr takes care of the rest. The first two steps only have to be done once; you can deploy as many recipes as you need.
The setup process takes a few minutes, then you can connect with SSH to one of your remote servers.
Whirr also creates a security configuration for each recipe, opening only the ports that are required by the servers in the recipe, limiting source of the connections to specific servers. You can also edit the security rules directly in the recipe if you want.
The ease with which this can be done is really surprising. It reminds me of how easy it was to deploy a Rails application on Heroku.
Now, I do not have any foreseen uses for such computing capacity, but I can see how it could be helpful for any organisation to be able to run occasional large data processing jobs without having to maintain a permanent data center.
There is only one exercise today: to open a Thrift connection to an AWS deployed HBase.
First is to get Thrift to run on the deployed machines. The book suggest to connect by SSH and start the instance there, but there is a better way if you know you will need Thrift: ask Whirr to deploy it automatically.
In the file below, I’ve added the server
hbase-thriftserver to the
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
As for the connection to the Thrift server, the method described in the book is to open the port 9090 to the world, and to hope to be the only one to know about this port: a likely possibility, but who would want to take such a chance in production?
Fortunately, there is a better solution: SSH Tunneling. It is very easy to set up and requires nothing but what we already have.
The general idea is to open a ssh tunnel between a local port and a remote port: whatever you puts in the local port is taken by ssh, transported over the SSH connection; once it reaches the remote machine, the remote ssh instance will forward the data to the remote port, as if it was a client running on the remote machine.
The transport between the two machines only requires the remote one to have the SSH port open (which is both the case, and secure). You have to use authentication and encryption for the transport.
And what is required to implement this SSH tunneling:
(from the directory where you created the
Here I map the local port 9090 to the remote machine’s port 9090. That
way I don’t even have to change my
thrift_example.rb code. But of
course, if I had to connect to different machines, I would use
The Thrift server was automatically started by the recipe.
With this in place, and after creating some tables in the remote HBase:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
(be careful not to use LZO as a compression algorithm in the remote HBase, as I did when I tried the first time: the default HBase has no LZO support and will fail when you try to enable a table with LZO compression).
To take a tunnel down, you’ll have to find and kill it (as far as I
can tell). If you have no other ssh connections,
killall ssh is a
simple solution. In any case, the connection will be cut when the
remote servers are destroyed.
Wrapping up HBase
I like what I see with HBase: the project has strong backers among its users (Yahoo, Facebook, …); it belongs to a large family of tools that help to design Big Data solutions, and integrates well with some Cloud networks
The model is easy to understand (the book mentions the possibility of eventual consistency due to regional replication, but this remains a simpler model than Riak’s), and close to the original MapReduce concept.
This is really one tool I will have a closer look to in the near future.