Why The Next Big
ETL Framework
Will Be Written In Ruby

As I have contemplated the need and purpose for some kind of information architecture/ETL framework, the Ruby language has repeatedly presented itself as an excellent candidate for implementation language. Here are some reasons:

Clear Syntax

Programming languages exist for two major purposes. The one that everybody classically mentions is, of course, to communicate instructions to a compiler. But there is a second, very important purpose, which is often overlooked: to communicate intent to humans.

Historically, most programming languages have done reasonably well at the former task, but have fallen short on the latter. To shore up the communication disconnect for the human readers, we have things like code comments, and inline documentation. The implication is that you can talk to the compiler, or you can talk to the humans, but not both at the same time.

This communication division comes at a cost: the separation of configuration from code carries great overhead in and of itself; and the increased divide between where the program does its work and where we do our work (e.g. usually in XML configuration files) means that we end up trying to build an app with two concurrent (and often confusing) work flows: the work flow of the operator/configurer, and the work flow of the application itself.

Ruby is a great choice for a framework because 1) as an interpreted (scripted) language, the code is stored in a readable file format (as opposed to compiled bytecode or machine code), but also 2) Ruby’s clean syntax eliminates much of the cruft and ceremony which makes source code so difficult to read in other languages. The result is that Ruby code leaves very little communication with the compiler which is not also simultaneously communicating with the human reader.

With a language as clear and concise as Ruby, the need to abstract the “interesting” bits out into a configuration file fades away: it no longer makes sense to try to stash configs into a shared human/computer documentation scheme. Instead, when we want to read or change what the computer sees, we simply read or edit the source code itself.

Not only is a system like this easier to maintain, it is easier to build. Configuration parsers are notoriously complex to build, and all the work spent building and maintaining them comes at the expense of work on the actual task which the application is supposed to accomplish.

Test All The @#$%ing Time

The Ruby community have truly been the pioneers in some of the most innovative and exciting modalities in test-driven development. Just as Ruby code excels at communicating intent, so Ruby tests excel at communicating clearly what is being tested.

The role of testing in building this framework should be obvious. But a test-driven mindframe also will yield a great benefit when implementing a solution on the framework: by first writing tests for all data movement, transformation and amalgamation operations, we will be able to assure with high confidence the behavior of a system. This assurance becomes increasingly important the more complex our data work flows become: the more intricate our data universe, the more volatile its response to small changes. By supporting the universe with a comprehensive frame of tests, we are able to identify small, unexpected consequences in the system and correct them.

Is Ruby Fast Enough?

The speed of Ruby is, without a doubt, one of the objections I hear most often regarding using Ruby for ETL tasks. But there are options. There are several variants from standardC-Ruby which offer significant speed improvements: Ruby Enterprise Edition, JRuby and Ruby 1.9. When designing the framework, we will need to explore these alternative interpreters in order to address questions of speed.

We will also need to build the framework in a way that facilitates benchmarking: if we can identify areas of systemic slow performance, we can bring in another faster language (e.g. Java, C, etc.) to speed up those operations.

I’m convinced Ruby is the right choice for speed. The time spent developing a system is a major concern for enterprises; developer time is a cost which scales up and down very smoothly. Execution time, on the other hand, has a cost/benefit threshold: if an operation completes within an acceptable time frame, there is little benefit in further optimization.

For example, suppose an enterprise has a report which generates automatically overnight. If the report does not complete until 11:30am the next morning, this is a problem. The enterprise will no doubt be interested in getting it to complete by 8:00am. On the other hand, if the report finishes by 6:30am, how much developer time and money will they spend trying to get it to complete by 2am? In the real world, zero dollars and zero time.

So execution speed needs to be good, but we ought to remember that shorter execution time has a threshold of no-additional-value, whereas developer time spent in implementing a solution really does not. Ruby is an excellent choice because it minimizes developer time. If we can also make it a performant choice (and I’m convinced we can) then a Ruby framework will be a compelling choice.

API’s In Ruby Are Easy

One of the most disappointing things about working with Java was the broken promise of web services. Supposedly web services were going to result in nice, fast, clean, zero config remote interactions between applications in arbitrary locations. All the tutorials kept saying how much easier and nicer this was than things like COM and RPC. Oof, I hate to imagine. Web services were never easy in Java, not when compared toRESTful interfaces written in Ruby. The Java variants could not keep from dipping into ugly reams of XML: “Hello World” descends into to “Hell World” as you create or edit XML in eight different locations just to build a completely useless spike.

By comparison, when I built my first Ruby API, I found myself wondering, “that can’t be it, can it?” In Ruby they are dead simple. This fact is extremely important: in real-world enterprises, you cannot assume all ETL operations will take place on a single monolithic core. The fact is that most enterprises need to integrate data which comes from all over the place, from machines with varying computational power and storage capacity. Furthermore, the various custodians of all the data repositories tend to be very possessive and protective of those resources: they neither want someone else to take over those datastores, nor do they have the budget to become the arbiter of all other data stores. This is both causal and co-resultant with the kind of information siloing which prevails in so many enterprises. So ETL in the real world must be, above all else, distributed, adaptive and easy.RESTful API’s are important for providing that adaptability.

Ruby Frameworks Are Light

When compared to equivalent stacks in the Java world, Ruby on Rails is gloriously light. It can run just about anywhere. In the real world of information architecture, it is unrealistic to imagine that all ETL operations can be done on a single, mammoth server. It is more realistic to anticipate smaller, more distributed ETL going on in a variety of servers spread over several departments.

However, most of the high-end commercial ETL solutions are still tied to an old-school per-seat licensing scheme. Quite frankly, this opens up an opportunity to eat their lunch. What if your ETL framework could run anywhere, in a wide variety of environments, from .NET, to the Java stack, to Linux, Mac, or Windows? What if the hardware requirements were minimal? And what if there was a low barrier to intercommunication between all those instances of the framework?

Extensibility

The idea of a framework is to solve all of the common challenges within a problem domain in a way that puts the truly distinctive work front and center. In the world of web applications, frameworks like Ruby on Rails do this by handling the details of the HTTP protocol, cookies and sessions. Web frameworks do this so well that few people really realize how much benefit they receive by using them.

One difficulty in building a framework is in deciding just which problems really are common. For an ETL framework, an obvious one might be fault tolerant FTP file movement: it’s a fairly common thing to do, so why make users reinvent the wheel each time they need to move files over FTP?

But there will always be tasks which turn out to be somewhat common: some people don’t need this functionality, but many do. So it makes sense to design the framework in such a way that it can be easily extended in the future. And here is where the power of the open source community can really bear fruit: if your framework is open, extensible and free, and implemented in a language that is also open, extensible and free, they will scratch their own itch, and often will share the resulting solutions with the community.

Ruby is awesome for designing extensible frameworks. As a langauge, it has tools for extensibility that other languages simply don’t have. Powerful concepts such as “everything is an object” and closures mean that extensions which would be difficult in other languages are facile in Ruby. In the same way that Ruby’s clean syntax aids the implementer in configuring a data work flow, it also helps the plugin developer to understand the architecture of the framework, so that plugins are also lightweight, powerful and easy to read.

Published by Joel Helbling on

For more articles on this topic, see Ruby