Data Distilled

Saturday, 29 November 2008

In Defense of Rolling-Your-Own Data Access Layer

I would always say that if some code already exists to implement a particular piece of functionality then you should use that library. There are so many great open source components that implement pretty much everything you can imagine that you would be foolish not to reuse them. The only place where I would go against my own advice is at the data access layer of an application.

Libraries that implement data access components are available in abundance. You can get both commercial and open source offerings for any language that you care to think of. So, why with all this available code, could it be a good idea to go your own way when it comes to data access?

The first reason is down to performance. Given all the meta data interrogation, object creation and data type reflection that is necessary to implement a generic data access layer, you will always be able to implement a quicker application specific data access layer than is available from an off-the-shelf solution. You'll find a number of different studies on the web that do comparisons of raw performance verses performance through a data access library if you do a quick search on google. (Here is a typical one comparing performance using various persistence technologies available for the Java language.)

The second reason is down to developer laziness. A data access layer allows the developer to forget about the fact they are interacting with a database. This can lead to very bad code being produced. For example, performing multi-way joins between tables in the application code rather than allowing the database to take the strain. This is something that I don't think would happen if the developers explicitly understood they were interacting with the database.

One of the arguments for not going down the roll-your-own route is why re-invent the wheel when there are plenty of implementations available. Well, there is NOT an implementation available that specifically meets the access requirements of your application and your database. Other libraries are providing specific functionality, not dependent on aspects of your application. The data access layer is dependent on your application code and the structure of your database. For this reason, I think the data access layer is a special case here.

One of the reasons that are are so many implementations around is that it is not that difficult to write a data access layer, especially one for a specific application. There are good books available to get you started (Data Access Patterns by Clifton Nock or Patterns of Enterprise Application Architecture by Martin Fowler.) So, go ahead and be brave, roll your own. It is not as hard as you think and your application will benefit enormously.

Friday, 31 October 2008

Application specific database engines.

I've already written about the fact that databases are increasingly being created to support a single application and Michael Stonebraker has written about the drivers for multiple types of database engines. However, these trends extend much deeper than that.

When packaged applications or services need a storage solution, choosing a traditional relational database system is not necessarily the first choice. Many document management systems, workflow systems, CRM solutions, application servers, etc. exist with their own specialized storage systems. This enables these applications' developers to build a storage engine to meet their own very specific requirements which in turn means that a lot of unnecessary overhead and complexity can be removed from the storage component to deliver the application consistent high performance.

A great example of this that exists on the web is Facebook and the approach that it takes to storing and serving the enormous number of photos that they hold. If you take a look at the presentation needle in a haystack: efficient storage of billions of photos then you'll see how they've built a database engine for images driven by some extreme scaleability requirements. Could they have used a traditional database? ...No. They wouldn't have been able to meet been their performance criteria. They needed to ensure that every I/O that was made was necessary.

Even when a traditional database engine is involved, there can be database-like code sitting in the application to extend the capabilities of the underlying database engine. Database sharding is a good example of this. In this approach, data is federated over a collection of cheap servers to increase scalability and performance. Typically the applications that use sharding have the code that distributes the data over the shards and combines the results from the shards within their application code. I've used similar techniques myself before most of the commercial database engines starting supporting partitioning and clustering natively. (Something that MySQL - which most of the sharding practitioners seem to use - has only just started to support.)

Now, not all applications need to go to the extremes of building or augmenting a database storage engine, but for those that just aren't getting enough out of an off-the-shelf database solution, more and more brave souls seem to be taking on the challenge.

Monday, 22 September 2008

Databases no longer shared resources?

One interesting trend that I've noticed in many of the organizations that I've been into is that increasingly databases are being built to serve single applications. The early visions of databases shared amongst multiple applications is no longer the first choice. To a certain extent this has always been the case for certain operational systems, but now the reach of single application databases has grown. You'll even find data replicated across multiple multi-terabyte data warehouses to support different business intelligence solutions.

One reason for this trend could be that disk is seen as a cheap resource and there is no longer the cost constraint for minimizing the number of copies of data. Obviously this is not the full story as there is a cost to keep these different versions of data in sync, although this may not always be necessary. Given the synchronization cost, it seems the real driver for why single application databases are becoming more and more popular is the need for business agility. I think businesses can no longer be held to ransom by the database when they perceive the need to update an application to improve business performance.

In a shared database environment, making a change to an application that requires changes to the underlying database can be an extremely costly and time-consuming business. Just take a look at the excellent book Refactoring Databases: Evolutionary Database Design if you want to better understand the various impacts of making changes at the data layer. By moving to single application databases, all these complexities can be removed and business can update their business applications at the rate that the market demands rather than at the pace the database allows.

This move also fits in with the agile development practices which have been coming to the fore in the development community in recent years. I'm sure that time and again business guys have asked for updates to applications and the developers have said "sure" only to be blocked by what was going on in the database layer. That is not to say that changes are not possible in a traditional shared database environment, as Refactoring Databases shows, but without introducing single application databases you are not going to be able to run as fast as the business guys want.

Thursday, 18 September 2008

Big Data

Take a look at the special edition of Nature on Big Data. Although this is focused on the continued explosion of scientific data and the problems with processing all this data, it's a fair commentary on what is happening across all industries. There is too much data for current tools to efficiently process. More often than not, organizations are finding themselves limited by their ability to process and understand all the data that they hold. Google, when is it not, is a stellar example of the payback you can get if you can tame the data deluge. If only ordinary organizations could do half as good a job of getting to grips with their data. Fortunately, that means work for the rest of us.

Another article worth read is The Claremont Report on Database Research. This reports on a workshop held by some eminent database bigwigs and their conclusions on the challenges facing the database industry in this new climate. The report picks up on the "Breadth of excitement about Big Data". Again Google has probably been key in stirring up this excitement. They have certainly opened peoples eyes to the value of getting to grips with this explosion of information. This is probably why there are so many new venture-backed database companies starting out at the moment. It's all about the data. What a great time to be a database practitioner.