The Often Forgotten Task Of Data Integration

When developing applications, most teams are focused on the features that the application must have. This is fairly normal given that many applications only manage data created by the application itself. However, more web applications or more similar to mashups than the client-server applications of the 1990s. That means that more data may be coming from external sources than from the application itself. I am reminded of this daily, but I decided to give more thought to this when I saw an interesting post about controlled vocabularies:

Just recently a survey about “Controlled vocabularies” and their significance for enterprise information management has started… What are the main application areas of controlled vocabularies from your perspective? A bit surprising is the intermediate result, that it’s not “Semantic Search” or “Support of multilingual applications” which was considered to be the most important application. Instead of this it turned out that “Data Integration” is king

You may be wondering why this matters to you, and hopefully I can answer that for you. In particular, if you are a social web application, it is likely that data integration should actually be a core task that you are performing. As an example, look at the blog comment platforms like Disqus and IntenseDebate or the social media monitoring applications like Radian6, ScoutLabs, Trackur and YackTrack. All of these applications require data from other social applications like Facebook, Twitter and Digg. Obviously, these applications need to combine data from other systems into their data storage platform of choice and do something with it all.

I know sharding and partitioning are all the rage, but sometimes the data is meant to be combined into one common format. Think about the basic status update information from Twitter, Facebook and LinkedIn. Does it always make sense to store each of these in their own table/shard/partition? What benefits do you gain from having separate storage for each of these status updates when they tend to converge on the same functionality over time? This does not sound like a large issue, but it has the potential to become one, especially if it is not handled well. If you stored all of the social updates in their own table, then you have data access code for all 3 of the applications. This also means that if you decide to support any social update application, like, you have to write another set of data access code for the new tables.

As with your data access code, your entire codebase could become bloated with extraneous code dealing with each service separately. Besides the obvious differences in gathering data, you would likely have code for a user within each service, a status update for each service as well as other functionality or data like geolocation. If you replicate this basic code once, it is annoying, duplicating it 4 times is just extremely bad practice. Hopefully, your solid object-oriented development practices have avoided this type of scenario.

The other side of the data integration question is, when do you know when you have gone too far? Let’s look at this using an example. If you already have the basic status updates integrated into one table, it sounds like a good idea to add other similar concepts to the same table. Digg has comments and diggs (upvotes). A comment looks a lot like a status update on the surface, so you shove that into the same model. Then you decide that diggs are essentially the same thing as retweets or likes. Once you continue down this path, you may have a smaller codebase, but it is loaded with special case handling of the different services. Yep, you saved code in the data access layer for more complexity in other areas. Obviously, that does not work either.

Granted, I have painted a rather bleak picture of data integration, but that was not entirely the point. The point is that data integration could be a core part of your application and you are not giving it the same weight as the functionality within the application. If you look at data analysis and integration as a core feature of your application, then you may see other things that you can do with your data. Given that we are collecting more data than we ever have before, knowing more about how your data should work for you is becoming much more important.

Remember, this is the year of big data. If you do not design your data storage like you would design your application, you will not be able to use your data effectively. If you can not use your data effectively, you may lose a big opportunity.

Enhanced by Zemanta