Is Object Serialization Evil?

In my daily work, I use both an RDBMS and MarkLogic, an XML database. MarkLogic can be considered akin to the newer NoSQL databases, but it has the added structure of XML and standard languages in XQuery and XPath. The NoSQL databases are typically storing documents or key-value pairs, and some other things in between. Given that any datastore will be searched at some point, you will always care how the data is actually stored or whether there is some way to query it easily. Once you start thinking about the problem, you quickly generalize to the “how do I persist any type of data” question. However, my focus is not going to be the comparison of the various data stores, but the comparison of how data is stored. More specifically, I want to show the object serialization, mainly the Java built in method, as a data persistence format is evil.

Given what you normally read on this blog, this may seem like an oddly timed post, but I have run into serialization issues lately in some production code and Mark Needham recently wrote an interesting post about this as well. Coincidentally, Mark is also working with MarkLogic, and there is an interesting item in his post:

The advantage of doing things this way [using lightweight wrappers] is that it means we have less code to write than we would with the serialisation/deserialisation approach although it does mean that we’re strongly coupled to the data format that our storage mechanism uses. However, since this is one bit of the architecture which is not going to change it seems to makes sense to accept the leakage of that layer.

The interesting part of this is that he has accepted using the data format of the storage mechanism, XML in MarkLogic in this case. Why is this interesting? First, it is a move away from the ORM technologies that try to hide the complexities of converting data into objects in the RDBMS world. Also, this is a glimpse into the types of issues that could arise from non-RDBMS storage choices as well as how to persist objects in general.

So, an RDBMS is typically used to map object attributes to a table and columns. The mapping is mostly straightforward with some defined relationship for child objects and collections. This is a well-known area, called Object-Relational Mapping (ORM), and several open source and commercial options exist. In this scenario, object attributes are stored in a similar datatype, meaning a String is stored as a varchar and an int is stored as an integer. But, what happens when you move away from an RDBMS for data persistence?

If you look at Java and its session objects, pure object serialization is used. Assuming that an application session is fairly short-lived, meaning at most a few hours, object serialization is simple, well supported and built into the Java concept of a session. However, when the data persistence is over a longer period of time, possibly days or weeks, and you have to worry about new releases of the application, serialization quickly becomes evil. As any good Java developer knows, if you plan to serialize an object, even in a session, you need a real serialization ID (serialVersionUID), not just a 1L, and you need to implement the Serializable interface. However, most developers do not know the real rules behind the Java deserialization process. If your object has changed, more than just adding simple fields to the object, it is possible that Java cannot deserialize the object correctly even if the serialization ID has not changed. Suddenly, you cannot retrieve your data any longer, which is inherently bad.

Now, may developers reading this may say that they would never write code that would have this problem. That may be true, but what about a library that you use or some other developer no longer employed by your company? Can you guarantee that this problem will never happen? The only way to guarantee that is to use a different serialization method.

What options do we have? Obviously, there are the NoSQL datastores but the actual object format is the relevant question not which solution to choose. Besides the obvious serialized object, some NoSQL datastores use JSON to store objects, MarkLogic uses XML and there are others that store just key-value pairs. Key-value pairs are typically a mapping of a text key to a value that is a serialized object, either a binary or textual format. So, that leaves us with XML, JSON and other textual formats.

One of the benefits of a structured format like XML or JSON is that they can be made searchable and provide some level of context. I have talked about data formats before, so I won’t go into a comparison again. However, do these types of formats avoid the issues that native Java object serialization has? This is really dependent upon what library you are using for serialization. Some libraries will deserialize an object without any issues regardless of whether the object field list has changed. Other libraries could have problems depending upon whether a serialized field exists in the target object, or there might not be solid support for collections (though that is doubtful at this point).

Given that even structured formats could have serialization issues, is the only safe path hand-coded mappings like those used by ORM tools? Some JSON and XML serialization tools use the same mapping methods as the ORM tools in order to avoid these problems. However, once you define these mappings, you are explicitly stating how an object gets translated. This explicit definition will require maintenance, but that is definitely cleaner than trying to trace down a serialization defect in some random stack trace.

So is implicit object serialization really worth the potential headaches? Or should we just consider it evil and never speak of it again?

Enhanced by Zemanta