With social media, big data has come to the forefront of technology. Whether you want to continuously search Twitter, aggregate the social activity on several sites, or do some mining of people’s activity on Facebook, handling big data is critical. There are two questions you need to answer when looking at a project that will handle big data. First, how is big data defined and when do I know I am dealing with it? The second question is how do I deal with big data?
How big is big data?
Big data thresholds change over time. This is really due to how well traditional storage mechanisms can deal with them. Part of the storage problem is hardware related, eg. can the disk store a file larger than 4GB? That question may not be a big deal now, but 15 years ago it was a major concern. Another question about size is, how well an RDBMS can store the data? Will the database crash if it tries to manage 100GB of data? Yes, 100GB of data in one database was huge before 2000. Technologies like database partitioning, where a large table was physically split and managed by the database engine, were still young. Now, even open source and free databases have partitioning and replication. The size of big data has increased dramatically as well. When people talk of big data, they mean hundreds of millions of rows in one table and a database potentially over 1TB, yes one terabtye. Even though big data is a hot topic, you have few opportunities to really interact with big data. For our purposes, lets assume you are going to aggregate data from social services in some way, otherwise this post would be fairly short and uninteresting.
How do you deal with big data?
One of the first questions when dealing with any database, big data or smaller data, is what are you going to do with it. Is your primary function search of the data? Are you going to try to analyze the data using typical data mining techniques? Are you creating more of a FriendFeed-like reading and browsing service? Knowing your target is very important as it will likely change the way your data is stored as well as affect your choice in technologies. One major assumption that I am making is that you do not want to spend money on expensive tools like Oracle, SAS or Informatica. So, what kind of tools and technologies do you need to look at?
Data storage is possibly the most important decision when dealing with large amounts of data. Traditional RDBMS software can handle huge amounts of data but sometimes require extensive knowledge to manage. MySQL can easily handle many data storage needs and it is well known by many developers. It is the easy choice for many people. However, there are a growing number of NoSQL choices that may also make sense. Some of the NoSQL options have very good text search capabilities, while others have been optimized for speed of reads or writes. Knowing how your application will handle data access helps refine this choice. Also, do not forget about the potential of a mixed environment where some data is in an RDBMS and other data is better suited to a NoSQL datastore. There is a large list of categorized NoSQL options at NoSQL-Databases.org.
No matter how well architected your data storage solution is, sometimes reads are just not fast enough. This will typically happen if you have a highly trafficked site, but maybe there is just some data that does not change too frequently. In order to squeeze as much speed as possible out of your application, you probably want some level of data caching in your application. The basic idea is that your data cache is on big hashmap stored in memory which allows extremely fast reads. This is much faster than traditional database access or basic file I/O. If you have paid any attention to web application development over the past several years, you have heard of memcache. Memcache is a data caching server that you can use with your application. This is one option you can take, but some people like to have more control over how data caching works with their application. In that case, you need to find a data caching library for your language of choice. For Java, there are several available, and some have been integrated into web frameworks like Spring. In particular, ehcache has good integration with Spring so you could quickly include data caching in your application.
If you take the NoSQL route, many of those solutions are meant to be deployed in a distributed environment. In many cases, the software will have an agent running on several servers in order to store some of your data on that server. The master or orchestrator (the terminology could be different) will be configured to know which agent to talk to for the requested data. This is a gross simplification of the process, but it should give you an idea of what to expect. Distributed computing has various potential issues as well. If one of your servers crashes, or even if you have to perform some maintenance on a server, how do you continue to retrieve the data stored on that server? Is the data replicated to multiple agents in order to provide simple fail-over capabilities? Do you need to provide your own clustering solution to support the data storage? In some cases, you may even feel that the existing software do not provide you with a good enough solution so you need to build your own. Distributed computing is at the center of many solutions when dealing with big data. Knowing more about how things work will give you a better idea of how to architect your solution as well as what failure points may exist.
Search is a separate field entirely due to its focus on relevance and speed. Speed is critical in search because nobody wants to wait more than a minute for reasonable results. Thanks to Google, the longer a user waits, the higher the expectations will be. For example, if I wait a minute to get results from a search engine, I would expect that they would be highly relevant to my question. Google’s focus on speed with good enough results definitely changed how we interact with search. If your application will have significant search requirements, you need to look at your data storage to determine whether search is core to its function or whether you need an external solution. In years past, search was the domain of the RDBMS vendors, but the rise of the internet and Google has changed things. Search is not about finding the structured data in your database, it now looks at anything on the internet. There are various search projects on Apache that deal with various levels of search. Lucene is the core search engine index software and can be considered a low-level search technology. Solr, using the Lucene libraries, provides search through web services in order to keep search as a distinct application outside of your application. Solr and Lucene are focused on keyword searches just like most search technology. Nutch, also built on Lucene, is Apache’s answer to web crawling, so if you wanted to search the contents of various web pages, this is the solution for you.
Probability, Statistics and Machine Learning
If you decide to do any analysis of your data, there is a host of information you may need to review. If you are planning to graph trends or even simply report on your data, then you need a basic understanding of some simple statistics. You do not need a deep understanding, but even gaining knowledge of standard deviations could prove valuable. If you decide that you want to take your trends a step further and look at expected trends or even simple prediction, probability will rear its ugly head. Just like statistics, some simple concepts in probability will go a long way for many web applications. However, there are times when statistics and probability do not give you the results or the functionality that you desire. At that point you will need to delve into the realm of machine learning. This is not an idea that should just be jumped into as machine learning uses some advanced statistics, probability and mathematics to show how things work. In some cases, you may be able to treat things like a black box and implement an algorithm for simple categorization, like naive bayes, but it may not give you the results you desire. In those cases, you may need to understand more of how these machine learning algorithms work in order to determine what the best approach may be. This may be a difficult area to understand, but you can do some amazing things with machine learning. How cool would it be to personalize your site based on the user’s past behavior without the user needing to explicitly select categories or keywords?
Do I need a PhD to do all this?
Typical databases are easy to work with. You can use a GUI to create a database and some tables. You can write a query to get back information. Big data changes everything and there are a lot of technologies that try to make things easier. Thankfully, you do not need a PhD to work with big data, because many tools and libraries have been created to make these technologies more accessible to the typical developer. Sometimes more advanced knowledge would be helpful, but in many cases you might be able to treat the technologies as a black box, just like your old RDBMS. You might also think that your case is special and nobody has done anything like it before. If you are developing a web application, I highly doubt what you are doing is really unknown. It may not be known to you, but there may be academic papers explaining things or even solutions in an unrelated field. Big data did not start with social media, it really started in financials, pharmaceuticals and health information. So, if you can’t find something specific to fit your needs, broaden your search and the information is probably out there.