The age of Big Data is here. With zettabytes of daily Internet traffic, analysis of even a small subset involves Big Data. The age old paradigm of sequential programming no longer suffices, it is essential to move to a parallel programming mode to service the needs of Big Data applications. This blog sheds light on some issues that arise when programming multithreaded Big Data applications, and makes some observations that might help your parallel Big Data application development.
Consider High Performance Storage Technologies
RAID technologies have been around for years, and there are lots of newer technologies that provide even better performance. But for this discussion, we will consider what RAID 5 offers that can help to solve Big Data problems. The main thing that RAID 5 provides is data striping which introduces less seek-time delay because multiple drives can read a file’s data on different physical devices simultaneously. This is good because it speeds up retrieval time tremendously.
But RAID technologies normally also include a mechanism for data caching. When this is the case, not only does the data get read faster because of the striping, but much of it is cached for even faster future access. For Big Data programming, this can be the difference between systems that perform adequately, and those whose performance is so poor that they are not viable solutions. If you are starting a Big Data project, especially one that uses a parallel approach, start by doing an analysis of the storage mechanism. If you need to implement something such as RAID 5, this might give your software the boost that it needs.
Make Threading Configurable
Developers usually think they know best when it comes to parallelizing code. Split into five threads for this loop, spawn these two threads to read data from disk, and create two threads to manage data analysis. But developers don’t always know the best strategy when applications run on other systems with vastly varying architectures. And the effects of these varying situations are accentuated with Big Data applications due to the significant demands made by such applications.
The best way to accommodate different scenarios, in which applications run, is to make them configurable. This will almost always include some benchmarking functionality to evaluate the performance of each piece of the application. Once the benchmarks have been gathered, the configuration can be automatic or manual. This choice depends on how much faith the developers place in the installers. If the installers are skilled, then they can probably manually configure with good results. If the installer skills are unknown or highly variable, then an automatic configuration based on the benchmarks would be best.
Different Parts of Systems Have Varied Performances
Application development must also consider the performance of various modules when dealing with Big Data. This is especially true when accessing storage media. For this reason, special care must be taken to design the retrieval architecture of an application. An approach that you should seriously consider is where parallel threads that are synchronized manage data retrieval. Data can be queued and cached based on a system of heuristics. Then, the modules that need the data can retrieve it without the waits that would have otherwise been experienced.
In addition to data retrieval, number crunching can introduce bottlenecks to application performance. As with data retrieval, well-planned thread usage can mitigate these situations so that slow number crunching can be done in parallel with other processes resulting in an application that does not suffer from obvious computational slowdowns.
Optimize Memory Allocation
It is possible that a Big Data application may not scale well, regardless of the number of processors and memory that a system has. For instance, under what you think should be a heavy load, performance monitor might report a 25 percent total utilization—a far cry from what should be reported for a heavy load. This can sometimes result from locks within the memory allocation functions. These locks can introduce contention that may significantly slow things down.
There are two solutions that should be considered. The first is to use a third-party memory management library such as Hoard, which all but eliminates the lock contention problem. The second solution is to implement your own memory management solution, which will also need to eliminate the lock contention issue. These are both good solutions, but writing your own memory management solution will take some serious time and debugging.
Working with Big Data is a challenge and compels us to explore solutions within the parallel programming paradigm. This blog has suggested several approaches, including high-performance storage devices, configurable threading, and consideration of performance issues for various modules in an application. It is clear that continued persistence to optimize Big Data applications will include parallelization. Since Parallel Studio provides parallelization baked into it, using the technology would solve many of the issues discussed in this blog.