New programming language delivers fourfold speedups on Big Data problems

Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) are presenting a new programming language, called Milk, that lets application developers manage memory more efficiently in programs that deal with scattered data points in large data sets.
In tests on several common algorithms, programs written in the new language were four times as fast as those written in existing languages. But the researchers believe that further work will yield even larger gains.

The reason that today’s big data sets pose problems for existing memory management techniques, explains Saman Amarasinghe, a professor of electrical engineering and computer science, is not so much that they are large as that they are what computer scientists call “sparse.” That is, with big data, the scale of the solution does not necessarily increase proportionally with the scale of the problem.

“In social settings, we used to look at smaller problems,” Amarasinghe says. “If you look at the people in this [CSAIL] building, we’re all connected. But if you look at the planet scale, I don’t scale my number of friends. The planet has billions of people, but I still have only hundreds of friends. Suddenly you have a very sparse problem.”

Similarly, Amarasinghe says, an online bookseller with, say, 1,000 customers might like to provide its visitors with a list of its 20 most popular books. It doesn’t follow, however, that an online bookseller with a million customers would want to provide its visitors with a list of its 20,000 most popular books.

Batch processing

Milk simply adds a few commands to OpenMP, an extension of languages such as C and Fortran that makes it easier to write code for multicore processors. With Milk, a programmer inserts a couple additional lines of code around any instruction that iterates through a large data collection looking for a comparatively small number of items. Milk’s compiler — the program that converts high-level code into low-level instructions — then figures out how to manage memory accordingly.

With a Milk program, when a core discovers that it needs a piece of data, it doesn’t request it — and a cacheful of adjacent data — from main memory. Instead, it adds the data item’s address to a list of locally stored addresses. When the list is long enough, all the chip’s cores pool their lists, group together those addresses that are near each other, and redistribute them to the cores. That way, each core requests only data items that it knows it needs and that can be retrieved efficiently.

MIT also has a more efficient language for simulations

Simit is a language that can speed up computer simulations 200-fold or reduce the code they require by 90 percent.

In experiments, simulations written in the language were dozens or even hundreds of times as fast as those written in existing simulation languages. But they required only one-tenth as much code as meticulously hand-optimized simulations that could achieve similar execution speeds.