The next step in computing
By Oscar Shaw and Michael Roberts
The most impactful advances in technology are not always the most easily seen. Although graduate researcher Devin Petersohn’s Modin project may not hold the spotlight like the latest cryptocurrency or machine learning breakthrough, it may be a necessary catalyst in their advancement moving forward.
Modin speeds up Pandas, a popular program used in creating machine learning networks, by taking advantage of the full capabilities of modern computers’ physical hardware. In some cases, Modin runs four times faster than Pandas, or even faster depending on the computer, equating to saving data scientists hours each day and greatly accelerating the research process.
Although such drastic improvements in computing performance were common a few decades ago, nowadays stories like Modin’s are quite rare. Moore’s law, which states that the processing power of computers doubles every two years, was the norm for the past fifty years. However, the driving force behind this trend, the shrinking of semiconductors, came to a halt some time ago, as the smaller a semiconductor is, the more heat it generates, and cooling technology has not kept up.
Unfortunately, this indicates that modern computers’ processing speed may plateau. However, modern advancements in computer hardware continue in the form of increasing the number of cores that computers have, a progression that Modin exploits to improve performance. A core is something that can independently handle a “thread of execution,” i.e. a train of thought, for the computer. This behavior is similar to how a single employee, in an office with multiple employees, can manage a single task individually, while periodically reporting to the team to maintain cohesion.
However, in order to take advantage of multiple cores, programs like Modin must be written in a special, “parallel” way, breaking up tasks so that the computer can perform them side by side. This style of programming is notoriously difficult, and so error prone to develop that it would in many cases slow the researcher down, rather than speed them up. For this reason, much software intended to run on modern computers, such as Pandas, simply doesn’t take advantage of a computer’s multiple cores, instead running in a “sequential” fashion, on only one core. Modin’s accomplishment is that it has rewritten many portions of Pandas’ software so that each operation a researcher asks it to perform will automatically be partitioned into sub tasks, which can then be run much faster in parallel, on multiple cores, with no extra work from the researcher.
Modin’s ease of use is one of its main selling points, and one that Petersohn set out to achieve from the beginning. Achieving multi-core processing is no easy task, and is notoriously the source of many software bugs, and Modin’s competitors either fail to simplify this technological burden or make compromising trade-offs.
For example, Spark, a technology which has already tried to make Pandas utilize multiple cores, requires data scientists to manually write code to take advantage of those multiple cores, a task which demands considerable specific knowledge of how computers work behind the scenes, and therefore a very different skill set than building machine learning algorithms. Another alternative is Dask, which does not require machine learning engineers to write custom code, but requires them to change pre-existing code because of quirks specific to Dask. Modin, on the other hand, requires engineers to change only one line of code in their entire program.
Petersohn expressed that a data scientist’s role is to “extract value from data”, and that performing the complicated task of crafting high quality, parallel code, shouldn’t be a mandatory skill for that.
Adding multi-core support is a well-known headache in the machine learning community, and in the many events Petersohn has introduced Modin, he frequently mentions anecdotes of colleagues whose projects ground to a halt after the amount of data they processed exceeded a certain threshold beyond which single-core processing became an obvious bottleneck.
Stories like these were the main motivation behind Modin’s design, which Petersohn aptly summarizes as developer-focused instead of machine-focused. “The idea that the data scientist is at the center of everything we plan to achieve is core to our principles” he states. Petersohn continues that his goal with Modin is to optimize for “wall clock time”, the amount of time that it takes the user to get what they need done, rather than “CPU time”, which is to say that Modin’s purpose is to benefit the user by optimizing their workflow, rather than optimizing for the computer’s time, which is very cheap.
So far this approach has paid off very well with a sizable community of engineers online asking questions about how to incorporate Modin into their projects and numerous suggestions about how to develop it further in the future.
Technology like Modin may be the future of software engineering, as many modern programs which could run on multiple cores currently do not, due to the difficulty involved in writing error free software in this fashion. The “automatic parallelism,” or the ability to have software automatically run on multiple cores, as Modin allows, may cause many end users to see continued significant speedups in the applications that they use.
The work that engineers do shapes the world around us. But given the technical nature of that work, non-engineers may not always realize the impact and reach of engineering research. In E185: The Art of STEM Communication, students learn about and practice written and verbal communication skills that can bring the world of engineering to a broader audience. They spend the semester researching projects within the College of Engineering, interviewing professors and graduate students, and ultimately writing about and presenting that work for a general audience. This piece is one of the outcomes of the Fall 2019 E185 course.