Container Revolution Enables Science Breakthroughs
For the science community, there is a lot of excitement around the idea of using deep learning to help researchers find connections hidden within big data. In fact, scientists—in just about every field of science—are drawn toward the potential that the powerful branch of machine learning holds to conduct leading-edge research outside of traditional modeling and simulation techniques.
However, there are many obstacles to scale machine-learning workloads on high-performance computing systems built traditionally for science. Deep learning software packages are complex and require specific environments to run.
One solution to make deep learning more accessible to science comes from the IT world’s “container revolution.” Our team at the Oak Ridge Leadership Computing Facility at Oak Ridge National Laboratory is using containers to bundle an operating system and software into a single file, which will make it easier for researchers to run deep learning software on our supercomputers.
Containers bridge the divide, allowing us to run deep learning software on top of a supercomputer’s operating system
Our supercomputers run on conservatively updated but extremely stable operating system. Newer deep learning software packages, however, assume users are running on one that’s up-to-date. This makes it difficult to get these packages installed on an enterprise-level system. Containers bridge the divide, allowing us to run deep learning software on top of a supercomputer’s operating system.
We’re using the Singularity container application licensed by Lawrence Berkeley National Laboratory to create customized software stacks that include newer operating systems as well as various deep learning frameworks. Users can then run their own applications via container’s software, enabling the use of packages that was not possible previously.
Additionally, our staff is testing a platform created by a software company—Red Hat called OpenShift, an open-source program built on top of the container management system Kubernetes. Within the OLCF environment, OpenShift provides users with a way to execute scientific workflows and other long-running services via containers. Containers give users the freedom to create their own application environment, while OpenShift handles system administration details so that users would not have to.
The real benefit of the container is that our users can interact with an operating system they are familiar with, rather than the enterprise host operating system.
Containers are already proving useful to our researchers. Just this February, some of our staff members successfully ran multiple containers on Summitdev, our early-access version of Summit.
Summit will provide revolutionary performance by making tectonic changes to the current Titan hybrid architecture, making it an ideal follow-on system. Our team is working with scientific users to redesign, port, and optimize application codes for Summit’s heterogeneous CPU–GPU architecture.
Containers will ultimately allow newer systems like Summitdev to run packages like TensorFlow, an open source library for machine learning. TensorFlow is one of six software packages being evaluated by team members working on the CANcer Distributed Learning Environment (CANDLE), a project that aims to build a deep neural network for studying cancer behavior and treatment outcomes. CANDLE is funded by the Exascale Computing Project—a joint effort between the Office of Science and the National Nuclear Security Administration to develop an exascale computing system and environment— and includes collaborators from NVIDIA, the National Cancer Institute, and other DOE laboratories.
Arvind Ramanathan, ORNL’s technical lead for CANDLE, says, “Scaling existing deep learning platforms on newer machines is one of the project’s main goals but that many of the platforms have dependencies—programs that an initial software application relies on to do its work.”
We cannot expect every user to come up with all the necessary software packages and files needed to run these platforms, and we can’t track everyone and see how each person is individually compiling the software. With containers, we can create a common environment upfront so that the user can run their code efficiently on their own.
Containerizing software will eliminate the hassle of optimizing individual computing environments, allowing users to more readily run deep learning packages on new resources. The team has had success running deep learning packages MXNet, Caffe, TensorFlow, Theano, pyTorch, and Neon on Summitdev via containers.
Deep learning and machine learning are important themes for the future of leadership computing. At the NVIDIA GPU Technology Conference in May 2017, in addition to supporting a number of machine learning science talks, we organized a panel discussion to engage the community in understanding opportunities for scaling deep learning workloads on high-performance computing systems. We also dedicated a significant portion of our annual user meeting this year to machine learning.
At OLCF, we are trying to support as many popular platforms as possible. Right now, we are trying out different combinations of software and systems to see which ones work best for our users. Now that we have scaled containers on Summitdev, users can begin testing systems themselves to give us an idea of what improvements we need to make.
Our container effort was a pilot project to evaluate these tools in the context of the OLCF’s leadership-class computing resources. Our team will continue testing and developing this technology with an eye toward making them production-ready for the future.
Oak Ridge National Laboratory is supported by the US Department of Energy’s Office of Science. As the single largest supporter of basic research in the physical sciences in the United States, the Office of Science is working to address some of the most pressing challenges of our time.