
Reconfiguring the disk I/O subsystem did enable scaling up to eight GPUs, but the performance improvement was only incremental even though the GPU compute resources were essentially doubled at each iteration. While the application does scale up with the addition of GPU compute resources, the gains still aren’t linear the ideal case would be to double the performance with eight GPUs instead of four. Upon rerunning the benchmark, he was able to achieve a modest performance gain when moving from four to eight GPUs, indicating that the storage bottleneck had been resolved with the additional drives. To prove this, he added two more 500GB SSDs to the system and configured all three drives as a RAID-0 stripe set, so that data is written simultaneously to all three drives, improving performance by three times compared to a single drive.

“This suggested there was some I/O bottleneck present that was keeping RELION from obtaining better performance with the eight GPU configuration.” “During the initial phase of testing it was obvious the application would not scale past four GPUs, regardless of what parameters were used at runtime,” Del Vecchio said. Particularly with complex systems like HPC software, optimizing performance becomes a lengthy process, and often must be tailored to not only the clustered operating system, but the specific application that runs on top of it. Once that choke point is resolved, the next lowest-performing component becomes the bottleneck. Isolating bottlenecks is an ongoing process once one bottleneck is found and ameliorated, a new bottleneck is usually discovered. These simulations are so complex that one of the standard measurements is days per nanosecond (days/ns), which measures how many days it takes to simulate one billionth of a second of a biological system in operation. RELION, GROMACS, NAMD and Amber are all applications that simulate different biological and chemical processes. Since communications between nodes run over the PCIe bus, some of the usual challenges of clustered systems involving the network that connects the nodes are eliminated. Dedicated, specialized motherboards and PCIe bus expansion systems allow for eight or more 16x PCIe slots in a single system.
#TURBOTAX FOR S CORP BUILDER SOFTWARE#
In a similar fashion, the CUDA software is used to run simulations of biological processes across multiple NVIDIA Tesla GPUs in a single system. Sales Engineer, was given the job of creating a demo system that could run various molecular dynamics applications utilizing CUDA-enabled GPUs. Being able to sell these systems successfully requires expertise in setting up and optimizing both the hardware systems and software used. In particular, gains achieved when scaling from four to eight GPUs were incremental at best.Įxxact sells custom systems to labs and universities doing research into life sciences, real-time modeling of biological processes, deep learning, Big Data and more. Furthermore, performance with four or eight GPUs was nowhere near the expected performance of single-GPU performance. The test system originally had a single Samsung solid state drive (SSD), but test results showed scaling from one to two and from four to eight GPUs was nowhere near linear. In most cases, application processing gets divided across multiple NVIDIA Tesla GPUs in a single system. These applications run best when leveraging NVIDIA’s CUDA-enabled GPUs. Recently, Exxact engineers have been characterizing the performance of life-science applications such as RELION, GROMACS, NAMD and Amber, all of which are molecular dynamic simulation applications that model biochemical processes for life science research.

More specifically, when their systems include multiple NVIDIA Tesla Graphics Processing Units (GPUs). Terms and Conditions apply.¤Įxxact Corporation performs benchmarking of multiple High Performance Computing (HPC) software applications in efforts to characterize how specific apps will perform on their systems.
