Important Announcement
PubHTML5 Scheduled Server Maintenance on (GMT) Sunday, June 26th, 2:00 am - 8:00 am.
PubHTML5 site will be inoperative during the times indicated!

Home Explore DL-Whitepaper

DL-Whitepaper

Published by ajohnson, 2018-05-21 17:14:38

Description: DL-Whitepaper

Search

Read the Text Version

EFFICIENCY WHEN TRAINING DEEP LEARNING MODELS IT’S NOT JUST ABOUT GPUs EXPERT ADVICE ON DEFINING ANOPTIMAL COMPUTING PLATFORM FOR AI DEVELOPMENT www.boxx.com

THE AI RENAISSANCE AND aspects of a technology stack deployed forCOMPUTING CHALLENGES DL training put the emphasis on other subsystems aside from GPUs. Because of theAII has recently become a field in which array of choices available to assemble a set ofresearch is proceeding at a feverish pace. tools for DL, there tends to be no one-size-fits-allOrganizations with access to valuable data in computing platform for AI development.industries such as Healthcare and Insurance Moreover, a mismatch between sub-systemsare racing to derive new intelligence and can easily result in poor training efficiencygreater efficiencies by applying the most with lower than expected output in terms ofrecent Deep Learning (DL) methods. Leading images-per-second. This ultimately translatesproviders of AI technology are exploring the into a longer time-to-model.most effective way to develop DL models tocut the costs of tasks that previously could Here are s o m e s u g g e s t i o n s f r o monly be performed by people, or to identify B O X X and Cirrascale Cloud Services, expertsimportant patterns in data that are too subtle in AI computing infrastructure when planning forto detect by conventional methods. At the a DL development platform:heart of it all is the promise that AI representsa new way of developing software and that it • Thoroughly understand the characteristics ofcan solve intractable challenges in your chosen DL framework: Actual modelprogramming and algorithm development. training is just one part of the process ofMany data scientists have come tounderstand that the computing platform they development AI. Pre-processing is a crucialuse to train DL models is of criticalimportance. The right computing power step which puts data in a form that a GPUmeans the difference between a highaccuracy model that is first-to-market, and can easily address. Depending on the type ofone that remains an exploratory project. data, pre-processing can involve some or allThere are many factors that influence theefficiency of a system designed to support DL of these steps: reading data in their originaltraining. This short paper outlines somecritical points to consider when planning for format, loading it in portions that RAM andthe right computing platform for DL. VRAM can accept, transforming it to make itTHE IMPACT OF STORAGE:GPU POWER VS. STORAGE uniform and correctly sized, and groupingPERFORMANCE it into mini batches for efficiency. SomeCurrently, GPUs are the most effective wayt o accelerate the development of DL models. frameworks perform many of these pre-Other types of silicon already offer alternatives,but GPUs are the most widely used because processing steps themselves. These stepsof their broad availability and because of arobust software ecosystem (including are strongly dependent on CPUNVIDIA’s API for DL: cuDNN) that makes itmuch easier to harness their specialized performance, so despite the heavy focuscomputing capabilities. on GPUs, it’s important not to neglect the choice of the correct CPU(s).But GPUs represent only one of several • Dual CPU systems pose special challenges:subsystems that influence the ultimateperformance of a computing platformdedicated to training DL models. Several 2

using two CPUs rather than just one waiting for storage to provide the data forsounds like a reasonable way to achieve the next round of matrix computations.higher performance, but using two CPUs isdependent on the speed of the interconnect • Matching storage and GPUs: choosingbetween the two CPU sockets. Bottlenecks NVMe over SATA. NVMe is not a type ofcan appear that prevent the full use of the drive, contrary to a common misperception,GPU’s that the two CPUs are feeding data but a very fast storage interface. Theto. Systems that are “single root” avoid difference in throughput between the twothis pitfall by connecting all GPUs to a technologies is very large: a SATA drivesingle CPU with the right number of will achieve 6 gigabits per second whilePCIe (PCI Express) lanes to avoid relying a drive with NVMe will reach 6 gigabyteson the CPU-to-CPU connection. per second, almost an order of magnitude more. Flash drives running on NVMe look a• Right-sizing CPUs: While the CPU plays an lot like memory and can keep up with the important role in pre-processing, it is high throughput of GPUs. They do have the important to understand and that there is reputation of being expensive, an issue that a limit to how much the CPU c a n can be addressed with careful selection. contribute to training efficiency. Allocating more budget to higher level CPUs may have a limited impact on overall performance and will adversely affect the platform’s overall return on investment. Most middle of the road CPUs will be adequate to handle moving the training data, except in very demanding cases. • Choosing the right NVMe drives: flash drives do not have to break the bank. Most flash storage devices using NVMe are designed for enterprise applications such as server or desktop virtualization. These are enterprise applications that put equal focus on ‘read’ performance and ‘write’ p e r f o r m a n c e to support the intense I/O of enterprise workloads. This dual capability is what tends to make NVMe flash drives more expensive. The process of training a DL model certainly involves reading from storage many times as training data is loaded repeatedly for each training “epoch”. But the frequency of• Local storage is a major source of data writes is low(er). Because ‘write’ bottlenecks in deep learning applications. Local storage must be able to keep up with performance is much less important than powerful, modern GPUs. A sub-optimally configured data storage subsystem results ‘read’ performance when it comes to AI in a low number of images processed per second as the very powerful GPU sits idle development, it is not necessary to invest 3 in expensive enterprise-class flash drives. • Multi-GPU systems have special requirements: it’s now firmly established that adding GPUs is an effective way of accelerating DL model development.

Training performance typically scales • Aligning data type and storage sub- quasi-linearly with the addition of more system performance: The type of data you GPUs (in the order of 95%+). But with a comparatively slow storage sub- system, work with will determine the throughput scaling efficiency can drop dramatically, negating the investment in expensive that the system must be able to handle GPU cards. Adding more GPU cards without upgrading the storage interface in order to avoid creating bottlenecks. is a sure way to create data bottlenecks, so it is important to add NMVe cards as It is important to scale the storage appropriate to maintain balance. bandwidth in proportion with the volume of data that needs to be fed to the GPU(s). • Number of GPU Iterations vs. storage bandwidth: different types of DL projects• Understand the inherent latency of your Have different requirements: some DL platform: It is possible to add more PCIe projects seek to develop algorithms for a devices to increase throughput to support very large decision space but may use only more GPU cards, but latency is a hard- a limited amount of data. The example of coded characteristic of the PCIe fabric DeepMind’s AlphaGo, which is itself. Although different devices using the characterized by many NN layers, but PCIe interface can receive priority when relatively low-density data sets for using the PCIe bus (like GPUs often do), the training, comes to mind. In this case, the latency cannot be changed. performance of the storage sub-system isDATA RICHNESS AND not as critical. On the other hand, modelsDL TRAINING PERFORMANCE that focus on processing high-resolutionThe type of data used for training the DLmodel represents a key factor in defining the video will read a great amount of dataright computing platform. The richness of thetraining data will determine in great part into RAM and into GPU memory. Theywhether adding GPUs will scale with existingstorage bandwidth or if more bandwidth will are much more likely to encounterbe required. A mismatched storagesubsystem can leave the GPU (or multiple bottlenecks created by inadequate storageGPUs) idle, which equates to a severe loss oftraining efficiency. performance. • More storage bandwidth will solve throughput-sensitive use cases but will not solve a latency problem. The first example is latency-sensitive while the second is throughput sensitive.Training data types by density and total volume of information passed to GPUs.Finding the right grade of platform for different DL situations is the key to maximizing model training efficiency. 4

WRAPPING IT UP...AI is experiencing an unexpected renaissance. At this stage of its development, the focus is ontraining accurate models as fast as possible based on specific data sets, often categorized as “truthdata”. Even at this predominantly R&D focused stage, the complexity of the matrix operationsperformed by the GPU and the number of iterations puts special focus on the performance of thecomputing platform. GPUs are important but data scientists who wish to achieve the fastestpossible time-to-model need to understand all drivers of higher efficiency in their AI developmentprocess.For the most part, it is the richness of the training data that will determine the optimal configurationof the computing system. The data storage sub-system can be the source of serious bottlenecks,especially when working with medium to high data density data. Data scientists should take careto choose the combination of a deep learning framework and of a computing platform thatmatches the data type they choose to work with. For more insight into the computing needs of AI development, contact BOXX Technologies or Cirrascale Cloud Services at 877-877-2699 or (512) 835-0400 (outside U.S.A.) www.boxx.com 5


Like this book? You can publish your book online for free in a few minutes!
Create your own flipbook