You are tasked with setting up High Availability (HA) for NVIDIA Base Command Manager (BCM) in a new GPU cluster. The cluster consists of a primary head node, a secondary head node, and several compute nodes. The requirements are automatic failover of BCM services, minimal disruption to workloads, and proper cluster health monitoring during and after installation. During your BCM HA installation and configuration process, which two of the following actions are mandatory for ensuring a robust and verified HA cluster configuration?
Pick the 2 correct responses below.
An engineer is reimaging a DGX system in a large cluster. Which method ensures the most efficient and secure remote installation without physical access?
An infrastructure engineer in an AI factory has successfully replaced a power supply unit on an NVIDIA DGX H100. After installation, both the IN and OUT LEDs on the new power supply illuminate solid green. Which NVSM CLI command should the engineer use to quickly verify the overall system status and ensure it is operating as expected?
After Spectrum-X fabric deployment, NCCL tests show intermittent latency spikes. Which network condition most severely impacts East-West bandwidth?
What command is needed to measure BER (Bit Error Rate)?
After upgrading to HPL-AI 2.0 on a DGX A100 cluster, a 2x performance gain is observed. Which optimization is primarily responsible for this improvement?
During cluster validation, the Cable Validation Tool (CVT) reports " Underperforming (BER) " for an InfiniBand link. Which BER thresholds indicate a critical signal quality issue requiring cable replacement?
After ClusterKit reports " GPU-Host latency exceeds threshold, " which NVIDIA diagnostic tool should be used to isolate hardware faults?
ClusterKit ' s NCCL bandwidth test shows 350 GB/s on a 400G InfiniBand fabric. How should this result be interpreted?
What is the primary purpose of running an NCCL burn-in test on a new GPU cluster?
What information does the ' ibnodes ' command display?
An engineer must ensure that a BlueField-3 NIC firmware download matches the cluster’s PSID. Which step is critical before installation?
To validate bisectional bandwidth across two racks in a Spectrum-X Ethernet fabric, which NCCL test configuration isolates East-West traffic?
For an NVIDIA Enterprise AI Factory with 256 GPUs, which storage solution characteristic is most critical to validate during scaling tests?
What command sequence is used to identify the exact name of the server that runs as the master SM in a multi-node fabric?
An administrator needs to perform a comprehensive pre-production stress test on a DGX H100 system. Which command validates GPU, CPU, memory, and storage components while following NVIDIA’s recommended procedure?
An administrator needs to verify HA functionality after configuring BCM (Bright Cluster Manager). Which command confirms the active head node and failover readiness?
A customer is designing an AI Factory for enterprise-scale deployments and wants to ensure redundancy and load balancing for the management and storage networks. Which feature should be implemented on the Ethernet switches?
A leaf switch shows " FW Version Mismatch " alerts for transceivers after cluster expansion. Which tool validates transceiver firmware against expected versions?
During BCM cluster setup, an engineer must configure bonded network interfaces on DGX nodes for high availability. Which cmsh command sequence properly configures a bond0 interface with two physical NICs?
A DGX server reports degraded performance and storage alerts. How would you use NVSM and nvidia-smi to troubleshoot both system and GPU issues?
Refer to the output:
~ $ sudo nvsm show healthinfo
—Timestamp: Sat Dec 16 16:26:32 2017 -0800
Version: 17.12-5
Checks—BIOS Revision [5.11].........................
DGX Serial Number [YSY72800016)..................
Verify installed DIMM memory sticks........................Healthy
...[output truncated)
Verify Ethernet controllers...........................Healthy
Verify installed GPU ' s..............................Unhealthy
Checking output of ' lspci ' for expected GPU ' s
Missing GPU at PCI address ' 07:00.0 '
Verify installed InfiniBand controllers....................Healthy
Verify PCIe switches..................................Healthy
...[output truncated)
What insights can a system administrator gain regarding the DGX system ' s health?
During cluster deployment, the UFM Cable Validation Tool reports " Wrong-neighbor " errors on multiple InfiniBand links. What is the most efficient way to resolve this issue?
An administrator installs NVIDIA GPU drivers on a DGX H100 system with UEFI Secure Boot enabled. After reboot, the drivers fail to load. What is the first action to resolve this issue?
An administrator is configuring node categories in BCM for a DGX BasePOD cluster. They need to group all NVIDIA DGX H200 nodes under a dedicated category for GPU-accelerated workloads. Which approach aligns with NVIDIA ' s recommended BCM practices?
A systems administrator is preparing a new DGX server for deployment. What is the most secure approach to configuring the BMC port during initial setup?
An engineer needs to validate 400G DAC cable signal integrity in a DGX cluster. Which CVT metric best identifies marginal cables needing replacement?
After a firmware upgrade on a DGX H100, the administrator notices that one GPU is not detected by the system. Which troubleshooting step should be performed first to identify the root cause?
An engineer needs to completely remove NVIDIA GPU drivers from an Ubuntu 22.04 system to troubleshoot conflicts. Which command sequence ensures all driver components are purged?
After configuring NGC CLI with ngc config set, a user receives ”Authentication failed” errors when pulling containers. What step was most likely omitted?
You are evaluating the integration of NVIDIA BlueField DPUs into your data center ' s storage architecture to optimize AI workloads. The storage solution chosen has incorporated BlueField DPUs to enhance performance and efficiency. Which of the following benefits directly results from this integration?
During East-West fabric validation on a 64-GPU cluster, an engineer runs all_reduce_perf and observes an algorithm bandwidth of 350 GB/s and bus bandwidth of 656 GB/s. What does this indicate about the fabric performance?
Your company is planning to expand its AI capabilities significantly over the next five years. To future-proof your storage infrastructure, you need a solution that can scale in both capacity and performance. Which of the following strategies best ensures that your storage infrastructure remains adaptable to future AI demands?
You are leading a project to enhance the energy efficiency of a data center that heavily relies on AI workloads. NVIDIA suggests moving beyond traditional metrics like Power Usage Effectiveness (PUE) to better capture the efficiency of modern data centers. Which strategy should you prioritize?
A customer has just completed the first boot of their DGX system and is prompted to create an administrative user. What is the correct approach for setting up this user to ensure secure BMC and GRUB access?
A system engineer needs to set the vGPU scheduling behavior for all GPUs to share the scheduling equally with the default time slice length. What command should be used?