[Remote] DGX Cloud Performance Engineer - New Grad 2026
Note: The job is a remote job and is open to candidates in USA. NVIDIA is a leading technology company known for its innovative AI solutions. They are seeking a DGX Cloud Performance Engineer who will focus on performance analysis, optimization, and modeling for NVIDIA's DGX Cloud clusters, collaborating with cross-functional teams to enhance system performance and usability.
Responsibilities
- Develop benchmarks, end to end customer applications running at scale, instrumented for performance measurements, tracking, sampling, to measure and optimize performance of important applications and services
- Construct carefully designed experiments to analyze, study and develop critical insights into performance bottlenecks, dependencies, from an end to end perspective
- Develop ideas on how to improve the end to end system performance and usability by driving changes in the HW or SW (or both)
- Collaborate with AI researchers, developers, and application service providers to understand internal developer and external customer pain points, requirements, project future needs and share best practice
- Develop the necessary modeling framework and the TCO (total cost of ownership) analysis to enable efficient exploration and sweep of the architecture and design space
- Develop the methodology needed to drive the engineering analysis to Inform the architecture, design and roadmap of DGX Cloud
Skills
- Currently pursuing Masters or PhD in Engineering or equivalent experience (preferably, Electrical Engineering, Computer Engineering, or Computer Science)
- Experience in working with large scale parallel and distributed accelerator-based system systems
- Experience optimizing performance and AI workloads on large scale systems
- Background with performance modeling and benchmarking at scale
- Background in Computer Architecture, Networking, Storage systems, Accelerators
- Familiarity with popular AI frameworks (PyTorch, TensorFlow, JAX, Megatron-LM, Tensort-LLM, VLLM) among others
- Experience with AI/ML models and workloads, in particular LLMs as well as an understanding of DNNs and their use in emerging AI/ML applications and service
- Proficiency in Python, C/C++
- Expertise with at least one of public CSP infrastructure (GCP, AWS, Azure, OCI, …)
- PhD in the relevant areas
- Very high intellectual curiosity; Confidence to dig in as needed; Not afraid of confronting complexity; Able to pick up new areas quickly
- Proficiency in CUDA, XLA
- Excellent interpersonal skills
Benefits
- Equity
- Benefits
Company Overview
Company H1B Sponsorship
Apply To This Job