Leads technical discussions on the architecture of Graphics and AI user-mode and kernel-mode drivers. Leads by example within the team by producing extensible, maintainable, and efficient code. Analyzes and fixes performance bottlenecks usi...
Deeply understand the pipeline of collecting data, training, evaluating, and serving language models and multimodal models. Have experience working side-by-side with AI researchers and engineers. Thrive in a 0->1, scrappy, innovative env...
Are passionate about the role of data in large-scale AI model training Will thrive in a highly collaborative, fast-paced environment Have a high degree of expertise and pay close attention to details Demonstrate a proactive attitude and ent...
Drive projects and programs related to compute infrastructure, including forecasting and allocation resource needs like compute, storage, network. Collaborate with product teams, engineers, researchers, and external partners to identify gap...
Are passionate about advancing the state of post-training research; Have experience with reward modeling, RL, or other post-training techniques; Will thrive in a highly collaborative, fast-paced environment; Have a high degree of craftsmans...
Lead a team of software engineers, hiring, growing, and upskilling their talent. Review the design and code artifacts that make up large-scale distributed cloud services and solutions with a focus on high availability, scalability, robustne...
Develop and tune the pretraining scalable software for Nvidia GB200 72NVL CX8 and AMD MIxxx architectures. Benchmark GB200 and AMD MIxxx GPU clusters. Gather data and insights to develop the pretraining compute roadmap. Care deeply about co...
Works with appropriate stakeholders to determine user requirements for a set of features. Contributes to the identification of dependencies, and the development of design documents for a product area with little oversight. Creates and imple...
Model Bring-Up & Characterization Lead the bring-up and functional validation of LLMs on custom AI accelerators and GPUs. Develop and maintain detailed performance characterizations across compute, memory, and interconnect domains. Instrume...
Workload performance evaluation for memory disaggregation and evaluation of compression technologies with Compute eXpress Link (CXL) memory beyond cloud databases Research Interns put inquiry and theory into practice. Alongside fellow docto...
Develop and tune the pretraining scalable software for Nvidia GB200 72NVL CX8 and AMD MIxxx architectures. Benchmark GB200 and AMD MIxxx GPU clusters. Gather data and insights to develop the pretraining compute roadmap. Care deeply about co...
Develop novel data collection strategies Improve dataset quality and integrity Create high-quality datasets for training and evaluation; run experiments on new datasets (data ablations) to assess their impact and determine the most effectiv...
Design, and develop large-scale distributed cloud services and solutions with a focus on high availability, scalability, robustness, and observability. Lead project development across the organization and work with subject matter experts an...
Design, implement, test, and optimize distributed training infrastructure in Python and C++ for large-scale GPU clusters. Profile, benchmark, and debug performance bottlenecks across compute, memory, networking, and storage subsystems. Opti...
Own and pursue a research agenda to improve model capability and performance for agentive application. Collaborate closely with the other research and product teams, from pretraining to model hosting to unlock new model capabilities. Build ...
Develop and tune the pretraining scalable software for Nvidia GB200 72NVL CX8 and AMD MIxxx architectures Benchmark GB200 and AMD MIxxx GPU clusters Gather data and insights to develop the pretraining compute roadmap Care deeply about conve...
Leverage subject matter expertise to improve model quality for interactive and agentive experiences. Oversee data acquisition or generation efforts, ensuring that the data meets the model needs. Generalize machine learning (ML) solutions in...
We seek exceptional individuals who: Bring proven expertise, demonstrated through impactful publications or technical leadership on high-scale projects. Possess strong analytical skills, attention to detail, and a data-driven approach to de...
Design and develop next-generation hardware health monitoring and diagnostic frameworks for large GPU clusters (NVL16/NVL72/GB200+ scale). Build predictive analytics pipelines leveraging telemetry, power, and thermal data to anticipate hard...
Coordinate projects and programs related to AI/ML infrastructure (e.g. pre-training, post-training pipelines, inference & model serving stacks), including end-to-end planning, timelines, milestones, performance metrics, and resource needs. ...