, AI infrastructure, building cluster scale automation for distributed training and inference workloads, MLOps. You will be a member... for distributed training and inference workloads with AMD's ROCM software Build cluster scale automation for distributed training...
system optimized for the Kubernetes platform, along with the supporting cluster management system. Contribute to kernel..., or Python. Deep expertise in orchestrating containerized applications and building scalable cluster management systems...
across service orchestration, job scheduling, cluster management, big-data processing, and other core services that business teams...
benchmarking studies, ISO NE transitional cluster studies, load interconnection studies replicating ISO/Utility practices...
using container-native Hadoop services to work in a Kubernetes cluster. As a Sr. Staff Software Engineer...
, and cluster deployments and lead discussions about network topologies, compute, management, telemetry, and storage fabrics...
. Experiences to run workloads on large scale heterogeneous cluster is a plus Experiences to optimize GPU kernels...
. SpringBoot, Redis, MongoDB, Kafka, and MicroServices architecture. 3. AWS deployments, scaling and EKS cluster management 4...
, and negotiate staffing plans. Manage the end-to-end cluster development cycle, including forecasting, sourcing, procurement...
personnel to create costed bills of material (BOMs) for rack and cluster level solutions Partner with business development...
stacks, and cluster environments. This role requires good understanding and experience in ROCm, CUDA, GPU architecture, ML...
. Lead and manage interconnection applications and queue positions during the cluster study phases in NYISO/PJM/SERC...
of running AI/HPC workloads in single node and cluster level and develop test suites and performance automation. Lead the debug...
, and automated provisioning. Strong experience in Kafka cluster management, topic configuration, performance tuning, and ensuring...
consistency Proficiency in monitoring cluster health and resource utilization Ability to troubleshoot complex database...
. Expertise in Databricks components such as Delta Lake, Notebooks, Pipelines, cluster management, and cloud integration (Azure...
networking. Experience with PCIe, CXL, NVMe interconnects and cluster schedulers (Kubernetes, Slurm). Proven ability...
. Expertise in Databricks components such as Delta Lake, Notebooks, Pipelines, cluster management, and cloud integration (Azure...
. Experiences to run workloads, especially AI models, on large scale heterogeneous cluster Familiarity with clusters...
, EC2, RDS, S3, CloudWatch, IAM) and Kubernetes including multi-cluster management * Strong programming skills (Python...