Opens intel.wd1.myworkdayjobs.com in a new tab
About This Role
- In this position, you will work with a system reliability research team focusing on RAS (Reliability, Availability, Serviceability) and silent data error (SDE) characterization and mitigation on AI and general-purpose compute platforms, including heterogeneous systems (CPU + GPU/accelerators) and large-scale server clusters.
- You will help design and run experiments under representative AI training/inference and cloud workloads, analyze fleet-scale logs/telemetry, and prototype detection/diagnosis methods to improve end-to-end data integrity and platform robustness across the HW/FW/OS/runtime stack.
- Your responsibilities will include but not be limited to: -Collect, clean, and analyze platform telemetry / error logs from CPU servers and accelerator-enabled nodes (e.g., memory/DDR/HBM, storage, interconnect, PCIe/CXL, fabrics) to identify error signatures and failure patterns. -Design and execute fault injection, stress tests, or workload-driven experiments to reproduce silent data corruption scenarios for AI training/inference and general compute workloads, and validate hypotheses. -Research and analyze in-field scan and lockstep mode features (coverage, limitations, trigger conditions, and impact on AI/CPU workloads), and help evaluate how they can be leveraged to improve silent error detection and data integrity in production. -Research and analyze Silicon Lifecycle Management (SLM) solutions, and integrate them with platform telemetry to enable in-field health monitoring, degradation/trend analysis, and proactive reliability improvements for AI/CPU platforms. -Develop scripts/tools (Python preferred) to automate data processing, experiment orchestration, and report generation; build dashboards or repeatable pipelines when needed. -Study and evaluate mitigation techniques for AI + CPU platforms (e.g., ECC/CRC/EDAC, scrubbing policies, retry/recovery, checkpoint/restart, end-to-end checks at data/communication boundaries) and quantify effectiveness vs. performance/cost impact. -Collaborate with cross-functional teams (HW, FW, OS, driver/runtime, datacenter operations) to trace error propagation paths and drive actionable improvements; document findings and present progress regularly.
What You'll Do
- will include but not be limited to: -Collect, clean, and analyze platform telemetry / error logs from CPU servers and accelerator-enabled nodes (e.g., memory/DDR/HBM, storage, interconnect, PCIe/CXL, fabrics) to identify error signatures and failure patterns. -Design and execute fault injection, stress tests, or workload-driven experiments to reproduce silent data corruption scenarios for AI training/inference and general compute workloads, and validate hypotheses. -Research and analyze in-field scan and lockstep mode features (coverage, limitations, trigger conditions, and impact on AI/CPU workloads), and help evaluate how they can be leveraged to improve silent error detection and data integrity in production. -Research and analyze Silicon Lifecycle Management (SLM) solutions, and integrate them with platform telemetry to enable in-field health monitoring, degradation/trend analysis, and proactive reliability improvements for AI/CPU platforms. -Develop scripts/tools (Python preferred) to automate data processing, experiment orchestration, and report generation; build dashboards or repeatable pipelines when needed. -Study and evaluate mitigation techniques for AI + CPU platforms (e.g., ECC/CRC/EDAC, scrubbing policies, retry/recovery, checkpoint/restart, end-to-end checks at data/communication boundaries) and quantify effectiveness vs. performance/cost impact. -Collaborate with cross-functional teams (HW, FW, OS, driver/runtime, datacenter operations) to trace error propagation paths and drive actionable improvements; document findings and present progress regularly.
- Qualifications Preference will be given to candidates who are interested in system reliability / data integrity research on AI and general-purpose compute platforms.
- The qualifications include but not limited to: -PHD students (CS/CE/EE/Math/Statistics or related majors). -Solid programming skills in Python; experience with Linux and basic scripting; familiarity with Github copilot is a plus. -Strong data analysis skills; experience with pandas/numpy/matplotlib, SQL, or log analytics is a plus. -Basic understanding of computer architecture and systems (memory hierarchy, storage, networking) is preferred; familiarity with RAS concepts (ECC, CRC, parity, scrubbing, checkpoints) is a plus. -Understanding of AI system stack is a plus: GPU/accelerators, driver/runtime, distributed training/inference, communication collectives, data pipelines, and performance/reliability trade-offs. -Good Mandarin and English communication skills are required, in both verbal and written. -Research mindset: ability to form hypotheses, design experiments, and write clear technical reports.
Sourced directly from Intel’s career page
Your application goes straight to Intel.
Opens intel.wd1.myworkdayjobs.com in a new tab
Specialisation
Open roles at Intel
762 positions
Job ID
/job/PRC-Shanghai/Cloud-and-AI-System-Intern_JR0283377
Get matched to roles like this
Upload your resume once. We’ll notify you when matching roles open up.
Join talent pool — freeSimilar Other roles
Samsung Semiconductor
Staff Engineer, AI System Architect (Hardware)
San Jose, California, United States|Other
Samsung Semiconductor
Sr. Director, Sales
Washington|Other
Samsung Semiconductor
Software Engineer, Trace-Driven Simulator Development
San Jose, California, United States|Other
Samsung Semiconductor
Principal Engineer, CPU Architecture & Performance Research
San Jose, California, United States|Other