AI Software Engineer Intern

Opens intel.wd1.myworkdayjobs.com in a new tab

About This Role

  • Intel's Data Center Network Edge AI team is responsible for delivering best-in-class AI performance on Intel® architecture.
  • From hyperscale data centers powered by Intel® Xeon® processors to network edge nodes, our performance engineers shape the inner loops of frameworks and operator libraries that millions of developers and customers rely on every day.
  • We are seeking an intern to join our CPU performance engineering team and drive operator-level optimizations for modern AI workloads, including Transformer-based LLMs, VLM / VLA multi-modal models, classical CNNs, and MLP models, etc.
  • You will design, implement, and tune high-performance CPU kernels that translate Intel architectural advantages — AVX-512, Intel® AMX, and VNNI — into measurable end-user value.
  • Responsibilities Design and hand-tune CPU kernels for Transformer operators (Attention, GEMM, LayerNorm, RMSNorm, RoPE, MoE, Softmax) and classical operators (Conv2D / Conv3D, Depthwise Conv, Winograd, im2col, Pooling, BatchNorm, RNN / LSTM / GRU).
  • Develop SIMD-optimized implementations using Intel® AVX2 / AVX-512 / AMX / VNNI intrinsics, with ARM Neon / SVE as a secondary target where applicable.
  • Apply parallelization strategies (OpenMP, TBB, thread-pool design) and exploit CPU micro-architectural features: cache blocking and tiling, NUMA affinity, prefetching, memory alignment, and false-sharing mitigation.
  • Implement and optimize low-bit quantized kernels (INT8 / INT4 / W4A16 / W8A8) for LLM / VLM inference, leveraging Intel® AMX and VNNI for maximum throughput per watt.
  • Integrate custom operators into production frameworks and runtimes, including Intel® oneDNN, PyTorch CPU backend, ONNX Runtime, llama.cpp, MLC-LLM, and XNNPACK.
  • Conduct systematic performance analysis using Intel® VTune™ Profiler, Linux perf, and roofline modeling; identify bottlenecks and quantify optimization gains.
  • Contribute reusable kernels, optimization templates, and best-practice documentation to Intel's internal performance libraries.

Requirements

  • The candidate must have the right to work in the country of employment without restriction.
  • Currently pursuing a BS (senior year), MS, or PhD in Computer Science, Electrical Engineering, Computer Engineering, Parallel Computing, or a related technical field.
  • Available for a minimum of 3 months of full-time or near full-time engagement.
  • Strong proficiency in C / C++ and solid understanding of computer architecture, including CPU pipelines, cache hierarchies, memory models, and SIMD execution.
  • Hands-on experience with at least one of: x86 SIMD intrinsics (AVX2 / AVX-512 / AMX) ARM Neon / SVE intrinsics OpenMP / TBB-based multi-threaded optimization High-performance CPU GEMM or convolution implementation (e.g., referencing oneDNN, OpenBLAS, XNNPACK, ggml) Experience with performance profiling tools (Intel® VTune™ Profiler, perf) and the ability to translate profile data into concrete optimizations.
  • Preferred Qualifications Open-source contributions to projects such as oneDNN, OpenVINO™ toolkit, llama.cpp, ggml, XNNPACK, OpenBLAS, PyTorch, or ONNX Runtime.
  • Familiarity with CNN inference optimizations: Winograd, im2col + GEMM, Direct Conv, NCHW / NHWC layout transforms.
  • Familiarity with LLM inference optimization techniques: KV-cache management, continuous batching, speculative decoding, and low-bit quantization.
  • Experience with compiler infrastructure (LLVM, MLIR, TVM) or auto-tuning frameworks (AutoTVM, Ansor).
  • Edge or on-device deployment experience (ARM servers, AI PCs, embedded SoCs).

Sourced directly from Intel’s career page

Your application goes straight to Intel.

Intel logo

Intel

3 Locations

Specialisation
Open roles at Intel
765 positions
Job ID
/job/PRC-Shanghai/AI-Software-Engineer-Intern_JR0283186-1

Get matched to roles like this

Upload your resume once. We’ll notify you when matching roles open up.

Join talent pool — free

Similar Other roles