Joy Dong

HELLO

Greetings from Joy! Welcome to my page. My Chinese name is 董珏初 Juechu (pronounced ge ü e, chew). If you find it hard to pronounce my name in mandarin, I’m totally fine with Joy.😊

Bio:
Juechu (Joy) Dong is a PhD candidate at the University of Michigan CSE department advised by Prof. Satish Narayanasamy. She studies emerging technologies in computer architecture and systems, with a focus on GPU programming model and confidential computing. Her work seeks to democratize kernel customization by building flexible and adaptive infrastructure for mapping novel algorithms to GPU hardware. Joy received the 2024 Rackham International Student Fellowship and the MLCommons ML and Systems Rising Star Award 2025.

News

[Mar 2025]

Excited to be selected to recieve the MLCommons ML and Systems Rising Star Award!!

[Mar 2025]

FlexAttention is accepted to MLsys ‘25. See you in the Santa Clara this summer~

[Oct 2024]

Our work mm2-gb is accepted to ACM BCB ‘24, the flagship conference of the ACM SIGBio. Join us in Shenzhen, China to see how we accelerate minimap2 using GPU!

[Aug 2024]

Our work FlexAttention is lauched. See our PyTorch Blog and 180k view X post. Stay tuned to FlexAttention Part II - decoding and paged attention.

[Jun 2024]

Our work Toleo is accepted to ASPLOS ‘24. Its presentation is delayed to ASPLOS ‘25. See you in Rotterdam~

Archived news ...

[May 2024]

I joined Meta Pytorch Compiler team this summer as a research scienctist intern. See you at Menlo Park~
[Mar 2024]

Our work mm2-gb for long sequence DNA mapping is accepted by BioSys'24. Checkout our open sourced demo. Many thanks to AMD HPC team! see you in San Diego~
[Jan 2024]

I passed PhD qualification test and becomes a PhD candidate.
[Dec 2023]

I recieved Rackham International Student Fellowship for 2023-2024 acdemic year .

Selected Honors

2025
Rackham Doctorate Internship Fellowship
2025
MLCommons ML and Systems Rising Stars

These promising researchers, drawn from over 170 applicants, have demonstrated excellence in Machine Learning (ML) and Systems research and stand out for their current and future contributions and potential. news
2024
Rackham International Student Fellowship

The award recognizes her academic excellence and will support her ongoing research in CSE. news

Experience

Education

2022 Sept - exp. 2027
University of Michigan
Ph.D in Computer Science and Engineering | Computer Architecture & Systems
2022 Apr
University of Michigen
B.S.E. in Computer Engineering

Summa Cum Laude
2022 Aug
Shanghai Jiaotong Univeristy
B.S.E. in Electrical & Computer Engineering

Industry

2024, 2025
Meta
Research Scientist Intern | PyTorch Team

Build domain-specific language Helion for authoring machine learning kernels.
Work with Helion compiler and designed in-kernel communication APIs for Helion.
Build flexible and efficient attention programming model: FlexAttention.
Work with TorchInductor and conduct performance analysis and optimizations on attention kernels.
2022 May - 2022 Aug
NVIDIA
Deep Learning Compute Architect Intern | GPU Architecture

Model and analyze new memory features on next-gen GPUs such as distributed shared memory and TMA.
Specialize in: GPU architecture, memory hierarchy & multi-device communication

Publications

MLSys ‘25
FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention
Joy Dong *, Boyuan Feng *, Driss Guessous *, Yanbo Liang *, Horace He
*authors contributed equally to this work.
[poster] [arxiv] [blog] [github] [citeme]

FlexAttention is a novel compiler-driven programming model for flexible and efficient attention variants implementation.
🌟Flexible: Allow users to implements majority of attention variants in a few lines of idomatic PyTorch code.
🌟Fast & Efficient: Achive comparable performance to expert tuned kernels via JIT torch.compile.
🌟Block Sparsity: Leverages block sparsity to further improve performance without manual optimization for a specific mask.
ASPLOS ‘24
Toleo: Scaling Freshness to Tera-scale Memory Using CXL and PIM
Juechu Dong, Jonah Rosenblum, Satish Narayanasamy
[paper] [github] [poster] [citeme]

🌟Scale trusted memory size from hundreds of MB to tens of TB by expanding the span of trusted from a single trusted processor to an entire platform including intelligent memories.
🌟Design a new scheme of freshness protection that reduces the space requirement by 50x.
🌟Reduce deployment cost by spacing sharing one intelligent memory device among multiple CPUs.
Nature Computer Science – under submission
SECRET-GWAS: Confidential Computing for Population-Scale GWAS
Jonah Rosenblum, Juechu Dong, Satish Narayanasamy
[preprint] [code] [citeme]

Develop a thousand-core platform on Azure Confidential Computing to conduct multi-institutional GWAS on millions of patients in less than a minute.
Adapt Spark-based Hail genomic analysis framework to run on TEE under obliviousness requirement.
Parallelize GWAS computation on 1k cores to achieve near linear speedup.
ACM BCB ‘24
mm2-gb: GPU Accelerated Minimap2 for Long Read DNA Mapping
Juechu Dong *, Xueshen Liu *, Harisankar Sadasivan, Sriranjani Sitaraman, Satish Narayanasamy
*both authors contributed equally to this work.
[paper] [github] [slides] [blog] [citeme]

Performance Boost: Accelerate bottleneck step (chaining) of state-of-art long sequence mapping tool minimap2 by 2.57x-5.33x on GPU.
Scales well: Optimize towards ultra long reads of 50kb+ to accommodate genome sequencing technology trend.
Open Sourced! with active maintainance and optimization! Welcome community contributions~

Skills

Programming Language

c/c++ cuda, (system)verilog HIP, bash, Makefile
Technologies/Frameworks

GPU Tuning: nsight-compute/nsight-sys, omniperf/omnitrace/rocprof
Simulators: SniperSim, DRAMSim, pinplay
Confidential Computing: Open Enclave SDK, Intel SGX
Programming Language & Compilers: torch.compile, Helion, thunderkittens
Architectures

AMD CDNA2/3 GPU, NVIDIA Hopper GPU, Trusted memory systems

Fun Facts

Book stores, free markets and cafés are my must-visits while traveling. My recent best is Campfire Coffee, Negaunee, MI, in a tiny town near Marquette, upper Peninsula. Nice place to visit in fall.

Shanghai has only two seasons, winter and summer, and they switch randomly. It is otherwise a wonderful city to live in.

Meet my cat Cuda 😼

I’m Juechu “Joy” Dong

HELLO

News

[Mar 2025]

[Mar 2025]

[Oct 2024]

[Aug 2024]

[Jun 2024]

[May 2024]

[Mar 2024]

[Jan 2024]

[Dec 2023]

Selected Honors

Rackham Doctorate Internship Fellowship

MLCommons ML and Systems Rising Stars

Rackham International Student Fellowship

Experience

Education

University of Michigan

University of Michigen

Shanghai Jiaotong Univeristy

Industry

Meta

NVIDIA

Publications

FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention

Toleo: Scaling Freshness to Tera-scale Memory Using CXL and PIM

SECRET-GWAS: Confidential Computing for Population-Scale GWAS

mm2-gb: GPU Accelerated Minimap2 for Long Read DNA Mapping

Services

Teaching

Graduate Student Instructor: EECS570 Parallel Computer Architecture

Graduate Student Instructor: EECS471 CUDA Programming

Instructional Aid: EECS470 Computer Architecture

Skills

Programming Language

Technologies/Frameworks

Architectures

Fun Facts

Email

WeChat

Address