Home Blog

A Survival Guide for Machine Learning Systems Research

Posted on January 3, 2026 by Jiachen Liu

If you've ever felt lost in the rapidly evolving world of Machine Learning Systems research—constantly worried about being scooped, unsure which problems matter, or overwhelmed by the sheer pace of innovation—you're not alone. After years of working on ML systems across both academia and industry, spanning serving infrastructure, pre-training pipelines, and post-training optimization, I've distilled some hard-won insights that I believe can help you not just survive, but thrive in this field.

These aren't just ML-specific tactics. They're meta-research principles that apply broadly to any research discipline. Let's dive in.

The Art of Topic Selection: 50% of Your Success

Here's a truth that took me too long to learn: a great topic selection determines half of your project's success. Before you write a single line of code, before you run your first experiment, the problem you choose to work on has already shaped your destiny.

Through reading and writing numerous papers, I've observed that innovation in ML systems follows three distinct stages:

Stage 1: Zero-to-One Innovation (Be the First)

This is where you're doing something no one has done before. You're the pioneer, the first to eat the crab (as we say in Chinese).

Take vLLM as an example. Before large language models exploded onto the scene, there was no dedicated serving infrastructure for them. vLLM filled that void. It wasn't an incremental improvement—it was a foundational system for an entirely new paradigm.

This stage rewards vision and courage. If you can identify an emerging application that lacks proper systems support, you have a golden opportunity.

Stage 2: One-to-Two Enhancement

Once a pioneering system exists, there's immense room for improvement. Maybe the original was more of a proof-of-concept. Your job now is to make it:

  • More robust across different workloads
  • More efficient under various conditions
  • More generalizable to adjacent use cases

This is enhancement work—taking something from "it works" to "it works beautifully."

Both Stage 1 and Stage 2 are ideal territory for academic researchers. You can make meaningful contributions without needing massive infrastructure or engineering armies.

Stage 3: Industry-Scale Deployment

Here's where things get tricky for academics. At industry scale, research often focuses on squeezing out the last drops of performance—optimizing for efficiency at levels that require:

  • Thousands of GPUs
  • Large engineering teams
  • Problems that only manifest at massive scale (fault tolerance, extreme scalability)

These are challenges you often can't even reproduce in an academic setting, let alone solve. You might not discover the problem exists until you're operating at hyperscale.

My advice: Leave Stage 3 for your industry internships or full-time roles. Trying to compete here as an academic student is like bringing a knife to a gunfight—the engineering resources simply aren't in your favor.

Think Ahead: The Power of Prediction

So how do you find those magical Stage 1 or Stage 2 problems that haven't been claimed yet?

You have to think ahead.

Imagine future scenarios. What applications will become mainstream? What systems challenges will emerge?

Let me share a personal example. In 2023, while everyone was racing to optimize token latency and throughput for LLM serving, I noticed something different: quality of experience matters more in conversational AI. The real question wasn't just "how fast can we generate tokens?" but "how can we serve more users while reducing GPU costs and maintaining a great experience?"

That insight led me to a niche that wasn't crowded with competition.

Future-Gazing Exercise

Here are some thought experiments to spark your imagination:

Robotics Revolution: As robots become ubiquitous, what systems will we need to connect AI algorithms seamlessly with the physical world? How do we enable robots to complete tasks quickly, accurately, and efficiently?

Smart Glasses Era: When everyone wears intelligent glasses, we'll need real-time multimodal processing—systems that can see the world, understand context, and communicate with users through voice or text instantaneously. What infrastructure makes that possible?

Your Turn: Take any emerging technology. Assume it becomes wildly successful and widespread. What new systems challenges arise? That's where your opportunity lives.

Once You've Found Your Problem: The Deep Dive

You've identified a promising problem. Now what?

Here's a principle I hold dear: No system is universal. Every new application scenario presents fresh opportunities for optimization or complete redesign.

Your job is to study the unique characteristics of your chosen domain:

  • What's the objective function? Is it different from existing systems?
  • What are the latency thresholds? Are they stricter or more relaxed?
  • How does the model architecture differ?
  • What do user request patterns look like?

There's always something different. Find it. That difference is your leverage point for innovation.

The Beautiful Predictability of ML Systems Research

Here's what makes this field both challenging and exciting: AI leads, and ML systems follow.

Whether it's new architectures like Mamba, diffusion models, vision-language-action models for robotics, or multimodal applications for smart devices—the AI breakthroughs come first. Systems research responds.

If you develop a keen eye for technological trends, you can predict what systems the world will need before the demand becomes obvious.

I'll give you an example. In 2024, while most discussions centered on standard LLM challenges, I recognized that reasoning and reinforcement learning would become critical. This intuition came months before OpenAI released o1 and made it mainstream knowledge.

Early detection = early positioning = competitive advantage.

The Convergence Reality (And Why Execution Matters)

Here's both the comfort and the challenge of ML Systems research: people converge toward optimal solutions.

Given enough time, researchers will identify the same important problems and arrive at similar solutions. This means there's always some level of competition—being scooped is a perpetual risk during your PhD.

But here's the upside: if you have a strong prediction model (knowing which problems matter), and you combine that with excellent execution and engineering skills, you can stay ahead.

Execution is non-negotiable. The engineering depth in this field is immense. My only advice? Embrace continuous learning. Identify your gaps. Fill them relentlessly. There's no shortcut—just a steep, rewarding climb.