Profiling Mysteries: Debugging Memory Leaks, Deadlocks, and GC Pauses in Production

Performance bugs are among the hardest problems in software engineering. Unlike functional bugs, they often appear under load, in production, and vanish under test conditions. Symptoms range from slow response times to unexplained crashes, but the root causes often hide deep in memory allocation patterns, thread contention, or garbage collector behavior.
This article explores profiling mysteries through real-world case studies of memory leaks, deadlocks, and GC pauses. Each case shows how the problem manifested, how it was diagnosed, and how it was ultimately solved.
Case Study 1: The Slow Memory Leak That Took Weeks to Show
The Symptoms
A Java-based microservice ran smoothly in staging but, in production, memory usage steadily increased. Every 2–3 weeks, the service crashed with an OutOfMemoryError
.
Grafana monitoring showed a classic “stair-step” memory usage pattern:
The Investigation
Tools used:
- JVisualVM for heap dump analysis.
- Eclipse MAT (Memory Analyzer Tool) for object reference chains.
- Prometheus JVM metrics for heap occupancy trends.
Heap dumps revealed millions of HashMap
entries in a cache class:
public class Cache {
private static final Map<String, Object> cache = new HashMap<>();
public static void put(String key, Object value) {
cache.put(key, value);
}
public static Object get(String key) {
return cache.get(key);
}
}
The cache had no eviction policy. Keys accumulated forever.
The Fix
Introduced Caffeine Cache (LRU + expiration):
Cache<String, Object> cache = Caffeine.newBuilder()
.expireAfterWrite(10, TimeUnit.MINUTES)
.maximumSize(10_000)
.build();
Lessons Learned
- Memory leaks often appear only under long-running load.
- Always add eviction to in-memory caches.
- Heap dump analysis is essential for tracing leaks back to code.
Case Study 2: The Deadlock Nobody Could Reproduce
The Symptoms
A financial trading system intermittently froze during peak load. CPU usage dropped to near zero, but requests hung indefinitely.
Thread dumps (jstack
) revealed threads stuck in a synchronized
block:
public synchronized void transfer(Account from, Account to, int amount) {
synchronized (from) {
synchronized (to) {
from.withdraw(amount);
to.deposit(amount);
}
}
}
Two threads attempted transfers in opposite directions:
- Thread A: locks
Account1
→ waiting forAccount2
. - Thread B: locks
Account2
→ waiting forAccount1
.
Classic circular wait deadlock.
The Fix
Use a consistent lock ordering:
public void transfer(Account from, Account to, int amount) {
Account first = from.getId() < to.getId() ? from : to;
Account second = from.getId() < to.getId() ? to : from;
synchronized (first) {
synchronized (second) {
from.withdraw(amount);
to.deposit(amount);
}
}
}
Visualization
Lessons Learned
- Deadlocks may appear only under rare interleavings.
- Thread dumps are the best tool for diagnosis.
- Always define a global lock ordering policy.
Case Study 3: The GC Pause That Broke SLAs
The Symptoms
A high-throughput API written in Go exhibited random latency spikes of 500ms–1s. SLAs required p99 latency <100ms.
Grafana showed “sawtooth” GC pause times:
The Investigation
- pprof heap and CPU profiles showed frequent allocation of small temporary objects in a hot loop.
- The Go runtime GC was triggered more often under load due to allocation pressure.
Problematic code:
func process(data []string) []string {
result := []string{}
for _, d := range data {
result = append(result, strings.ToUpper(d)) // allocates new strings
}
return result
}
Every request allocated thousands of new strings.
The Fix
- Introduced object pooling via
sync.Pool
:
var bufPool = sync.Pool{
New: func() interface{} { return new(bytes.Buffer) },
}
func process(data []string) []string {
buf := bufPool.Get().(*bytes.Buffer)
buf.Reset()
defer bufPool.Put(buf)
for _, d := range data {
buf.WriteString(strings.ToUpper(d))
}
return strings.Split(buf.String(), ",")
}
- Reduced allocation pressure by reusing buffers.
- GC pauses dropped from 500ms to <50ms.
Lessons Learned
- Allocation hotspots cause frequent GC cycles.
- Profiling tools like
pprof
reveal hidden allocation bottlenecks. - Object pools are effective in high-throughput systems.
Case Study 4: The Mystery of the Vanishing Throughput
The Symptoms
A Node.js service showed steady CPU usage but throughput decreased over time. Eventually, latency exceeded 1s per request.
The Investigation
- Used Chrome DevTools and
clinic.js
for profiling. - Heap snapshots showed closures capturing large objects unintentionally.
Problematic code:
function createHandler(config) {
return function(req, res) {
// Captures full config object, even if only timeout is used
setTimeout(() => res.send("done"), config.timeout);
}
}
Handlers held references to full config
objects, preventing GC.
The Fix
Refactor to capture only required fields:
function createHandler(timeout) {
return function(req, res) {
setTimeout(() => res.send("done"), timeout);
}
}
Visualization
Lessons Learned
- JavaScript closures can create hidden memory leaks.
- Heap snapshots are invaluable for tracking down retained objects.
- Minimal capture is a good defensive coding practice.
Tools of the Trade
JVM
- jmap, jstack, jvisualvm, Eclipse MAT.
- Profilers: YourKit, JProfiler.
Go
- pprof (
go tool pprof
). - Flame graphs via speedscope.
Node.js
- Chrome DevTools, clinic.js.
- Heap dumps via
--inspect
.
General
- Flame graphs for CPU bottlenecks.
- Heap snapshots for leaks.
- Thread dumps for deadlocks.
Best Practices for Avoiding Profiling Mysteries
-
Add Observability Early Metrics, logs, and tracing help detect issues before customers do.
-
Stress-Test in Staging Many performance issues only show under sustained load.
-
Use Timeouts Everywhere Prevent cascading failures during deadlocks or slowdowns.
-
Avoid Premature Optimization But always review hot paths for allocation patterns.
-
Automate Leak Detection Run soak tests and monitor heap usage over time.
Conclusion
Profiling mysteries are engineering detective stories. They test your ability to reason about concurrency, memory, and runtime behavior under pressure.
Whether it is a memory leak that takes weeks to appear, a deadlock triggered under rare conditions, or a garbage collector pause that breaks SLAs, these issues demand tools, patience, and methodical investigation.
The good news: with modern profilers, heap dumps, and flame graphs, engineers have powerful lenses into production systems. The best teams don’t just react to performance issues—they continuously monitor, profile, and tune their systems to stay ahead.