Hunyuan3D_2.1_Low_VRAM/gradio_app.py at akasei

Files

Akasei 70289d04d7 fix: eliminate OOM on RTX 3080 via load_state_dict(assign=True) + low-VRAM mode

Root cause: torch.load() with mmap=True returns fp16 tensors, but
load_state_dict() without assign=True widens them fp16→fp32 in-place,
doubling CPU anon-rss (7 GB fp16 ckpt → 14 GB fp32 params). Combined
with the 2 GB Gradio server baseline, this exceeded the 15 GB physical
RAM limit on the second generation request.

Fix: add assign=True to all load_state_dict calls in pipelines.py and
autoencoders/model.py. With assign=True the mmap fp16 tensors are
assigned directly as model parameters without any fp16→fp32 copy.
When model.to('cuda') is then called, the mmap pages (file-backed,
evictable) are streamed directly to VRAM — CPU anon-rss stays near 0.

Peak RSS is now ~3.9 GB instead of 14.7 GB (killed) across all rounds.

gradio_app.py changes:
- low_vram_mode always takes the full-delete path (never CPU offload)
- glibc malloc tuning at startup (MALLOC_ARENA_MAX=1, malloc_trim)
- preemptive gc.collect(2) + malloc_trim + empty_cache at generation start
- _rlog() memory logging at each major step for monitoring

pipelines.py:
- load_state_dict(..., assign=True) for model, vae, conditioner
- del ckpt after state dict assignment to release mmap fd early

autoencoders/model.py:
- load_state_dict(..., assign=True) in from_single_file
- load_state_dict(..., assign=True) in init_from_ckpt

Verified: 4 consecutive Playwright WebUI rounds (shape+texture) pass
with no OOM. API two-round test also passes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

2026-03-17 02:03:43 +08:00

41 KiB

Raw Permalink Blame History

View Raw

41 KiB Raw Permalink Blame History

41 KiB

Raw Permalink Blame History