Hunyuan3D_2.1_Low_VRAM

Author	SHA1	Message	Date
Akasei	70289d04d7	fix: eliminate OOM on RTX 3080 via load_state_dict(assign=True) + low-VRAM mode Root cause: torch.load() with mmap=True returns fp16 tensors, but load_state_dict() without assign=True widens them fp16→fp32 in-place, doubling CPU anon-rss (7 GB fp16 ckpt → 14 GB fp32 params). Combined with the 2 GB Gradio server baseline, this exceeded the 15 GB physical RAM limit on the second generation request. Fix: add assign=True to all load_state_dict calls in pipelines.py and autoencoders/model.py. With assign=True the mmap fp16 tensors are assigned directly as model parameters without any fp16→fp32 copy. When model.to('cuda') is then called, the mmap pages (file-backed, evictable) are streamed directly to VRAM — CPU anon-rss stays near 0. Peak RSS is now ~3.9 GB instead of 14.7 GB (killed) across all rounds. gradio_app.py changes: - low_vram_mode always takes the full-delete path (never CPU offload) - glibc malloc tuning at startup (MALLOC_ARENA_MAX=1, malloc_trim) - preemptive gc.collect(2) + malloc_trim + empty_cache at generation start - _rlog() memory logging at each major step for monitoring pipelines.py: - load_state_dict(..., assign=True) for model, vae, conditioner - del ckpt after state dict assignment to release mmap fd early autoencoders/model.py: - load_state_dict(..., assign=True) in from_single_file - load_state_dict(..., assign=True) in init_from_ckpt Verified: 4 consecutive Playwright WebUI rounds (shape+texture) pass with no OOM. API two-round test also passes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-17 02:03:43 +08:00
Akasei	f192c86c60	fix(oom): use mmap=True for checkpoint loading + malloc_trim + expandable_segments Root cause: torch.load() reads 6.9GB .ckpt into Python heap + model params in CPU RAM = ~14GB peak, exceeding 16GB system RAM → OOM Killer. Fix 1 - mmap=True on all torch.load() calls (torch 2.7 supports this): With mmap, checkpoint storage is file-backed (not heap). Only the model parameters (also ~7GB) exist in physical RAM during loading. Peak RAM drops from ~14GB to ~7GB — within safe limits on 16GB machines. Files changed: pipelines.py, hunyuan3ddit.py, model.py (×2), flow_matching_sit.py Fix 2 - malloc_trim(0) after every gc.collect(): Forces glibc to return freed heap pages to OS immediately, so Python's memory pool doesn't hoard freed model memory before the next load. Fix 3 - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True: Prevents CUDA allocator fragmentation between model switches. Fix 4 - Adaptive threshold recalculated: With mmap loading, loading a model requires ~7.5GB (model params) not 14GB. CPU offload threshold lowered from 16GB → 10.5GB, enabling fast path on machines with more headroom. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-16 23:18:16 +08:00
Akasei	6534f4ba15	fix: adaptive VRAM strategy + force rembg CPU to prevent OOM Two root causes of CUDA OOM fixed: 1. onnxruntime-gpu CUDAExecutionProvider pre-allocated ~12GB VRAM arena for bria-rmbg background removal, starving PyTorch models. Fix: force CPUExecutionProvider in BackgroundRemover (rembg is lightweight, runs fine on CPU, frees all VRAM for shape/tex). 2. Previous 'always delete' strategy was wasteful on high-RAM machines. New adaptive strategy checks available system RAM at runtime: - RAM >= 16GB free: offload i23d to CPU (.to('cpu')) — fast, ~1s - RAM < 16GB free: full del + reload from disk — safe, ~20-30s This gives instant model switching on 32GB+ machines while keeping 16GB machines safe from OOM Killer. Helper functions: - _prepare_for_tex(): adaptive offload/delete based on RAM check - _ensure_i23d_worker(): restore from CPU (fast) or disk (slow) - _get_available_ram_gb(): reads /proc/meminfo - _can_offload_to_cpu(): threshold check with logging Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-16 22:57:32 +08:00
Akasei	3cd767a18d	fix(gradio): prevent OOM on 16GB RAM by fully deleting models between uses Previous hybrid strategy (i23d in CPU RAM, tex del'd) still caused OOM: - i23d in CPU RAM: ~7GB - tex loading from disk: ~7GB peak in RAM before GPU transfer - Total: ~14GB > 16GB system RAM → OOM Killer New strategy: fully delete both models between uses. Neither model persists in CPU RAM between requests. Peak RAM during any load: ~7GB (one model staging to GPU). Changes: - Replace _offload_i23d_to_cpu/_restore_i23d_to_gpu with _unload_i23d_worker/_ensure_i23d_worker (full del + reload) - Add double gc.collect() + empty_cache before each load - Skip i23d startup load in low_vram_mode (load on first request) - Both models reload from local HF cache (~20-30s each) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-16 22:39:03 +08:00
Akasei	76c36e53eb	fix(gradio): fix OOM killer on second request in low_vram_mode Root cause: _ensure_i23d_worker() reloaded from disk via from_pretrained(), which loads the ~7GB checkpoint into CPU RAM. If Python GC hadn't freed previous del'd tensors yet, both old+new copies in RAM → OOM Killer. Fix: hybrid strategy per model type: i23d (shape, ~7.25GB VRAM): .to('cpu') ↔ .to('cuda') — stays in RAM, no disk IO, fast switch tex_pipeline (texture, ~6.59GB VRAM): del + gc + empty_cache ↔ reload from HF cache — full VRAM release Renamed helpers: _unload_i23d_worker() → _offload_i23d_to_cpu() _ensure_i23d_worker() → _restore_i23d_to_gpu() (tex helpers unchanged) VRAM timeline per request in low_vram_mode: shape gen: i23d on GPU (7.25GB), tex unloaded → _offload_i23d_to_cpu(): i23d→RAM (0GB VRAM) → _ensure_tex_pipeline(): tex loads (6.59GB) texture gen: tex on GPU (6.59GB), i23d in RAM → _unload_tex_pipeline(): tex del'd (0GB VRAM) next request: _restore_i23d_to_gpu(): RAM→GPU (7.25GB) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-16 22:05:08 +08:00
Akasei	9bee8e1844	refactor(gradio): replace CPU offload with direct GPU unload/lazy-load Instead of .to('cpu') / .to('cuda'), models are now fully del'd from GPU (no CPU intermediate) and reloaded on demand: - _unload_i23d_worker(): del + gc.collect() + empty_cache() - _ensure_i23d_worker(): lazy reload from pretrained if None - _unload_tex_pipeline(): del + gc.collect() + empty_cache() - _ensure_tex_pipeline(): lazy load from tex_conf if None generation_all() flow in low_vram_mode: shape gen → _unload_i23d_worker → _ensure_tex_pipeline → texture gen → _unload_tex_pipeline (shape model reloads on next _gen_shape call via _ensure_i23d_worker) Startup: tex_pipeline NOT loaded in low_vram_mode (only tex_conf stored), reducing startup VRAM from ~13.5GB to ~7.25GB. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-16 21:15:56 +08:00
Akasei	5d0405dc68	feat(gradio): apply VRAM optimization and fix texture config - generation_all(): offload i23d_worker to CPU before texture gen, restore after — mirrors batch_generate.py sequential strategy. Prevents OOM when both models peak simultaneously on RTX 3080. - Change texture config: max_num_view 8→9, resolution 768→512. 768 resolution OOMs (14.6GB activation); 512 is practical max for RTX 3080 20GB. max_views 9 gives better texture coverage. - Only active when --low_vram_mode flag is passed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-16 21:05:14 +08:00
WncFht	00fa3ac012	feat: 为 gradio_app.py 加上 enable_flashvdm	2025-07-13 11:44:49 +08:00
HuiwenShi	8f7b4be92e	Update gradio_app.py	2025-06-16 22:13:47 +08:00
HuiwenShi	3f102487ba	Update gradio_app.py	2025-06-16 22:12:54 +08:00
Zeqiang Lai	d2465f0427	Update gradio_app.py	2025-06-14 15:36:20 +08:00
Huiwenshi	dd93e7ce4e	fix some	2025-06-14 14:32:20 +08:00
Huiwenshi	c88bee648e	init	2025-06-13 23:53:14 +08:00

13 Commits