Hunyuan3D_2.1_Low_VRAM

Author	SHA1	Message	Date
Akasei	70289d04d7	fix: eliminate OOM on RTX 3080 via load_state_dict(assign=True) + low-VRAM mode Root cause: torch.load() with mmap=True returns fp16 tensors, but load_state_dict() without assign=True widens them fp16→fp32 in-place, doubling CPU anon-rss (7 GB fp16 ckpt → 14 GB fp32 params). Combined with the 2 GB Gradio server baseline, this exceeded the 15 GB physical RAM limit on the second generation request. Fix: add assign=True to all load_state_dict calls in pipelines.py and autoencoders/model.py. With assign=True the mmap fp16 tensors are assigned directly as model parameters without any fp16→fp32 copy. When model.to('cuda') is then called, the mmap pages (file-backed, evictable) are streamed directly to VRAM — CPU anon-rss stays near 0. Peak RSS is now ~3.9 GB instead of 14.7 GB (killed) across all rounds. gradio_app.py changes: - low_vram_mode always takes the full-delete path (never CPU offload) - glibc malloc tuning at startup (MALLOC_ARENA_MAX=1, malloc_trim) - preemptive gc.collect(2) + malloc_trim + empty_cache at generation start - _rlog() memory logging at each major step for monitoring pipelines.py: - load_state_dict(..., assign=True) for model, vae, conditioner - del ckpt after state dict assignment to release mmap fd early autoencoders/model.py: - load_state_dict(..., assign=True) in from_single_file - load_state_dict(..., assign=True) in init_from_ckpt Verified: 4 consecutive Playwright WebUI rounds (shape+texture) pass with no OOM. API two-round test also passes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-17 02:03:43 +08:00
Akasei	f192c86c60	fix(oom): use mmap=True for checkpoint loading + malloc_trim + expandable_segments Root cause: torch.load() reads 6.9GB .ckpt into Python heap + model params in CPU RAM = ~14GB peak, exceeding 16GB system RAM → OOM Killer. Fix 1 - mmap=True on all torch.load() calls (torch 2.7 supports this): With mmap, checkpoint storage is file-backed (not heap). Only the model parameters (also ~7GB) exist in physical RAM during loading. Peak RAM drops from ~14GB to ~7GB — within safe limits on 16GB machines. Files changed: pipelines.py, hunyuan3ddit.py, model.py (×2), flow_matching_sit.py Fix 2 - malloc_trim(0) after every gc.collect(): Forces glibc to return freed heap pages to OS immediately, so Python's memory pool doesn't hoard freed model memory before the next load. Fix 3 - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True: Prevents CUDA allocator fragmentation between model switches. Fix 4 - Adaptive threshold recalculated: With mmap loading, loading a model requires ~7.5GB (model params) not 14GB. CPU offload threshold lowered from 16GB → 10.5GB, enabling fast path on machines with more headroom. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-16 23:18:16 +08:00
Akasei	6534f4ba15	fix: adaptive VRAM strategy + force rembg CPU to prevent OOM Two root causes of CUDA OOM fixed: 1. onnxruntime-gpu CUDAExecutionProvider pre-allocated ~12GB VRAM arena for bria-rmbg background removal, starving PyTorch models. Fix: force CPUExecutionProvider in BackgroundRemover (rembg is lightweight, runs fine on CPU, frees all VRAM for shape/tex). 2. Previous 'always delete' strategy was wasteful on high-RAM machines. New adaptive strategy checks available system RAM at runtime: - RAM >= 16GB free: offload i23d to CPU (.to('cpu')) — fast, ~1s - RAM < 16GB free: full del + reload from disk — safe, ~20-30s This gives instant model switching on 32GB+ machines while keeping 16GB machines safe from OOM Killer. Helper functions: - _prepare_for_tex(): adaptive offload/delete based on RAM check - _ensure_i23d_worker(): restore from CPU (fast) or disk (slow) - _get_available_ram_gb(): reads /proc/meminfo - _can_offload_to_cpu(): threshold check with logging Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-16 22:57:32 +08:00
Akasei	474001da6b	feat(rembg): switch background removal to bria-rmbg model Replace default u2net with bria-rmbg-2.0 for better quality. BackgroundRemover now accepts model_name param (defaults to 'bria-rmbg'). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-16 22:14:21 +08:00
HuiwenShi	c9b21668e2	Create is_watertight.py	2025-09-24 11:35:53 +08:00
HuiwenShi	5b6885dcf4	Update chamfer_distance.py	2025-09-23 14:10:26 +08:00
HuiwenShi	34746fcbc2	Create chamfer_distance.py	2025-09-23 11:46:01 +08:00
s572915912	b3dd50ba37	Update misc.py repair	2025-08-06 01:14:49 +08:00
s572915912	d9fc4d31bf	Update hunyuandit-mini-overfitting-flowmatching-dinol518-bf16-lr1e4-4096.yaml repair	2025-08-06 01:12:13 +08:00
s572915912	f4e0307665	Update train_deepspeed.sh	2025-07-11 18:32:16 +08:00
s572915912	f0a008279e	Update pipelines.py	2025-07-11 16:51:33 +08:00
s572915912	dc2ea32d76	Update hunyuandit-mini-overfitting-flowmatching-dinol518-bf16-lr1e4-4096.yaml	2025-07-11 16:47:40 +08:00
s572915912	96349ad5d0	Update train_deepspeed.sh	2025-07-11 16:43:40 +08:00
s572915912	de7996251d	Update hunyuandit-mini-overfitting-flowmatching-dinol518-bf16-lr1e4-4096.yaml	2025-07-11 16:37:32 +08:00
s572915912	af935af688	Update train_deepspeed.sh	2025-07-11 16:36:46 +08:00
s572915912	f2f19d74a8	Update hunyuandit-mini-overfitting-flowmatching-dinol518-bf16-lr1e4-4096.yaml add explain	2025-07-11 15:53:01 +08:00
s572915912	8cd92830fb	Update train_deepspeed.sh auto detect	2025-07-11 15:51:55 +08:00
s572915912	b06e6ddf37	Update pipelines.py	2025-07-11 02:29:25 +08:00
Huiwenshi	d0b85dc7d9	fix some	2025-06-26 20:08:17 +08:00
Huiwenshi	e59169a8ec	update readme	2025-06-26 16:34:51 +08:00
Huiwenshi	7c92655a0d	fix shape training	2025-06-26 16:03:44 +08:00
Huiwenshi	4d67e18386	update	2025-06-14 01:39:07 +08:00
Huiwenshi	c88bee648e	init	2025-06-13 23:53:14 +08:00

23 Commits