Hunyuan3D_2.1_Low_VRAM

Author	SHA1	Message	Date
Akasei	70289d04d7	fix: eliminate OOM on RTX 3080 via load_state_dict(assign=True) + low-VRAM mode Root cause: torch.load() with mmap=True returns fp16 tensors, but load_state_dict() without assign=True widens them fp16→fp32 in-place, doubling CPU anon-rss (7 GB fp16 ckpt → 14 GB fp32 params). Combined with the 2 GB Gradio server baseline, this exceeded the 15 GB physical RAM limit on the second generation request. Fix: add assign=True to all load_state_dict calls in pipelines.py and autoencoders/model.py. With assign=True the mmap fp16 tensors are assigned directly as model parameters without any fp16→fp32 copy. When model.to('cuda') is then called, the mmap pages (file-backed, evictable) are streamed directly to VRAM — CPU anon-rss stays near 0. Peak RSS is now ~3.9 GB instead of 14.7 GB (killed) across all rounds. gradio_app.py changes: - low_vram_mode always takes the full-delete path (never CPU offload) - glibc malloc tuning at startup (MALLOC_ARENA_MAX=1, malloc_trim) - preemptive gc.collect(2) + malloc_trim + empty_cache at generation start - _rlog() memory logging at each major step for monitoring pipelines.py: - load_state_dict(..., assign=True) for model, vae, conditioner - del ckpt after state dict assignment to release mmap fd early autoencoders/model.py: - load_state_dict(..., assign=True) in from_single_file - load_state_dict(..., assign=True) in init_from_ckpt Verified: 4 consecutive Playwright WebUI rounds (shape+texture) pass with no OOM. API two-round test also passes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-17 02:03:43 +08:00
Akasei	5acd0a765b	test: WebUI API end-to-end verification (chair.jpg, 227s, no OOM) Generated via gradio_client /generation_all endpoint: - Shape generation: 104s - Face reduction: 2s - RAM check: 9.4GB < 10.5GB threshold → full delete path - Tex pipeline load: ~15s (from HF cache) - Texture generation: 98s - Post-request VRAM: 361 MiB (tex pipeline unloaded) - Zero OOM kills Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-17 00:15:53 +08:00
Akasei	f651475ec5	test: batch generation 9/9 success with mmap+malloc_trim fixes All 9 images processed successfully: - Phase 1: 9/9 shapes generated - Phase 2: 9/9 textured GLBs generated - Zero OOM kills, zero failures Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-16 23:42:20 +08:00
Akasei	f192c86c60	fix(oom): use mmap=True for checkpoint loading + malloc_trim + expandable_segments Root cause: torch.load() reads 6.9GB .ckpt into Python heap + model params in CPU RAM = ~14GB peak, exceeding 16GB system RAM → OOM Killer. Fix 1 - mmap=True on all torch.load() calls (torch 2.7 supports this): With mmap, checkpoint storage is file-backed (not heap). Only the model parameters (also ~7GB) exist in physical RAM during loading. Peak RAM drops from ~14GB to ~7GB — within safe limits on 16GB machines. Files changed: pipelines.py, hunyuan3ddit.py, model.py (×2), flow_matching_sit.py Fix 2 - malloc_trim(0) after every gc.collect(): Forces glibc to return freed heap pages to OS immediately, so Python's memory pool doesn't hoard freed model memory before the next load. Fix 3 - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True: Prevents CUDA allocator fragmentation between model switches. Fix 4 - Adaptive threshold recalculated: With mmap loading, loading a model requires ~7.5GB (model params) not 14GB. CPU offload threshold lowered from 16GB → 10.5GB, enabling fast path on machines with more headroom. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-16 23:18:16 +08:00
Akasei	6534f4ba15	fix: adaptive VRAM strategy + force rembg CPU to prevent OOM Two root causes of CUDA OOM fixed: 1. onnxruntime-gpu CUDAExecutionProvider pre-allocated ~12GB VRAM arena for bria-rmbg background removal, starving PyTorch models. Fix: force CPUExecutionProvider in BackgroundRemover (rembg is lightweight, runs fine on CPU, frees all VRAM for shape/tex). 2. Previous 'always delete' strategy was wasteful on high-RAM machines. New adaptive strategy checks available system RAM at runtime: - RAM >= 16GB free: offload i23d to CPU (.to('cpu')) — fast, ~1s - RAM < 16GB free: full del + reload from disk — safe, ~20-30s This gives instant model switching on 32GB+ machines while keeping 16GB machines safe from OOM Killer. Helper functions: - _prepare_for_tex(): adaptive offload/delete based on RAM check - _ensure_i23d_worker(): restore from CPU (fast) or disk (slow) - _get_available_ram_gb(): reads /proc/meminfo - _can_offload_to_cpu(): threshold check with logging Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-16 22:57:32 +08:00
Akasei	3cd767a18d	fix(gradio): prevent OOM on 16GB RAM by fully deleting models between uses Previous hybrid strategy (i23d in CPU RAM, tex del'd) still caused OOM: - i23d in CPU RAM: ~7GB - tex loading from disk: ~7GB peak in RAM before GPU transfer - Total: ~14GB > 16GB system RAM → OOM Killer New strategy: fully delete both models between uses. Neither model persists in CPU RAM between requests. Peak RAM during any load: ~7GB (one model staging to GPU). Changes: - Replace _offload_i23d_to_cpu/_restore_i23d_to_gpu with _unload_i23d_worker/_ensure_i23d_worker (full del + reload) - Add double gc.collect() + empty_cache before each load - Skip i23d startup load in low_vram_mode (load on first request) - Both models reload from local HF cache (~20-30s each) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-16 22:39:03 +08:00
Akasei	474001da6b	feat(rembg): switch background removal to bria-rmbg model Replace default u2net with bria-rmbg-2.0 for better quality. BackgroundRemover now accepts model_name param (defaults to 'bria-rmbg'). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-16 22:14:21 +08:00
Akasei	76c36e53eb	fix(gradio): fix OOM killer on second request in low_vram_mode Root cause: _ensure_i23d_worker() reloaded from disk via from_pretrained(), which loads the ~7GB checkpoint into CPU RAM. If Python GC hadn't freed previous del'd tensors yet, both old+new copies in RAM → OOM Killer. Fix: hybrid strategy per model type: i23d (shape, ~7.25GB VRAM): .to('cpu') ↔ .to('cuda') — stays in RAM, no disk IO, fast switch tex_pipeline (texture, ~6.59GB VRAM): del + gc + empty_cache ↔ reload from HF cache — full VRAM release Renamed helpers: _unload_i23d_worker() → _offload_i23d_to_cpu() _ensure_i23d_worker() → _restore_i23d_to_gpu() (tex helpers unchanged) VRAM timeline per request in low_vram_mode: shape gen: i23d on GPU (7.25GB), tex unloaded → _offload_i23d_to_cpu(): i23d→RAM (0GB VRAM) → _ensure_tex_pipeline(): tex loads (6.59GB) texture gen: tex on GPU (6.59GB), i23d in RAM → _unload_tex_pipeline(): tex del'd (0GB VRAM) next request: _restore_i23d_to_gpu(): RAM→GPU (7.25GB) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-16 22:05:08 +08:00
Akasei	9bee8e1844	refactor(gradio): replace CPU offload with direct GPU unload/lazy-load Instead of .to('cpu') / .to('cuda'), models are now fully del'd from GPU (no CPU intermediate) and reloaded on demand: - _unload_i23d_worker(): del + gc.collect() + empty_cache() - _ensure_i23d_worker(): lazy reload from pretrained if None - _unload_tex_pipeline(): del + gc.collect() + empty_cache() - _ensure_tex_pipeline(): lazy load from tex_conf if None generation_all() flow in low_vram_mode: shape gen → _unload_i23d_worker → _ensure_tex_pipeline → texture gen → _unload_tex_pipeline (shape model reloads on next _gen_shape call via _ensure_i23d_worker) Startup: tex_pipeline NOT loaded in low_vram_mode (only tex_conf stored), reducing startup VRAM from ~13.5GB to ~7.25GB. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-16 21:15:56 +08:00
Akasei	5d0405dc68	feat(gradio): apply VRAM optimization and fix texture config - generation_all(): offload i23d_worker to CPU before texture gen, restore after — mirrors batch_generate.py sequential strategy. Prevents OOM when both models peak simultaneously on RTX 3080. - Change texture config: max_num_view 8→9, resolution 768→512. 768 resolution OOMs (14.6GB activation); 512 is practical max for RTX 3080 20GB. max_views 9 gives better texture coverage. - Only active when --low_vram_mode flag is passed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-16 21:05:14 +08:00
Akasei	e150058012	feat(batch): use steps=50, resolution=512, max_views=9 for RTX 3080 768 resolution causes OOM (14.6GB model activation) on RTX 3080 20GB. 512 is the practical maximum: texture model uses 6.59GB, leaving sufficient headroom. Increased max_views 6→9 for better texture coverage. Result: 9/9 images → textured GLB in 12.3 min total. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-16 20:53:12 +08:00
Akasei	b6685c9560	feat: add batch 3D generation script with VRAM optimization - Add batch_generate.py: two-phase pipeline (shape→texture) that loads models sequentially to avoid OOM on RTX 3080 - Fix mesh_utils.py: make bpy import lazy so load_mesh/save_mesh work without Blender installed - Phase 1: shape generation for all images, then unload - Phase 2: texture generation for all meshes, then unload - Skip already-generated outputs for resumability - Tested: 9/9 images successfully generated textured GLB models Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-16 20:20:46 +08:00
Huiwen Shi	82920d643c	Update LICENSE Update License	2025-10-17 18:10:07 +08:00
HuiwenShi	c9b21668e2	Create is_watertight.py	2025-09-24 11:35:53 +08:00
HuiwenShi	5b6885dcf4	Update chamfer_distance.py	2025-09-23 14:10:26 +08:00
HuiwenShi	34746fcbc2	Create chamfer_distance.py	2025-09-23 11:46:01 +08:00
HuiwenShi	a0fd02ea01	Merge pull request #98 from qinmaohui/main 【犀牛鸟实战issue】修复在windows系统中安装custom_rastorizer报错	2025-09-11 22:38:31 +08:00
qinmaohui	663ee27446	还原对于custom_rasterizer_kernel的修改	2025-09-10 22:37:29 +08:00
qinmaohui	928f41b289	将原文件恢复，新建custom_rasterizer_kernel_for_windows文件夹放置修改的文件	2025-09-10 09:04:36 +08:00
Xianghui Yang	2eb92bcfd1	Merge pull request #104 from WncFht/feature/add-enable-flashvdm 【犀牛鸟实战issue】inference speed	2025-09-08 23:16:17 +08:00
HuiwenShi	06ea674535	Merge pull request #137 from ItsThatRandomDev/fix/docker-conda-tos Fix: accept Anaconda ToS in Dockerfile to prevent build failure	2025-09-08 20:29:54 +08:00
Xianghui Yang	840d66abe8	Merge pull request #102 from s572915912/s572915912-patch-1 【犀牛鸟实战issue】training: split_sizes error	2025-09-08 19:57:36 +08:00
Xianghui Yang	7cc51b67ef	Update README.md add acknowledgment	2025-08-27 14:52:15 +08:00
Xianghui Yang	3efb87e736	Update README.md add acknowledgment	2025-08-27 14:51:20 +08:00
ItsThatRandomDev	0e9f8d78d4	fix conda ToS acceptance in Dockerfile	2025-08-18 19:00:20 +02:00
s572915912	b3dd50ba37	Update misc.py repair	2025-08-06 01:14:49 +08:00
s572915912	d9fc4d31bf	Update hunyuandit-mini-overfitting-flowmatching-dinol518-bf16-lr1e4-4096.yaml repair	2025-08-06 01:12:13 +08:00
Xianghui Yang	84e97834c0	Update README.md	2025-07-31 17:21:07 +08:00
Xianghui Yang	cd2feb5f38	Add files via upload	2025-07-30 23:21:51 +08:00
Xianghui Yang	b791e9c22f	Update LICENSE	2025-07-30 23:18:42 +08:00
Xianghui Yang	b521b9f71f	update x.png	2025-07-28 17:44:21 +08:00
Xianghui Yang	b96ea3558d	update x qrcode	2025-07-28 17:40:27 +08:00
Xianghui Yang	b32d264d43	add hunyuan world 1.0	2025-07-28 17:36:11 +08:00
Xianghui Yang	665c38a19a	add hunyuan world 1.0	2025-07-28 17:35:39 +08:00
oakshy	2ca5fd3155	modify model zoo	2025-07-15 11:28:24 +08:00
WncFht	a6509b95fb	feat: 为 api_server 加上 enable_flashvdm	2025-07-13 11:48:13 +08:00
WncFht	00fa3ac012	feat: 为 gradio_app.py 加上 enable_flashvdm	2025-07-13 11:44:49 +08:00
s572915912	f4e0307665	Update train_deepspeed.sh	2025-07-11 18:32:16 +08:00
s572915912	8eff6d8233	Delete run_inference_with_fix.py	2025-07-11 16:54:43 +08:00
s572915912	7a9d765627	Update run_inference_with_fix.py	2025-07-11 16:53:19 +08:00
s572915912	f0a008279e	Update pipelines.py	2025-07-11 16:51:33 +08:00
s572915912	dc2ea32d76	Update hunyuandit-mini-overfitting-flowmatching-dinol518-bf16-lr1e4-4096.yaml	2025-07-11 16:47:40 +08:00
s572915912	96349ad5d0	Update train_deepspeed.sh	2025-07-11 16:43:40 +08:00
s572915912	6726877bbb	Update train_deepspeed.sh	2025-07-11 16:40:01 +08:00
s572915912	c6d4cb89e2	Update train_deepspeed.sh	2025-07-11 16:39:10 +08:00
s572915912	de7996251d	Update hunyuandit-mini-overfitting-flowmatching-dinol518-bf16-lr1e4-4096.yaml	2025-07-11 16:37:32 +08:00
s572915912	af935af688	Update train_deepspeed.sh	2025-07-11 16:36:46 +08:00
s572915912	f2f19d74a8	Update hunyuandit-mini-overfitting-flowmatching-dinol518-bf16-lr1e4-4096.yaml add explain	2025-07-11 15:53:01 +08:00
s572915912	8cd92830fb	Update train_deepspeed.sh auto detect	2025-07-11 15:51:55 +08:00
s572915912	e34a3ba752	Create run_inference_with_fix.py	2025-07-11 02:33:30 +08:00

1 2 3

114 Commits