114 Commits

Author SHA1 Message Date
Akasei
70289d04d7 fix: eliminate OOM on RTX 3080 via load_state_dict(assign=True) + low-VRAM mode
Root cause: torch.load() with mmap=True returns fp16 tensors, but
load_state_dict() without assign=True widens them fp16→fp32 in-place,
doubling CPU anon-rss (7 GB fp16 ckpt → 14 GB fp32 params). Combined
with the 2 GB Gradio server baseline, this exceeded the 15 GB physical
RAM limit on the second generation request.

Fix: add assign=True to all load_state_dict calls in pipelines.py and
autoencoders/model.py. With assign=True the mmap fp16 tensors are
assigned directly as model parameters without any fp16→fp32 copy.
When model.to('cuda') is then called, the mmap pages (file-backed,
evictable) are streamed directly to VRAM — CPU anon-rss stays near 0.

Peak RSS is now ~3.9 GB instead of 14.7 GB (killed) across all rounds.

gradio_app.py changes:
- low_vram_mode always takes the full-delete path (never CPU offload)
- glibc malloc tuning at startup (MALLOC_ARENA_MAX=1, malloc_trim)
- preemptive gc.collect(2) + malloc_trim + empty_cache at generation start
- _rlog() memory logging at each major step for monitoring

pipelines.py:
- load_state_dict(..., assign=True) for model, vae, conditioner
- del ckpt after state dict assignment to release mmap fd early

autoencoders/model.py:
- load_state_dict(..., assign=True) in from_single_file
- load_state_dict(..., assign=True) in init_from_ckpt

Verified: 4 consecutive Playwright WebUI rounds (shape+texture) pass
with no OOM. API two-round test also passes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-17 02:03:43 +08:00
Akasei
5acd0a765b test: WebUI API end-to-end verification (chair.jpg, 227s, no OOM)
Generated via gradio_client /generation_all endpoint:
- Shape generation: 104s
- Face reduction: 2s
- RAM check: 9.4GB < 10.5GB threshold → full delete path
- Tex pipeline load: ~15s (from HF cache)
- Texture generation: 98s
- Post-request VRAM: 361 MiB (tex pipeline unloaded)
- Zero OOM kills

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-17 00:15:53 +08:00
Akasei
f651475ec5 test: batch generation 9/9 success with mmap+malloc_trim fixes
All 9 images processed successfully:
- Phase 1: 9/9 shapes generated
- Phase 2: 9/9 textured GLBs generated
- Zero OOM kills, zero failures

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-16 23:42:20 +08:00
Akasei
f192c86c60 fix(oom): use mmap=True for checkpoint loading + malloc_trim + expandable_segments
Root cause: torch.load() reads 6.9GB .ckpt into Python heap + model params
in CPU RAM = ~14GB peak, exceeding 16GB system RAM → OOM Killer.

Fix 1 - mmap=True on all torch.load() calls (torch 2.7 supports this):
  With mmap, checkpoint storage is file-backed (not heap). Only the model
  parameters (also ~7GB) exist in physical RAM during loading. Peak RAM
  drops from ~14GB to ~7GB — within safe limits on 16GB machines.
  Files changed: pipelines.py, hunyuan3ddit.py, model.py (×2), flow_matching_sit.py

Fix 2 - malloc_trim(0) after every gc.collect():
  Forces glibc to return freed heap pages to OS immediately, so Python's
  memory pool doesn't hoard freed model memory before the next load.

Fix 3 - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True:
  Prevents CUDA allocator fragmentation between model switches.

Fix 4 - Adaptive threshold recalculated:
  With mmap loading, loading a model requires ~7.5GB (model params) not
  14GB. CPU offload threshold lowered from 16GB → 10.5GB, enabling fast
  path on machines with more headroom.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-16 23:18:16 +08:00
Akasei
6534f4ba15 fix: adaptive VRAM strategy + force rembg CPU to prevent OOM
Two root causes of CUDA OOM fixed:

1. onnxruntime-gpu CUDAExecutionProvider pre-allocated ~12GB VRAM arena
   for bria-rmbg background removal, starving PyTorch models.
   Fix: force CPUExecutionProvider in BackgroundRemover (rembg is
   lightweight, runs fine on CPU, frees all VRAM for shape/tex).

2. Previous 'always delete' strategy was wasteful on high-RAM machines.
   New adaptive strategy checks available system RAM at runtime:
   - RAM >= 16GB free: offload i23d to CPU (.to('cpu')) — fast, ~1s
   - RAM <  16GB free: full del + reload from disk — safe, ~20-30s
   This gives instant model switching on 32GB+ machines while keeping
   16GB machines safe from OOM Killer.

Helper functions:
- _prepare_for_tex(): adaptive offload/delete based on RAM check
- _ensure_i23d_worker(): restore from CPU (fast) or disk (slow)
- _get_available_ram_gb(): reads /proc/meminfo
- _can_offload_to_cpu(): threshold check with logging

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-16 22:57:32 +08:00
Akasei
3cd767a18d fix(gradio): prevent OOM on 16GB RAM by fully deleting models between uses
Previous hybrid strategy (i23d in CPU RAM, tex del'd) still caused OOM:
- i23d in CPU RAM: ~7GB
- tex loading from disk: ~7GB peak in RAM before GPU transfer
- Total: ~14GB > 16GB system RAM → OOM Killer

New strategy: fully delete both models between uses.
Neither model persists in CPU RAM between requests.
Peak RAM during any load: ~7GB (one model staging to GPU).

Changes:
- Replace _offload_i23d_to_cpu/_restore_i23d_to_gpu with
  _unload_i23d_worker/_ensure_i23d_worker (full del + reload)
- Add double gc.collect() + empty_cache before each load
- Skip i23d startup load in low_vram_mode (load on first request)
- Both models reload from local HF cache (~20-30s each)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-16 22:39:03 +08:00
Akasei
474001da6b feat(rembg): switch background removal to bria-rmbg model
Replace default u2net with bria-rmbg-2.0 for better quality.
BackgroundRemover now accepts model_name param (defaults to 'bria-rmbg').

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-16 22:14:21 +08:00
Akasei
76c36e53eb fix(gradio): fix OOM killer on second request in low_vram_mode
Root cause: _ensure_i23d_worker() reloaded from disk via from_pretrained(),
which loads the ~7GB checkpoint into CPU RAM. If Python GC hadn't freed
previous del'd tensors yet, both old+new copies in RAM → OOM Killer.

Fix: hybrid strategy per model type:
  i23d (shape, ~7.25GB VRAM):
    .to('cpu') ↔ .to('cuda') — stays in RAM, no disk IO, fast switch
  tex_pipeline (texture, ~6.59GB VRAM):
    del + gc + empty_cache ↔ reload from HF cache — full VRAM release

Renamed helpers:
  _unload_i23d_worker()  → _offload_i23d_to_cpu()
  _ensure_i23d_worker()  → _restore_i23d_to_gpu()
  (tex helpers unchanged)

VRAM timeline per request in low_vram_mode:
  shape gen: i23d on GPU (7.25GB), tex unloaded
  → _offload_i23d_to_cpu(): i23d→RAM (0GB VRAM)
  → _ensure_tex_pipeline(): tex loads (6.59GB)
  texture gen: tex on GPU (6.59GB), i23d in RAM
  → _unload_tex_pipeline(): tex del'd (0GB VRAM)
  next request: _restore_i23d_to_gpu(): RAM→GPU (7.25GB)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-16 22:05:08 +08:00
Akasei
9bee8e1844 refactor(gradio): replace CPU offload with direct GPU unload/lazy-load
Instead of .to('cpu') / .to('cuda'), models are now fully del'd from
GPU (no CPU intermediate) and reloaded on demand:

- _unload_i23d_worker(): del + gc.collect() + empty_cache()
- _ensure_i23d_worker(): lazy reload from pretrained if None
- _unload_tex_pipeline(): del + gc.collect() + empty_cache()
- _ensure_tex_pipeline(): lazy load from tex_conf if None

generation_all() flow in low_vram_mode:
  shape gen → _unload_i23d_worker → _ensure_tex_pipeline →
  texture gen → _unload_tex_pipeline
  (shape model reloads on next _gen_shape call via _ensure_i23d_worker)

Startup: tex_pipeline NOT loaded in low_vram_mode (only tex_conf stored),
reducing startup VRAM from ~13.5GB to ~7.25GB.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-16 21:15:56 +08:00
Akasei
5d0405dc68 feat(gradio): apply VRAM optimization and fix texture config
- generation_all(): offload i23d_worker to CPU before texture gen,
  restore after — mirrors batch_generate.py sequential strategy.
  Prevents OOM when both models peak simultaneously on RTX 3080.
- Change texture config: max_num_view 8→9, resolution 768→512.
  768 resolution OOMs (14.6GB activation); 512 is practical max for
  RTX 3080 20GB. max_views 9 gives better texture coverage.
- Only active when --low_vram_mode flag is passed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-16 21:05:14 +08:00
Akasei
e150058012 feat(batch): use steps=50, resolution=512, max_views=9 for RTX 3080
768 resolution causes OOM (14.6GB model activation) on RTX 3080 20GB.
512 is the practical maximum: texture model uses 6.59GB, leaving
sufficient headroom. Increased max_views 6→9 for better texture coverage.

Result: 9/9 images → textured GLB in 12.3 min total.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-16 20:53:12 +08:00
Akasei
b6685c9560 feat: add batch 3D generation script with VRAM optimization
- Add batch_generate.py: two-phase pipeline (shape→texture) that loads
  models sequentially to avoid OOM on RTX 3080
- Fix mesh_utils.py: make bpy import lazy so load_mesh/save_mesh work
  without Blender installed
- Phase 1: shape generation for all images, then unload
- Phase 2: texture generation for all meshes, then unload
- Skip already-generated outputs for resumability
- Tested: 9/9 images successfully generated textured GLB models

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-16 20:20:46 +08:00
Huiwen Shi
82920d643c Update LICENSE
Update License
2025-10-17 18:10:07 +08:00
HuiwenShi
c9b21668e2 Create is_watertight.py 2025-09-24 11:35:53 +08:00
HuiwenShi
5b6885dcf4 Update chamfer_distance.py 2025-09-23 14:10:26 +08:00
HuiwenShi
34746fcbc2 Create chamfer_distance.py 2025-09-23 11:46:01 +08:00
HuiwenShi
a0fd02ea01 Merge pull request #98 from qinmaohui/main
【犀牛鸟实战issue】修复在windows系统中安装custom_rastorizer报错
2025-09-11 22:38:31 +08:00
qinmaohui
663ee27446 还原对于custom_rasterizer_kernel的修改 2025-09-10 22:37:29 +08:00
qinmaohui
928f41b289 将原文件恢复,新建custom_rasterizer_kernel_for_windows文件夹放置修改的文件 2025-09-10 09:04:36 +08:00
Xianghui Yang
2eb92bcfd1 Merge pull request #104 from WncFht/feature/add-enable-flashvdm
【犀牛鸟实战issue】inference speed
2025-09-08 23:16:17 +08:00
HuiwenShi
06ea674535 Merge pull request #137 from ItsThatRandomDev/fix/docker-conda-tos
Fix: accept Anaconda ToS in Dockerfile to prevent build failure
2025-09-08 20:29:54 +08:00
Xianghui Yang
840d66abe8 Merge pull request #102 from s572915912/s572915912-patch-1
【犀牛鸟实战issue】training: split_sizes error
2025-09-08 19:57:36 +08:00
Xianghui Yang
7cc51b67ef Update README.md
add acknowledgment
2025-08-27 14:52:15 +08:00
Xianghui Yang
3efb87e736 Update README.md
add acknowledgment
2025-08-27 14:51:20 +08:00
ItsThatRandomDev
0e9f8d78d4 fix conda ToS acceptance in Dockerfile 2025-08-18 19:00:20 +02:00
s572915912
b3dd50ba37 Update misc.py
repair
2025-08-06 01:14:49 +08:00
s572915912
d9fc4d31bf Update hunyuandit-mini-overfitting-flowmatching-dinol518-bf16-lr1e4-4096.yaml
repair
2025-08-06 01:12:13 +08:00
Xianghui Yang
84e97834c0 Update README.md 2025-07-31 17:21:07 +08:00
Xianghui Yang
cd2feb5f38 Add files via upload 2025-07-30 23:21:51 +08:00
Xianghui Yang
b791e9c22f Update LICENSE 2025-07-30 23:18:42 +08:00
Xianghui Yang
b521b9f71f update x.png 2025-07-28 17:44:21 +08:00
Xianghui Yang
b96ea3558d update x qrcode 2025-07-28 17:40:27 +08:00
Xianghui Yang
b32d264d43 add hunyuan world 1.0 2025-07-28 17:36:11 +08:00
Xianghui Yang
665c38a19a add hunyuan world 1.0 2025-07-28 17:35:39 +08:00
oakshy
2ca5fd3155 modify model zoo 2025-07-15 11:28:24 +08:00
WncFht
a6509b95fb feat: 为 api_server 加上 enable_flashvdm 2025-07-13 11:48:13 +08:00
WncFht
00fa3ac012 feat: 为 gradio_app.py 加上 enable_flashvdm 2025-07-13 11:44:49 +08:00
s572915912
f4e0307665 Update train_deepspeed.sh 2025-07-11 18:32:16 +08:00
s572915912
8eff6d8233 Delete run_inference_with_fix.py 2025-07-11 16:54:43 +08:00
s572915912
7a9d765627 Update run_inference_with_fix.py 2025-07-11 16:53:19 +08:00
s572915912
f0a008279e Update pipelines.py 2025-07-11 16:51:33 +08:00
s572915912
dc2ea32d76 Update hunyuandit-mini-overfitting-flowmatching-dinol518-bf16-lr1e4-4096.yaml 2025-07-11 16:47:40 +08:00
s572915912
96349ad5d0 Update train_deepspeed.sh 2025-07-11 16:43:40 +08:00
s572915912
6726877bbb Update train_deepspeed.sh 2025-07-11 16:40:01 +08:00
s572915912
c6d4cb89e2 Update train_deepspeed.sh 2025-07-11 16:39:10 +08:00
s572915912
de7996251d Update hunyuandit-mini-overfitting-flowmatching-dinol518-bf16-lr1e4-4096.yaml 2025-07-11 16:37:32 +08:00
s572915912
af935af688 Update train_deepspeed.sh 2025-07-11 16:36:46 +08:00
s572915912
f2f19d74a8 Update hunyuandit-mini-overfitting-flowmatching-dinol518-bf16-lr1e4-4096.yaml
add explain
2025-07-11 15:53:01 +08:00
s572915912
8cd92830fb Update train_deepspeed.sh
auto detect
2025-07-11 15:51:55 +08:00
s572915912
e34a3ba752 Create run_inference_with_fix.py 2025-07-11 02:33:30 +08:00