Root cause: torch.load() with mmap=True returns fp16 tensors, but
load_state_dict() without assign=True widens them fp16→fp32 in-place,
doubling CPU anon-rss (7 GB fp16 ckpt → 14 GB fp32 params). Combined
with the 2 GB Gradio server baseline, this exceeded the 15 GB physical
RAM limit on the second generation request.
Fix: add assign=True to all load_state_dict calls in pipelines.py and
autoencoders/model.py. With assign=True the mmap fp16 tensors are
assigned directly as model parameters without any fp16→fp32 copy.
When model.to('cuda') is then called, the mmap pages (file-backed,
evictable) are streamed directly to VRAM — CPU anon-rss stays near 0.
Peak RSS is now ~3.9 GB instead of 14.7 GB (killed) across all rounds.
gradio_app.py changes:
- low_vram_mode always takes the full-delete path (never CPU offload)
- glibc malloc tuning at startup (MALLOC_ARENA_MAX=1, malloc_trim)
- preemptive gc.collect(2) + malloc_trim + empty_cache at generation start
- _rlog() memory logging at each major step for monitoring
pipelines.py:
- load_state_dict(..., assign=True) for model, vae, conditioner
- del ckpt after state dict assignment to release mmap fd early
autoencoders/model.py:
- load_state_dict(..., assign=True) in from_single_file
- load_state_dict(..., assign=True) in init_from_ckpt
Verified: 4 consecutive Playwright WebUI rounds (shape+texture) pass
with no OOM. API two-round test also passes.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Hunyuan3D-2.1-Shape
Quick Inference
Given a reference image image.png, you can run inference using the following code. The result will be saved as demo.glb.
python3 minimal_demo.py
Memory Recommendation: For we recommend using a GPU with at least 10GB VRAM.
Training
Here we demonstrate the complete training workflow of DiT on a small dataset.
Data Preprocessing
The rendering and watertight mesh generation process is described in detail in this document. After preprocessing, the dataset directory structure should look like the following:
dataset/preprocessed/{uid}
├── geo_data
│ ├── {uid}_sdf.npz
│ ├── {uid}_surface.npz
│ └── {uid}_watertight.obj
└── render_cond
├── 000.png
├── ...
├── 023.png
├── mesh.ply
└── transforms.json
We provide a preprocessed mini_dataset containing 8 cases (all sourced from Objaverse-XL) as tools/mini_trainset, which can be used directly for DiT overfitting training experiments.
Launching Training
We provide example configuration files and launch scripts for reference. By default, the training runs on a single node with 8 GPUs using DeepSpeed. Users can modify the configurations and scripts as needed to suit their environment.
Configuration File
configs/hunyuandit-mini-overfitting-flowmatching-dinog518-bf16-lr1e4-512.yaml
Launch Script
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export num_gpu_per_node=8
export node_num=1
export node_rank=0
export master_ip=0.0.0.0 # set your master_ip
# export config=configs/hunyuandit-finetuning-flowmatching-dinol518-bf16-lr1e5-4096.yaml
# export output_dir=output_folder/dit/fintuning_lr1e5
export config=configs/hunyuandit-mini-overfitting-flowmatching-dinol518-bf16-lr1e4-4096.yaml
export output_dir=output_folder/dit/overfitting_depth_16_token_4096_lr1e4
bash scripts/train_deepspeed.sh $node_num $node_rank $num_gpu_per_node $master_ip $config $output_dir