Files
Hunyuan3D_2.1_Low_VRAM/hy3dshape
Akasei f192c86c60 fix(oom): use mmap=True for checkpoint loading + malloc_trim + expandable_segments
Root cause: torch.load() reads 6.9GB .ckpt into Python heap + model params
in CPU RAM = ~14GB peak, exceeding 16GB system RAM → OOM Killer.

Fix 1 - mmap=True on all torch.load() calls (torch 2.7 supports this):
  With mmap, checkpoint storage is file-backed (not heap). Only the model
  parameters (also ~7GB) exist in physical RAM during loading. Peak RAM
  drops from ~14GB to ~7GB — within safe limits on 16GB machines.
  Files changed: pipelines.py, hunyuan3ddit.py, model.py (×2), flow_matching_sit.py

Fix 2 - malloc_trim(0) after every gc.collect():
  Forces glibc to return freed heap pages to OS immediately, so Python's
  memory pool doesn't hoard freed model memory before the next load.

Fix 3 - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True:
  Prevents CUDA allocator fragmentation between model switches.

Fix 4 - Adaptive threshold recalculated:
  With mmap loading, loading a model requires ~7.5GB (model params) not
  14GB. CPU offload threshold lowered from 16GB → 10.5GB, enabling fast
  path on machines with more headroom.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-16 23:18:16 +08:00
..
2025-06-13 23:53:14 +08:00
2025-07-11 18:32:16 +08:00
2025-09-24 11:35:53 +08:00
2025-06-13 23:53:14 +08:00
2025-06-13 23:53:14 +08:00
2025-06-26 16:03:44 +08:00
2025-06-26 20:08:17 +08:00
2025-06-26 16:03:44 +08:00
2025-06-13 23:53:14 +08:00
2025-06-13 23:53:14 +08:00
2025-06-26 16:34:51 +08:00
2025-06-13 23:53:14 +08:00
2025-06-26 16:03:44 +08:00

Hunyuan3D-2.1-Shape

Quick Inference

Given a reference image image.png, you can run inference using the following code. The result will be saved as demo.glb.

python3 minimal_demo.py

Memory Recommendation: For we recommend using a GPU with at least 10GB VRAM.

Training

Here we demonstrate the complete training workflow of DiT on a small dataset.

Data Preprocessing

The rendering and watertight mesh generation process is described in detail in this document. After preprocessing, the dataset directory structure should look like the following:

dataset/preprocessed/{uid}
├── geo_data
│   ├── {uid}_sdf.npz
│   ├── {uid}_surface.npz
│   └── {uid}_watertight.obj
└── render_cond
    ├── 000.png
    ├── ...
    ├── 023.png
    ├── mesh.ply
    └── transforms.json

We provide a preprocessed mini_dataset containing 8 cases (all sourced from Objaverse-XL) as tools/mini_trainset, which can be used directly for DiT overfitting training experiments.

Launching Training

We provide example configuration files and launch scripts for reference. By default, the training runs on a single node with 8 GPUs using DeepSpeed. Users can modify the configurations and scripts as needed to suit their environment.

Configuration File

configs/hunyuandit-mini-overfitting-flowmatching-dinog518-bf16-lr1e4-512.yaml

Launch Script

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export num_gpu_per_node=8

export node_num=1
export node_rank=0
export master_ip=0.0.0.0 # set your master_ip

# export config=configs/hunyuandit-finetuning-flowmatching-dinol518-bf16-lr1e5-4096.yaml
# export output_dir=output_folder/dit/fintuning_lr1e5
export config=configs/hunyuandit-mini-overfitting-flowmatching-dinol518-bf16-lr1e4-4096.yaml
export output_dir=output_folder/dit/overfitting_depth_16_token_4096_lr1e4

bash scripts/train_deepspeed.sh $node_num $node_rank $num_gpu_per_node $master_ip $config $output_dir