I am currently a Research Scientist at the General Vision Lab of the Beijing Institute for General
Artificial Intelligence (BIGAI). I received my Ph.D. degree in
Computer Science and Technology from Shanghai Jiao Tong
University in June 2024, where I was an active
member of the SJTU MVIG lab under the guidance of Prof.
Cewu
Lu. I have also served as a research intern at Shanghai AI Lab. My research pursuits
predominantly revolve around Computer
Vision, 3D Vision, and Embodied
AI.
Our objective is to develop a policy that enables dexterous robotic hands to replicate human hands
accurately–object interaction trajectories in simulation while satisfying the tasks' semantic
manipulation constraints. The key innovation of ManipTrans is to frame this transfer as a
two-stage
process: first, a pre-training trajectory imitation stage focusing solely on hand motion, and
second, a specific action fine-tuning stage that addresses interaction constraints. By leveraging
>ManipTrans, we transfer multiple hand–object
datasets to robotic hands, creating DexManipNet—a
large-scale dataset featuring previously unexplored tasks such as pen capping and bottle unscrewing,
that facilitate further policy training for dexterous hands and enabling real-world deployments.
Generating human grasps involves both object geometry and semantic cues. This paper introduces
SemGrasp, a method that infuses semantic information
into grasp generation, aligning with language
instructions. Leveraging a unified semantic framework and a Multimodal Large Language Model (MLLM),
SemGrasp is supported by CapGrasp, a dataset
featuring detailed captions and diverse grasps.
Experiments demonstrate SemGrasp's ability to produce
grasps consistent with linguistic intentions,
surpassing shape-only approaches.
Rearranging objects is key in human-environment interaction, and creating natural sequences of such
motions is crucial in AR/VR and CG. Our work presents Favor, a unique dataset that captures
full-body virtual object rearrangement motions through motion capture and AR glasses. We also
introduce a new pipeline, Favorite, for generating
lifelike digital human rearrangement motions
driven by commands. Our experiments show that Favor
and Favorite produce high-fidelity motion
sequences.
We proposed a new method Chord which exploits the
categorical shape prior for reconstructing the
shape of intra-class objects. In addition, we constructed a new dataset, COMIC, of category-level
hand-object interaction. Comic encompasses a diverse
collection of object instances, materials, hand
interactions, and viewing directions, as illustrated.
Learning how humans manipulate objects requires machines to acquire knowledge from two perspectives:
one for understanding object affordances and the other for learning human interactions based on
affordances. In this work, we propose a multi-modal and rich-annotated knowledge repository,
OakInk,
for the visual and cognitive understanding of hand-object interactions. Check our website for more
details!
We propose a lightweight online data enrichment method that boosts articulated hand-object pose
estimation
from the data perspective.
During training, ArtiBoost alternatively performs data exploration and synthesis.
Even with a simple baseline, our method can boost it to outperform the previous SOTA on several
hand-object benchmarks.
In this paper, we extend MANO with more Diverse Accessories and Rich Textures, namely DART.
DART is comprised of 325 exquisite hand-crafted texture maps which vary in appearance and cover
different kinds of blemishes, make-ups, and accessories.
We also generate large-scale (800K), diverse, and high-fidelity hand images, paired with
perfect-aligned 3D labels, called DARTset.
OAKINK2 is a rich dataset focusing on bimanual object manipulation tasks involved in complex daily
activities. It introduces a unique three-tiered abstraction structure—Affordance, Primitive Task,
and Complex Task—to systematically organize task representations. By emphasizing an object-centric
approach, the dataset captures multi-view imagery and precise annotations of human and object poses,
aiding in applications like interaction reconstruction and motion synthesis. Furthermore, we propose
a Complex Task Completion framework that utilizes Large Language Models to break down complex
activities into Primitive Tasks and a Motion Fulfillment Model to generate corresponding bimanual
motions.
Color-NeuS focuses on mesh reconstruction with color. We remove view-dependent color while using a
relighting network to maintain volume rendering performance. Mesh is extracted from the SDF network,
and vertex color is derived from the global color network. We conceived a in hand object scanning
task and gathered several videos for it to evaluate Color-NeuS.
We highlight contact in the hand-object interaction modeling task by proposing an
explicit representation named Contact Potential Field (CPF). In CPF, we treat each contacting
hand-object
vertex pair as a spring-mass system, Hence the whole system forms a potential field with minimal
elastic
energy
at the grasp position.