Web2Grasp: Learning Functional Grasps from Web Images of Hand-Object Interactions

Abstract

Functional grasping is essential for enabling dexterous multi-finger robot hands to manipulate objects effectively. Prior work largely focuses on power grasps, which only involve holding an object, or relies on in-domain demonstrations for specific objects. We propose leveraging human grasp information extracted from web images, which capture natural and functional hand–object interactions (HOI). Using a pretrained 3D reconstruction model, we recover 3D human HOI meshes from RGB images. To train on these noisy HOI data, we propose to use: 1) an interaction-centric model to learn the functional interaction pattern between hand and object, and 2) geometry-based filtering to remove the infeasible grasps and physical simulation to retain grasps who can resist disturbance. In IssacGym simulation, our model trained on reconstructed HOI grasps achieves a 75.8% success rate on objects from the web dataset and generalizes to unseen objects, yielding a 6.7% improvement in success rate and a 1.8$\times$ increase in functionality ratings compared to baselines. In real-world experiments with the LEAP hand and Inspire hand, it attains a 77.5% success rate across 12 objects, including challenging ones such as a syringe, spray bottle, knife, and tongs.

Pipeline Overview

Overview of the Web2Grasp framework
We propose Web2Grasp for autonomously obtaining robot grasp data by reconstructing Hand-Object meshes from Web images and how to train dexterous grasping model using the noisy reconstructed HOI data. We demonstrate that the resulting dataset obtained via web images and not requiring any robot-specific teleoperation enables training a supervised learning model with strong grasping performance both in simulation and in the real-world.

Reconstructed HOI from Web Images

We reconstruct human HOI from web images of humans holding objects, retarget the human hand mesh to the multi-fingered robot, and align the noisy object meshes with 3D shapes generated by the text-to-3D tool Meshy AI.

Grasp 0

Grasp 1

Grasp 2

Functional Grasp Predictions

Using these reconstructed HOI as training data, we train the interaction-centric grasping model DRO. The results show that our model can predict functional grasps using reconstructed HOI from web images, and generalize to held-out unseen objects.

Real-world Demos

All videos are sped up by 2.25 times

wine glass Cover

Wine Glass

tong Cover

Tong

Phone Cover

Phone

Power Drill Cover

Power Drill

Mug Cover

Mug

Microphone Cover

Microphone

Spray Cover

Spray

Syringe Cover

Syringe

Bowl Cover

Bowl

Plate Cover

Plate

Fork Cover

Fork

Knife Cover

Knife