Perception Pipeline
The perception layer of kcare_robot takes a
text query like “apple” and returns a 6-DOF grasp pose in the robot’s
base frame, ready for the arm.
The pipeline
Why open-vocabulary?
A classic robot perception stack ships a fixed set of trained classes. The moment the user says “the red mug”, you’re stuck.
This stack runs GroundingDINO for the detection step — a vision-language model that accepts free-text and returns bounding boxes. The skill author writes the query at call time:
ret = find(inputs='apple') # worksret = find(inputs='red mug on the table') # also worksret = find(inputs='the can') # also worksThe same goes for GroundedSAM (segmentation) and mask2grasps (grasp proposal) — none of them are trained on a fixed class set.
Why TCP, not in-process?
The vision models are heavy (GroundingDINO + SAM ≈ several GB of weights and a GPU). The robot’s onboard compute is not.
robot_agent’s DeviceManager supports TCP as a first-class device
transport. The robot opens a long-lived TCP connection to a GPU host
(192.168.1.11:8805) and pipes detection/segmentation queries over it. Same
behaviour over the same API:
ret = node.agents['vlms'].send({ 'detector': 'grounding_dino', 'image': rgb_jpeg_bytes, 'caption': 'apple',})The vlms service publishes GroundingDINO, GroundedSAM, and
mask2grasps. Adding a new detector is one entry on the GPU side and one
device registration on the robot side.
3D feature extraction — what attach_3d_features does
Source: kcare_robot/skills/recognition.py, 415 lines.
For each detected mask:
- Cluster the masked depth pixels (handles partial-occlusion / noise).
- Compute per-cluster normals via weighted depth gradient.
- Aggregate depth statistics (
min,median,max) per cluster. - Lift to 3D:
Ixy2xyz(u, v, depth, fx, fy, ppx, ppy)— straightforward inverse projection. - Classify pose — lying vs standing from normal-vector dispersion; estimate handle position via mass-center percentages.
The output is a dict per detection:
{ 'caption': 'apple', 'score': 0.78, 'bbox': [u0, v0, u1, v1], 'mask': <bool array>, 'xyz': [0.42, 0.13, 0.88], # base frame 'normal': [0.01, -0.03, 0.99], # in camera frame 'depth': { 'min': 0.41, 'med': 0.44, 'max': 0.48 }, 'lying': False,}Grasp pose — from pixels to 6-DOF
mask2grasps returns the grasp as two 2D pixel endpoints (the gripper’s
two-finger axis). The skill lifts them with three pieces of geometry:
- Depth at the midpoint → grasp Z
- Endpoints + depth → grasp roll (rotation around tool axis)
- Wrist-offset transform → final tool-frame pose
# pseudocode of mask2grasps liftmid = (p1 + p2) / 2z = depth[mid]xyz = Ixy2xyz(mid.u, mid.v, z, fx, fy, ppx, ppy)rpy = pixel_axis_to_rpy(p1, p2, depth_gradient)pose_camera = (xyz, rpy)pose_tool = wrist_offset @ pose_cameraThen the head-to-base calibration transforms it into the base frame the arm can plan to.
Head-to-base calibration
skills/calibrattion.py ships a Head2BaseCalibration class with:
- intrinsic camera parameters (fx, fy, ppx, ppy) per stream
- a 4×4 link-to-base transform
- a 4×4 base-to-lift transform (the lift moves the camera vertically)
- per-mode (front / left / right) error-linear corrections — tiny rotations that compensate for the head’s mechanical play
This is what makes “the apple your wrist camera sees” turn into “an XYZ in the base frame the arm can actually plan a trajectory to.” Skipping it costs you ~3 cm of error at table distance — enough to miss every grasp.
Closed-loop pick with self-correction
skills/pick.py (422 lines) doesn’t just snapshot-and-go. The full pick is:
fine_move() is the closed-loop step — the wrist camera images the object
again at close range and the arm corrects laterally. Two trials are
allowed before the skill gives up; in practice the first trial succeeds
~90 % of the time on graspable everyday objects.
Parallel actuator coordination
Picks aren’t sequential — the arm, lift, and head move simultaneously:
run_parallel_check([ ('lift', {'height': 0.4}), ('movej', {'joints': ARM_PRE_PICK}), ('moveh', {'rz': -30, 'ry': 20}),])run_parallel_check() (from pyconnect) fires three ROS actions in
parallel and waits for all to converge. Sequential ≈ 7 s; parallel ≈ 3 s.