Object Detection and Localization Examples
This section provides practical examples of object detection and localization using NVIDIA Isaac technologies, demonstrating how AI-powered perception systems work in robotics applications.
Introduction to Object Detection in Robotics
Object detection in robotics involves identifying and localizing objects in the robot's environment. This capability is crucial for:
- Navigation and path planning
- Manipulation and grasping
- Scene understanding
- Human-robot interaction
- Autonomous decision making
Isaac ROS Perception Pipeline
Overview of Isaac ROS Perception Stack
Isaac ROS Perception
├── Isaac ROS Image Pipeline
│ ├── Image Proc
│ ├── Rectification
│ └── Format Conversion
├── Isaac ROS Visual SLAM
│ ├── Feature Detection
│ ├── Pose Estimation
│ └── Map Building
├── Isaac ROS Object Detection
│ ├── Deep Learning Models
│ ├── TensorRT Optimization
│ └── Post-processing
├── Isaac ROS Pose Estimation
│ ├── 2D-3D Correspondence
│ ├── PnP Solvers
│ └── Refinement
└── Isaac ROS Bi3D
├── 3D Segmentation
├── Depth Estimation
└── Instance Segmentation
```
## Isaac ROS Object Detection Examples
### 1. Isaac ROS DetectNet
DetectNet is NVIDIA's specialized network for object detection optimized for robotics applications.
#### Basic DetectNet Node Implementation
```cpp
#include <rclcpp/rclcpp.hpp>
#include <sensor_msgs/msg/image.hpp>
#include <sensor_msgs/msg/camera_info.hpp>
#include <isaac_ros_detectnet_interfaces/msg/detection_array.hpp>
#include <cv_bridge/cv_bridge.h>
#include <opencv2/opencv.hpp>
class IsaacDetectNetNode : public rclcpp::Node
{
public:
IsaacDetectNetNode() : Node("isaac_detectnet_node")
{
// Create subscribers
image_sub_ = this->create_subscription<sensor_msgs::msg::Image>(
"image_input", 10,
std::bind(&IsaacDetectNetNode::imageCallback, this, std::placeholders::_1)
);
camera_info_sub_ = this->create_subscription<sensor_msgs::msg::CameraInfo>(
"camera_info", 10,
std::bind(&IsaacDetectNetNode::cameraInfoCallback, this, std::placeholders::_1)
);
// Create publisher for detections
detection_pub_ = this->create_publisher<isaac_ros_detectnet_interfaces::msg::DetectionArray>(
"detections", 10
);
}
private:
void imageCallback(const sensor_msgs::msg::Image::SharedPtr image_msg)
{
// Convert ROS image to OpenCV
cv_bridge::CvImagePtr cv_ptr;
try {
cv_ptr = cv_bridge::toCvCopy(image_msg, sensor_msgs::image_encodings::BGR8);
} catch (cv_bridge::Exception& e) {
RCLCPP_ERROR(this->get_logger(), "cv_bridge exception: %s", e.what());
return;
}
// Process image through DetectNet model
auto detections = runDetectNetInference(cv_ptr->image);
// Create detection message
auto detection_msg = createDetectionMessage(detections, image_msg->header);
// Publish detections
detection_pub_->publish(detection_msg);
}
void cameraInfoCallback(const sensor_msgs::msg::CameraInfo::SharedPtr info_msg)
{
camera_info_ = *info_msg;
}
std::vector<Detection> runDetectNetInference(const cv::Mat& image)
{
// This would interface with the actual DetectNet model
// In practice, this uses TensorRT for optimized inference
std::vector<Detection> detections;
// Placeholder for actual inference
// In real implementation, this would:
// 1. Preprocess image for the model
// 2. Run inference using TensorRT
// 3. Post-process results
// 4. Apply non-maximum suppression
// 5. Filter by confidence threshold
return detections;
}
isaac_ros_detectnet_interfaces::msg::DetectionArray createDetectionMessage(
const std::vector<Detection>& detections,
const std_msgs::msg::Header& header)
{
isaac_ros_detectnet_interfaces::msg::DetectionArray detection_array;
detection_array.header = header;
for (const auto& detection : detections) {
isaac_ros_detectnet_interfaces::msg::Detection det_msg;
det_msg.label = detection.label;
det_msg.confidence = detection.confidence;
// Bounding box coordinates
det_msg.bbox.center.x = detection.center_x;
det_msg.bbox.center.y = detection.center_y;
det_msg.bbox.size_x = detection.width;
det_msg.bbox.size_y = detection.height;
detection_array.detections.push_back(det_msg);
}
return detection_array;
}
struct Detection {
std::string label;
float confidence;
float center_x, center_y;
float width, height;
};
rclcpp::Subscription<sensor_msgs::msg::Image>::SharedPtr image_sub_;
rclcpp::Subscription<sensor_msgs::msg::CameraInfo>::SharedPtr camera_info_sub_;
rclcpp::Publisher<isaac_ros_detectnet_interfaces::msg::DetectionArray>::SharedPtr detection_pub_;
sensor_msgs::msg::CameraInfo camera_info_;
};
```
#### DetectNet Launch Configuration
```xml
<!-- detectnet.launch.xml -->
<launch>
<!-- Image rectification -->
<node pkg="isaac_ros_image_proc" exec="isaac_ros_image_proc" name="image_proc">
<param name="input_encoding" value="bgr8"/>
<param name="output_encoding" value="bgr8"/>
</node>
<!-- DetectNet node -->
<node pkg="isaac_ros_detectnet" exec="isaac_ros_detectnet" name="detectnet">
<param name="model_name" value="ssd_mobilenet_v2_coco"/>
<param name="input_topic" value="/image_rect_color"/>
<param name="output_topic" value="/detections"/>
<param name="confidence_threshold" value="0.5"/>
<param name="max_objects" value="10"/>
</node>
<!-- Visualization node -->
<node pkg="isaac_ros_visualization" exec="detection_visualizer" name="detection_visualizer">
<param name="image_topic" value="/image_rect_color"/>
<param name="detection_topic" value="/detections"/>
<param name="output_topic" value="/detection_image"/>
</node>
</launch>
```
### 2. Isaac ROS Bi3D (3D Object Detection)
Bi3D provides 3D object detection and segmentation capabilities.
#### Bi3D Node Implementation
```cpp
#include <rclcpp/rclcpp.hpp>
#include <sensor_msgs/msg/image.hpp>
#include <stereo_msgs/msg/disparity_image.hpp>
#include <isaac_ros_bi3d_interfaces/msg/bi3_d_inference_array.hpp>
class IsaacBi3DNode : public rclcpp::Node
{
public:
IsaacBi3DNode() : Node("isaac_bi3d_node")
{
// Subscribe to stereo image pair
left_sub_ = this->create_subscription<sensor_msgs::msg::Image>(
"left/image_rect_color", 10,
std::bind(&IsaacBi3DNode::leftImageCallback, this, std::placeholders::_1)
);
right_sub_ = this->create_subscription<sensor_msgs::msg::Image>(
"right/image_rect_color", 10,
std::bind(&IsaacBi3DNode::rightImageCallback, this, std::placeholders::_1)
);
// Subscribe to disparity for depth
disparity_sub_ = this->create_subscription<stereo_msgs::msg::DisparityImage>(
"disparity", 10,
std::bind(&IsaacBi3DNode::disparityCallback, this, std::placeholders::_1)
);
// Publisher for 3D detections
bi3d_pub_ = this->create_publisher<isaac_ros_bi3d_interfaces::msg::Bi3DInferenceArray>(
"bi3d_detections", 10
);
}
private:
void leftImageCallback(const sensor_msgs::msg::Image::SharedPtr msg)
{
if (has_right_image_ && has_disparity_) {
processStereoPair(msg, right_image_, disparity_);
} else {
left_image_ = msg;
}
}
void rightImageCallback(const sensor_msgs::msg::Image::SharedPtr msg)
{
has_right_image_ = true;
right_image_ = msg;
if (has_left_image_ && has_disparity_) {
processStereoPair(left_image_, msg, disparity_);
}
}
void disparityCallback(const stereo_msgs::msg::DisparityImage::SharedPtr msg)
{
has_disparity_ = true;
disparity_ = msg;
if (has_left_image_ && has_right_image_) {
processStereoPair(left_image_, right_image_, msg);
}
}
void processStereoPair(
const sensor_msgs::msg::Image::SharedPtr left,
const sensor_msgs::msg::Image::SharedPtr right,
const stereo_msgs::msg::DisparityImage::SharedPtr disparity)
{
// Run Bi3D inference
auto bi3d_results = runBi3DInference(left, right);
// Create 3D detection message
auto bi3d_msg = createBi3DMessage(bi3d_results, left->header);
// Publish results
bi3d_pub_->publish(bi3d_msg);
// Reset flags
has_left_image_ = false;
has_right_image_ = false;
has_disparity_ = false;
}
std::vector<Bi3DResult> runBi3DInference(
const sensor_msgs::msg::Image::SharedPtr left,
const sensor_msgs::msg::Image::SharedPtr right)
{
// Placeholder for actual Bi3D inference
// This would:
// 1. Process stereo images through Bi3D network
// 2. Generate 3D segmentation masks
// 3. Extract 3D bounding boxes
// 4. Estimate 3D poses
std::vector<Bi3DResult> results;
// Implementation would go here
return results;
}
isaac_ros_bi3d_interfaces::msg::Bi3DInferenceArray createBi3DMessage(
const std::vector<Bi3DResult>& results,
const std_msgs::msg::Header& header)
{
isaac_ros_bi3d_interfaces::msg::Bi3DInferenceArray bi3d_array;
bi3d_array.header = header;
for (const auto& result : results) {
isaac_ros_bi3d_interfaces::msg::Bi3DInference bi3d_msg;
bi3d_msg.class_id = result.class_id;
bi3d_msg.confidence = result.confidence;
// 3D bounding box
bi3d_msg.bounding_box_3d.center.position.x = result.center_x;
bi3d_msg.bounding_box_3d.center.position.y = result.center_y;
bi3d_msg.bounding_box_3d.center.position.z = result.center_z;
// Convert Euler angles to quaternion
tf2::Quaternion q;
q.setRPY(result.roll, result.pitch, result.yaw);
bi3d_msg.bounding_box_3d.center.orientation.x = q.x();
bi3d_msg.bounding_box_3d.center.orientation.y = q.y();
bi3d_msg.bounding_box_3d.center.orientation.z = q.z();
bi3d_msg.bounding_box_3d.center.orientation.w = q.w();
bi3d_msg.bounding_box_3d.size.x = result.size_x;
bi3d_msg.bounding_box_3d.size.y = result.size_y;
bi3d_msg.bounding_box_3d.size.z = result.size_z;
bi3d_array.inferences.push_back(bi3d_msg);
}
return bi3d_array;
}
struct Bi3DResult {
int class_id;
float confidence;
float center_x, center_y, center_z;
float roll, pitch, yaw;
float size_x, size_y, size_z;
};
rclcpp::Subscription<sensor_msgs::msg::Image>::SharedPtr left_sub_;
rclcpp::Subscription<sensor_msgs::msg::Image>::SharedPtr right_sub_;
rclcpp::Subscription<stereo_msgs::msg::DisparityImage>::SharedPtr disparity_sub_;
rclcpp::Publisher<isaac_ros_bi3d_interfaces::msg::Bi3DInferenceArray>::SharedPtr bi3d_pub_;
sensor_msgs::msg::Image::SharedPtr left_image_;
sensor_msgs::msg::Image::SharedPtr right_image_;
stereo_msgs::msg::DisparityImage::SharedPtr disparity_;
bool has_left_image_ = false;
bool has_right_image_ = false;
bool has_disparity_ = false;
};
```
## Object Localization Examples
### 1. Camera-Object 3D Localization
```cpp
#include <rclcpp/rclcpp.hpp>
#include <geometry_msgs/msg/point_stamped.hpp>
#include <sensor_msgs/msg/camera_info.hpp>
#include <isaac_ros_detectnet_interfaces/msg/detection_array.hpp>
#include <tf2_ros/transform_listener.h>
#include <tf2_geometry_msgs/tf2_geometry_msgs.h>
class ObjectLocalizationNode : public rclcpp::Node
{
public:
ObjectLocalizationNode() : Node("object_localization_node"), tf_buffer_(this->get_clock())
{
detection_sub_ = this->create_subscription<isaac_ros_detectnet_interfaces::msg::DetectionArray>(
"detections", 10,
std::bind(&ObjectLocalizationNode::detectionCallback, this, std::placeholders::_1)
);
camera_info_sub_ = this->create_subscription<sensor_msgs::msg::CameraInfo>(
"camera_info", 10,
std::bind(&ObjectLocalizationNode::cameraInfoCallback, this, std::placeholders::_1)
);
object_pose_pub_ = this->create_publisher<geometry_msgs::msg::PointStamped>(
"object_3d_position", 10
);
tf_listener_ = std::make_shared<tf2_ros::TransformListener>(tf_buffer_);
}
private:
void detectionCallback(const isaac_ros_detectnet_interfaces::msg::DetectionArray::SharedPtr detections)
{
if (!camera_info_received_ || !has_camera_to_robot_tf_) {
RCLCPP_WARN(this->get_logger(), "Camera info or TF not available yet");
return;
}
for (const auto& detection : detections->detections) {
if (detection.confidence < confidence_threshold_) {
continue; // Skip low-confidence detections
}
// Convert 2D bounding box center to 3D point
geometry_msgs::msg::PointStamped pixel_point;
pixel_point.header = detections->header;
pixel_point.point.x = detection.bbox.center.x;
pixel_point.point.y = detection.bbox.center.y;
pixel_point.point.z = 1.0; // Placeholder depth
// Convert pixel coordinates to 3D camera frame
geometry_msgs::msg::PointStamped camera_point;
camera_point = pixelToCameraFrame(pixel_point);
// Transform to robot base frame
geometry_msgs::msg::PointStamped robot_point;
robot_point = transformToRobotFrame(camera_point);
// Create and publish object position
geometry_msgs::msg::PointStamped object_position;
object_position.header = robot_point.header;
object_position.point = robot_point.point;
// Add object label as metadata (in a real system, you might publish this separately)
RCLCPP_INFO(this->get_logger(),
"Detected %s at position: (%.2f, %.2f, %.2f) with confidence %.2f",
detection.label.c_str(),
object_position.point.x,
object_position.point.y,
object_position.point.z,
detection.confidence
);
object_pose_pub_->publish(object_position);
}
}
geometry_msgs::msg::PointStamped pixelToCameraFrame(const geometry_msgs::msg::PointStamped& pixel_point)
{
geometry_msgs::msg::PointStamped camera_point;
camera_point.header = pixel_point.header; // Keep same timestamp/frame initially
// Convert pixel coordinates to normalized coordinates
double x_norm = (pixel_point.point.x - camera_info_.k[2]) / camera_info_.k[0]; // cx, fx
double y_norm = (pixel_point.point.y - camera_info_.k[5]) / camera_info_.k[4]; // cy, fy
// For this example, assume depth is known from other sources
// In practice, you'd get depth from stereo, LIDAR, or depth sensor
double depth = estimateDepth(pixel_point.point.x, pixel_point.point.y);
camera_point.point.x = x_norm * depth;
camera_point.point.y = y_norm * depth;
camera_point.point.z = depth;
return camera_point;
}
geometry_msgs::msg::PointStamped transformToRobotFrame(const geometry_msgs::msg::PointStamped& camera_point)
{
geometry_msgs::msg::PointStamped robot_point;
try {
// Transform from camera frame to robot base frame
tf_buffer_.transform(camera_point, robot_point, "base_link");
} catch (tf2::TransformException& ex) {
RCLCPP_ERROR(this->get_logger(), "Transform failed: %s", ex.what());
return camera_point; // Return original if transform fails
}
return robot_point;
}
double estimateDepth(double u, double v)
{
// Placeholder depth estimation
// In a real system, this would come from:
// 1. Stereo vision
// 2. Depth sensor (RGB-D camera, LIDAR)
// 3. Monocular depth estimation
// 4. Object size-based estimation (if object size is known)
// For this example, return a fixed depth
// A more realistic approach would use stereo disparity or other depth sources
return 1.0; // 1 meter depth as placeholder
}
void cameraInfoCallback(const sensor_msgs::msg::CameraInfo::SharedPtr msg)
{
camera_info_ = *msg;
camera_info_received_ = true;
}
rclcpp::Subscription<isaac_ros_detectnet_interfaces::msg::DetectionArray>::SharedPtr detection_sub_;
rclcpp::Subscription<sensor_msgs::msg::CameraInfo>::SharedPtr camera_info_sub_;
rclcpp::Publisher<geometry_msgs::msg::PointStamped>::SharedPtr object_pose_pub_;
tf2_ros::Buffer tf_buffer_;
std::shared_ptr<tf2_ros::TransformListener> tf_listener_;
sensor_msgs::msg::CameraInfo camera_info_;
bool camera_info_received_ = false;
bool has_camera_to_robot_tf_ = false;
const double confidence_threshold_ = 0.7;
};
```
### 2. Semantic Segmentation for Object Localization
```cpp
#include <rclcpp/rclcpp.hpp>
#include <sensor_msgs/msg/image.hpp>
#include <cv_bridge/cv_bridge.h>
#include <opencv2/opencv.hpp>
class SemanticSegmentationNode : public rclcpp::Node
{
public:
SemanticSegmentationNode() : Node("semantic_segmentation_node")
{
image_sub_ = this->create_subscription<sensor_msgs::msg::Image>(
"image_input", 10,
std::bind(&SemanticSegmentationNode::imageCallback, this, std::placeholders::_1)
);
segmentation_pub_ = this->create_publisher<sensor_msgs::msg::Image>(
"segmentation_output", 10
);
}
private:
void imageCallback(const sensor_msgs::msg::Image::SharedPtr image_msg)
{
// Convert ROS image to OpenCV
cv_bridge::CvImagePtr cv_ptr;
try {
cv_ptr = cv_bridge::toCvCopy(image_msg, sensor_msgs::image_encodings::BGR8);
} catch (cv_bridge::Exception& e) {
RCLCPP_ERROR(this->get_logger(), "cv_bridge exception: %s", e.what());
return;
}
// Run semantic segmentation
cv::Mat segmentation_mask = runSegmentationInference(cv_ptr->image);
// Create result image with segmentation overlay
cv::Mat result_image = createSegmentationOverlay(cv_ptr->image, segmentation_mask);
// Publish segmentation result
publishSegmentationResult(result_image, image_msg->header);
}
cv::Mat runSegmentationInference(const cv::Mat& image)
{
// Placeholder for actual segmentation inference
// This would typically use a model like DeepLab, SegNet, or similar
// For Isaac ROS, this might use Isaac ROS Segmentation packages
cv::Mat segmentation_mask;
// In a real implementation, this would:
// 1. Preprocess image for the segmentation model
// 2. Run inference using TensorRT
// 3. Post-process to get class labels for each pixel
// 4. Return a mask where each pixel value represents the class ID
// For this example, return a dummy mask
segmentation_mask = cv::Mat::zeros(image.size(), CV_8UC1);
// Simulate detection of a few classes in specific regions
cv::rectangle(segmentation_mask, cv::Rect(100, 100, 200, 150), cv::Scalar(1), -1); // Class 1
cv::rectangle(segmentation_mask, cv::Rect(300, 200, 150, 100), cv::Scalar(2), -1); // Class 2
return segmentation_mask;
}
cv::Mat createSegmentationOverlay(const cv::Mat& original_image, const cv::Mat& segmentation_mask)
{
cv::Mat overlay = original_image.clone();
// Define colors for different classes
std::vector<cv::Vec3b> class_colors = {
cv::Vec3b(0, 0, 0), // Class 0: background (black)
cv::Vec3b(255, 0, 0), // Class 1: red
cv::Vec3b(0, 255, 0), // Class 2: green
cv::Vec3b(0, 0, 255), // Class 3: blue
cv::Vec3b(255, 255, 0), // Class 4: cyan
cv::Vec3b(255, 0, 255), // Class 5: magenta
};
// Create overlay with transparency
for (int y = 0; y < segmentation_mask.rows; y++) {
for (int x = 0; x < segmentation_mask.cols; x++) {
int class_id = segmentation_mask.at<uchar>(y, x);
if (class_id > 0 && class_id < class_colors.size()) {
// Blend original color with class color
cv::Vec3b& pixel = overlay.at<cv::Vec3b>(y, x);
cv::Vec3b class_color = class_colors[class_id];
// Simple blending (50% original, 50% class color)
pixel = 0.5 * pixel + 0.5 * class_color;
}
}
}
return overlay;
}
void publishSegmentationResult(const cv::Mat& result_image, const std_msgs::msg::Header& header)
{
cv_bridge::CvImage cv_image;
cv_image.header = header;
cv_image.encoding = sensor_msgs::image_encodings::BGR8;
cv_image.image = result_image;
segmentation_pub_->publish(*cv_image.toImageMsg());
}
rclcpp::Subscription<sensor_msgs::msg::Image>::SharedPtr image_sub_;
rclcpp::Publisher<sensor_msgs::msg::Image>::SharedPtr segmentation_pub_;
};
```
## Isaac Sim Perception Integration
### Isaac Sim Perception Configuration
```python
# Isaac Sim perception setup
import omni
from omni.isaac.core import World
from omni.isaac.core.utils.stage import add_reference_to_stage
from omni.isaac.core.utils.nucleus import get_assets_root_path
from omni.isaac.sensor import Camera
from omni.isaac.range_sensor import RotatingLidarPhysX
import numpy as np
class IsaacSimPerception:
def __init__(self):
self.world = World(stage_units_in_meters=1.0)
self.setup_perception_sensors()
def setup_perception_sensors(self):
# Add a robot to the scene
assets_root_path = get_assets_root_path()
if assets_root_path is None:
print("Could not find Isaac Sim assets")
return
# Add a simple robot with sensors
robot_path = assets_root_path + "/Isaac/Robots/Carter/carter_navigate.usd"
add_reference_to_stage(usd_path=robot_path, prim_path="/World/Carter")
# Add a camera sensor
self.camera = Camera(
prim_path="/World/Carter/chassis/camera",
frequency=30,
resolution=(640, 480)
)
# Add a LIDAR sensor
self.lidar = RotatingLidarPhysX(
prim_path="/World/Carter/chassis/lidar",
translation=np.array([0.0, 0.0, 0.25]),
config="Carter",
rotation_frequency=10,
samples_per_scan=1080
)
# Initialize the world
self.world.reset()
def get_sensor_data(self):
# Get camera data
rgb_data = self.camera.get_rgb()
depth_data = self.camera.get_depth()
seg_data = self.camera.get_semantic_segmentation()
# Get LIDAR data
lidar_data = self.lidar.get_linear_depth_data()
return {
'rgb': rgb_data,
'depth': depth_data,
'segmentation': seg_data,
'lidar': lidar_data
}
def run_perception_pipeline(self):
# Main perception loop
while not self.world.is_stopped():
self.world.step(render=True)
# Get sensor data
sensor_data = self.get_sensor_data()
# Process perception data
objects = self.detect_objects(sensor_data)
# Localize objects in world coordinates
object_poses = self.localize_objects(objects, sensor_data)
# Print results
self.print_perception_results(object_poses)
def detect_objects(self, sensor_data):
# Placeholder for object detection
# In Isaac Sim, this would interface with Isaac ROS perception packages
# or use built-in synthetic data generation
# For this example, return simulated detections
objects = [
{'class': 'box', 'confidence': 0.95, 'bbox': [100, 100, 200, 150]},
{'class': 'cylinder', 'confidence': 0.89, 'bbox': [300, 200, 150, 100]}
]
return objects
def localize_objects(self, objects, sensor_data):
# Convert 2D detections to 3D world coordinates
# This would use depth information and camera parameters
object_poses = []
for obj in objects:
# Convert 2D bbox center to 3D using depth
center_x = (obj['bbox'][0] + obj['bbox'][2]) // 2
center_y = (obj['bbox'][1] + obj['bbox'][3]) // 2
# Get depth at center point
depth = sensor_data['depth'][center_y, center_x]
if depth < 10.0: # Valid depth check
# Convert pixel coordinates to world coordinates
# This requires camera intrinsic parameters
world_pos = self.pixel_to_world(
center_x, center_y, depth,
self.camera.prim.GetAttribute("xformOp:transform").Get()
)
object_poses.append({
'class': obj['class'],
'position': world_pos,
'confidence': obj['confidence']
})
return object_poses
def pixel_to_world(self, u, v, depth, camera_transform):
# Convert pixel coordinates to world coordinates
# This is a simplified version - in practice, you'd use camera intrinsics
# Camera intrinsic parameters (these would come from camera config)
fx = 616.363 # Focal length x
fy = 616.363 # Focal length y
cx = 313.071 # Principal point x
cy = 245.091 # Principal point y
# Convert to camera coordinates
x_cam = (u - cx) * depth / fx
y_cam = (v - cy) * depth / fy
z_cam = depth
# Transform to world coordinates using camera pose
# (simplified - would need proper transformation matrix)
x_world = x_cam # Simplified
y_world = y_cam
z_world = z_cam
return [x_world, y_world, z_world]
def print_perception_results(self, object_poses):
print("Perception Results:")
for obj in object_poses:
print(f" {obj['class']}: ({obj['position'][0]:.2f}, {obj['position'][1]:.2f}, {obj['position'][2]:.2f}), conf: {obj['confidence']:.2f}")
```
## Practical Examples
### Example 1: Person Detection and Localization
```cpp
// Complete example for detecting and localizing people
class PersonDetectionNode : public rclcpp::Node
{
public:
PersonDetectionNode() : Node("person_detection_node")
{
// Subscribe to camera image
image_sub_ = this->create_subscription<sensor_msgs::msg::Image>(
"camera/image_raw", 10,
std::bind(&PersonDetectionNode::imageCallback, this, std::placeholders::_1)
);
// Subscribe to camera info
camera_info_sub_ = this->create_subscription<sensor_msgs::msg::CameraInfo>(
"camera/camera_info", 10,
std::bind(&PersonDetectionNode::cameraInfoCallback, this, std::placeholders::_1)
);
// Publisher for person positions
person_pub_ = this->create_publisher<geometry_msgs::msg::PointStamped>(
"person_position", 10
);
// Publisher for visualization
viz_pub_ = this->create_publisher<sensor_msgs::msg::Image>(
"person_detection_viz", 10
);
}
private:
void imageCallback(const sensor_msgs::msg::Image::SharedPtr image_msg)
{
if (!camera_info_received_) return;
// Convert to OpenCV
cv_bridge::CvImagePtr cv_ptr;
try {
cv_ptr = cv_bridge::toCvCopy(image_msg, sensor_msgs::image_encodings::BGR8);
} catch (cv_bridge::Exception& e) {
RCLCPP_ERROR(this->get_logger(), "cv_bridge exception: %s", e.what());
return;
}
// Run person detection
std::vector<PersonDetection> persons = detectPersons(cv_ptr->image);
// Process each detected person
for (const auto& person : persons) {
if (person.confidence > 0.8) { // Confidence threshold
// Localize person in 3D space
geometry_msgs::msg::PointStamped person_3d = localizePerson(
person, image_msg->header
);
// Publish person position
person_pub_->publish(person_3d);
// Add to visualization
cv::rectangle(cv_ptr->image, person.bbox, cv::Scalar(0, 255, 0), 2);
std::string label = "Person: " + std::to_string(person.confidence);
cv::putText(cv_ptr->image, label,
cv::Point(person.bbox.x, person.bbox.y - 10),
cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 255, 0), 1);
}
}
// Publish visualization image
viz_pub_->publish(*cv_ptr->toImageMsg());
}
std::vector<PersonDetection> detectPersons(const cv::Mat& image)
{
std::vector<PersonDetection> detections;
// In a real implementation, this would run a DNN model
// such as YOLO, SSD MobileNet, or Isaac ROS DetectNet
// For this example, we'll use OpenCV's HOG descriptor
cv::HOGDescriptor hog;
hog.setSVMDetector(cv::HOGDescriptor::getDefaultPeopleDetector());
std::vector<cv::Rect> found_locations;
std::vector<double> found_weights;
hog.detectMultiScale(image, found_locations, found_weights, 0, cv::Size(8,8), cv::Size(32,32), 1.05, 2, false);
for (size_t i = 0; i < found_locations.size(); ++i) {
PersonDetection detection;
detection.bbox = found_locations[i];
detection.confidence = found_weights[i];
detection.center_x = detection.bbox.x + detection.bbox.width / 2.0;
detection.center_y = detection.bbox.y + detection.bbox.height / 2.0;
detections.push_back(detection);
}
return detections;
}
geometry_msgs::msg::PointStamped localizePerson(
const PersonDetection& person,
const std_msgs::msg::Header& header)
{
geometry_msgs::msg::PointStamped person_3d;
person_3d.header = header;
// Estimate depth using simple heuristics
// In practice, you'd use stereo vision or depth sensor
double depth = estimatePersonDepth(person.bbox.height);
// Convert pixel to 3D coordinates
double x_norm = (person.center_x - camera_info_.k[2]) / camera_info_.k[0];
double y_norm = (person.center_y - camera_info_.k[5]) / camera_info_.k[4];
person_3d.point.x = x_norm * depth;
person_3d.point.y = y_norm * depth;
person_3d.point.z = depth;
return person_3d;
}
double estimatePersonDepth(int bbox_height)
{
// Simple depth estimation based on bounding box height
// Assumes average person height is ~1.7m
// height_in_pixels = (focal_length * real_height) / depth
// So depth = (focal_length * real_height) / height_in_pixels
double focal_length = camera_info_.k[0]; // fx
double real_person_height = 1.7; // meters
double pixel_height = static_cast<double>(bbox_height);
return (focal_length * real_person_height) / pixel_height;
}
struct PersonDetection {
cv::Rect bbox;
double confidence;
double center_x, center_y;
};
void cameraInfoCallback(const sensor_msgs::msg::CameraInfo::SharedPtr msg)
{
camera_info_ = *msg;
camera_info_received_ = true;
}
rclcpp::Subscription<sensor_msgs::msg::Image>::SharedPtr image_sub_;
rclcpp::Subscription<sensor_msgs::msg::CameraInfo>::SharedPtr camera_info_sub_;
rclcpp::Publisher<geometry_msgs::msg::PointStamped>::SharedPtr person_pub_;
rclcpp::Publisher<sensor_msgs::msg::Image>::SharedPtr viz_pub_;
sensor_msgs::msg::CameraInfo camera_info_;
bool camera_info_received_ = false;
};
```
### Example 2: Object Detection with Isaac ROS and Isaac Sim
```python
# Isaac Sim + Isaac ROS integration example
import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image, CameraInfo
from isaac_ros_detectnet_interfaces.msg import DetectionArray
from geometry_msgs.msg import PointStamped
import cv2
import numpy as np
from cv_bridge import CvBridge
class IsaacPerceptionPipeline(Node):
def __init__(self):
super().__init__('isaac_perception_pipeline')
# ROS 2 interface
self.bridge = CvBridge()
# Publishers and subscribers
self.image_sub = self.create_subscription(
Image, '/camera/image_raw', self.image_callback, 10
)
self.camera_info_sub = self.create_subscription(
CameraInfo, '/camera/camera_info', self.camera_info_callback, 10
)
self.detection_sub = self.create_subscription(
DetectionArray, '/detectnet/detections', self.detection_callback, 10
)
self.object_pub = self.create_publisher(
PointStamped, '/detected_object_position', 10
)
self.viz_pub = self.create_publisher(
Image, '/perception_visualization', 10
)
# Storage
self.camera_info = None
self.latest_image = None
def image_callback(self, msg):
self.latest_image = msg
def camera_info_callback(self, msg):
self.camera_info = msg
def detection_callback(self, msg):
if self.latest_image is None or self.camera_info is None:
return
# Convert ROS image to OpenCV
cv_image = self.bridge.imgmsg_to_cv2(self.latest_image, "bgr8")
# Process detections
for detection in msg.detections:
if detection.confidence > 0.7: # Confidence threshold
# Localize object in 3D
object_3d = self.localize_object_3d(
detection.bbox.center.x,
detection.bbox.center.y,
detection
)
# Publish 3D position
self.object_pub.publish(object_3d)
# Draw bounding box on image
pt1 = (int(detection.bbox.center.x - detection.bbox.size_x/2),
int(detection.bbox.center.y - detection.bbox.size_y/2))
pt2 = (int(detection.bbox.center.x + detection.bbox.size_x/2),
int(detection.bbox.center.y + detection.bbox.size_y/2))
cv2.rectangle(cv_image, pt1, pt2, (0, 255, 0), 2)
cv2.putText(cv_image, f"{detection.label}: {detection.confidence:.2f}",
(pt1[0], pt1[1]-10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 1)
# Publish visualization
viz_msg = self.bridge.cv2_to_imgmsg(cv_image, encoding="bgr8")
viz_msg.header = self.latest_image.header
self.viz_pub.publish(viz_msg)
def localize_object_3d(self, x_2d, y_2d, detection):
# This is a simplified example
# In practice, you'd use depth information from stereo or depth sensor
point_3d = PointStamped()
point_3d.header = detection.header
# Estimate depth based on object size or use depth map
# For this example, assume a fixed depth of 2 meters
estimated_depth = 2.0 # meters
# Convert 2D pixel coordinates to 3D using camera intrinsics
if self.camera_info:
fx = self.camera_info.k[0] # Focal length x
fy = self.camera_info.k[4] # Focal length y
cx = self.camera_info.k[2] # Principal point x
cy = self.camera_info.k[5] # Principal point y
point_3d.point.x = (x_2d - cx) * estimated_depth / fx
point_3d.point.y = (y_2d - cy) * estimated_depth / fy
point_3d.point.z = estimated_depth
return point_3d
def main(args=None):
rclpy.init(args=args)
perception_node = IsaacPerceptionPipeline()
try:
rclpy.spin(perception_node)
except KeyboardInterrupt:
pass
finally:
perception_node.destroy_node()
rclpy.shutdown()
if __name__ == '__main__':
main()
```
## Performance Optimization
### 1. TensorRT Optimization for Deep Learning Models
```cpp
// Example of TensorRT optimization for perception
#include <NvInfer.h>
#include <cuda_runtime_api.h>
class OptimizedPerceptionNode : public rclcpp::Node
{
public:
OptimizedPerceptionNode() : Node("optimized_perception_node")
{
// Initialize TensorRT engine
initializeTensorRTEngine();
}
private:
void initializeTensorRTEngine()
{
// This would load a pre-built TensorRT engine
// for optimized inference of perception models
// The engine would be built offline from ONNX models
}
std::vector<Detection> runOptimizedInference(const cv::Mat& image)
{
// Run inference using TensorRT for maximum performance
// This would include:
// 1. Memory management for GPU
// 2. Batch processing
// 3. Asynchronous execution
// 4. Proper input/output binding
std::vector<Detection> detections;
// Implementation would go here
return detections;
}
nvinfer1::ICudaEngine* engine_;
nvinfer1::IExecutionContext* context_;
cudaStream_t stream_;
void* buffers_[2]; // Input and output buffers
};
```
### 2. Multi-Threaded Perception Pipeline
```cpp
#include <thread>
#include <queue>
#include <mutex>
#include <condition_variable>
class MultiThreadedPerceptionNode : public rclcpp::Node
{
public:
MultiThreadedPerceptionNode() : Node("multithreaded_perception_node")
{
// Create threads for different perception tasks
detection_thread_ = std::thread(&MultiThreadedPerceptionNode::detectionLoop, this);
localization_thread_ = std::thread(&MultiThreadedPerceptionNode::localizationLoop, this);
// Subscribe to image
image_sub_ = this->create_subscription<sensor_msgs::msg::Image>(
"image_input", 10,
std::bind(&MultiThreadedPerceptionNode::imageCallback, this, std::placeholders::_1)
);
}
~MultiThreadedPerceptionNode()
{
running_ = false;
if (detection_thread_.joinable()) detection_thread_.join();
if (localization_thread_.joinable()) localization_thread_.join();
}
private:
void imageCallback(const sensor_msgs::msg::Image::SharedPtr image_msg)
{
std::lock_guard<std::mutex> lock(image_queue_mutex_);
image_queue_.push(image_msg);
if (image_queue_.size() > max_queue_size_) {
image_queue_.pop(); // Drop oldest if queue is full
}
image_queue_cond_.notify_one();
}
void detectionLoop()
{
while (running_) {
sensor_msgs::msg::Image::SharedPtr image_msg;
{
std::unique_lock<std::mutex> lock(image_queue_mutex_);
image_queue_cond_.wait(lock, [this] { return !image_queue_.empty() || !running_; });
if (!running_) break;
image_msg = image_queue_.front();
image_queue_.pop();
}
// Run object detection
auto detections = runDetection(image_msg);
// Add to detection queue
{
std::lock_guard<std::mutex> lock(detection_queue_mutex_);
detection_queue_.push(std::make_pair(image_msg->header, detections));
}
detection_queue_cond_.notify_one();
}
}
void localizationLoop()
{
while (running_) {
std_msgs::msg::Header header;
std::vector<Detection> detections;
{
std::unique_lock<std::mutex> lock(detection_queue_mutex_);
detection_queue_cond_.wait(lock, [this] { return !detection_queue_.empty() || !running_; });
if (!running_) break;
auto detection_pair = detection_queue_.front();
header = detection_pair.first;
detections = detection_pair.second;
detection_queue_.pop();
}
// Run localization
auto object_positions = runLocalization(detections, header);
// Publish results
publishResults(object_positions);
}
}
std::vector<Detection> runDetection(const sensor_msgs::msg::Image::SharedPtr image_msg)
{
// Run object detection on the image
// Implementation would go here
return std::vector<Detection>();
}
std::vector<ObjectPosition> runLocalization(
const std::vector<Detection>& detections,
const std_msgs::msg::Header& header)
{
// Localize objects in 3D space
// Implementation would go here
return std::vector<ObjectPosition>();
}
void publishResults(const std::vector<ObjectPosition>& positions)
{
// Publish localization results
// Implementation would go here
}
struct Detection {
std::string label;
float confidence;
cv::Rect bbox;
};
struct ObjectPosition {
std::string label;
geometry_msgs::msg::Point position;
float confidence;
};
rclcpp::Subscription<sensor_msgs::msg::Image>::SharedPtr image_sub_;
// Image processing queue
std::queue<sensor_msgs::msg::Image::SharedPtr> image_queue_;
std::mutex image_queue_mutex_;
std::condition_variable image_queue_cond_;
// Detection queue
std::queue<std::pair<std_msgs::msg::Header, std::vector<Detection>>> detection_queue_;
std::mutex detection_queue_mutex_;
std::condition_variable detection_queue_cond_;
std::thread detection_thread_;
std::thread localization_thread_;
std::atomic<bool> running_{true};
const size_t max_queue_size_ = 5;
};
```
## Best Practices
### 1. Confidence Thresholding
Always use confidence thresholds to filter out low-quality detections:
- Set appropriate thresholds based on your application requirements
- Consider using adaptive thresholds based on scene complexity
- Validate detections with geometric consistency checks
### 2. Multi-Sensor Fusion
Combine data from multiple sensors for robust perception:
- Fuse camera, LIDAR, and radar data
- Use Kalman filters or particle filters for tracking
- Implement sensor validation and fault detection
### 3. Performance Monitoring
Monitor perception performance in real-time:
- Track inference time and frame rates
- Monitor memory and GPU usage
- Log detection accuracy and false positive rates
## Exercise
Create a complete perception pipeline that includes:
1. Object detection using Isaac ROS DetectNet
2. 3D localization using stereo vision or depth information
3. Multi-threaded processing for real-time performance
4. Integration with Isaac Sim for testing
5. Visualization of detection results
6. Performance evaluation metrics
Test your pipeline with various objects in different lighting conditions and evaluate its accuracy and performance.