Abstract
Perception is a cornerstone of autonomous driving, enabling vehicles to understand their surroundings and make safe, reliable decisions. Developing robust perception algorithms requires large-scale, high-quality datasets that cover diverse driving conditions and support thorough evaluation. Existing datasets often lack a high-fidelity digital twin, limiting systematic testing, edge-case simulation, sensor modification, and sim-to- real evaluations. To address this gap, we present DrivIng, a large-scale multimodal dataset with a complete geo-referenced digital twin of a ∼ 18 km route spanning urban, suburban, and highway segments. Our dataset provides continuous recordings from six RGB cameras, one LiDAR, and high-precision ADMA- based localization, captured across day, dusk, and night. All sequences are annotated at 10 Hz with 3D bounding boxes and track IDs across 12 classes, yielding ∼ 1.2 million annotated instances. Alongside the benefits of a digital twin, DrivIng allows a 1-to-1 transfer of real traffic into simulation, preserving interactions between agents while enabling realistic and flexible scenario testing. We benchmark DrivIng with state-of-the-art perception models and publicly release the dataset, digital twin, HD map, and codebase to support reproducible research and robust validation.
At a Glance
DrivIng is a real-world autonomous driving dataset with a fully geo-referenced CARLA digital twin, designed for systematic sim-to-real evaluation and reproducible benchmarking across matched real and simulated environments. It provides synchronized multi-sensor streams (six RGB cameras, LiDAR, and ADMA GNSS/IMU), together with 3D bounding-box annotations for 3D detection, tracking, and multi-modal perception.
Dataset Composition
DrivIng provides a synchronized and spatially calibrated multi-modal dataset paired with a geo-referenced CARLA digital twin. The dataset is organized into three continuous sequences recorded under different illumination conditions:
- Sequences (3): Day, Dusk, and Night continuous runs.
- Coverage: ~18 km route spanning urban, suburban, and highway segments (unique track length ~16 km).
- Sensors: 6 RGB cameras (360° coverage), 1 LiDAR, and 1 ADMA GNSS/IMU (geo-referencing and motion).
- Annotations: 10 Hz 3D bounding boxes with track IDs in the LiDAR point cloud, aligned to the vehicle reference frame.
- Classes: 12 object categories with class-colored 3D boxes for clear visual distinction.
- Scale: ~63k annotated frames (~378k camera images, ~63k LiDAR frames) and ~1.2M labeled instances.
- Environments: Highway, suburban streets, urban roads, and construction zones.
- Privacy: Faces and license plates are anonymized in camera images.
Calibration Overview
Each sensor within DrivIng is temporally synchronized and spatially calibrated to support frame-accurate multi-sensor fusion and direct comparison between real-world recordings and the CARLA digital twin. Calibration covers intrinsics for each camera, extrinsics between cameras, LiDAR, and the ADMA GNSS/IMU, and consistent alignment to the vehicle reference frame for reliable 3D annotation projection and evaluation.
Coordinate System
All sensor data and 3D annotations are provided in a consistent vehicle-centric reference frame to enable reliable multi-sensor fusion and benchmarking. The dataset includes the necessary rigid transforms to map between the individual sensor frames (cameras, LiDAR, IMU) and the vehicle coordinate frame. The coordinate convention follows a standard automotive setup with a right-handed axis definition.
Citation
@misc{rößle2026drivinglargescalemultimodaldriving,
title={DrivIng: A Large-Scale Multimodal Driving Dataset with Full Digital Twin Integration},
author={Dominik Rößle and Xujun Xie and Adithya Mohan and Venkatesh Thirugnana Sambandham and Daniel Cremers and Torsten Schön},
year={2026},
eprint={2601.15260},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.15260},
}
Funded by: