PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

DrivIng: A Large-Scale Multimodal Driving Dataset with Full Digital Twin Integration

Dominik Rößle¹, Xujun Xie¹, Adithya Mohan¹, Venkatesh Thirugnana Sambandham¹,
Daniel Cremers², Torsten Schön¹

IEEE IV 2026 ¹Technische Hochschule Ingolstadt ²Technical University of Munich

In Collaboration With

Code arXiv

DrivIng dataset and digital twin visualization

This visualization illustrates the core features of DrivIng and its digital twin. The left panel shows a real-world satellite view of the track and its fully geo-referenced digital twin, aligned with a location marker indicating the vehicle’s position. The right panel presents the synchronized sensor suite, including six camera views and a LiDAR frame. The top row displays real-world images, while the bottom row shows the corresponding CARLA simulation with all real-world objects precisely mapped. All images and the LiDAR frame include class-colored 3D bounding boxes for clear object distinction.

Satellite image © Esri, i-cubed, USDA, USGS, AEX, GeoEye, Getmapping, Aerogrid, IGN, IGP, UPR-EGP, and the GIS User Community.

Abstract

Perception is a cornerstone of autonomous driving, enabling vehicles to understand their surroundings and make safe, reliable decisions. Developing robust perception algorithms requires large-scale, high-quality datasets that cover diverse driving conditions and support thorough evaluation. Existing datasets often lack a high-fidelity digital twin, limiting systematic testing, edge-case simulation, sensor modification, and sim-to- real evaluations. To address this gap, we present DrivIng, a large-scale multimodal dataset with a complete geo-referenced digital twin of a ∼ 18 km route spanning urban, suburban, and highway segments. Our dataset provides continuous recordings from six RGB cameras, one LiDAR, and high-precision ADMA- based localization, captured across day, dusk, and night. All sequences are annotated at 10 Hz with 3D bounding boxes and track IDs across 12 classes, yielding ∼ 1.2 million annotated instances. Alongside the benefits of a digital twin, DrivIng allows a 1-to-1 transfer of real traffic into simulation, preserving interactions between agents while enabling realistic and flexible scenario testing. We benchmark DrivIng with state-of-the-art perception models and publicly release the dataset, digital twin, HD map, and codebase to support reproducible research and robust validation.

At a Glance

DrivIng is a real-world autonomous driving dataset with a fully geo-referenced CARLA digital twin, designed for systematic sim-to-real evaluation and reproducible benchmarking across matched real and simulated environments. It provides synchronized multi-sensor streams (six RGB cameras, LiDAR, and ADMA GNSS/IMU), together with 3D bounding-box annotations for 3D detection, tracking, and multi-modal perception.

DrivIng real-world and CARLA digital twin overview — This visualization illustrates the core features of DrivIng and its digital twin. The left panel shows a real-world satellite view of the track and its fully geo-referenced digital twin, aligned with a location marker indicating the vehicle’s position. The right panel presents the synchronized sensor suite, including six camera views and a LiDAR frame. The top row displays real-world images, while the bottom row shows the corresponding CARLA simulation with all real-world objects precisely mapped. All images and the LiDAR frame include class-colored 3D bounding boxes for clear object distinction. Satellite image © Esri, i-cubed, USDA, USGS, AEX, GeoEye, Getmapping, Aerogrid, IGN, IGP, UPR-EGP, and the GIS User Community.

Dataset Composition

DrivIng provides a synchronized and spatially calibrated multi-modal dataset paired with a geo-referenced CARLA digital twin. The dataset is organized into three continuous sequences recorded under different illumination conditions:

Sequences (3): Day, Dusk, and Night continuous runs.
Coverage: ~18 km route spanning urban, suburban, and highway segments (unique track length ~16 km).
Sensors: 6 RGB cameras (360° coverage), 1 LiDAR, and 1 ADMA GNSS/IMU (geo-referencing and motion).
Annotations: 10 Hz 3D bounding boxes with track IDs in the LiDAR point cloud, aligned to the vehicle reference frame.
Classes: 12 object categories with class-colored 3D boxes for clear visual distinction.
Scale: ~63k annotated frames (~378k camera images, ~63k LiDAR frames) and ~1.2M labeled instances.
Environments: Highway, suburban streets, urban roads, and construction zones.
Privacy: Faces and license plates are anonymized in camera images.

Calibration Overview

Each sensor within DrivIng is temporally synchronized and spatially calibrated to support frame-accurate multi-sensor fusion and direct comparison between real-world recordings and the CARLA digital twin. Calibration covers intrinsics for each camera, extrinsics between cameras, LiDAR, and the ADMA GNSS/IMU, and consistent alignment to the vehicle reference frame for reliable 3D annotation projection and evaluation.

Coordinate System

All sensor data and 3D annotations are provided in a consistent vehicle-centric reference frame to enable reliable multi-sensor fusion and benchmarking. The dataset includes the necessary rigid transforms to map between the individual sensor frames (cameras, LiDAR, IMU) and the vehicle coordinate frame. The coordinate convention follows a standard automotive setup with a right-handed axis definition.

Citation

@misc{rößle2026drivinglargescalemultimodaldriving,
      title={DrivIng: A Large-Scale Multimodal Driving Dataset with Full Digital Twin Integration}, 
      author={Dominik Rößle and Xujun Xie and Adithya Mohan and Venkatesh Thirugnana Sambandham and Daniel Cremers and Torsten Schön},
      year={2026},
      eprint={2601.15260},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.15260}, 
}

Funded by:

Data Privacy:

🔒 Data Privacy Policy (THI)