<p>The safety and
reliability of autonomous driving pivots on the accuracy of perception and
motion prediction pipelines, which in turn reckons primarily on the sensors
deployed onboard. Slight confusion in perception and motion prediction can
result in catastrophic consequences due to misinterpretation in later
pipelines. Therefore, researchers have recently devoted considerable effort
towards developing accurate perception and motion prediction models. To that
end, we propose LIDAR Camera network (LiCaNet) that leverages multi-modal
fusion to further enhance the joint perception and motion prediction
performance accomplished in our earlier work. LiCaNet expands on our previous
fusion network by adding a camera image to the fusion of RV image with historical
BEV data sourced from a LIDAR sensor. We present a comprehensive evaluation to
validate the outstanding performance of LiCaNet compared to the
state-of-the-art. Experiments reveal that utilizing a camera sensor results in
a substantial perception gain over our previous fusion network and a steep
reduction in displacement errors. Moreover, the majority of the achieved
improvement falls within camera range, with the highest registered for small
and distant objects, confirming the significance of incorporating a camera
sensor into a fusion network.</p>