Abstract: Neither a monocular RGB camera nor a small-size microphone array is capable of accurate three-dimensional (3D) speaker localization. By taking advantage of accurate visual object detection, ...