Image Credit: Zapp2Photo/Shutterstock.com
This article concerns augmented reality and artificial intelligence technologies to capture real audio-visual scenes through digital sensors to render enhanced virtual objects, which can be accomplished by wearing a headset display or smart glasses and listening through headphones or loudspeakers.
Augmented reality headsets in today's consumer electronics market are equipped with micro-electro-mechanical systems (MEMS) sensors with embedded processing capability and artificial intelligence elements. These sensors respond with high precision to stimuli from the actual physical world to blend holographic data with real-world environments.
The environmental data is collected through sensors such as microphones, accelerometers, gyroscopes, and cameras to detect sound pressure, vibration acceleration, rotational (or inertial) direction, and visual information, respectively. The integration of this data is known as multi-sensor data fusion.
As discussed below, sensor output accuracy and data fusion are essential for augmented reality wearable devices' perceived performance quality. The users have a high level of expectations and demand extremely responsive systems to align real-world dynamic environments with the virtual counterparts.
The consumer demand promotes product standardization and generates economies of scale, presenting ground-breaking opportunities and significant challenges to harness intelligent sensing benefits for a wide range of applications. These include audio-visual communication, gaming, navigation, autonomous driving, smart homes, health monitoring, and robotics.
Spatial Audio Augmented Reality Applications and Sensor Performance
Over the last decade or so, there has been intensive augmented reality research. This research has focused on visual perception, where the rendering of virtual auditory objects in the three-dimensional space has had little focus on artificial intelligence methods such as deep learning for motion tracking, directional localization, and distance (or depth) detection.
Recent deep learning methods are advancing automatic sound description, classification, and recognition in contexts other than augmented reality. Novel automatic scene description and room layout reconstruction can be used to localize virtual auditory objects when using an augmented reality headset to identify and classify visual features such as room size, sound source angular direction, and distance.
Novel deep learning methods for video captioning have also shown improved performance and discovery of audio-visual correlation and hence their potential for audio-visual localization and synchronization.
There is also increasing interest in spatial audio capture and rendering techniques to automatically adapt augmented audio reproduction over bespoke loudspeaker arrangements using object-based audio content, which involves capturing and controlling the spatial distribution of sound objects in musical events such as orchestral settings.
To this effect, intelligent spherical microphone array technology provides outstanding performance through ultra-small fabrication geometries, low-power consumption, and excellent sensor properties stability in terms of sensitivity, repeatability, and frequency response accuracy.
These three-dimensional microphone configurations are capable of dynamic beamforming to adjust microphone directivity patterns and are inspired by positional tracking (i.e., measuring) techniques such as six degrees of freedom, which is based on the detection of both rotational and translational movement by using gyroscopes, resembling the ability of the human inner ear to sense body pose (i.e., the combination of position and orientation).
Motion Tracking and Sensor Data Fusion Techniques
As a user of smart glasses or an augmented reality headset can see real and virtual objects simultaneously, positional information and motion tracking are critical to providing meaningful sensory feedback for interactivity.
Tracking a headset pose can be particularly demanding when the device's wearer makes rapid head movements, as visual misalignment may occur due to poor inertial sensor data fusion. Optical tracking techniques can alleviate this problem by using a visual camera, for example, to reconstruct either the pose of the camera in its surroundings or the pose and the spatial depth of the tracked object. Hybrid techniques can also be more efficient by fusing visual with inertial orientation data.
Microphone arrays can estimate sound source distance for environmental localization. However, signal noise or environmental interference may depend on sensor directivity and hardware interfacings such as signal digitization and amplification.
To provide accurate or reliable signal information, popular multi-sensor data fusion schemes such as state-estimation methods that are based on Kalman filtering or particle filtering tend to use a relatively small number of MEMS microphones in symmetrical array configurations. For example, one-dimensional array configurations are common in biologically inspired systems that mimic human binaural perception.
This approach may not enhance the signal-to-noise ratio as efficiently as larger-sized condenser microphone technologies but is more straightforward to configure than other irregular configurations (e.g. pyramidal, cubic, etc.).
This article has explored some current and emerging trends on intelligent MEMS sensors' performance and their configurations, also considering multi-sensor data fusion, deep learning, and the nature of relevant techniques such as motion tracking in applications for wearable devices used in augmented audio-visual reality and immersive spatial audio.
The highlighted developments can help researchers and the broader community in industry and society to better understand some essential concepts that enable manufacturers and developers to create more meaningful interactive experiences.
The existing increasing demand for human-computer interaction and sensor technologies for augmented reality products indicates the importance of these products to the societal well-being and the perceived performance quality of wearable devices, not only for multimedia entertainment and communication applications but also for a wide range of multidisciplinary applications in industries such as medical, transport and urban infrastructure.
References and Further Reading
 K. Kim, et al. (2018, Accessed on 8 March 2021). Revisiting trends in augmented reality research: A review of the 2nd decade of ISMAR (2008–2017). IEEE transactions on visualization and computer graphics 24(11), 2947-2962. Available: https://par.nsf.gov/biblio/10105851-revisiting-trends-augmented-reality-researchreview-decade-ismar
 H. Kim, et al. (2020, Accessed on 8 March 2021). Immersive Virtual Reality Audio Rendering Adapted to the Listener and the Room. Real VR – Immersive Digital Reality: How to Import the Real World into Head-Mounted Immersive Displays, 293- 318. Available: https://doi.org/10.1007/978-3-030-41816-8_13
 H. Zhu, et al. (2020, Accessed on 8 March 2021). Deep Audio-Visual Learning: A Survey. arXiv:2001.04758. Available: https://ui.adsabs.harvard.edu/abs/2020arXiv200104758Z
 P. Coleman, et al. (2018, Accessed on 8 March 2021). An Audio-Visual System for Object-Based Audio: From Recording to Listening. IEEE Transactions on Multimedia PP, 1-1. Available: https://www.researchgate.net/deref/http%3A%2F%2Fdx.doi.org%2F10.1109%2FTMM.2018.2794780
 J. Y. Hong, et al. (2017, Accessed on 8 March 2021). Spatial Audio for Soundscape Design: Recording and Reproduction. Applied Sciences 7(6), 627. Available: https://www.mdpi.com/2076-3417/7/6/627
 G. A. Koulieris, et al. (2019, Accessed on 8 March 2021). Near-Eye Display and Tracking Technologies for Virtual and Augmented Reality. Computer Graphics Forum 38(2), 493-519. Available: https://core.ac.uk/display/324166066?recSetID=
 C. Rascón and I. V. M. Ruiz. (2017, Accessed on 8 March 2021). Localization of sound sources in robotics: A review. Robotics Auton. Syst. 96, 184-210. Available: https://www.sciencedirect.com/science/article/pii/S0921889016304742
 F. Castanedo. (2013, Accessed on 8 March 2021). A Review of Data Fusion Techniques. The Scientific World Journal 2013, 704504. Available: https://doi.org/10.1155/2013/704504