Fires can cause major infrastructure damage, loss of life, and disruption to essential services in smart cities. As a result, early detection is a priority, especially in outdoor Internet of Things (IoT) environments where smoke is often the earliest visible sign of danger.
Traditional heat, smoke, gas, and temperature sensors can issue late alerts and are often sensitive to surrounding conditions. They have been known to produce false alarms from harmless triggers such as dust or steam.
In fast-moving outdoor fires, these limitations can have devastating effects.
Computer vision has emerged as a promising alternative. Digital cameras, improved processors, graphics processing units (GPUs), and deep learning have made it more practical to detect smoke and fire directly from images and video in real time.
The Vision Transformer Hybrid Device
In the Scientific Reports paper, researchers developed a vision-based smoke and fire detection framework that combines a Vision Transformer with the YOLOv8 detection architecture.
The model divides the task between the two components: ViT acts as a global feature extractor, helping the system capture long-range spatial relationships and broader scene context. YOLOv8 serves as the real-time detection head, identifying and localizing fire and smoke regions within the image.
The framework was evaluated using the Fire and Smoke Dataset and the Forest Fire Smoke Dataset, with a combined total of 7,720 images drawn from rural and urban scenes.
Their goal was to improve both detection accuracy and speed, strengthen performance under variable lighting and changing smoke or fire appearance, and support real-time use in complex urban settings.
How YOLOv8 and the Vision Transformer Work Together
The framework uses ViT’s self-attention mechanism to capture visual patterns such as color gradients, texture, and intensity. According to the authors, this helps the model recognize subtle spatial relationships that conventional convolutional neural networks (CNNs) may miss.
Those extracted features are then passed to YOLOv8 for fast object localization and classification. The authors selected YOLOv8 because of its strong balance between speed and precision, making it suitable for real-time detection with low latency.
The model was trained on augmented datasets to improve generalization across different smoke densities, lighting conditions, and fire characteristics, reducing false positives and improving overall detection performance.
Results of 98+ % Across Metrics
The paper reports 98.5 % precision, 97.8 % recall, and an F1-score of 98.1 % for the proposed ViT-YOLOv8 model. It also reports an accuracy of 99.2 % in the abstract and conclusion, although one results section lists 99.6 %, indicating an internal inconsistency in the paper’s reporting.
The authors say the framework outperformed conventional CNN-based and YOLO-only approaches, with a reported 4.3 % gain in accuracy over comparison methods. They also report low inference latency and qualitative results showing accurate localization of smoke and fire regions.
Taken together, those findings suggest the model could support faster visual fire monitoring and improve emergency response planning in smart city settings.
The paper is careful to note that the system was tested mainly on controlled datasets under laboratory conditions, rather than in fully real-world urban deployments. The authors also acknowledge that performance may be affected by weather, occlusion, darkness, and thick fog.
Another limitation is that the framework relies on visual input alone. The researchers suggest that future systems could be strengthened by combining camera-based detection with thermal imaging and environmental sensors such as temperature and humidity sensors. They recommend testing on streaming video and developing lighter models for edge deployment.
Fire Detection in Smart Cities
The study points to a practical direction for early smoke and fire detection: pairing transformer-based scene understanding with fast object detection. The reported results are strong, and the framework appears well-suited to safety applications that depend on rapid visual analysis.
Journal Reference
Abozeid, A., Alanazi, R. (2026). An intelligent approach for early smoke/fire detection using vision sensors in smart cities. Scientific Reports. DOI: 10.1038/s41598-026-42762-y
Disclaimer: The views expressed here are those of the author expressed in their private capacity and do not necessarily represent the views of AZoM.com Limited T/A AZoNetwork the owner and operator of this website. This disclaimer forms part of the Terms and conditions of use of this website.