Complications that Sensor Data Provides IoT Designers

Sensors bridge the gap between the virtual workings of devices and the physical world, but it isn't as simple to extract the data they collect. In fact, many designers that are new to the Internet of Things (IoT) are unprepared for how messy a sensor’s data can be.

The IoT motion-sensor company MbientLab undergoes the near-daily struggle to tactfully teach its customers that the enormous amount of data they are seeing is not due to a fault in their sensors. Rather, it is the fault of the system design that incorporates those sensors, which is missing some crucial step in the data cleaning process.

“I battle this every day,” said MbientLab CEO Laura Kassovic in a recent presentation, that warned engineers just how difficult training IoT wearables with machine learning can be. She explained that while tools and hardware have improved over time, basic understanding of how to process the data is still lagging.

“I applaud users for trying to use sensors to solve problems and research complex topics,” she said. “It’s brave, it’s fun, it’s wild, it’s hard. My issues are with those who blame their failures on our sensors instead of their methodology and failure to solve the real problem. Sensors don’t lie. Sensors aren’t biased. Sensor data is always correct. It is only the user that can misuse or misinterpret the sensor data.”

Sensors cannot always just be picked up and used, however. And just because the data is there, does not necessarily mean it is valuable. In order to make the most out of what is collected, the real value has to be separated from the bulk, and the rest discarded discard the rest.

Most sensing is very cheap. There are some exceptions to that, such as artificial eyes. But some also falls into the AI category, such as the wrist watch that picks up various measurements. What kind of insights can you get? Can you predict a heart attack? If you can, that is of pretty high value. So how much would you pay for what? If you have one minute, you can scribble down ‘thank you’ to your wife and that’s about it. If you have an hour, you can call for a medivac. If you have a few hours, the value and risk change again.

Aart de Geus, Chairman and Co-CEO, Synopsys

Data can appear differently, depending on the form or application, with what is considered clean in one case may necessitate greater polish before it can be used in a different application. Also, some data can be cleaned locally, whereas other data has to be cleaned in a data center.

Let’s say you have a facial recognition application and only certain employees are allowed to enter this building. Every month you update the AI network in the edge device and it will be up-to-date on all the faces. It may do a lot of work because there are a lot of people coming in all the time, but not all of that has to be updated all the time.

Aart de Geus, Chairman and Co-CEO, Synopsys

Some cases may require the data to be scrubbed in real time. A tragic example is that of the Lion Air crash of a new Boeing MAX 8 aircraft, which on October 29th 2018, which killed all on board. The current theory is heading toward “the sensor did it.” When the black box was recovered from the flight, the data from one of the two angle-of-attack (AOA) sensors was inconsistent. From the perspective of the computer, half of the data was apparently incorrect, and this was enough to trigger this plane’s anti-stall system into a nose down action, which the pilots wrestled all the way into the Java sea.

Although this is a working theory, it’s still too early to tell what really happened in this case.

It’s not just a sensor. There are multiple aspects of this system. There is a sensing part, a connectivity part, and then a computing part. There is some algorithm that looks at sensor data and determines what is the orientation of the airplane. Multiple features have to work together harmoniously and synchronously to provide the information about the orientation of the airplane.

Mahesh Chowdhary, Director, Strategic Platforms and IoT Excellence Center, STMicroelectronics

But not all data is good, and even data that was picked out as valuable may be corrupted or inaccurate. From the apparently simple IoT system to more complex safety-critical system, when sensor system designs fail, how often is data—especially dirty data—the culprit? Another issue is how to tell if the sensor or the data is bad? Or does it go deeper, with the logic in the algorithms or the firmware that reads and acts upon the data being faulty? The first step is forming an agreed definition of what dirty data is.

“It’s an ambiguous area. Is the sensor working right? Well, yeah it is but it’s not working the way you intended it. So, is it user error or is it sensor error? I find the whole concept of dirty data is super ambiguous because if you get the sensors working right, it’s just not working as intended by the user,” said Robert Pohlen, a product line director at TT Electronics, a sensor design company that also helps clients create various sensor-based systems.

The Data Processing Path

To fully comprehend the difference between clean and dirty data, it is important to know how data moves from point A to point B.

To say that data from sensors experiences post-processing is something of an understatement. A basic transducer converts one form of energy to another, with or without assistance from external power, to create a signal: digital or analog. The original conversion is a result of a real-world stimulus in the form of an analog signal—sound, light, temperature, magnetic forces, pressure, etc.

At some point along the line, whether inside the sensor or on the printed circuit board, the original analog signal gets conditioned—or amplified if needed—and converted to a digital signal. After conversion, the data usually is sent to a microchip or some other processor for additional filtering through algorithms to clean the noise and pull the relevant information in a useful form.

Computer architectures are only now starting to come to grips with this kind of data-first approach, where some data needs to be pre-processed at the edge, while other data can be moved to more powerful servers to be scrubbed.

Edge computation is going to be a big play. The fundamentals are all there. We know what all the base building blocks are. We need to figure out how to efficiently move data around in whatever formats, paying attention to the memory hierarchy of how you move the data the least distance to get it to the computation. These are fundamentals to how to get more efficient computing.

Robert Blake, President and CEO, Achronix

It’s also vital for separating data that must be acted upon straight away from data that could be used to identify trends over time, as well as removing worthless data. When you consider that there are many different types of data this gets even more difficult, with some instances requiring multiple data types to navigate the physical world or to form a conclusion about whether someone is about to suffer a medical emergency.

Even data that has started out clean may become dirty, either through updates or viruses.

Globally, all components need to be as secure as possible, so you want to build trust up from the hardware. Once you security boot up, the communication data already has some sort of trust. But there are also insecure, unknown components, and that requires intrusion detection and software analysis on larger sets of data. That allows you to see if anything has been corrupted. In an automotive scenario, you want to detect which part is giving you anomalies or weird data. That’s a security issue, but it’s also a safety issue.

Helena Handschuh, Rambus Fellow

It is clear that dirty data needs to be addressed, but the action that is taken depends on where and how it becomes dirty. It may be that the sensor itself generates dirty raw data, so designers must account for that in the first stages of the process.

Solving a sensor problem requires a lot of domain expertise. It requires knowledge of the sensor at the hardware level, understanding of the data extracted from the sensors and experience with software (algorithm) development.

Laura Kassovic, CEO, MbientLab

It is important to differentiate between the sources of the data: don’t mistake data from an accelerometer with data from a GPS.

An accelerometer only measures acceleration of a body. What most fail to understand is that is not a substitute for a GPS, which outputs the absolute position of a body in space. Every single application is unique enough that it requires a unique approach to most optimally extract the correct end metric. I am always perplexed by the number of users who think the data coming from the sensors should look exactly like their college textbook. Real-world sensor data is imperfect. When you open your physics, engineering, or computer science textbook, it is littered with perfect curves of bodies in motion. When you take data from the real world, those same curves are going to look quite different. There is noise and error in the real world.

Laura Kassovic, CEO, MbientLab

Each application is unique enough that it requires a unique approach to most optimally extract the correct end metric.

Understanding Data

So what is the best course of action for dealing with dirty data? The first step is at the source, the sensor, and how the output is understood and interpreted. Sensor data is usually relative rather than absolute, and real-world sensor readings aren’t always perfect.

In the eyes of sensor makers, there are basic issues with the noise, filters and algorithms - so often they provide tools to help. At the user end of the data-handling system some systems designers and platform vendors can see valid data that is populating their database incorrectly, so they provide a watchful eye and tools to help.

I see dirty data on the analog side, not on the digital side. Dirty data is noisy data. Noise would be my biggest concern. Noise can be induced from lots of different sources. You could just have just electrical noise that’s being picked up from your wiring harnesses or caused by components going bad.

Robert Pohlen, Product Line Director, TT Electronics

In Pohlen's eyes, data doesn't qualify as dirty if there is noise caused by some kind of external influence on the actual sensing mechanism.

You know, for example, it’s a light sensor and you have an ambient source of light. I wouldn’t consider that dirty data because that’s really not truly what you’re trying to measure but it is measuring it correctly.

Robert Pohlen, Product Line Director, TT Electronics

Calibration is also important, as uncalibrated sensors produce more dirty data than calibrated ones.

Computation with raw sensor data that is not calibrated is what generally dirty data is essentially referred to as — or one that has a lot of noise on it. Besides the physical part of sensors using some phenomena, like measuring Coriolis acceleration for example to detect rotation of a device, rotation of a user, or rotation of a phone, you have signal conditioning blocks. These signal conditioning blocks operate at different conditions for low power mode, where the objective of the designer is to minimize the current consumption for the sensor if you can use that block. If you do that, the noise on the sensor data moves up because the more power you apply to signal conditioning, the cleaner is your data.

Considering these different aspects, dirty data is sensor data that is not calibrated, sensor data that has been impacted by input of noise, whether the noise is due to purely signal conditioning blocks or from external disturbances.

Mahesh Chowdhary, Director, Strategic Platforms and IoT Excellence Center, STMicroelectronics

He classifies external disturbances as dirty data, such as when a magnetometer is affected by an external magnetic.

You know that data can all be clumped together and categorized as dirty data.

Mahesh Chowdhary, Director, Strategic Platforms and IoT Excellence Center, STMicroelectronics

Sensors can vary in quality due to the nature of manufacturing, even those within the same batch may have slight differences. Sensors can pickup damage or be blocked out in the field: a ground crew can damage a plane’s sensor, even an AOA sensor, parts can go bad or wear out, sensors have to be re-calibrated.

Pratik Parikh offers an enterprise point of view in trying to make sense of the data, “in sensor-based device networks, dirty data can be the product of one or many issues. Issues can be caused by but not limited to time series laps, sensor unit measurement, date/time calibration, inappropriate associations of sensors, improper aggregation of data point across regions, etc. Dirty data could also be as simple as data produced not meeting the business objective and thus is unstable or unusable or invalid.”

Parikh is the director of product marketing at Liaison Technologies, a company that helps put the usable data on a platform for enterprises to use.

There are other, specific definitions in industry.

Dirty data is well-formed data reported by your devices that is invalid in some way. It doesn’t immediately get flagged as this is garbage that we can’t even interpret. You can totally read it in, but you find out at some point that that data is actually completely invalid.

James Branigan, Co-Founder, Bright Wolf

In the IIoT and IoT, dirty data risks contaminating a company’s data lake and other problem-causing behavior, as well as wasting money.

The reason it is a problem is because in all these IoT systems, as you look for value in the data and you make programmatic analytics that are going to run over those incoming data values, you are going to connect those analytical outputs to your enterprise system in some way. There’s some interesting event that is going to happen as the output of all this. And if you base that interesting event on bad assumptions—dirty data that came in—you get into that classic garbage in, garbage out. Dirty data can cause you real harm where you are starting to incur real economic cost, because these automated actions are being kicked off by data that is not actually not valid.

James Branigan, Co-Founder, Bright Wolf

Branigan outlines three key issues of dirty data.

One, something is physically wrong with the sensor. Either the environment has changed or the sensor is having an error that it cannot detect itself, and it is giving you well-formed but completely garbage data. The next category involves whether the firmware that runs on the device has software bugs. Even newer versions of firmware can cause different issues where well-formed data is reported in that is totally erroneous. The third category, which is really nefarious, is where you need very specific knowledge of the machine operations in order to understand how to interpret the data that comes in. Without that knowledge you may interpret a data packet as valid, when some other part of the system did not intend it to be interpreted that way.

James Branigan, Co-Founder, Bright Wolf

So is dirty data clear as mud? Perhaps the term is too general to be useful?

Help with Cleaning Chores

Don't despair, there are a number of available tools to help data cleanup.

There are so many great tools out there. Matlab, Labview and Python are the most popular. Our very own MetaWear APIs support filters in all major coding languages. I typically recommend that our users use the tools they are most comfortable with. Python is a great tool because it has many machine learning libraries available that are open source, easy to use, and well documented.

Laura Kassovic, CEO, MbientLab

MbientLab also uses Bosch’s FusionLab as they carry a Bosch sensor as well as their own.

Bosch-Sensortec, which also provides drivers and libraries for their sensors, wants a sensor system that detects, interprets, monitors, is context-aware and has prediction intent, writes Marcellino Gemelli, who is in charge of business development for Bosch Sensortec’s MEMS product portfolio. Sensortec provides libraries, drivers and tools for setting up sensors, along with microcontrollers for streamlining assistance.

Kassovic thinks that one of the most important factors is the right person with the right expertise.

What I firmly believe today is you can’t send a software engineer to do a firmware engineer’s job.

Laura Kassovic, CEO, MbientLab

On the enterprise side, having a data scientist in the loop for data clean-up takes up too much time.

With machines generating the data, whole new classes of dirtiness can happen beyond human generated data. That is really what the focus of cleaning your dirty data needs to be. There are lots of big data cleaning tools in the big data market place but those are centered around the data scientist. You get a fairly static data set, you need to go and clean it and you need to go and analyze it to look for something interesting. That approach really works well at the rate humans generate data. At the rate that machines generate data, that approach doesn’t scale. It’s not even possible. You end up having these ingestion systems that are taking live feeds from the devices, streaming analytics over them and then hooking those outputs up to some enterprise system so the action happens automatically.

James Branigan, Co-Founder, Bright Wolf

Moving to digital may help.

Moving towards digital communications definitely helps. All things being considered that like the sensor—you are assuming the sensor is getting good data and what the data you’re collecting, is it noisy due to analog? I see the natural trend would move towards digital where you could have error-checking built in. There is some room for noise in the digital system. If this noise is on the lines, who cares really because it’s either high or low and then you have some kind of error check to go along with it. If that’s the case, you can just throw the data out.

Robert Pohlen, Product Line Director, TT Electronics

“Although raw data may be filtered, compensated and corrected, in most cases there are definite limits to what a user can do with it,” writes Marcello Gemelli, responsible for the business development of Bosch-Sensortec’s MEMS product portfolio in a recent article.

The first step to overcome these challenges is implementing and integrating proper sanitation tools. These sanitation tools not only have to deal with the quality of data but also with validation of identity, trust, time series, and each data point from the perspective of the project. Each project has unique requirements. The project implementer can and should use common technology features but must be ready to do mass customization as needed to achieve business objectives.

Pratik Parikh, Director of Product Marketing, Liaison Technologies

Along with de-duplication detection, Liaison Technologies provides data cleansing, filtering, and management. “One of the key features we provide is the tracking of data lineage, which allows us to track the data from it’s the raw introduction to a cleansed structured format. Customers can trace and monitor the data lineage and if need to make course connection then can replay the data after making appropriate changes to business logic.”

An expensive, but worthwhile solution for safety critical systems is redundancy.

Everybody wants to get to a higher ASIL rating, but do they necessarily want to commit to having more sensing? Again, it all comes down to it might be correct data, it might be incorrect data, but on the back end, how do you interpret that data. Unless you have some kind of self-diagnostic within your sensor, the best way is redundancy.

Robert Pohlen, Product Line Director, TT Electronics

This information has been sourced, reviewed and adapted from materials provided by TT Electronics plc.

For more information on this source, please visit TT Electronics plc.


Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    TT Electronics plc. (2019, October 29). Complications that Sensor Data Provides IoT Designers. AZoSensors. Retrieved on May 24, 2022 from

  • MLA

    TT Electronics plc. "Complications that Sensor Data Provides IoT Designers". AZoSensors. 24 May 2022. <>.

  • Chicago

    TT Electronics plc. "Complications that Sensor Data Provides IoT Designers". AZoSensors. (accessed May 24, 2022).

  • Harvard

    TT Electronics plc. 2019. Complications that Sensor Data Provides IoT Designers. AZoSensors, viewed 24 May 2022,

Tell Us What You Think

Do you have a review, update or anything you would like to add to this article?

Leave your feedback
Your comment type