3

I want to algorithmically process energy meter data. The energy meter measures a heat or power producer or a heat or power consumer (but not both, so the measured energy will always have a positive sign). No additional information is known about the energy system (like maximum load) neither about the type of energy meter - only data stored in a database can be accessed. Processing will be done by an algorithm looking at data for a given time interval (no live processing).

Usually, data are weakly monotonic of the form

2015-04-01 00:00 20.78 kWh

2015-04-01 00:05 30.80 kWh

2015-04-01 00:10 73.99 kWh

2015-04-01 00:20 82.30 kWh

2015-04-01 00:25 82.30 kWh

2015-04-01 00:30 83.44 kWh

...

The energy produced or consumed for a given period is simply the difference of the energy meter counts. So far, so good. However, the algorithm has to deal with the following three problems:

1. Outliers "above" have to be detected as invalid data.

2015-04-01 00:00 20.78 kWh

2015-04-01 00:05 30.80 kWh

2015-04-01 00:10 500 kWh

2015-04-01 00:20 82.30 kWh

2015-04-01 00:25 82.30 kWh

2015-04-01 00:30 83.44 kWh

....

2. Outliers "below" have to be detected as invalid data.

2015-04-01 00:00 20.78 kWh

2015-04-01 00:05 30.80 kWh

2015-04-01 00:10 20 kWh

2015-04-01 00:20 82.30 kWh

2015-04-01 00:25 82.30 kWh

2015-04-01 00:30 83.44 kWh

....

In unlikely cases, there might be several consecutive outliers above or below or a combination of both.

3. A reset of the energy meter has to be detected automatically.

2015-04-01 00:00 20.78 kWh

2015-04-01 00:05 30.80 kWh

2015-04-01 00:10 3.99 kWh

2015-04-01 00:20 12.30 kWh

2015-04-01 00:25 12.30 kWh

2015-04-01 00:30 13.44 kWh

...

After a reset, counting starts a again from another level (a reset is simply a level shift). The level the counting starts from after the reset is often zero, but can also be any other positive number. A reset can occur at an arbitrary point in time (usually not too often).

To my eyes, problems 1. - 3. seem ubiquitous in measurement engineering and must have been already addressed. Nevertheless, I couldn't find any literature on this topic. Does anybody know about existing solutions to this problem? All help will be highly appreciated.

410 gone
  • 3,771
  • 18
  • 35
Daniel
  • 33
  • 3

2 Answers2

2

There are two ways to do it.

The old way

The traditional way is to develop a set of somewhat arbitrary rules based on the errors you manually classify. You filter out non-monotonicity (easy), identify resets (easy), and try to spot other bad values (trickier). That gives you a set of values to mark as missing, and then you analyse the rest of the data. This method is not well grounded in theory, but you will have the (somewhat unsatisfactory) defence that: "it's how lots of other people do it".

Best practice

The best-practice way to do it is to write down the probability of everything relevant, and then apply Bayes' Theorem to work out what the most likely real time series was, given your recorded observations.

You start with a prior distribution for the rate of use of energy, based on preceding work.

And then create probability distributions for the ways that errors can happen: a meter reset, a dropped decimal point in recording; a dropped digit; a completely junk reading. Add in a distribution for the measurement error of the meter itself: they've usually got either a datasheet or an accredited standard which has an error range defined.

The statistics should account for things like a real usage spike and a reset coinciding. You might need to specify a joint distribution if they are linked: for example, a power cut could conceivably result in a meter reset and a power surge, as things like heaters, fridges and freezers would all come back on at full power when power is restored.

And then you calculate a posterior distribution for actual energy use, which is the thing you're interested in.

Pros and cons

The second method has the advantage of being rooted in rock-solid theory. It is, however, quite a lot of work to set up the distributions; and in pretty much every real-world case, there isn't an analytic solution, so you have to find for a numerical solution (e.g. using markov-chain monte-carlo). Software packages such as Andrew Gelman's STAN will do that part of the work for you.

Before you start, chart

Either way, start by charting your raw data. The eye will pick up informative patterns.

410 gone
  • 3,771
  • 18
  • 35
0

Discarding outliers is easy, you simply discard (or ignore) them. The hard part is deciding what is a outlier and what is valid data. This comes down to defining what is possible and likely.

One obvious check in your case is against the maximum power the system can consume or produce. If the delta between two readings exceeds that, then something is wrong somewhere. For example, if the system being measured can't consume more than 200 kW, then the meter can't increase by more than 17 kWh in a 5 minute period.

Note that the maximum power production can be different from the maximum power consumption. Maybe your system can consume up to 200 kW, but can't ever produce more than 50 kW. That means in 5 minutes the possible valid range is +16.7 kWh to -4.2 kWh.

You may know other things about your system, which you can use to detect invalid readings. For example, even though its power consumption is bounded by -50 to +200 kW, maybe it can't change faster than 10 kW per 5 minutes. Or, if the power is being produce by solar cells, then getting power production at night must be wrong. Surely there are things you know your system can't do.

Again, it all comes down to defining what valid is. For example, you say the third reading in case 2 is wrong? How do you know that? If it's possible for the system to generate 10.8 kWh during a 5 minute interval, then consume 62.3 kWh during the next 5 minutes, then case 2 may be all correct readings after all. You haven't said anything about your system to tell this is not the case.

Olin Lathrop
  • 11,461
  • 1
  • 24
  • 36