Potential issue with aggregation result (std deviation and variance)

Question

I am running a query against the following dataset: https://www.kaggle.com/datasets/census/population-time-series-data.

Here is the code:

year_in_mili = 31536000000
ts = store.get_container("population")
query = ts.query("select * from population where value > 280000")
rs = query.fetch()
data = rs.next()
timestamp = calendar.timegm(data[0].timetuple())
gsTS = (griddb.TimestampUtils.get_time_millis(timestamp))
time = datetime.datetime.fromtimestamp(gsTS/1000.0)
added = gsTS + (year_in_mili * 7)
addedTime = datetime.datetime.fromtimestamp(added/1000.0)
variance = ts.aggregate_time_series(time, addedTime, griddb.Aggregation.VARIANCE, "value")
print("VARIANCE: ", variance.get(griddb.Type.DOUBLE))
stdDev = ts.aggregate_time_series(time, addedTime, griddb.Aggregation.STANDARD_DEVIATION, "value")
print("STANDARD DEVIATION: ", stdDev.get(griddb.Type.LONG))

The results for all are correct except for stdDev and variance

TOTAL:  48714984
AVERAGE:  289970
VARIANCE:  -84078718183.5204
STANDARD DEVIATION:  -9223372036854775808
COUNT  168
WEIGHTED AVERAGE:  289970

Obviously they should not be negative, but I did run the numbers in excel. These are the results:

var: 31045317.27 
std dev: 5571.832488

score 0 · Answer 1 · answered Mar 11 '24 at 12:30

I'd guess, from the large relative values of stdDev and variance, that there's an integer overflow happening somewhere - possibly because of missing values. The original post for your code mentions "avoiding the missing data from before 1970" as a reason for querying populations over 280000 but then the sample code is querying populations over 327000. Perhaps you can try changing 280000 to 327000, or otherwise checking for any missing time data.

Potential issue with aggregation result (std deviation and variance)

1 Answers1