More direct file storage research

I was surprised to find how easy it was to use flat file storage for feed data using PHP file access commands and how fast this approach could be.

While reading up on indexes I realised that the timestamp column in a feed data table is its own index as it is naturally sorted in ascending order and each row (datapoint) should be unique. A data-point can then be searched for efficiently using binary search which I remember covering in A-level computing. The feed data in mysql had a binary tree index which If I understand correctly is similar but it is implemented using a separate index layer which uses quite a bit of extra disk space.

I had a go at implementing the get_feed_data function used in emoncms to select a given number of datapoint's over a timewindow used for drawing graphs. An example of the standalone function can be found here:

https://github.com/emoncms/experimental/blob/master/storage/directfiles/get_feed_data.php

A development branch of emoncms that uses this flat file approach and includes this function can be found here (inserting data, input processing such as power to kwhd and visualisation all work, but its still quite conceptual)

https://github.com/emoncms/emoncms/tree/flatfilestore

The get_feed_data function as implemented above takes roughly 120-230ms on a RaspberryPI to select 1000 datapoints over 1 to 300 days with a feed table with over 9 million rows.

Thats much better than the 900-2700ms achieved at similar ranges with the current mysql implementation.

The feed table in mysql used 178Mb of disk space. The same feed with no loss of data stored without an index and accessed as above takes up 67Mb so that's a considerable saving. Interestingly a 67Mb feed can be compressed to 18.5Mb with tar.gz compression.

One of the issues with the above get_feed_data query is that it needs to know the data interval to do the get a datapoint every x number of datapoints approach. We could use binary search to find every datapoint but this would be slower although maybe worth trying to get a benchmark so that it can be compared.

The other issue is that the datapoints selected may not be representative of the window they represent as they are just one random datapoint at a particular point in time. Which is the problem that the averaging approach used by Mike Stirling in Timestore and by Frank Oakner in EmonWeb solves.

Timestore is also a fair bit faster than the above get_feed_data function returning 1000 datapoints in 45ms.

The advantage of the above approach is that it can fit into emoncms without having to change the current implementation too much, the feed data retains its timestamps, input processing is used in the same way.

Not storing timestamps as timestore does could also be an advantage as it helps keep data quality high: fixed interval datapoints should be easier to use for comparison's, mathematical operations between feeds, it gives you higher certainty when querying the data, fetching data is faster and disk use is potentially half the size of the above approach if the values are stored as 4 byte floats as above rather than the default 8 byte double. This coupled with averaged layers provides data that is representative at all time scales and datapoint number requests.

My next step will therefore be to explore timestore further, first creating a script to export data from emoncms into timestore. The script needs to analyse the emoncms feed to work out the most common data interval. It needs to check for missing data, If a monitor went offline for an extended length of time it needs to give you the option to take this into account. It then needs to export and import into timestore as efficiently as possible. To engage in discussion regarding this post, please post on our Community Forum.