Rethinking the data input and storage core of emoncms: benchmarks

Over the last few days I've been looking again at the core data input, storage and access part of emoncms. There is definitely a lot of opportunity to improve performance and there are a lot of options so I thought I would start to do some more systematic benchmarking.

So here are some initial benchmarks of feed data storage in different storage engines: mysql (myisam vs innodb), timestore and direct file access. I also thought Id have a go at writing the current implementation of input processing in both python and nodejs in addition to php to learn a bit more about these languages as they are being used and favoured by others in the community such as Jerome (python), Houseahedron (python) and Jean Claude Wippler of Jeelabs (nodejs). Id like to see if there is any measurable difference in performance between these different languages for the kind of application that we are developing and if there are any other benefits: easier to do certain things etc.

Housemon by Jean Claude Wippler is a good example of how a timeseries data storage and visualisation application can be implemented in a different way by using a mixture of direct file storage and a redis in-memory database with the server side part of the application written in nodejs.

Intrigued by the idea of using direct file storage as Jean Claude Wippler does in Housemon and following the approach used by Mike Stirling in timestore of using a fixed time interval to simplify and speed up searching I had a go at writing a basic implementation using php file access and the results are good.

Storage engine test

All tested on a raspberrypi, running off the standard SanDisk SDHC 4Gb SD Card. 

  • InnoDB INSERT 1000 points 21s,25s,20s (Normalised to 100,000 ~ 2200s)
  • InnoDB INSERT 10000 points 167s,183s (Normalised to 100,000 ~ 1750s)
  • MYISAM INSERT 10000 points 15-17s (Normalised to 100,000 ~ 160s)
  • MYISAM INSERT 100000 points 165s

Benchmark of current emoncms mysql read function that selects given number of datapoints over a time window.

MYISAM results on the left | INNODB results on the right

10000 datapoint table:
  • 1000dp over 5 hours (average method) 232ms | 391ms
  • 1000dp over 24 hours (average method) 424ms | 675ms
1000000 datapoint table: (115 days @ 10s)
  • all 0.2 hours (all method) 40ms | 38ms
  • all 0.5 hours (all method) 58ms | 55ms
  • all over 1 hours (all method) 90ms | 82ms
  • all over 1.3 hours (all method) 108ms | 100ms
  • 1000dp over 3 hours (average method) 237ms | 272ms
  • 1000dp over 5 hours (average method) 280ms | 327ms
  • 1000dp over 24 hours (average method) 726 ms | 949ms
  • 1000dp over 48 hours (average method) 1303 ms | 1767ms
  • 1000dp over 52 hours (php loop method) 2875 ms | 2650ms
  • 1000dp over 100 hours (php loop method) 3124 ms | 2882ms
  • 1000dp over 200 hours (php loop method) 2934 ms | 2689ms
  • 1000dp over 400 hours (php loop method) 2973 ms | 2749ms
  • 1000dp over 2000 hours (php loop method) 2956 ms | 2762ms
  • 1000dp over 2600 hours (php loop method) 2969 ms | 2767ms
PHP loop method timing may be quite a bit longer if the server is under heavy load as it involves making many separate mysql queries, each query needs to wait for other queries in the mysql process list to complete.

Timestore is a promising solution, developed specifically for timeseries data, written by Mike Stirling.
Blog post on timestore: Timestore timeseries database
  • 10000 inserts 52s
  • 100,000 inserts 524s
  • Read 1000 datapoints over 5 hours: 45ms
  • Read 10 datapoints over 5 hours 20ms
Includes layer averaging and multiple layers so there is quite a bit more going on (that would still need to be added to other implementations like direct file and mysql above), so benchmarks are not directly comparable.

Direct file
For some reason I did not think this method would work as well as the benchmarks show but its great that it does because from an implementation point of view its really simple and very flexible as its easy to modify the code to do what you want, see the examples linked:
  • Direct file write 100,000: 6-7s
  • Direct file write 100,000 open and close each time: 27,24,26s
  • Direct file read 1000 datapoints over 5 hours of 10 second data in 85-88ms
  • Reads 1000 datapoints over 200 hours of 10 second data in 93ms
  • Reads 1000 datapoints over 2000 hours of 10 second data in 130ms
  • Reads 1000 datapoints over 2600 hours of 10 second data in 124ms
For a short term storage, storage to disk may not be necessary, instead we can store data in memory using an in-memory database like redis. Benchmarks to add.

Blog post: Redis idea

Other ideas for storage format
What about the programming language? No benchmarks yet but interesting to look at the difference in how the code looks. I found each language pretty straightforward to use and online resources to get me passed the bits I didn't know where readily available. The language links below show the core parts of the input processing stage of emoncms written in php, nodejs and python. I've also linked to emonweb a port of emoncms (or more a build in its own right be now) by Frank Oxener in ruby on rails.
Servers stats

HouseMon by Jean Claude Wippler stores data in 3 forms: 
  • Raw log of the serial data received to file (compressed daily) 
  • Redis in-memory storage for last 48 hours which makes for quick access of most recent data. 
  • Archival storage via direct file access for data older than 48 hours, the archive is hourly aggregated data (hourly - unless a use case demands finer resolution at which point the archive can be rebuilt from the raw logs).

Its quite clear from some of the above tests that the housemon implementation is going to be fast in terms of data access speeds (with redis storing everything in memory for the last 48 hours) and efficient in terms of data storage (binary files – hourly data), the big difference is that full resolution data is not available after 48 hours but Jean Claude Wippler argues that it would be better to wait for a use case rather than implement higher resolution for higher resolution sake and that logs can be used to rebuild archives at higher resolution if needed anyway.

Next steps

If you have a standard emoncms raspberrypi install, changing the mysql storage engine to myisam should bring immediate performance improvements, especially if you have a lot of feeds being recorded, I will try and put together a script to make this easier and also update the ready to go image.

The next development step I think is to integrate redis into emoncms by rebuilding the input processing implementation to use redis rather than go to disk to get the last feed and input values. Then it would be good to test both timestore and the integrated direct file storage in action on several parallel raspberrypi's, keep benchmarking the differences and then see where that gets us. To engage in discussion regarding this post, please post on our Community Forum.