Programmable Access Libraries Status 2019-04-10

Some ideas and arbitrarily-ish chosen decisions.

Plan

Date	Day	Description	Task/Milestone
19/04/12	Fri	Gather an example workload of HDF5 and implement HDF5 VOL Passthrough	Task
19/04/15	Mon	Gather additional bioinformatics use cases from Genomics Institute	Task
19/04/19	Fri	Have workloads working on HDF5 files on top of librados	Milestone
19/05/03	Fri	Have workloads working on HDF5 files on top of HDF5-RADOS interface	Milestone

For Friday (05/03), I would like to have a naive/simple implementation of what Carlos had mentioned, where HDF5 VOL passes API calls to another HDF5 VOL server/endpoint which is closely integrated with RADOS (via objecter class/code) so that I can compare the overhead of using librados (just VOL call out to object store) and the overhead of the object store “incorporating” operations from an access library.

I think at this point, I would be able to do a majority of a project for Carlos’s class, which more explicitly mentions benchmarking the comparative performance of HDF5 over a local file system to HDF5 operations distributed to multiple, remote file systems:

Build a HDF5/VOL plugin that maps to HDF5. Measure performance of HDF5 over a regular file system vs HDF5/VOL to HDF5 over a regular file system. What is the overhead of this indirection? Then map HDF5/VOL via plugin to HDF5 on multiple servers. How many servers do you need to be faster than HDF5 over a regular file system?

Pending time I want to merge some ideas here with ideas for a microservice architecture for a bioinformatics application.