Sandbox for running code on data where the data lives


#1

Data is larger than programs; programs have dependencies (software/plarform). To what extent is it of interest to have sandboxes where data lives, so code can be run on data without worrying about download/storage? Does this topic/issue overlap with the discussion of containers?


#2

Hi Ramy–

Would you mind being a little more specific about what you are imagining when you describe sandboxes? Like virtual machines?


#3

I am imagining somewhere to upload code (e.g. a Python script) to a file system that is local to the data, and that (ideally) has or (less ideally) will let me install the appropriate interpreter/dependencies, and run that code there on the data, collect the output, and then be wiped clean when I am done/make room for someone else.


#4

Is the motivation for this that the data is sufficiently large or sensitive that it cannot be moved? Are there throughput concerns? There are several very good execution engines out there today. Is there an underlying domain functionality needed that they do not meet or is this something you feel should be part of the scientific process for anyone doing collaborative work in the field, ala continuous integration?


#5

The motivation is size, download times, and local storage space. Can you provide a reference/pointer for “execution engines”?


#6

Here are representative, popular projects which will all accomplish what you want within the context of very different use cases

  • Agave: PaaS, hybrid, and self-hosted Science-as-a-Service solution supporting interactive, batch, and event driven tasks on hpc, htc, cloud, and container systems.
  • Taverna: SaaS and self-hosted managed workflow execution
  • Mesos: SaaS and self-hosted distributed data center orchestration natively supporting most forms of invoking tasks.
  • Jupyter: Web application with distributed remote code execution support in several languages.
  • Ansible: Python-based automation framework with remote task execution baked in
  • Spring Batch: Batch execution framework with native support for distributed task execution and orchestration built in.
  • Jenkins: SaaS or self-hosted automation server with plugins to perform remote task execution, reporting, and orchestrated tasks.