<aside> 📢 Target audience: Data scientists of any level interested in an easy, robust, reproducible, language- and platform-agnostic setup for data processing, analysis, and visualization.

</aside>

📖 Storytime

When I joined a bioinformatics (= data science performed on biomedical data) research lab as a predoctoral fellow in 2020 I was told that we have complete freedom in choosing a computational setup for our work. There were no restrictions concerning programming language, development environments, or specific software. As so often with freedom of choice came the feeling of being overwhelmed by all the potential options, I started talking to my new colleagues and found out that almost everyone had their own systems set up with some overlap.

After some time on Google and discussions about best practices, I came up with my own combination of tools, that form the basis of my current data science setup, presented here.

Shortly after me, a postdoctoral scientist joined our lab and faced the same challenge. After sharing my setup and watching him go around and ask for opinions to come up with his own setup, I realized that a simple blog post could solve that for future lab members and everyone else interested in data science. So, here we are, I hope you enjoy the ride and find it helpful.

<aside> 📌 I will keep improving this blog post over time, according to the feedback I receive.

</aside>

<aside> ❗ TL;DR: This is a tutorial and resource describing an easy, robust, reproducible, language- and platform-agnostic setup for data processing, analysis, and visualization by means of package and environment management (conda); software development with an integrated development environment (JupyterLab); distribution and version control (GitHub); workflow management (Snakemake) and containerization (Singularity).

</aside>

Introduction

In this blog post, I will describe the general components of a setup to perform data science adhering to the following requirements. A good setup should...

...be easy & fast to set up, use and maintain.
...be compatible with widely used infrastructure (e.g., common operating systems).
...be agnostic of (programming) language.
...consist of components that seamlessly work together.
...guarantee reproducibility.
...facilitate collaboration.
...be future proof (i.e., robust toward changes and open for extensions e.g., open source)

Every section will contain a formal definition 🔍, multiple solutions ✅, my weapon of choice 🏁(including arguments as to why❓), concrete instructions for the setup and usage 📋, synergistic integration with the other components 🧩, a minimal working example 💻 (which is gradually extended over the course of this post), resources📚 and a hidden gem 💎.

The order of the sections is by design as every component builds on the previous and integrates synergistically into the setup as do the examples.

<aside> 📌 Follow along to set up your own system in the recommended order (as a tutorial) or go cherry-picking for specific applications (as a resource), the post should facilitate both.

</aside>