Bioinformatics Project Template

Author

David Martín-Gálvez

Published

January 12, 2026

1 Introduction

This document describes a generic, flexible template to structure any bioinformatics analysis project. It is designed to work across different project types (RNA-seq, ChIP-seq, metagenomics, variant calling, proteomics, single-cell analysis, long-read sequencing, computational chemistry, etc.) and different computing environments (local workstations, HPC clusters, cloud).

The template implements:

  • A standard folder layout for organizing code, data, and results.
  • Minimal software requirements and installation instructions.
  • Initialization scripts to quickly scaffold new projects.
  • Git + SSH key setup and basic cluster access.
  • Separation of shared configuration and machine-specific configuration.
  • A portable path system based on BASE_DIR and REPO_DIR.
  • A repository that stores scripts and configuration only, never large data.
  • A project structure designed for use on local machines, cluster, and NAS.
  • SLURM array job architecture for scalable HPC execution.
  • Multi-environment storage strategy (macOS, HOME, fstrat, NAS).

1.1 Configuration System

The configuration system consists of:

  • config.yamlshared, versioned configuration; same for all users and machines.
  • config_local.yamlprivate, unversioned configuration; machine-specific.
  • load_config.sh → loader that reads both YAML files with yq and exposes variables to Bash/SLURM.

This ensures complete reproducibility, portability, and clarity across different computing environments.

1.2 Key Principles

  • Scripts never hard-code paths — all paths are derived from configuration.
  • Data lives outside the repository — only scripts and configuration are version-controlled.
  • One-sample Bash scripts can be tested locally or used in SLURM array jobs.
  • Clear separation between DATA (inputs) and ANALYSES (outputs).
  • Portable across environments — only BASE_DIR and REPO_DIR change between machines.

1.3 Why This Template is Generic

This template is not specific to any analysis type. Instead, it provides:

  • Flexible naming for scripts and directories (adapt 01_analysis_one_sample.sh to your analysis).
  • Customizable paths in config.yaml (use raw_data, sequences, alignments, or your own naming).
  • Agnostic architecture (works with any tool: FastQC, STAR, samtools, bcftools, Mothur, QIIME2, R, Python, etc.).
  • Scalable design (from single-machine to HPC array jobs).

Examples of how to adapt for different projects:

Project Type Sample Definition Input Files Analysis Steps
RNA-seq Sample ID FASTQ files QC → Trim → Map → Count → DE
ChIP-seq Replicate ID BAM files QC → Peak call → Annotate
Metagenomics Sample ID FASTQ files QC → Denoise → Taxonomy → Abundance
Variant calling Sample ID FASTQ/BAM files QC → Map → Call → Annotate → Filter
Single-cell Cell barcode h5ad/mtx files QC → Normalize → Cluster → Annotate

1.4 How to Use This Template

This repository is configured as a GitHub Template Repository, which allows you to create new projects without copying the full Git history.

1.4.1 Creating a New Project from Template

  1. On GitHub:

    • Navigate to this page.
    • Click the green “Use this template” button (top right)
    • Select “Create a new repository”
    • Fill in:
      • Owner: Your username or organization
      • Repository name: Your project name (e.g., sparrow-rnaseq-analysis)
      • Description: Brief project description
      • Visibility: Public or Private (your choice)
    • Click “Create repository”
  2. Clone your new repository:

    git clone git@github.com:<your-username>/<your-project>.git
    cd <your-project>
  3. Customize for your project:

    • Edit README.md: Update title, description, and project-specific information
    • Edit config/config.yaml: Adjust paths and parameters for your analysis type
    • Create config/config_local.yaml: Add your machine-specific paths
    • Update docs/: Rename QMD files and customize documentation
    • Adapt scripts in scripts/<step>/: Rename and modify for your specific analysis steps
    • Create sample list: Generate config/samples_names.txt with your sample identifiers
  4. Initialize project structure:

    # Specify parent path and analysis folder name
    bash scripts/utils/init_project.sh -p /path/to/parent -n analysis_name
    
    # Example: Create analysis folder 'run_20260112' in your project area
    bash scripts/utils/init_project.sh -p /fstrat/username/my-project -n run_20260112

1.4.2 Advantages of Template Repository vs Manual Cloning

Aspect Template Repository Manual Clone
Git history ✅ Clean start ❌ Inherits all template history
GitHub connection ✅ Automatic ❌ Requires remote reset
Simplicity ✅ One-click creation ❌ Manual rm -rf .git needed
Updates ✅ Independent ❌ Conflicts with template updates
Best practice ✅ Recommended ❌ Not recommended

1.4.3 After Creating Your Project

  1. Configure Git identity (if not already set globally):

    git config user.name "Your Name"
    git config user.email "your.email@example.com"
  2. Set up SSH keys (see Section 4 for detailed instructions)

  3. Initialize analysis folder (see detailed instructions in Section on Initialize the Project):

    # Create your analysis folder with appropriate name
    bash scripts/utils/init_project.sh -p /path/to/parent -n analysis_folder

    This will automatically create config/config_local.yaml with correct paths.

  4. Start developing: Add your data paths, customize scripts, and begin analysis


2 Minimal Software

2.1 Required Software

  • Shell & CLI:
    • bash ≥ 4.0, coreutils, awk, sed, grep, rsync, tar, gzip
  • YAML:
    • yq (mikefarah/yq) to read YAML with bash
  • Documentation:
    • Quarto (optional but recommended): https://quarto.org
  • VCS:
    • Git ≥ 2.30
  • Optional (depending on analysis):
    • Conda/Mamba for reproducible environments
    • Snakemake or Nextflow for workflows
    • R (≥4.2) and/or Python (≥3.10)

2.2 Quick Install (macOS)

First, install Homebrew if you don’t have it (https://brew.sh):

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Then install the required packages:

brew install yq git quarto rsync gnu-sed gawk

3 Initialize Repository

3.1 Initialize and Contect to the Remote

Initialize and connect to remote:

git init
# create the remote on GitHub/GitLab, then:
git remote add origin https://github.com/<user>/<repo>.git
git config user.name "Your Name"
git config user.email "you@email"
git add .
git commit -m "Initial template"
git push -u origin main

3.2 Configure .gitignore

Ignore local files:

config/config_local.yaml
analyses/
data/processed/
logs/

Add to .gitignore:

echo "config/config_local.yaml" >> .gitignore

Important: config_local.yaml must never be committed as it contains machine-specific paths.


4 SSH Keys (Git, Cluster, and NAS)

SSH keys enable secure, password-free authentication to remote servers. This section covers setup for GitHub, HPC clusters, and NAS storage.

4.1 Generate SSH Keys

Generate an ED25519 key pair (modern, secure standard):

ssh-keygen -t ed25519 -C "your-email@example.com" -f ~/.ssh/id_ed25519
# Start SSH agent and add key
ssh-agent -s
ssh-add ~/.ssh/id_ed25519
# Display public key (copy this to remote servers)
cat ~/.ssh/id_ed25519.pub

Important: Keep ~/.ssh/id_ed25519 private. Only share ~/.ssh/id_ed25519.pub.

4.2 Add Public Key to Remote Services

4.2.1 GitHub/GitLab SSH Setup

  1. Copy your public key from above (cat ~/.ssh/id_ed25519.pub)
  2. Add to GitHub:
    • Go to GitHub → Settings → SSH and GPG keys
    • Click “New SSH key” and paste your public key
  3. Test connection:
ssh -T git@github.com
# Expected: "Hi <username>! You've successfully authenticated..."

4.2.2 HPC Cluster SSH Setup

Add your public key to the cluster’s ~/.ssh/authorized_keys:

# On your local machine, copy key to cluster
ssh-copy-id -i ~/.ssh/id_ed25519.pub user@cluster.example.com
# Or manually:
cat ~/.ssh/id_ed25519.pub | ssh user@cluster.example.com "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"

4.2.3 NAS Storage SSH Setup (Restricted Access)

For security on shared NAS systems, create a restricted user with limited permissions:

# On NAS (typically done by admin)
# Create user with SSH-only access, no shell
useradd -m -s /usr/sbin/nologin -c "Restricted SSH user" data_ssh

# Set ACL permissions to restrict access to specific folders
setfacl -m u:data_ssh:rx /volume/shared_data/
# This user can only read/execute, not write to parent directories

4.3 Configure SSH Client (~/.ssh/config)

Create or edit ~/.ssh/config to manage multiple SSH connections with appropriate settings for HPC:

# GitHub
Host github.com
    HostName github.com
    User git
    IdentityFile ~/.ssh/id_ed25519
    IdentitiesOnly yes

# HPC Cluster (Picasso, XSEDE, etc.)
Host cluster
    HostName cluster.example.com
    User your_username
    IdentityFile ~/.ssh/id_ed25519
    IdentitiesOnly yes
    # Keep connection alive (prevent disconnections)
    ServerAliveInterval 60
    ServerAliveCountMax 5
    # Prevent "Too many authentication methods" error
    PreferredAuthentications publickey
    # Required for VS Code Remote-SSH
    ForwardAgent yes

# NAS Storage
Host nas
    HostName nas.example.com
    User data_ssh
    IdentityFile ~/.ssh/id_ed25519
    IdentitiesOnly yes
    StrictHostKeyChecking accept-new

4.4 Test SSH Connections

# Test cluster
ssh cluster
# Should connect without password

# Test NAS
ssh nas
# Should connect without password

# Test GitHub
ssh -T github.com

4.6 Securing Permissions

Ensure SSH directory permissions are correct:

# On both local and remote systems
chmod 700 ~/.ssh
chmod 600 ~/.ssh/id_ed25519
chmod 644 ~/.ssh/id_ed25519.pub
chmod 600 ~/.ssh/authorized_keys
chmod 644 ~/.ssh/config

5 Repository Structure

This repository contains only scripts and configuration, never data or results.

bioinfo-project-repo/
│
├── config/
│   ├── config.yaml              # Shared, versioned, portable paths
│   ├── config_local.yaml        # Local-only, ignored by Git
│   └── samples_names.txt         # Sample identifiers list
│
├── scripts/
│   ├── 01_qc/                   # Quality control analysis scripts
│   │   ├── slurm_out/           # SLURM log files
│   │   ├── 01_fastqc_one_sample.sh
│   │   ├── 01_fastqc_array.sbatch
│   │   └── README.md
│   ├── templates/               # Reusable script templates
│   │   ├── analysis_script_template.sh
│   │   ├── array_job_template.sbatch
│   │   ├── job_slurm_template.sh
│   │   └── slurm_out/
│   └── utils/
│       ├── init_project.sh      # Scaffolds folders and templates
│       ├── load_config.sh       # Loads YAML configuration
│       └── test_load_config.sh
│
├── docs/
│   ├── bioinformatics_project_template.qmd  # Project documentation
│   ├── bioinformatics_project_template.html
│   └── index.html
│
├── logs/                        # Job logs and outputs
│
└── README.md

5.1 Key Principles

  • No data and no results in the repository.
  • Scripts are portable and reusable across machines.
  • Configuration is split between shared (config.yaml) and local (config_local.yaml).
  • config_local.yaml is never committed (add to .gitignore).

6 Configuration System

6.1 config.yaml (Shared, Versioned)

Path: config/config.yaml

This file contains only relative paths and pipeline parameters, never absolute paths.

Generic example (adapt to your analysis type):

paths:
  fastq_raw:       "1_DATA/FASTQ"
  fastq_processed: "2_ANALYSES/Results/02_preprocessing"
  reference:       "1_DATA/REFERENCE"
  analyses:        "2_ANALYSES"

cluster:
  threads: 8
  memory: "32G"
  queue: "short"
  time_limit: "02:00:00"

6.1.1 Key Rules

  • No absolute paths.
  • Stable across macOS, HOME, fstrat, NAS.
  • Versioned in Git.
  • Customize path names for your specific analysis type.

6.2 config_local.yaml (Machine-Specific, Not Versioned)

Path: config/config_local.yaml

Examples:

6.2.1 macOS

base_dir: "/Users/david/my-analysis"
repo_dir: "/Users/david/Repositories/my-analysis-repo"

6.2.2 Cluster HOME

base_dir: "/mnt/home/users/.../my-analysis"
repo_dir: "/mnt/home/users/.../my-analysis-repo"

6.2.3 fstrat (execution workspace)

base_dir: "/fstrat/dmartin/my-analysis/run_20251209"
repo_dir: "/mnt/home/users/dba_001_uma/dmartin/my-analysis-repo"

6.2.4 Add to .gitignore:

echo "config/config_local.yaml" >> .gitignore

This file must never be committed.

6.3 load_config.sh

Path: scripts/utils/load_config.sh

#!/bin/bash

MAIN_CONFIG="config/config.yaml"
LOCAL_CONFIG="config/config_local.yaml"

if [[ ! -f "$MAIN_CONFIG" ]]; then
    echo "ERROR: Missing config.yaml" >&2; exit 1
fi
if [[ ! -f "$LOCAL_CONFIG" ]]; then
    echo "ERROR: Missing config_local.yaml" >&2; exit 1
fi

# Machine-specific base directory
BASE_DIR=$(yq -r '.base_dir' "$LOCAL_CONFIG")
REPO_DIR=$(yq -r '.repo_dir' "$LOCAL_CONFIG")

# Relative paths from config.yaml
FASTQ_RAW_REL=$(yq -r '.paths.fastq_raw' "$MAIN_CONFIG")
FASTQ_PROC_REL=$(yq -r '.paths.fastq_processed' "$MAIN_CONFIG")
REF_REL=$(yq -r '.paths.reference' "$MAIN_CONFIG")
ANALYSES_REL=$(yq -r '.paths.analyses' "$MAIN_CONFIG")

# Build absolute paths
FASTQ_RAW_DIR="$BASE_DIR/$FASTQ_RAW_REL"
FASTQ_PROC_DIR="$BASE_DIR/$FASTQ_PROC_REL"
REFERENCE_DIR="$BASE_DIR/$REF_REL"
ANALYSES_DIR="$BASE_DIR/$ANALYSES_REL"

# Cluster parameters
THREADS=$(yq -r '.cluster.threads' "$MAIN_CONFIG")
MEMORY=$(yq -r '.cluster.memory' "$MAIN_CONFIG")
QUEUE=$(yq -r '.cluster.queue' "$MAIN_CONFIG")

7 Project Structure (Data + Results)

This structure exists outside the repository, typically on NAS or local storage. For instance:

my-analysis/
│
├── 1_DATA/
│   ├── FASTQ/                # Original FASTQ files (read-only)
│   │   ├── sample_1_R1.fastq.gz
│   │   ├── sample_1_R2.fastq.gz
│   │   ├── sample_2_R1.fastq.gz
│   │   └── sample_2_R2.fastq.gz
│   │
│   ├── REFERENCE/            # Reference data and indices
│   │   ├── GENOME/           # Reference sequences (FASTA)
│   │   ├── ANNOTATION/       # Gene annotations (GTF/GFF3/BED)
│   │   └── INDEXES/          # Pre-built indices (tool-specific)
│
├── 2_ANALYSES/
│   ├── Scripts/              # symlink to repository scripts
│   └── Results/
│       ├── 01_qc/            # Quality control outputs (FastQC, etc.)
│       ├── 02_preprocessing/ # Processed FASTQ, trimming outputs
│       ├── 03_main_analysis/ # Alignment, quantification, etc.
│       ├── 04_tables/        # Results tables
│       └── 05_figures/       # Plots and visualizations

Note: prefer creating these folders incrementally as bioinformatics analysis steps are executed and needed. There is no need to pre-create the entire structure; scripts can create required paths on demand (for example, init_project.sh and analysis scripts create 2_ANALYSES/Results/<step> when appropriate). This approach avoids empty directories, reduces organizational errors, and improves traceability.

7.1 Key Principles

  • DATA (1_DATA/): Contains inputs, normally not modified by the pipeline.
  • ANALYSES (2_ANALYSES/): Contains all outputs, can be regenerated.
  • Clear separation enables reproducibility and clean HPC workflows.

8 Script Architecture

8.1 One-Sample Bash Scripts

Path: scripts/<step>/<step>_one_sample.sh (e.g., scripts/01_qc/01_fastqc_one_sample.sh)

Each script processes a single sample and is independent of SLURM. This enables:

  • Local testing: Test on macOS before cluster submission
  • Reusability: Same script runs locally or in HPC arrays
  • Debugging: Easy to troubleshoot single-sample issues

8.1.1 Usage

# Test locally
bash scripts/01_qc/01_fastqc_one_sample.sh sample_1

# This reads from: 1_DATA/FASTQ/sample_1*.fastq.gz
# And writes to:  2_ANALYSES/Results/01_qc/

8.1.2 Example Implementation

#!/bin/bash
set -euo pipefail

SAMPLE_ID="$1"

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$SCRIPT_DIR/../.."
cd "$REPO_ROOT" || exit 1

source scripts/utils/load_config.sh

# Create output directory
OUTDIR="$ANALYSES_DIR/Results/01_qc"
mkdir -p "$OUTDIR"

# Load module if available
if command -v module &> /dev/null; then
    module load fastqc
fi

# Run analysis
fastqc -t "$THREADS" -o "$OUTDIR" "$FASTQ_RAW_DIR/${SAMPLE_ID}"*.fastq.gz

echo "FastQC completed for $SAMPLE_ID"

8.1.3 Sample List Format

Create config/samples_names.txt with one sample ID per line:

sample_1
sample_2
sample_3

Array jobs use this file to determine: - Number of tasks: --array=1-N where N = line count - Sample assignment: Each task gets one line based on SLURM_ARRAY_TASK_ID


8.2 SLURM Array Jobs

Path: scripts/<step>/<step>_array.sbatch (e.g., scripts/01_qc/01_fastqc_array.sbatch)

Array jobs run one-sample scripts in parallel on HPC clusters. The structure is simple:

scripts/<step>/
├── slurm_out/                # Log files only
├── <step>_one_sample.sh      # Bash script - process one sample
└── <step>_array.sbatch       # SLURM script - submit array job

8.2.1 Quick Start

# Submit all samples as parallel tasks (from repository root)
cd /path/to/repository
sbatch scripts/01_qc/01_fastqc_array.sbatch

# Monitor
squeue -j <job_id>
tail -f scripts/01_qc/slurm_out/fastqc_array_<job_id>_1.out

8.2.2 How It Works

The array job: 1. Reads sample names from config/samples_names.txt 2. Creates N parallel SLURM tasks (N = sample count) 3. Each task calls: bash 01_fastqc_one_sample.sh <SAMPLE_ID> 4. Results are combined in 2_ANALYSES/Results/01_qc/

8.2.3 Example SLURM Script

#!/bin/bash
#SBATCH -J fastqc_array
#SBATCH -o scripts/01_qc/slurm_out/%x_%A_%a.out
#SBATCH -e scripts/01_qc/slurm_out/%x_%A_%a.err
#SBATCH -c 8
#SBATCH --mem=16G
#SBATCH -t 3-00:00:00
#SBATCH --constraint=cal
#SBATCH --array=1-N      # N = number of samples

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$SCRIPT_DIR/../.."
cd "$REPO_ROOT" || exit 1

source scripts/utils/load_config.sh

SAMPLES_FILE="config/samples_names.txt"
mapfile -t SAMPLES < "$SAMPLES_FILE"
TASK_ID=$((SLURM_ARRAY_TASK_ID - 1))
SAMPLE_ID="${SAMPLES[$TASK_ID]}"

echo "Processing sample: $SAMPLE_ID (task ${SLURM_ARRAY_TASK_ID})"

# Call one-sample script
bash scripts/01_qc/01_fastqc_one_sample.sh "$SAMPLE_ID"

8.2.4 Key Features

  • One sample per task: SLURM_ARRAY_TASK_ID indexes into sample list
  • Parallel execution: All samples run simultaneously on HPC
  • Simple submission: sbatch scripts/01_qc/01_fastqc_array.sbatch
  • Reusability: Same one-sample script for local testing and HPC
  • Log separation: Each task has its own log file in slurm_out/

8.2.5 Important: Log Directory Requirements

⚠️ Critical: SLURM requires the log directory to exist before job submission.

  • The repository includes .gitkeep files in each slurm_out/ directory
  • This ensures slurm_out/ exists when you clone the repository
  • SLURM cannot create the log directory automatically
  • If the directory doesn’t exist, job submission will fail

Why .gitkeep? Git doesn’t track empty directories. The .gitkeep file forces Git to include the slurm_out/ directory in the repository, ensuring it exists for all users.

For new analysis steps: When creating a new step (e.g., scripts/02_preprocessing/), always:

mkdir -p scripts/02_preprocessing/slurm_out
touch scripts/02_preprocessing/slurm_out/.gitkeep
git add scripts/02_preprocessing/slurm_out/.gitkeep

9 Storage Architecture (macOS, HOME, fstrat, NAS)

9.1 Summary

  • macOS: Development, testing, documentation.
  • HOME: Repository + lightweight files.
  • fstrat: High-performance execution workspace (temporary).
  • NAS: Permanent project archive.

All scripts behave identically because they rely on BASE_DIR and REPO_DIR.

9.2 fstrat Execution Workspace (Per-Run)

Within each run directory:

/fstrat/dmartin/my-analysis/run_20251209/
│
├── 1_DATA/        # minimal replicated inputs
└── 2_ANALYSES/
    ├── Scripts/   # symlink to repository scripts
    └── Results/   # all outputs generated here

Create symlink:

ln -s "$REPO_DIR/scripts" "$BASE_DIR/2_ANALYSES/Scripts/scripts_repo"

This ensures traceability without duplicating code.

9.3 Synchronizing Results from fstrat to NAS

After execution:

rsync -av /fstrat/.../2_ANALYSES/ /NAS/my-analysis/2_ANALYSES/

This pattern allows: * Fast execution on cluster. * Durable storage on NAS. * Clean reproducibility.

9.4 Environment-Dependent Configuration

Only two variables change across machines:

base_dir: "path/to/data_and_results"   # fstrat, NAS, or local path
repo_dir: "path/to/repository"         # HOME on cluster, local path on macOS

All scripts use these paths indirectly through load_config.sh.


10 Initialize the Project

Use scripts/utils/init_project.sh to create folders and templates:

# View help and options
bash scripts/utils/init_project.sh --help

# Initialize project with specified parent path and folder name
bash scripts/utils/init_project.sh -p <parent-path> -n <folder-name>

# Example: Create analysis folder 'run_20260112' in '/fstrat/dmartin/sparrow-analysis'
bash scripts/utils/init_project.sh -p /fstrat/dmartin/sparrow-analysis -n run_20260112

Required Arguments:

  • -p, --parent-path PATH: Parent directory where the analysis folder will be created
  • -n, --folder-name NAME: Name of the analysis folder (e.g., run_20260112, experiment_01)

What the script does:

  1. Creates the analysis folder structure:

    <parent-path>/<folder-name>/
    ├── 1_DATA/              # Minimal replicated inputs
    └── 2_ANALYSES/
        ├── Scripts/         # Symlink to repository scripts
        └── Results/         # All outputs generated here
  2. Creates a symbolic link: 2_ANALYSES/Scripts/ → repository scripts/ folder

  3. Creates repository configuration files if missing:

    • config/config.yaml → shared, versioned configuration
    • config/config_local.yaml → private, machine-specific configuration
  4. Creates script templates in scripts/templates/:

    • array_job_template.sbatch → SLURM array job template
    • analysis_script_template.sh → Generic one-sample script template

Structure Philosophy:

  • Repository folder (this repo): Contains scripts, config, docs (version-controlled)
  • Analysis folder (created by script): Contains data and results (not version-controlled)
  • Separation of concerns: Code vs. Data
  • Portability: Only analysis folder location changes between machines
  • Symlink advantage: Scripts always up-to-date with repository changes

11 Creating New Analysis Steps

To add a new analysis step (e.g., 02_preprocessing):

11.1 Create Directory Structure

mkdir -p scripts/02_preprocessing/slurm_out
touch scripts/02_preprocessing/slurm_out/.gitkeep
git add scripts/02_preprocessing/slurm_out/.gitkeep

Note: The .gitkeep file is required so Git tracks the slurm_out/ directory. SLURM needs this directory to exist before job submission.

11.2 Create One-Sample Script

Copy and adapt the template:

cp scripts/templates/analysis_script_template.sh scripts/02_preprocessing/02_trim_one_sample.sh

Edit the script to: - Process your specific analysis (trimming, alignment, etc.) - Use appropriate input/output paths from load_config.sh - Load required modules or tools

11.3 Create SLURM Array Script

Copy and adapt the template:

cp scripts/templates/array_job_template.sbatch scripts/02_preprocessing/02_trim_array.sbatch

Update: - Job name: #SBATCH -J trim_array - Array size: #SBATCH --array=1-N (N = sample count) - Script call: bash scripts/02_preprocessing/02_trim_one_sample.sh "$SAMPLE_ID"

11.4 Test Locally

bash scripts/02_preprocessing/02_trim_one_sample.sh sample_1

11.5 Submit to HPC

# Execute from repository root
cd /path/to/repository
sbatch scripts/02_preprocessing/02_trim_array.sbatch

Important: Always execute sbatch from the repository root directory. Log paths in SLURM directives (#SBATCH -o and -e) are relative to the repository root.


12 Best Practices

  • Always source scripts/utils/load_config.sh at the start of analysis scripts.
  • Never modify 1_DATA/ during analyses; write outputs to 2_ANALYSES/.
  • Track only code and generic config; keep large data out of the repo.
  • Document in docs/ and publish reports with Quarto.
  • Use SLURM array jobs for scalable parallel processing.
  • Test one-sample scripts locally before submitting array jobs.
  • Synchronize results from fstrat to NAS after each run.

13 Summary

  • config.yaml → shared, versioned, general settings.
  • config_local.yaml → local, private, machine-specific.
  • load_config.sh → merges both and exposes variables.
  • All scripts use source load_config.sh for consistent paths.
  • One-sample Bash scripts can be tested locally or used in SLURM arrays.
  • Clear separation between DATA (inputs) and ANALYSES (outputs).
  • Portable across macOS ↔︎ HOME ↔︎ fstrat ↔︎ NAS.
  • Ensures full reproducibility, separation of concerns, and portability.

14 Next Steps

Before you start:

  • Configure SSH keys for secure, password-free access to GitHub, cluster, and NAS (see Section 4).

  • Set up .ssh/config to manage multiple connections efficiently.

  • Install VS Code Remote-SSH for seamless cluster interaction.

For your project:

  • Adapt scripts to your specific analysis pipeline.

  • Add workflows (Snakemake/Nextflow) for complex dependencies.

  • Create reproducible environments (environment.yml, renv.lock, or requirements.txt).

  • Document analysis steps and parameters in Quarto reports.