Bioinformatics Project Template
1 Introduction
This document describes a generic, flexible template to structure any bioinformatics analysis project. It is designed to work across different project types (RNA-seq, ChIP-seq, metagenomics, variant calling, proteomics, single-cell analysis, long-read sequencing, computational chemistry, etc.) and different computing environments (local workstations, HPC clusters, cloud).
The template implements:
- A standard folder layout for organizing code, data, and results.
- Minimal software requirements and installation instructions.
- Initialization scripts to quickly scaffold new projects.
- Git + SSH key setup and basic cluster access.
- Separation of shared configuration and machine-specific configuration.
- A portable path system based on
BASE_DIRandREPO_DIR. - A repository that stores scripts and configuration only, never large data.
- A project structure designed for use on local machines, cluster, and NAS.
- SLURM array job architecture for scalable HPC execution.
- Multi-environment storage strategy (macOS, HOME, fstrat, NAS).
1.1 Configuration System
The configuration system consists of:
config.yaml→ shared, versioned configuration; same for all users and machines.config_local.yaml→ private, unversioned configuration; machine-specific.load_config.sh→ loader that reads both YAML files withyqand exposes variables to Bash/SLURM.
This ensures complete reproducibility, portability, and clarity across different computing environments.
1.2 Key Principles
- Scripts never hard-code paths — all paths are derived from configuration.
- Data lives outside the repository — only scripts and configuration are version-controlled.
- One-sample Bash scripts can be tested locally or used in SLURM array jobs.
- Clear separation between DATA (inputs) and ANALYSES (outputs).
- Portable across environments — only
BASE_DIRandREPO_DIRchange between machines.
1.3 Why This Template is Generic
This template is not specific to any analysis type. Instead, it provides:
- Flexible naming for scripts and directories (adapt
01_analysis_one_sample.shto your analysis). - Customizable paths in
config.yaml(useraw_data,sequences,alignments, or your own naming). - Agnostic architecture (works with any tool: FastQC, STAR, samtools, bcftools, Mothur, QIIME2, R, Python, etc.).
- Scalable design (from single-machine to HPC array jobs).
Examples of how to adapt for different projects:
| Project Type | Sample Definition | Input Files | Analysis Steps |
|---|---|---|---|
| RNA-seq | Sample ID | FASTQ files | QC → Trim → Map → Count → DE |
| ChIP-seq | Replicate ID | BAM files | QC → Peak call → Annotate |
| Metagenomics | Sample ID | FASTQ files | QC → Denoise → Taxonomy → Abundance |
| Variant calling | Sample ID | FASTQ/BAM files | QC → Map → Call → Annotate → Filter |
| Single-cell | Cell barcode | h5ad/mtx files | QC → Normalize → Cluster → Annotate |
1.4 How to Use This Template
This repository is configured as a GitHub Template Repository, which allows you to create new projects without copying the full Git history.
1.4.1 Creating a New Project from Template
On GitHub:
- Navigate to this page.
- Click the green “Use this template” button (top right)
- Select “Create a new repository”
- Fill in:
- Owner: Your username or organization
- Repository name: Your project name (e.g.,
sparrow-rnaseq-analysis) - Description: Brief project description
- Visibility: Public or Private (your choice)
- Click “Create repository”
Clone your new repository:
git clone git@github.com:<your-username>/<your-project>.git cd <your-project>Customize for your project:
- Edit
README.md: Update title, description, and project-specific information - Edit
config/config.yaml: Adjust paths and parameters for your analysis type - Create
config/config_local.yaml: Add your machine-specific paths - Update
docs/: Rename QMD files and customize documentation - Adapt scripts in
scripts/<step>/: Rename and modify for your specific analysis steps - Create sample list: Generate
config/samples_names.txtwith your sample identifiers
- Edit
Initialize project structure:
# Specify parent path and analysis folder name bash scripts/utils/init_project.sh -p /path/to/parent -n analysis_name # Example: Create analysis folder 'run_20260112' in your project area bash scripts/utils/init_project.sh -p /fstrat/username/my-project -n run_20260112
1.4.2 Advantages of Template Repository vs Manual Cloning
| Aspect | Template Repository | Manual Clone |
|---|---|---|
| Git history | ✅ Clean start | ❌ Inherits all template history |
| GitHub connection | ✅ Automatic | ❌ Requires remote reset |
| Simplicity | ✅ One-click creation | ❌ Manual rm -rf .git needed |
| Updates | ✅ Independent | ❌ Conflicts with template updates |
| Best practice | ✅ Recommended | ❌ Not recommended |
1.4.3 After Creating Your Project
Configure Git identity (if not already set globally):
git config user.name "Your Name" git config user.email "your.email@example.com"Set up SSH keys (see Section 4 for detailed instructions)
Initialize analysis folder (see detailed instructions in Section on Initialize the Project):
# Create your analysis folder with appropriate name bash scripts/utils/init_project.sh -p /path/to/parent -n analysis_folderThis will automatically create
config/config_local.yamlwith correct paths.Start developing: Add your data paths, customize scripts, and begin analysis
2 Minimal Software
2.1 Required Software
- Shell & CLI:
- bash ≥ 4.0, coreutils,
awk,sed,grep,rsync,tar,gzip
- bash ≥ 4.0, coreutils,
- YAML:
yq(mikefarah/yq) to read YAML with bash
- Documentation:
- Quarto (optional but recommended): https://quarto.org
- VCS:
- Git ≥ 2.30
- Optional (depending on analysis):
- Conda/Mamba for reproducible environments
- Snakemake or Nextflow for workflows
- R (≥4.2) and/or Python (≥3.10)
2.2 Quick Install (macOS)
First, install Homebrew if you don’t have it (https://brew.sh):
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Then install the required packages:
brew install yq git quarto rsync gnu-sed gawk
3 Initialize Repository
3.1 Initialize and Contect to the Remote
Initialize and connect to remote:
git init
# create the remote on GitHub/GitLab, then:
git remote add origin https://github.com/<user>/<repo>.git
git config user.name "Your Name"
git config user.email "you@email"
git add .
git commit -m "Initial template"
git push -u origin main
3.2 Configure .gitignore
Ignore local files:
config/config_local.yaml
analyses/
data/processed/
logs/
Add to .gitignore:
echo "config/config_local.yaml" >> .gitignore
Important: config_local.yaml must never be committed as it contains machine-specific paths.
4 SSH Keys (Git, Cluster, and NAS)
SSH keys enable secure, password-free authentication to remote servers. This section covers setup for GitHub, HPC clusters, and NAS storage.
4.1 Generate SSH Keys
Generate an ED25519 key pair (modern, secure standard):
ssh-keygen -t ed25519 -C "your-email@example.com" -f ~/.ssh/id_ed25519
# Start SSH agent and add key
ssh-agent -s
ssh-add ~/.ssh/id_ed25519
# Display public key (copy this to remote servers)
cat ~/.ssh/id_ed25519.pub
Important: Keep ~/.ssh/id_ed25519 private. Only share ~/.ssh/id_ed25519.pub.
4.2 Add Public Key to Remote Services
4.2.1 GitHub/GitLab SSH Setup
- Copy your public key from above (
cat ~/.ssh/id_ed25519.pub) - Add to GitHub:
- Go to GitHub → Settings → SSH and GPG keys
- Click “New SSH key” and paste your public key
- Test connection:
ssh -T git@github.com
# Expected: "Hi <username>! You've successfully authenticated..."
4.2.2 HPC Cluster SSH Setup
Add your public key to the cluster’s ~/.ssh/authorized_keys:
# On your local machine, copy key to cluster
ssh-copy-id -i ~/.ssh/id_ed25519.pub user@cluster.example.com
# Or manually:
cat ~/.ssh/id_ed25519.pub | ssh user@cluster.example.com "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
4.2.3 NAS Storage SSH Setup (Restricted Access)
For security on shared NAS systems, create a restricted user with limited permissions:
# On NAS (typically done by admin)
# Create user with SSH-only access, no shell
useradd -m -s /usr/sbin/nologin -c "Restricted SSH user" data_ssh
# Set ACL permissions to restrict access to specific folders
setfacl -m u:data_ssh:rx /volume/shared_data/
# This user can only read/execute, not write to parent directories
4.3 Configure SSH Client (~/.ssh/config)
Create or edit ~/.ssh/config to manage multiple SSH connections with appropriate settings for HPC:
# GitHub
Host github.com
HostName github.com
User git
IdentityFile ~/.ssh/id_ed25519
IdentitiesOnly yes
# HPC Cluster (Picasso, XSEDE, etc.)
Host cluster
HostName cluster.example.com
User your_username
IdentityFile ~/.ssh/id_ed25519
IdentitiesOnly yes
# Keep connection alive (prevent disconnections)
ServerAliveInterval 60
ServerAliveCountMax 5
# Prevent "Too many authentication methods" error
PreferredAuthentications publickey
# Required for VS Code Remote-SSH
ForwardAgent yes
# NAS Storage
Host nas
HostName nas.example.com
User data_ssh
IdentityFile ~/.ssh/id_ed25519
IdentitiesOnly yes
StrictHostKeyChecking accept-new
4.4 Test SSH Connections
# Test cluster
ssh cluster
# Should connect without password
# Test NAS
ssh nas
# Should connect without password
# Test GitHub
ssh -T github.com
4.5 VS Code Remote-SSH (Recommended for HPC)
VS Code Remote-SSH allows you to edit files and run terminals directly on the cluster without mounting via SSHFS (which is unstable on HPC systems).
4.5.1 Setup
Install VS Code extensions:
Remote - SSHRemote - SSH: Editing Configuration Files(optional)
VS Code will auto-detect entries from your
~/.ssh/configConnect to cluster:
- Press
F1(orCmd+Shift+Pon macOS) - Type:
Remote-SSH: Connect to Host… - Select your cluster entry
- VS Code opens a new window and installs VS Code Server on the remote host
- Press
Open remote folder:
- In the left sidebar, click “Open Folder”
- Navigate to your project directory (e.g.,
/home/user/projects/my-analysis-repo) - You can now edit files directly on the cluster
Open remote terminal:
- Press
Ctrl+`` (orCmd+`` on macOS) - Executes commands on the cluster, not locally
- Press
4.5.2 Benefits of Remote-SSH vs SSHFS
| Feature | Remote-SSH | SSHFS |
|---|---|---|
| Stability | ✅ Highly stable | ❌ Prone to failures |
| Setup | ✅ None needed | ❌ Requires mounting |
| Performance | ✅ Fast | ❌ Slow with many files |
| Permissions | ✅ Correct handling | ❌ Can get corrupted |
| Extensions | ✅ Work normally | ❌ Often fail |
| HPC standard | ✅ Recommended | ❌ Problematic on clusters |
4.6 Securing Permissions
Ensure SSH directory permissions are correct:
# On both local and remote systems
chmod 700 ~/.ssh
chmod 600 ~/.ssh/id_ed25519
chmod 644 ~/.ssh/id_ed25519.pub
chmod 600 ~/.ssh/authorized_keys
chmod 644 ~/.ssh/config
5 Repository Structure
This repository contains only scripts and configuration, never data or results.
bioinfo-project-repo/
│
├── config/
│ ├── config.yaml # Shared, versioned, portable paths
│ ├── config_local.yaml # Local-only, ignored by Git
│ └── samples_names.txt # Sample identifiers list
│
├── scripts/
│ ├── 01_qc/ # Quality control analysis scripts
│ │ ├── slurm_out/ # SLURM log files
│ │ ├── 01_fastqc_one_sample.sh
│ │ ├── 01_fastqc_array.sbatch
│ │ └── README.md
│ ├── templates/ # Reusable script templates
│ │ ├── analysis_script_template.sh
│ │ ├── array_job_template.sbatch
│ │ ├── job_slurm_template.sh
│ │ └── slurm_out/
│ └── utils/
│ ├── init_project.sh # Scaffolds folders and templates
│ ├── load_config.sh # Loads YAML configuration
│ └── test_load_config.sh
│
├── docs/
│ ├── bioinformatics_project_template.qmd # Project documentation
│ ├── bioinformatics_project_template.html
│ └── index.html
│
├── logs/ # Job logs and outputs
│
└── README.md
5.1 Key Principles
- No data and no results in the repository.
- Scripts are portable and reusable across machines.
- Configuration is split between shared (
config.yaml) and local (config_local.yaml). config_local.yamlis never committed (add to.gitignore).
6 Configuration System
6.2 config_local.yaml (Machine-Specific, Not Versioned)
Path: config/config_local.yaml
Examples:
6.2.1 macOS
base_dir: "/Users/david/my-analysis"
repo_dir: "/Users/david/Repositories/my-analysis-repo"6.2.2 Cluster HOME
base_dir: "/mnt/home/users/.../my-analysis"
repo_dir: "/mnt/home/users/.../my-analysis-repo"6.2.3 fstrat (execution workspace)
base_dir: "/fstrat/dmartin/my-analysis/run_20251209"
repo_dir: "/mnt/home/users/dba_001_uma/dmartin/my-analysis-repo"6.2.4 Add to .gitignore:
echo "config/config_local.yaml" >> .gitignoreThis file must never be committed.
6.3 load_config.sh
Path: scripts/utils/load_config.sh
#!/bin/bash
MAIN_CONFIG="config/config.yaml"
LOCAL_CONFIG="config/config_local.yaml"
if [[ ! -f "$MAIN_CONFIG" ]]; then
echo "ERROR: Missing config.yaml" >&2; exit 1
fi
if [[ ! -f "$LOCAL_CONFIG" ]]; then
echo "ERROR: Missing config_local.yaml" >&2; exit 1
fi
# Machine-specific base directory
BASE_DIR=$(yq -r '.base_dir' "$LOCAL_CONFIG")
REPO_DIR=$(yq -r '.repo_dir' "$LOCAL_CONFIG")
# Relative paths from config.yaml
FASTQ_RAW_REL=$(yq -r '.paths.fastq_raw' "$MAIN_CONFIG")
FASTQ_PROC_REL=$(yq -r '.paths.fastq_processed' "$MAIN_CONFIG")
REF_REL=$(yq -r '.paths.reference' "$MAIN_CONFIG")
ANALYSES_REL=$(yq -r '.paths.analyses' "$MAIN_CONFIG")
# Build absolute paths
FASTQ_RAW_DIR="$BASE_DIR/$FASTQ_RAW_REL"
FASTQ_PROC_DIR="$BASE_DIR/$FASTQ_PROC_REL"
REFERENCE_DIR="$BASE_DIR/$REF_REL"
ANALYSES_DIR="$BASE_DIR/$ANALYSES_REL"
# Cluster parameters
THREADS=$(yq -r '.cluster.threads' "$MAIN_CONFIG")
MEMORY=$(yq -r '.cluster.memory' "$MAIN_CONFIG")
QUEUE=$(yq -r '.cluster.queue' "$MAIN_CONFIG")7 Project Structure (Data + Results)
This structure exists outside the repository, typically on NAS or local storage. For instance:
my-analysis/
│
├── 1_DATA/
│ ├── FASTQ/ # Original FASTQ files (read-only)
│ │ ├── sample_1_R1.fastq.gz
│ │ ├── sample_1_R2.fastq.gz
│ │ ├── sample_2_R1.fastq.gz
│ │ └── sample_2_R2.fastq.gz
│ │
│ ├── REFERENCE/ # Reference data and indices
│ │ ├── GENOME/ # Reference sequences (FASTA)
│ │ ├── ANNOTATION/ # Gene annotations (GTF/GFF3/BED)
│ │ └── INDEXES/ # Pre-built indices (tool-specific)
│
├── 2_ANALYSES/
│ ├── Scripts/ # symlink to repository scripts
│ └── Results/
│ ├── 01_qc/ # Quality control outputs (FastQC, etc.)
│ ├── 02_preprocessing/ # Processed FASTQ, trimming outputs
│ ├── 03_main_analysis/ # Alignment, quantification, etc.
│ ├── 04_tables/ # Results tables
│ └── 05_figures/ # Plots and visualizations
Note: prefer creating these folders incrementally as bioinformatics analysis steps are executed and needed. There is no need to pre-create the entire structure; scripts can create required paths on demand (for example, init_project.sh and analysis scripts create 2_ANALYSES/Results/<step> when appropriate). This approach avoids empty directories, reduces organizational errors, and improves traceability.
7.1 Key Principles
- DATA (
1_DATA/): Contains inputs, normally not modified by the pipeline. - ANALYSES (
2_ANALYSES/): Contains all outputs, can be regenerated. - Clear separation enables reproducibility and clean HPC workflows.
8 Script Architecture
8.1 One-Sample Bash Scripts
Path: scripts/<step>/<step>_one_sample.sh (e.g., scripts/01_qc/01_fastqc_one_sample.sh)
Each script processes a single sample and is independent of SLURM. This enables:
- Local testing: Test on macOS before cluster submission
- Reusability: Same script runs locally or in HPC arrays
- Debugging: Easy to troubleshoot single-sample issues
8.1.1 Usage
# Test locally
bash scripts/01_qc/01_fastqc_one_sample.sh sample_1
# This reads from: 1_DATA/FASTQ/sample_1*.fastq.gz
# And writes to: 2_ANALYSES/Results/01_qc/8.1.2 Example Implementation
#!/bin/bash
set -euo pipefail
SAMPLE_ID="$1"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$SCRIPT_DIR/../.."
cd "$REPO_ROOT" || exit 1
source scripts/utils/load_config.sh
# Create output directory
OUTDIR="$ANALYSES_DIR/Results/01_qc"
mkdir -p "$OUTDIR"
# Load module if available
if command -v module &> /dev/null; then
module load fastqc
fi
# Run analysis
fastqc -t "$THREADS" -o "$OUTDIR" "$FASTQ_RAW_DIR/${SAMPLE_ID}"*.fastq.gz
echo "FastQC completed for $SAMPLE_ID"8.1.3 Sample List Format
Create config/samples_names.txt with one sample ID per line:
sample_1
sample_2
sample_3
Array jobs use this file to determine: - Number of tasks: --array=1-N where N = line count - Sample assignment: Each task gets one line based on SLURM_ARRAY_TASK_ID
8.2 SLURM Array Jobs
Path: scripts/<step>/<step>_array.sbatch (e.g., scripts/01_qc/01_fastqc_array.sbatch)
Array jobs run one-sample scripts in parallel on HPC clusters. The structure is simple:
scripts/<step>/
├── slurm_out/ # Log files only
├── <step>_one_sample.sh # Bash script - process one sample
└── <step>_array.sbatch # SLURM script - submit array job
8.2.1 Quick Start
# Submit all samples as parallel tasks (from repository root)
cd /path/to/repository
sbatch scripts/01_qc/01_fastqc_array.sbatch
# Monitor
squeue -j <job_id>
tail -f scripts/01_qc/slurm_out/fastqc_array_<job_id>_1.out8.2.2 How It Works
The array job: 1. Reads sample names from config/samples_names.txt 2. Creates N parallel SLURM tasks (N = sample count) 3. Each task calls: bash 01_fastqc_one_sample.sh <SAMPLE_ID> 4. Results are combined in 2_ANALYSES/Results/01_qc/
8.2.3 Example SLURM Script
#!/bin/bash
#SBATCH -J fastqc_array
#SBATCH -o scripts/01_qc/slurm_out/%x_%A_%a.out
#SBATCH -e scripts/01_qc/slurm_out/%x_%A_%a.err
#SBATCH -c 8
#SBATCH --mem=16G
#SBATCH -t 3-00:00:00
#SBATCH --constraint=cal
#SBATCH --array=1-N # N = number of samples
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$SCRIPT_DIR/../.."
cd "$REPO_ROOT" || exit 1
source scripts/utils/load_config.sh
SAMPLES_FILE="config/samples_names.txt"
mapfile -t SAMPLES < "$SAMPLES_FILE"
TASK_ID=$((SLURM_ARRAY_TASK_ID - 1))
SAMPLE_ID="${SAMPLES[$TASK_ID]}"
echo "Processing sample: $SAMPLE_ID (task ${SLURM_ARRAY_TASK_ID})"
# Call one-sample script
bash scripts/01_qc/01_fastqc_one_sample.sh "$SAMPLE_ID"8.2.4 Key Features
- One sample per task:
SLURM_ARRAY_TASK_IDindexes into sample list - Parallel execution: All samples run simultaneously on HPC
- Simple submission:
sbatch scripts/01_qc/01_fastqc_array.sbatch - Reusability: Same one-sample script for local testing and HPC
- Log separation: Each task has its own log file in
slurm_out/
8.2.5 Important: Log Directory Requirements
⚠️ Critical: SLURM requires the log directory to exist before job submission.
- The repository includes
.gitkeepfiles in eachslurm_out/directory - This ensures
slurm_out/exists when you clone the repository - SLURM cannot create the log directory automatically
- If the directory doesn’t exist, job submission will fail
Why .gitkeep? Git doesn’t track empty directories. The .gitkeep file forces Git to include the slurm_out/ directory in the repository, ensuring it exists for all users.
For new analysis steps: When creating a new step (e.g., scripts/02_preprocessing/), always:
mkdir -p scripts/02_preprocessing/slurm_out
touch scripts/02_preprocessing/slurm_out/.gitkeep
git add scripts/02_preprocessing/slurm_out/.gitkeep9 Storage Architecture (macOS, HOME, fstrat, NAS)
9.1 Summary
- macOS: Development, testing, documentation.
- HOME: Repository + lightweight files.
- fstrat: High-performance execution workspace (temporary).
- NAS: Permanent project archive.
All scripts behave identically because they rely on BASE_DIR and REPO_DIR.
9.2 fstrat Execution Workspace (Per-Run)
Within each run directory:
/fstrat/dmartin/my-analysis/run_20251209/
│
├── 1_DATA/ # minimal replicated inputs
└── 2_ANALYSES/
├── Scripts/ # symlink to repository scripts
└── Results/ # all outputs generated here
Create symlink:
ln -s "$REPO_DIR/scripts" "$BASE_DIR/2_ANALYSES/Scripts/scripts_repo"This ensures traceability without duplicating code.
9.3 Synchronizing Results from fstrat to NAS
After execution:
rsync -av /fstrat/.../2_ANALYSES/ /NAS/my-analysis/2_ANALYSES/This pattern allows: * Fast execution on cluster. * Durable storage on NAS. * Clean reproducibility.
9.4 Environment-Dependent Configuration
Only two variables change across machines:
base_dir: "path/to/data_and_results" # fstrat, NAS, or local path
repo_dir: "path/to/repository" # HOME on cluster, local path on macOSAll scripts use these paths indirectly through load_config.sh.
10 Initialize the Project
Use scripts/utils/init_project.sh to create folders and templates:
# View help and options
bash scripts/utils/init_project.sh --help
# Initialize project with specified parent path and folder name
bash scripts/utils/init_project.sh -p <parent-path> -n <folder-name>
# Example: Create analysis folder 'run_20260112' in '/fstrat/dmartin/sparrow-analysis'
bash scripts/utils/init_project.sh -p /fstrat/dmartin/sparrow-analysis -n run_20260112Required Arguments:
-p, --parent-path PATH: Parent directory where the analysis folder will be created-n, --folder-name NAME: Name of the analysis folder (e.g.,run_20260112,experiment_01)
What the script does:
Creates the analysis folder structure:
<parent-path>/<folder-name>/ ├── 1_DATA/ # Minimal replicated inputs └── 2_ANALYSES/ ├── Scripts/ # Symlink to repository scripts └── Results/ # All outputs generated hereCreates a symbolic link:
2_ANALYSES/Scripts/→ repositoryscripts/folderCreates repository configuration files if missing:
config/config.yaml→ shared, versioned configurationconfig/config_local.yaml→ private, machine-specific configuration
Creates script templates in
scripts/templates/:array_job_template.sbatch→ SLURM array job templateanalysis_script_template.sh→ Generic one-sample script template
Structure Philosophy:
- Repository folder (this repo): Contains scripts, config, docs (version-controlled)
- Analysis folder (created by script): Contains data and results (not version-controlled)
- Separation of concerns: Code vs. Data
- Portability: Only analysis folder location changes between machines
- Symlink advantage: Scripts always up-to-date with repository changes
11 Creating New Analysis Steps
To add a new analysis step (e.g., 02_preprocessing):
11.1 Create Directory Structure
mkdir -p scripts/02_preprocessing/slurm_out
touch scripts/02_preprocessing/slurm_out/.gitkeep
git add scripts/02_preprocessing/slurm_out/.gitkeepNote: The .gitkeep file is required so Git tracks the slurm_out/ directory. SLURM needs this directory to exist before job submission.
11.2 Create One-Sample Script
Copy and adapt the template:
cp scripts/templates/analysis_script_template.sh scripts/02_preprocessing/02_trim_one_sample.shEdit the script to: - Process your specific analysis (trimming, alignment, etc.) - Use appropriate input/output paths from load_config.sh - Load required modules or tools
11.3 Create SLURM Array Script
Copy and adapt the template:
cp scripts/templates/array_job_template.sbatch scripts/02_preprocessing/02_trim_array.sbatchUpdate: - Job name: #SBATCH -J trim_array - Array size: #SBATCH --array=1-N (N = sample count) - Script call: bash scripts/02_preprocessing/02_trim_one_sample.sh "$SAMPLE_ID"
11.4 Test Locally
bash scripts/02_preprocessing/02_trim_one_sample.sh sample_111.5 Submit to HPC
# Execute from repository root
cd /path/to/repository
sbatch scripts/02_preprocessing/02_trim_array.sbatchImportant: Always execute sbatch from the repository root directory. Log paths in SLURM directives (#SBATCH -o and -e) are relative to the repository root.
12 Best Practices
- Always
source scripts/utils/load_config.shat the start of analysis scripts. - Never modify
1_DATA/during analyses; write outputs to2_ANALYSES/. - Track only code and generic config; keep large data out of the repo.
- Document in
docs/and publish reports with Quarto. - Use SLURM array jobs for scalable parallel processing.
- Test one-sample scripts locally before submitting array jobs.
- Synchronize results from fstrat to NAS after each run.
13 Summary
config.yaml→ shared, versioned, general settings.config_local.yaml→ local, private, machine-specific.load_config.sh→ merges both and exposes variables.- All scripts use
source load_config.shfor consistent paths. - One-sample Bash scripts can be tested locally or used in SLURM arrays.
- Clear separation between DATA (inputs) and ANALYSES (outputs).
- Portable across macOS ↔︎ HOME ↔︎ fstrat ↔︎ NAS.
- Ensures full reproducibility, separation of concerns, and portability.
14 Next Steps
Before you start:
Configure SSH keys for secure, password-free access to GitHub, cluster, and NAS (see Section 4).
Set up
.ssh/configto manage multiple connections efficiently.Install VS Code Remote-SSH for seamless cluster interaction.
For your project:
Adapt scripts to your specific analysis pipeline.
Add workflows (Snakemake/Nextflow) for complex dependencies.
Create reproducible environments (
environment.yml,renv.lock, orrequirements.txt).Document analysis steps and parameters in Quarto reports.