What I Use

Languages:

Python
Bash
R
C
Nextflow
Java

Database Management Systems:

SQL
MySQL
Redis

Data Analysis, Machine Learning and Deep Learning:

Numpy
Pandas
Matplotlib
Seaborn
Keras
SciKit-Learn
PyTorch
TensorFlow
PyCaret

Miscellaneous Skills:

Git
Jupyter Notebook
Visual Studio
R Studio
PyCharm
Terra
Posit
Bash Terminal

Microarray Analysis:

limma
gProfileR

RNA Sequencing Analysis:

FastQC
Trimmomatic
Cutadapt
Fastp
STAR
HISAT2
DESeq2
ClusterProfiler
ggplot2
Bioconductor

Key Bash Commands and Tools for Bioinformatics:

1. File Management and Text Processing:

grep: For searching and filtering text data from large files (e.g., logs, sequence data).
awk: Used for text processing and pattern scanning, often in sequence data files like FASTA, FASTQ, or SAM.
sed: Stream editor for basic text transformations in bioinformatics pipelines.
cut: For extracting columns from text files such as tabular data (e.g., VCF or expression matrices).
sort and uniq: To sort and filter unique lines in large datasets.
find: Locating files and directories based on name, size, or modification time.

2. Sequence Data Handling:

fastqc: Quality control tool for high-throughput sequence data (FASTQ files).
multiqc: Aggregates results from multiple fastqc reports into a single summary report.
bwa: Burrows-Wheeler Aligner for mapping low-divergent sequences against a large reference genome.
samtools: Utilities for processing SAM, BAM, and CRAM files (e.g., sorting, indexing, and converting formats).
bcftools: Tools for variant calling and manipulating VCF files, used in variant calling workflows.
bedtools: Suite of tools for performing a wide range of genomic operations on BED files.
cutadapt: Used for trimming adapter sequences from reads in FASTQ files.
trimmomatic: Tool for trimming low-quality regions and adapters from high-throughput sequencing reads.

3. Data Compression and Decompression:

gzip / gunzip: For compressing and decompressing sequence files (e.g., FASTQ, BAM).
tar: To compress and extract multiple files or directories (often used with gzip).
bgzip: Used for compressing VCF and other large text files, often combined with tabix for indexing.
zcat / bzcat: For reading compressed files without decompressing them.

4. Pipeline and Workflow Automation:

Nextflow: Workflow manager for reproducible and scalable bioinformatics pipelines, designed to work with cloud resources and clusters.

5. Genomic Data Handling:

bcl2fastq: Converts raw Illumina BCL data files into FASTQ format for downstream analysis.
kraken2: For taxonomic classification of metagenomic sequence data.
prokka: For genome annotation of bacterial sequences.
pilon: Tool for improving genome assemblies by correcting sequence errors.
unicycler: Hybrid assembly pipeline optimized for bacterial genomes.
quast: For quality assessment of genome assemblies.

6. Visualization:

IGV: Command-line interface for the Integrative Genomics Viewer, useful for visualizing sequence data.
samtools tview: A text-based alignment viewer for BAM files.

7. Monitoring and Resource Management:

top: Monitor system resource usage (CPU, memory).
htop: A more interactive version of top for monitoring system performance.
df and du: For checking disk space and usage, ensuring enough storage is available for large sequencing datasets.

8. Software Installation:

conda: Package and environment manager for installing bioinformatics tools and managing dependencies.
docker: Containerization platform for running bioinformatics tools in isolated environments.

Biopython: A comprehensive library for biological computation, supporting sequence analysis, file parsing (FASTA, GenBank), phylogenetics, and more. Widely used in genomics, protein analysis, and molecular biology. Table of contents with Tools:
- Sequence objects: Handling biological sequences using the Seq object from Biopython.
- Sequence annotation objects: Annotating sequences with SeqRecord and Feature objects.
- Sequence Input/Output: Tools like SeqIO and AlignIO for reading and writing sequence file formats (e.g., FASTA, GenBank).
- Sequence alignments: Working with sequence alignment objects using AlignIO.
- Pairwise sequence alignment: Performing pairwise sequence alignment with Bio.pairwise2.
- Multiple Sequence Alignment objects: Handling and manipulating multiple sequence alignments using MultipleSeqAlignment and AlignIO.
- Pairwise alignments using pairwise2: Detailed use of the pairwise2 module for flexible sequence alignment.
- BLAST : Accessing NCBI BLAST services with the NCBIXML and NCBIWWW modules to parse BLAST output.
- Accessing NCBI’s Entrez databases: Using Bio.Entrez for accessing and retrieving data from NCBI’s Entrez databases (e.g., PubMed, GenBank).
- Swiss-Prot and ExPASy: Tools for accessing Swiss-Prot (UniProt) and ExPASy databases for protein sequences.
- Going 3D: The PDB module: Parsing and analyzing 3D protein structures with Bio.PDB.
- Bio.PopGen: Population genetics: Tools for analyzing population genetics using the Bio.PopGen module.
- Phylogenetics with Bio.Phylo: Tools for parsing, constructing, and analyzing phylogenetic trees using Bio.Phylo.
- Sequence motif analysis using Bio.motifs: Finding and analyzing sequence motifs with the Bio.motifs module.
- Cluster analysis: Performing cluster analysis of biological data.
- Graphics including GenomeDiagram: Creating high-quality genomic diagrams and visualizations with GenomeDiagram.
- KEGG: Tools for interacting with the Kyoto Encyclopedia of Genes and Genomes (KEGG) using Bio.KEGG.
- Bio.phenotype: Analyze phenotypic data: Analyzing phenotype-genotype associations with Bio.phenotype.
- PyBigWig: For reading and writing BigWig files, commonly used for storing dense, continuous data such as coverage tracks from sequencing experiments.
- HTSeq: A Python framework to process high-throughput sequencing data, especially for RNA-Seq counting and alignment.
- Pysam: A Python module for reading, manipulating, and writing genomic data in SAM/BAM/VCF format.

Additional Tools for Specialized Tasks:

PyCaret: An open-source, low-code machine learning library that automates machine learning workflows for classification, regression, and clustering.
TensorFlow: An end-to-end open-source platform for machine learning, especially for building and training deep learning models.
Keras: A high-level API for building and training neural networks, running on top of TensorFlow for deep learning tasks.

Data Wrangling

readxl for importing data into R.
dplyr, tidyr, and others from the tidyverse for data preparation.

Data Visualization

ggplot2 for the majority of the graphics, together with the hrbrtheme for styling.
patchwork to combine graphics.
ggraph and igraph for most network-related graphics.
plotly and other HTML widgets for interactive graphics.
RColorBrewer, viridis, and colormap for color control in charts.
ggrepel and other ggplot2 extensions for simplifying plotting tasks.
heatmaply for heatmaps.

Reproducible Research

R Markdown to produce statistical reports.
Quarto to build websites for courses and more.

Statistical Modeling

Static modeling in SPSS entails building and evaluating models that represent relationships within a dataset at a particular point in time. Common methods include ANOVA, descriptive statistics, and regression analysis. Data preparation, variable selection, model construction, and evaluation are part of the process. Coefficients, p-values, and R-squared help interpret output to determine the significance and strength of relationships. Applications include business analytics, social sciences, and market research.

Workflow Management and Pipeline Development:

Nextflow is used to create reproducible and scalable data analysis pipelines.
Integrates with software such as Docker, Singularity, and Conda for environment management.
Ideal for processing large-scale data, especially in genomics and bioinformatics.
Can manage complex workflows involving multiple tools like FastQC, STAR, HISAT2, DESeq2, and more.

Key Tools in Nextflow Pipelines:

FastQC: For quality control of raw sequence data.
Trimmomatic: For trimming low-quality reads and adapters.
STAR/HISAT2: For read alignment to reference genomes.
DESeq2/edgeR: For differential expression analysis.
MultiQC: To aggregate results across multiple samples for easy interpretation.
ARIBA: Rapid antimicrobial resistance genotyping directly from sequencing reads.
BCFtools: For variant calling and manipulating VCF/BCF files.
SAMtools: For handling BAM/SAM files.
BWA: For sequence alignment.
Docker Images (StaPH-B): Pre-built containerized environments with bioinformatics tools like ARIBA, BCFtools, Kraken 2, and more.
fastp: Ultra-fast all-in-one FASTQ preprocessor.
Kraken 2: For metagenomic analysis and taxonomy assignment.
mlst: For multi-locus sequence typing.
PopPUNK: Bacterial genomic epidemiology analysis tool.
QUAST: For genome assembly evaluation.
SeroBA: High-throughput serotyping of Streptococcus pneumoniae.
Unicycler: Bacterial genome assembly tool.
SPN-PBP-AMR: Predicts penicillin resistance in Streptococcus pneumoniae.

Example Use Cases:

GPS Pipeline
KPN pipeline
RNA-Seq pipelines
Whole Genome Sequencing (WGS)
Single-cell RNA-Seq analysis
Metagenomics analysis

Deep Learning Frameworks and Libraries:

TensorFlow: An open-source framework for deep learning developed by Google. TensorFlow supports building and training neural networks, including tasks such as image classification, natural language processing, and more.
Keras: A high-level neural networks API that runs on top of TensorFlow, simplifying the construction of deep learning models with minimal code.
PyTorch: Developed by Facebook, PyTorch is a widely used deep learning library with a dynamic computational graph, making it easier for research and experimentation. It’s highly popular in the research community for developing new models.
Theano: An older deep learning framework, originally developed to enable efficient computation of mathematical expressions, including neural networks. While it’s less commonly used now, Theano laid the foundation for many modern deep learning tools.
MXNet: A scalable deep learning framework supporting both imperative and symbolic programming. It’s often used for training deep learning models on large datasets in distributed environments.
Chainer: A flexible deep learning framework known for its support of dynamic computation graphs, which makes it easier to implement complex models.
Fastai: A deep learning library that simplifies training models using PyTorch. It provides easy-to-use functions and pre-built models for tasks such as computer vision and natural language processing.
DLib: A toolkit for developing machine learning and deep learning models, often used for computer vision tasks like face recognition and object detection.

Deep Learning Tools and Applications:

Autoencoders: Used for unsupervised learning and dimensionality reduction, autoencoders learn compressed representations of data. They are often used in tasks such as anomaly detection, data denoising, and generative models.
Convolutional Neural Networks (CNNs): Deep learning models primarily used for tasks like image classification, object detection, and image segmentation. CNNs excel in extracting spatial features from images.
Recurrent Neural Networks (RNNs): A type of neural network used for sequence data, such as time-series analysis, language modeling, and speech recognition. Variants like LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) help in capturing long-term dependencies.
Generative Adversarial Networks (GANs): A powerful framework for generating new data that mimics real datasets. GANs are used for generating images, improving image resolution, and generating synthetic datasets.
Transfer Learning: A technique where pre-trained deep learning models are fine-tuned for a specific task with fewer data. Popular models include VGG, ResNet, Inception, and BERT.
Reinforcement Learning: Deep learning models that learn by interacting with environments, often used in robotics, game AI, and autonomous systems.
Transformers: A deep learning model architecture used for tasks like natural language processing (NLP). Models like BERT, GPT, and T5 are transformer-based models.

Additional Libraries for Deep Learning:

Horovod: A distributed training framework for TensorFlow, PyTorch, and Keras, designed to enable efficient scaling of deep learning models across multiple GPUs.
ONNX (Open Neural Network Exchange): An open-source format that allows interoperability between different deep learning frameworks, enabling the movement of models between PyTorch, TensorFlow, MXNet, and other frameworks.
OpenCV: A computer vision library that integrates with deep learning frameworks like TensorFlow and PyTorch for tasks like image recognition, object detection, and video analysis.
Caffe: A deep learning framework that is particularly popular for computer vision tasks and is known for its speed in deploying deep learning models.

Use Cases in Deep Learning:

Image Classification: Using CNNs and pre-trained models like VGG or ResNet to classify images into categories.
Object Detection: Models like YOLO (You Only Look Once) and Faster R-CNN used for identifying and classifying objects within images.
Natural Language Processing (NLP): Deep learning models for tasks like sentiment analysis, language translation, and chatbot development using LSTMs, GRUs, or transformers like BERT and GPT.
Speech Recognition: Deep learning models like DeepSpeech are widely used in converting speech to text and understanding spoken commands.
Medical Imaging: Using deep learning for detecting and diagnosing diseases through X-rays, MRIs, and CT scans. CNNs and GANs are often applied for tasks like tumor detection and segmentation.
Autonomous Vehicles: Deep learning models are used in self-driving cars for tasks like object detection, lane detection, and decision-making processes.

Molecular Biology:

Agarose & Polyacrylamide Gel Electrophoresis and Imaging
Blotting techniques
DNA & RNA extraction
Molecular cloning & Restriction enzymes
Conventional and Real-time PCR
ELISA & ICT
Vaccine trial & microsurgery on mouse model
Library preparation and pooling for Bacterial and Viral sequencing (Illumina)
Library preparation and pooling (Nanopore)
DNA/RNA extraction, PCR, Library Preparation (Illumina/Nanopore), Tapestation

Microbiology:

Culture media preparation (Blood, MacCONKEY, Muller-Hinton, and Chocolate agar)
Gram Staining
Catalase, Coagulase, Optochin sensitivity, CAMP, Bile solubility, Oxidase, Biochemical (TSI, MIU, Citrate)
Satellitism, X-V factor
Antibiotic susceptibility test

Genomics

Illumina: , Sample sheet preparation, Converting (BCL2fastq), Quality checking (FastQC, MultiQC, Quast), Quality control (Trimmomatic), Assembly (Unicycler, Spades, Megahit).
Annotation: Kraken2, Prokka, Seroba, AMRFinderPlus, Abricate, SRST, MLST, Snippy, Mafft, fasta2phylip, Raxml-ng, Poppunk, PlasmidFinder, ResFinder, BLAST, Pharokka.

Data Sources

NCBI
The Cancer Genome Atlas (TCGA): Comprehensive multi-dimensional cancer genomics data.
Gene Expression Omnibus (GEO): Public repository for gene expression data, including cancer datasets.
CELLXGENE

Single Cell RNA Sequence Analysis

1. Data Pre-processing:

Scanpy: Preprocessing single-cell RNA-seq data with functions like pp.filter_genes, pp.normalize_total, and pp.log1p for normalization, scaling, and filtering.
Seurat: Preprocessing with functions such as NormalizeData, ScaleData, and FindVariableFeatures to prepare the data for downstream analysis.
Cell Ranger: Primary analysis pipeline from 10x Genomics for generating FASTQ, aligning reads, and creating expression matrices.

2. Clustering and Cell Annotation:

Scanpy: Clustering with algorithms like Louvain (tl.louvain) and Leiden (tl.leiden). Annotation with marker genes using tl.rank_genes_groups.
Seurat: Clustering with FindClusters and RunPCA for dimensionality reduction and visualization. Annotation via FindAllMarkers to identify cell-type-specific markers.
Harmony: Batch correction and integration during clustering to harmonize multiple datasets.

3. Integration and Batch Correction:

Scanpy: Integration tools like tl.bbknn for batch-correction using nearest neighbors and external.pp.harmony_integrate for Harmony integration.
Seurat: Integration tools such as Canonical Correlation Analysis (FindIntegrationAnchors) and Reciprocal PCA (IntegrateData) for batch correction across datasets.
Harmony: A flexible tool for integration and batch effect correction during analysis.
scVI: Scalable Variational Inference for integrating and analyzing multi-modal single-cell datasets.

4. Cell-Cell Communication:

CellPhoneDB: A Python tool for inferring cell-cell communication based on ligand-receptor interactions.
NATMI: For predicting ligand-receptor interactions between cell populations.
iTALK: R package for analyzing and visualizing cell-cell communication using scRNA-seq data.
CellChat: R package for inferring and visualizing intercellular communication networks.
SingleCellSignalR: An R package for identifying cell-cell communication networks based on receptor-ligand interaction analysis.

5. BCR Background and 10x Analysis:

Cell Ranger: For reconstructing BCR and TCR sequences from 10x Genomics single-cell data.
VDJtools: Analyzing BCR/TCR repertoires from single-cell RNA-seq data.
scRepertoire: R package for analyzing TCR/BCR sequences in single-cell datasets, providing clonotype tracking and diversity analysis.

6. Trajectory Inference:

Monocle3: R package for performing pseudotime analysis and reconstructing cell trajectories.
Slingshot: R package for lineage inference and trajectory reconstruction from single-cell RNA-seq data.
Scanpy: Trajectory inference using methods like Palantir (tl.palantir) and pseudotime analysis with tl.dpt.
PAGA: Partition-based graph abstraction for inferring lineage relationships and transitions between cell states in Scanpy.
Velocyto: A tool for estimating RNA velocity in single cells, predicting the future states of cells in trajectory analysis.
dynverse: A collection of methods and packages for trajectory inference, integrated with both Seurat and Scanpy workflows.

7. Differential Abundance:

DAseq: R package for differential abundance analysis of single-cell datasets, identifying significant differences in cell populations.
Milo: For testing differential abundance of cell neighborhoods across conditions.
edgeR: For differential expression analysis in scRNA-seq data.
DESeq2: R package for differential expression analysis in bulk RNA-seq and scRNA-seq data.

8. Multiomic scATAC:

Signac: An R package built on top of Seurat for integrating scRNA-seq and scATAC-seq data.
ArchR: A comprehensive tool for single-cell ATAC-seq data analysis, including peak calling, dimensionality reduction, and clustering.
Cicero: Tool for inferring co-accessibility networks from scATAC-seq data.
SnapATAC: For scATAC-seq clustering, visualization, and integration with scRNA-seq datasets.
scVI: Probabilistic modeling for multi-modal single-cell data, supporting scRNA-seq and scATAC-seq integration.

Research Fellow: Developed ML-based pipeline for shotgun sequencing, 16S rRNA sequencing using QIIME2, RNA-seq using DESeq2, stats, ggplot, corrplot, pheatmap, EDASeq, gProfileR.