What I Use
Languages:
- Python
- Bash
- R
- C
- Nextflow
- Java
Database Management Systems:
- SQL
- MySQL
- Redis
Data Analysis, Machine Learning and Deep Learning:
- Numpy
- Pandas
- Matplotlib
- Seaborn
- Keras
- SciKit-Learn
- PyTorch
- TensorFlow
- PyCaret
Miscellaneous Skills:
- Git
- Jupyter Notebook
- Visual Studio
- R Studio
- PyCharm
- Terra
- Posit
- Bash Terminal
Microarray Analysis:
- limma
- gProfileR
RNA Sequencing Analysis:
- FastQC
- Trimmomatic
- Cutadapt
- Fastp
- STAR
- HISAT2
- DESeq2
- ClusterProfiler
- ggplot2
- Bioconductor
Key Bash Commands and Tools for Bioinformatics:
1. File Management and Text Processing:
grep: For searching and filtering text data from large files (e.g., logs, sequence data).awk: Used for text processing and pattern scanning, often in sequence data files like FASTA, FASTQ, or SAM.sed: Stream editor for basic text transformations in bioinformatics pipelines.cut: For extracting columns from text files such as tabular data (e.g., VCF or expression matrices).sortanduniq: To sort and filter unique lines in large datasets.find: Locating files and directories based on name, size, or modification time.
2. Sequence Data Handling:
fastqc: Quality control tool for high-throughput sequence data (FASTQ files).multiqc: Aggregates results from multiplefastqcreports into a single summary report.bwa: Burrows-Wheeler Aligner for mapping low-divergent sequences against a large reference genome.samtools: Utilities for processing SAM, BAM, and CRAM files (e.g., sorting, indexing, and converting formats).bcftools: Tools for variant calling and manipulating VCF files, used in variant calling workflows.bedtools: Suite of tools for performing a wide range of genomic operations on BED files.cutadapt: Used for trimming adapter sequences from reads in FASTQ files.trimmomatic: Tool for trimming low-quality regions and adapters from high-throughput sequencing reads.
3. Data Compression and Decompression:
gzip/gunzip: For compressing and decompressing sequence files (e.g., FASTQ, BAM).tar: To compress and extract multiple files or directories (often used withgzip).bgzip: Used for compressing VCF and other large text files, often combined withtabixfor indexing.zcat/bzcat: For reading compressed files without decompressing them.
4. Pipeline and Workflow Automation:
Nextflow: Workflow manager for reproducible and scalable bioinformatics pipelines, designed to work with cloud resources and clusters.
5. Genomic Data Handling:
bcl2fastq: Converts raw Illumina BCL data files into FASTQ format for downstream analysis.kraken2: For taxonomic classification of metagenomic sequence data.prokka: For genome annotation of bacterial sequences.pilon: Tool for improving genome assemblies by correcting sequence errors.unicycler: Hybrid assembly pipeline optimized for bacterial genomes.quast: For quality assessment of genome assemblies.
6. Visualization:
IGV: Command-line interface for the Integrative Genomics Viewer, useful for visualizing sequence data.samtools tview: A text-based alignment viewer for BAM files.
7. Monitoring and Resource Management:
top: Monitor system resource usage (CPU, memory).htop: A more interactive version oftopfor monitoring system performance.dfanddu: For checking disk space and usage, ensuring enough storage is available for large sequencing datasets.
8. Software Installation:
conda: Package and environment manager for installing bioinformatics tools and managing dependencies.docker: Containerization platform for running bioinformatics tools in isolated environments.
Biopython: A comprehensive library for biological computation, supporting sequence analysis, file parsing (FASTA, GenBank), phylogenetics, and more. Widely used in genomics, protein analysis, and molecular biology. Table of contents with Tools:
Sequence objects: Handling biological sequences using the
Seqobject from Biopython.Sequence annotation objects: Annotating sequences with
SeqRecordandFeatureobjects.Sequence Input/Output: Tools like
SeqIOandAlignIOfor reading and writing sequence file formats (e.g., FASTA, GenBank).Sequence alignments: Working with sequence alignment objects using
AlignIO.Pairwise sequence alignment: Performing pairwise sequence alignment with
Bio.pairwise2.Multiple Sequence Alignment objects: Handling and manipulating multiple sequence alignments using
MultipleSeqAlignmentandAlignIO.Pairwise alignments using pairwise2: Detailed use of the
pairwise2module for flexible sequence alignment.BLAST : Accessing NCBI BLAST services with the
NCBIXMLandNCBIWWWmodules to parse BLAST output.Accessing NCBI’s Entrez databases: Using
Bio.Entrezfor accessing and retrieving data from NCBI’s Entrez databases (e.g., PubMed, GenBank).Swiss-Prot and ExPASy: Tools for accessing Swiss-Prot (UniProt) and ExPASy databases for protein sequences.
Going 3D: The PDB module: Parsing and analyzing 3D protein structures with
Bio.PDB.Bio.PopGen: Population genetics: Tools for analyzing population genetics using the
Bio.PopGenmodule.Phylogenetics with Bio.Phylo: Tools for parsing, constructing, and analyzing phylogenetic trees using
Bio.Phylo.Sequence motif analysis using Bio.motifs: Finding and analyzing sequence motifs with the
Bio.motifsmodule.Cluster analysis: Performing cluster analysis of biological data.
Graphics including GenomeDiagram: Creating high-quality genomic diagrams and visualizations with
GenomeDiagram.KEGG: Tools for interacting with the Kyoto Encyclopedia of Genes and Genomes (KEGG) using
Bio.KEGG.Bio.phenotype: Analyze phenotypic data: Analyzing phenotype-genotype associations with
Bio.phenotype.PyBigWig: For reading and writing BigWig files, commonly used for storing dense, continuous data such as coverage tracks from sequencing experiments.
HTSeq: A Python framework to process high-throughput sequencing data, especially for RNA-Seq counting and alignment.
Pysam: A Python module for reading, manipulating, and writing genomic data in SAM/BAM/VCF format.
Additional Tools for Specialized Tasks:
- PyCaret: An open-source, low-code machine learning library that automates machine learning workflows for classification, regression, and clustering.
- TensorFlow: An end-to-end open-source platform for machine learning, especially for building and training deep learning models.
- Keras: A high-level API for building and training neural networks, running on top of TensorFlow for deep learning tasks.
Data Wrangling
Data Visualization
- ggplot2 for the majority of the graphics, together with the hrbrtheme for styling.
- patchwork to combine graphics.
- ggraph and igraph for most network-related graphics.
- plotly and other HTML widgets for interactive graphics.
- RColorBrewer, viridis, and colormap for color control in charts.
- ggrepel and other ggplot2 extensions for simplifying plotting tasks.
- heatmaply for heatmaps.
Reproducible Research
- R Markdown to produce statistical reports.
- Quarto to build websites for courses and more.
Statistical Modeling
Static modeling in SPSS entails building and evaluating models that represent relationships within a dataset at a particular point in time. Common methods include ANOVA, descriptive statistics, and regression analysis. Data preparation, variable selection, model construction, and evaluation are part of the process. Coefficients, p-values, and R-squared help interpret output to determine the significance and strength of relationships. Applications include business analytics, social sciences, and market research.
Workflow Management and Pipeline Development:
- Nextflow is used to create reproducible and scalable data analysis pipelines.
- Integrates with software such as Docker, Singularity, and Conda for environment management.
- Ideal for processing large-scale data, especially in genomics and bioinformatics.
- Can manage complex workflows involving multiple tools like FastQC, STAR, HISAT2, DESeq2, and more.
Key Tools in Nextflow Pipelines:
- FastQC: For quality control of raw sequence data.
- Trimmomatic: For trimming low-quality reads and adapters.
- STAR/HISAT2: For read alignment to reference genomes.
- DESeq2/edgeR: For differential expression analysis.
- MultiQC: To aggregate results across multiple samples for easy interpretation.
- ARIBA: Rapid antimicrobial resistance genotyping directly from sequencing reads.
- BCFtools: For variant calling and manipulating VCF/BCF files.
- SAMtools: For handling BAM/SAM files.
- BWA: For sequence alignment.
- Docker Images (StaPH-B): Pre-built containerized environments with bioinformatics tools like ARIBA, BCFtools, Kraken 2, and more.
- fastp: Ultra-fast all-in-one FASTQ preprocessor.
- Kraken 2: For metagenomic analysis and taxonomy assignment.
- mlst: For multi-locus sequence typing.
- PopPUNK: Bacterial genomic epidemiology analysis tool.
- QUAST: For genome assembly evaluation.
- SeroBA: High-throughput serotyping of Streptococcus pneumoniae.
- Unicycler: Bacterial genome assembly tool.
- SPN-PBP-AMR: Predicts penicillin resistance in Streptococcus pneumoniae.
Example Use Cases:
- GPS Pipeline
- KPN pipeline
- RNA-Seq pipelines
- Whole Genome Sequencing (WGS)
- Single-cell RNA-Seq analysis
- Metagenomics analysis
Deep Learning Frameworks and Libraries:
- TensorFlow: An open-source framework for deep learning developed by Google. TensorFlow supports building and training neural networks, including tasks such as image classification, natural language processing, and more.
- Keras: A high-level neural networks API that runs on top of TensorFlow, simplifying the construction of deep learning models with minimal code.
- PyTorch: Developed by Facebook, PyTorch is a widely used deep learning library with a dynamic computational graph, making it easier for research and experimentation. It’s highly popular in the research community for developing new models.
- Theano: An older deep learning framework, originally developed to enable efficient computation of mathematical expressions, including neural networks. While it’s less commonly used now, Theano laid the foundation for many modern deep learning tools.
- MXNet: A scalable deep learning framework supporting both imperative and symbolic programming. It’s often used for training deep learning models on large datasets in distributed environments.
- Chainer: A flexible deep learning framework known for its support of dynamic computation graphs, which makes it easier to implement complex models.
- Fastai: A deep learning library that simplifies training models using PyTorch. It provides easy-to-use functions and pre-built models for tasks such as computer vision and natural language processing.
- DLib: A toolkit for developing machine learning and deep learning models, often used for computer vision tasks like face recognition and object detection.
Deep Learning Tools and Applications:
- Autoencoders: Used for unsupervised learning and dimensionality reduction, autoencoders learn compressed representations of data. They are often used in tasks such as anomaly detection, data denoising, and generative models.
- Convolutional Neural Networks (CNNs): Deep learning models primarily used for tasks like image classification, object detection, and image segmentation. CNNs excel in extracting spatial features from images.
- Recurrent Neural Networks (RNNs): A type of neural network used for sequence data, such as time-series analysis, language modeling, and speech recognition. Variants like LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) help in capturing long-term dependencies.
- Generative Adversarial Networks (GANs): A powerful framework for generating new data that mimics real datasets. GANs are used for generating images, improving image resolution, and generating synthetic datasets.
- Transfer Learning: A technique where pre-trained deep learning models are fine-tuned for a specific task with fewer data. Popular models include VGG, ResNet, Inception, and BERT.
- Reinforcement Learning: Deep learning models that learn by interacting with environments, often used in robotics, game AI, and autonomous systems.
- Transformers: A deep learning model architecture used for tasks like natural language processing (NLP). Models like BERT, GPT, and T5 are transformer-based models.
Additional Libraries for Deep Learning:
- Horovod: A distributed training framework for TensorFlow, PyTorch, and Keras, designed to enable efficient scaling of deep learning models across multiple GPUs.
- ONNX (Open Neural Network Exchange): An open-source format that allows interoperability between different deep learning frameworks, enabling the movement of models between PyTorch, TensorFlow, MXNet, and other frameworks.
- OpenCV: A computer vision library that integrates with deep learning frameworks like TensorFlow and PyTorch for tasks like image recognition, object detection, and video analysis.
- Caffe: A deep learning framework that is particularly popular for computer vision tasks and is known for its speed in deploying deep learning models.
Use Cases in Deep Learning:
- Image Classification: Using CNNs and pre-trained models like VGG or ResNet to classify images into categories.
- Object Detection: Models like YOLO (You Only Look Once) and Faster R-CNN used for identifying and classifying objects within images.
- Natural Language Processing (NLP): Deep learning models for tasks like sentiment analysis, language translation, and chatbot development using LSTMs, GRUs, or transformers like BERT and GPT.
- Speech Recognition: Deep learning models like DeepSpeech are widely used in converting speech to text and understanding spoken commands.
- Medical Imaging: Using deep learning for detecting and diagnosing diseases through X-rays, MRIs, and CT scans. CNNs and GANs are often applied for tasks like tumor detection and segmentation.
- Autonomous Vehicles: Deep learning models are used in self-driving cars for tasks like object detection, lane detection, and decision-making processes.
Molecular Biology:
- Agarose & Polyacrylamide Gel Electrophoresis and Imaging
- Blotting techniques
- DNA & RNA extraction
- Molecular cloning & Restriction enzymes
- Conventional and Real-time PCR
- ELISA & ICT
- Vaccine trial & microsurgery on mouse model
- Library preparation and pooling for Bacterial and Viral sequencing (Illumina)
- Library preparation and pooling (Nanopore)
- DNA/RNA extraction, PCR, Library Preparation (Illumina/Nanopore), Tapestation
Microbiology:
- Culture media preparation (Blood, MacCONKEY, Muller-Hinton, and Chocolate agar)
- Gram Staining
- Catalase, Coagulase, Optochin sensitivity, CAMP, Bile solubility, Oxidase, Biochemical (TSI, MIU, Citrate)
- Satellitism, X-V factor
- Antibiotic susceptibility test
Genomics
- Illumina: , Sample sheet preparation, Converting (BCL2fastq), Quality checking (FastQC, MultiQC, Quast), Quality control (Trimmomatic), Assembly (Unicycler, Spades, Megahit).
- Annotation: Kraken2, Prokka, Seroba, AMRFinderPlus, Abricate, SRST, MLST, Snippy, Mafft, fasta2phylip, Raxml-ng, Poppunk, PlasmidFinder, ResFinder, BLAST, Pharokka.
Data Sources
- NCBI
- The Cancer Genome Atlas (TCGA): Comprehensive multi-dimensional cancer genomics data.
- Gene Expression Omnibus (GEO): Public repository for gene expression data, including cancer datasets.
- CELLXGENE
Single Cell RNA Sequence Analysis
1. Data Pre-processing:
- Scanpy: Preprocessing single-cell RNA-seq data with functions like
pp.filter_genes,pp.normalize_total, andpp.log1pfor normalization, scaling, and filtering. - Seurat: Preprocessing with functions such as
NormalizeData,ScaleData, andFindVariableFeaturesto prepare the data for downstream analysis. - Cell Ranger: Primary analysis pipeline from 10x Genomics for generating FASTQ, aligning reads, and creating expression matrices.
2. Clustering and Cell Annotation:
- Scanpy: Clustering with algorithms like Louvain (
tl.louvain) and Leiden (tl.leiden). Annotation with marker genes usingtl.rank_genes_groups. - Seurat: Clustering with
FindClustersandRunPCAfor dimensionality reduction and visualization. Annotation viaFindAllMarkersto identify cell-type-specific markers. - Harmony: Batch correction and integration during clustering to harmonize multiple datasets.
3. Integration and Batch Correction:
- Scanpy: Integration tools like
tl.bbknnfor batch-correction using nearest neighbors andexternal.pp.harmony_integratefor Harmony integration. - Seurat: Integration tools such as Canonical Correlation Analysis (
FindIntegrationAnchors) and Reciprocal PCA (IntegrateData) for batch correction across datasets. - Harmony: A flexible tool for integration and batch effect correction during analysis.
- scVI: Scalable Variational Inference for integrating and analyzing multi-modal single-cell datasets.
4. Cell-Cell Communication:
- CellPhoneDB: A Python tool for inferring cell-cell communication based on ligand-receptor interactions.
- NATMI: For predicting ligand-receptor interactions between cell populations.
- iTALK: R package for analyzing and visualizing cell-cell communication using scRNA-seq data.
- CellChat: R package for inferring and visualizing intercellular communication networks.
- SingleCellSignalR: An R package for identifying cell-cell communication networks based on receptor-ligand interaction analysis.
5. BCR Background and 10x Analysis:
- Cell Ranger: For reconstructing BCR and TCR sequences from 10x Genomics single-cell data.
- VDJtools: Analyzing BCR/TCR repertoires from single-cell RNA-seq data.
- scRepertoire: R package for analyzing TCR/BCR sequences in single-cell datasets, providing clonotype tracking and diversity analysis.
6. Trajectory Inference:
- Monocle3: R package for performing pseudotime analysis and reconstructing cell trajectories.
- Slingshot: R package for lineage inference and trajectory reconstruction from single-cell RNA-seq data.
- Scanpy: Trajectory inference using methods like Palantir (
tl.palantir) and pseudotime analysis withtl.dpt. - PAGA: Partition-based graph abstraction for inferring lineage relationships and transitions between cell states in Scanpy.
- Velocyto: A tool for estimating RNA velocity in single cells, predicting the future states of cells in trajectory analysis.
- dynverse: A collection of methods and packages for trajectory inference, integrated with both Seurat and Scanpy workflows.
7. Differential Abundance:
- DAseq: R package for differential abundance analysis of single-cell datasets, identifying significant differences in cell populations.
- Milo: For testing differential abundance of cell neighborhoods across conditions.
- edgeR: For differential expression analysis in scRNA-seq data.
- DESeq2: R package for differential expression analysis in bulk RNA-seq and scRNA-seq data.
8. Multiomic scATAC:
- Signac: An R package built on top of Seurat for integrating scRNA-seq and scATAC-seq data.
- ArchR: A comprehensive tool for single-cell ATAC-seq data analysis, including peak calling, dimensionality reduction, and clustering.
- Cicero: Tool for inferring co-accessibility networks from scATAC-seq data.
- SnapATAC: For scATAC-seq clustering, visualization, and integration with scRNA-seq datasets.
- scVI: Probabilistic modeling for multi-modal single-cell data, supporting scRNA-seq and scATAC-seq integration.
- Research Fellow: Developed ML-based pipeline for shotgun sequencing, 16S rRNA sequencing using QIIME2, RNA-seq using DESeq2, stats, ggplot, corrplot, pheatmap, EDASeq, gProfileR.