# Hugging Science — AI for Science Resource Index

> A curated catalog of scientific datasets, models, and blog posts for ML researchers.
> Browse by topic using the /topics/{tag}.md files listed below, or fetch /llms-full.txt for everything at once.

## Topic Files

- [Astronomy](https://huggingscience.co/topics/astronomy.md): Space science and astrophysics
- [Benchmark](https://huggingscience.co/topics/benchmark.md): Evaluation and benchmarking datasets
- [Biology](https://huggingscience.co/topics/biology.md): Life sciences, genomics, and biological systems
- [Biotechnology](https://huggingscience.co/topics/biotechnology.md): Biological engineering and synthetic biology
- [Chemistry](https://huggingscience.co/topics/chemistry.md): Molecular science, reactions, and materials
- [Climate](https://huggingscience.co/topics/climate.md): Climate science and environmental modeling
- [Conservation](https://huggingscience.co/topics/conservation.md): Wildlife and habitat preservation
- [Earth Science](https://huggingscience.co/topics/earth-science.md): Geology, oceanography, and planetary science
- [Ecology](https://huggingscience.co/topics/ecology.md): Ecosystems and environmental biology
- [Energy](https://huggingscience.co/topics/energy.md): Energy systems and sustainability
- [Engineering](https://huggingscience.co/topics/engineering.md): Applied science and technical systems
- [Genomics](https://huggingscience.co/topics/genomics.md): DNA, RNA, and genetic analysis
- [Materials Science](https://huggingscience.co/topics/materials-science.md): Material properties and discovery
- [Mathematics](https://huggingscience.co/topics/mathematics.md): Mathematical modeling and computational methods
- [Medicine](https://huggingscience.co/topics/medicine.md): Healthcare, drug discovery, and clinical research
- [Physics](https://huggingscience.co/topics/physics.md): Fundamental forces, particles, and physical systems
- [Scientific Reasoning](https://huggingscience.co/topics/scientific-reasoning.md): Scientific QA, theorem proving, and multi-step problem-solving datasets

## Datasets

- [arcinstitute/opengenome2](https://huggingface.co/datasets/arcinstitute/opengenome2): Curated collection of prokaryotic and eukaryotic genomic sequences for training and benchmarking large-scale biological foundation models. [Biology, Genomics, Medicine]
- [arcinstitute/SE-167M-Human](https://huggingface.co/datasets/arcinstitute/SE-167M-Human): 167M human single-cell RNA expression profiles across diverse tissues and cell types, used for training STACK and SE single-cell foundation models. [Biology, Genomics, Medicine]
- [arcinstitute/Stack-CellxGene45M](https://huggingface.co/datasets/arcinstitute/Stack-CellxGene45M): 45M curated single-cell profiles drawn from the CellxGene corpus, standardised for in-context learning and cross-study perturbation analysis. [Biology, Genomics, Medicine]
- [polymathic-ai/active_matter](https://huggingface.co/datasets/polymathic-ai/active_matter): High-fidelity simulations of self-propelled particle systems for benchmarking learned PDE solvers and emergent collective behaviour models. [Physics, Engineering, Benchmark]
- [polymathic-ai/MHD_64](https://huggingface.co/datasets/polymathic-ai/MHD_64): 3D magnetohydrodynamics turbulence simulations at 64³ resolution for training and benchmarking physics-informed neural operators. [Physics, Engineering, Benchmark]
- [polymathic-ai/planetswe](https://huggingface.co/datasets/polymathic-ai/planetswe): Spherical shallow-water equation simulations modelling large-scale planetary atmospheric dynamics for weather and climate surrogate models. [Physics, Earth Science]
- [polymathic-ai/rayleigh_benard](https://huggingface.co/datasets/polymathic-ai/rayleigh_benard): Rayleigh–Bénard thermal convection simulations at varying Rayleigh and Prandtl numbers for benchmarking turbulence and heat transfer models. [Physics, Engineering, Benchmark]
- [polymathic-ai/supernova_explosion_64](https://huggingface.co/datasets/polymathic-ai/supernova_explosion_64): Hydrodynamic simulations of core-collapse supernova explosions at 64³ resolution, spanning diverse progenitor masses and explosion energies. [Physics, Astronomy]
- [ginkgo-datapoints/GDPa1](https://huggingface.co/datasets/ginkgo-datapoints/GDPa1): Antibody developability dataset with biophysical assay data for 242 antibodies across 9 assays. [Biology, Biotechnology]
- [ginkgo-datapoints/GDPx1](https://huggingface.co/datasets/ginkgo-datapoints/GDPx1): DRUG-seq functional genomics dataset with chemical perturbation experiments in A549 cells. [Biology, Biotechnology]
- [ginkgo-datapoints/GDPx2](https://huggingface.co/datasets/ginkgo-datapoints/GDPx2): DRUG-seq transcriptomic profiling across 4 primary human cell types with 85 compounds. [Biology, Biotechnology]
- [ginkgo-datapoints/GDPx3](https://huggingface.co/datasets/ginkgo-datapoints/GDPx3): High-content Cell Painting imaging dataset for AI/ML model training in drug discovery. [Biology, Biotechnology]
- [ginkgo-datapoints/GDPx4](https://huggingface.co/datasets/ginkgo-datapoints/GDPx4): DRUG-seq transcriptomic profiling in engineered HEK293 cells with inducible gene overexpression, enabling systematic study of gene-drug interactions. [Biology, Biotechnology]
- [eve-bio/drug-target-activity](https://huggingface.co/datasets/eve-bio/drug-target-activity): Drug-target interaction measurements for 1,397 FDA-approved small molecule drugs. [Biology, Medicine, Chemistry]
- [nasa-impact/WxC-Bench](https://huggingface.co/datasets/nasa-impact/WxC-Bench): Standardised benchmark for evaluating AI models across six atmospheric and earth science tasks including gravity wave parameterisation, turbulence prediction, and hurricane track forecasting. [Earth Science, Climate, Physics, Benchmark]
- [nasa-impact/EO-via-NLP](https://huggingface.co/datasets/nasa-impact/EO-via-NLP): Paired earth observation imagery and natural-language descriptions for training and evaluating multimodal models on remote sensing understanding tasks. [Earth Science, Climate]
- [proxima-fusion/constellaration](https://huggingface.co/datasets/proxima-fusion/constellaration): Large-scale dataset of quasi-isodynamic stellarator designs with MHD equilibria for fusion energy research. [Physics, Energy, Engineering]
- [EarthSpeciesProject/BEANS-Zero](https://huggingface.co/datasets/EarthSpeciesProject/BEANS-Zero): Zero-shot bioacoustics benchmark evaluating audio-language models on species detection, classification, and captioning across diverse animal taxa. [Biology, Ecology, Conservation, Benchmark, Earth Science]
- [SandboxAQ/SAIR](https://huggingface.co/datasets/SandboxAQ/SAIR): Largest public dataset of protein-ligand 3D structures with binding affinity measurements (1M+ pairs). [Chemistry, Medicine, Biology]
- [SandboxAQ/aqcat25-dataset](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset): 13.5M DFT calculation trajectories for heterogeneous catalysis and ML potential training. [Chemistry, Materials Science, Engineering]
- [jablonkagroup/chempile-mlift](https://huggingface.co/datasets/jablonkagroup/chempile-mlift): Curated lift-off subset of the ChemPile corpus for instruction-tuning and benchmarking chemistry language models across synthesis, property prediction, and reaction tasks. [Chemistry]
- [jablonkagroup/ChemBench](https://huggingface.co/datasets/jablonkagroup/ChemBench): Manually curated benchmark of 3,000+ chemistry and materials science questions across spectroscopy, reactivity, synthesis, and property prediction for evaluating LLMs. [Chemistry, Materials Science, Benchmark, Scientific Reasoning, Engineering, Mathematics]
- [jablonkagroup/chempile-paper](https://huggingface.co/datasets/jablonkagroup/chempile-paper): Large corpus of peer-reviewed chemistry papers and preprints for pre-training and fine-tuning chemistry language models. [Chemistry]
- [AI-MO/aops_raw](https://huggingface.co/datasets/AI-MO/aops_raw): Raw problem posts and discussion threads from the Art of Problem Solving forums, spanning AMC, AIME, and international olympiad competitions. [Mathematics]
- [AI-MO/olympiads-ref-base](https://huggingface.co/datasets/AI-MO/olympiads-ref-base): Canonical reference set of international and national mathematical olympiad problems, used as the base for downstream NuminaMath training splits. [Mathematics]
- [AI-MO/olympiads-ref](https://huggingface.co/datasets/AI-MO/olympiads-ref): Extended reference set of olympiad problems with verified step-by-step solutions, used for Chain-of-Thought and formal reasoning training. [Mathematics, Scientific Reasoning]
- [AI-MO/Kimina-Prover-Promptset](https://huggingface.co/datasets/AI-MO/Kimina-Prover-Promptset): Prompt-set for training and evaluating Kimina, a Lean 4 theorem prover that uses reinforcement learning over formal mathematical proofs. [Mathematics, Scientific Reasoning]
- [AI-MO/NuminaMath-LEAN](https://huggingface.co/datasets/AI-MO/NuminaMath-LEAN): Mathematical problems formalized in LEAN proof assistant. [Mathematics]
- [AI-MO/GeometryLeanBench](https://huggingface.co/datasets/AI-MO/GeometryLeanBench): Geometry theorem proving problems formalised in Lean 4, covering Euclidean, affine, and metric geometry for automated reasoning evaluation. [Mathematics, Benchmark, Scientific Reasoning]
- [AI-MO/CombiBench](https://huggingface.co/datasets/AI-MO/CombiBench): Combinatorics problems drawn from AMC, AIME, and olympiad competitions, formalised for benchmarking discrete-mathematics reasoning in language models. [Mathematics, Benchmark, Scientific Reasoning]
- [AI-MO/minif2f_test](https://huggingface.co/datasets/AI-MO/minif2f_test): Test set for miniF2F formal mathematics benchmark. [Mathematics, Benchmark, Scientific Reasoning]
- [AI-MO/aimo-validation-amc](https://huggingface.co/datasets/AI-MO/aimo-validation-amc): AMC 10/12 competition problems reformatted for AIMO challenge validation, covering algebra, geometry, and number theory at difficulty levels 1–5. [Mathematics, Benchmark, Scientific Reasoning]
- [AI-MO/aimo-validation-aime](https://huggingface.co/datasets/AI-MO/aimo-validation-aime): AIME I/II problems reformatted for AIMO challenge validation — 15-question integer-answer format, covering competition math at difficulty levels 5–9. [Mathematics, Benchmark, Scientific Reasoning]
- [AI-MO/NuminaMath-1.5](https://huggingface.co/datasets/AI-MO/NuminaMath-1.5): 860K+ competition math problems from 17 sources with verified solutions — the training backbone of the gold-medal solution at the 2024 AI Mathematical Olympiad. [Mathematics]
- [AI-MO/NuminaMath-TIR](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR): NuminaMath with Tool-Integrated Reasoning annotations. [Mathematics, Scientific Reasoning]
- [AI-MO/NuminaMath-CoT](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT): NuminaMath with Chain-of-Thought reasoning annotations. [Mathematics, Scientific Reasoning]
- [AI-MO/aimo-validation-math-level-4](https://huggingface.co/datasets/AI-MO/aimo-validation-math-level-4): Level-4 MATH benchmark problems (pre-calculus difficulty) used for AIMO challenge validation and fine-grained model evaluation. [Mathematics, Benchmark, Scientific Reasoning]
- [AI-MO/aimo-validation-math-level-5](https://huggingface.co/datasets/AI-MO/aimo-validation-math-level-5): Level-5 MATH benchmark problems (highest difficulty) used for AIMO challenge validation and measuring the ceiling of model mathematical reasoning. [Mathematics, Benchmark, Scientific Reasoning]
- [meta-math/MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA): Mathematical question-answering dataset for training and evaluating math reasoning. [Mathematics]
- [google/spiqa](https://huggingface.co/datasets/google/spiqa): Scientific Paper Image Question Answering benchmark requiring multimodal reasoning over figures, charts, and diagrams from research papers across scientific domains. [Biology, Chemistry, Physics, Benchmark, Scientific Reasoning, Mathematics]
- [nasa-ibm-ai4science/surya-bench-flare-forecasting](https://huggingface.co/datasets/nasa-ibm-ai4science/surya-bench-flare-forecasting): Full-disk solar flare forecasting dataset from NOAA GOES observations, providing multi-hour-ahead flare probability labels for heliophysics model evaluation. [Astronomy, Physics, Benchmark]
- [nasa-ibm-ai4science/core-sdo](https://huggingface.co/datasets/nasa-ibm-ai4science/core-sdo): Multi-modal Solar Dynamics Observatory dataset combining EUV imagery, magnetograms, and irradiance spectra for solar foundation model pre-training. [Astronomy, Physics]
- [LeMaterial/LeMat-Bulk-MLIP-Hull](https://huggingface.co/datasets/LeMaterial/LeMat-Bulk-MLIP-Hull): Convex hull data for bulk materials from MLIP calculations. [Materials Science, Chemistry, Physics, Engineering]
- [LeMaterial/LeMat-Bulk-DFT-Hull-All](https://huggingface.co/datasets/LeMaterial/LeMat-Bulk-DFT-Hull-All): Complete DFT convex hull dataset for bulk materials discovery. [Materials Science, Chemistry, Physics, Engineering]
- [LeMaterial/LeMat-Bulk-DFT-Hull](https://huggingface.co/datasets/LeMaterial/LeMat-Bulk-DFT-Hull): DFT convex hull reference data for materials stability analysis. [Materials Science, Chemistry, Physics, Engineering]
- [LeMaterial/LeMat-Bulk](https://huggingface.co/datasets/LeMaterial/LeMat-Bulk): Primary bulk materials database aggregating 1M+ crystal structures with DFT-computed formation energies, band gaps, and elastic properties for materials discovery. [Materials Science, Chemistry, Physics, Engineering]
- [LeMaterial/LeMat-Traj](https://huggingface.co/datasets/LeMaterial/LeMat-Traj): Large-scale molecular dynamics trajectory dataset for training machine learning interatomic potentials across diverse bulk material compositions. [Materials Science, Chemistry, Physics, Engineering]
- [openadmet/openadmet-expansionrx-challenge-train-data](https://huggingface.co/datasets/openadmet/openadmet-expansionrx-challenge-train-data): Training data for the OpenADMET ExpansionRx ADMET prediction challenge. [Medicine, Chemistry]
- [openadmet/openadmet-expansionrx-challenge-data](https://huggingface.co/datasets/openadmet/openadmet-expansionrx-challenge-data): Full ExpansionRx challenge dataset of RNA-targeted small-molecule compounds with measured ADMET properties for open pharmacokinetics benchmarking. [Medicine, Chemistry, Benchmark]
- [openadmet/Octant_CYP_inhibition_reactivity_blog_release](https://huggingface.co/datasets/openadmet/Octant_CYP_inhibition_reactivity_blog_release): Octant CYP inhibition and chemical reactivity dataset measuring cytochrome P450 activity across a diverse compound library for ADMET modelling. [Medicine, Chemistry]
- [InstaDeepAI/NTv3_benchmark_dataset](https://huggingface.co/datasets/InstaDeepAI/NTv3_benchmark_dataset): Benchmark dataset with functional tracks and genome annotations across 7 species. [Biology, Genomics, Benchmark]
- [InstaDeepAI/nucleotide_transformer_downstream_tasks](https://huggingface.co/datasets/InstaDeepAI/nucleotide_transformer_downstream_tasks): 18 genomic prediction benchmark tasks covering histone marks, regulatory regions, splice sites, and promoter activity across human and multi-species genomes. [Biology, Genomics, Benchmark]
- [InstaDeepAI/multi_species_genomes](https://huggingface.co/datasets/InstaDeepAI/multi_species_genomes): Whole-genome sequences for 850 species spanning bacteria, fungi, plants, and animals — the pre-training corpus for the Nucleotide Transformer model family. [Biology, Genomics]
- [InstaDeepAI/plant-genomic-benchmark](https://huggingface.co/datasets/InstaDeepAI/plant-genomic-benchmark): Plant genomics benchmark spanning gene expression, chromatin accessibility, and agronomic trait prediction tasks across multiple crop and model plant species. [Biology, Genomics, Benchmark]
- [InstaDeepAI/winnow-ms-datasets](https://huggingface.co/datasets/InstaDeepAI/winnow-ms-datasets): Mass spectrometry datasets for protein analysis and ML model training. [Biology, Chemistry]
- [InstaDeepAI/true-cds-protein-tasks](https://huggingface.co/datasets/InstaDeepAI/true-cds-protein-tasks): Coding sequence and protein function prediction benchmark tasks. [Biology, Genomics, Benchmark]
- [facebook/principia-collection](https://huggingface.co/datasets/facebook/principia-collection): Large-scale STEM reasoning dataset from Meta covering mathematics, physics, chemistry, and biology problems for training and evaluating scientific reasoning in language models. [Mathematics, Physics, Chemistry, Scientific Reasoning]
- [facebook/principia-bench](https://huggingface.co/datasets/facebook/principia-bench): Curated benchmark of challenging STEM problems requiring multi-step reasoning, quantitative analysis, and domain knowledge across natural sciences. [Mathematics, Physics, Chemistry, Benchmark, Scientific Reasoning]
- [futurehouse/BixBench](https://huggingface.co/datasets/futurehouse/BixBench): Benchmark with 205 reproducible research questions paired with data capsules for AI evaluation. [Biology, Chemistry, Benchmark, Scientific Reasoning, Mathematics]
- [futurehouse/lab-bench](https://huggingface.co/datasets/futurehouse/lab-bench): Language Agent Biology Benchmark - 8 categories of scientific research tasks including cloning, figures, and protocols. [Biology, Benchmark, Scientific Reasoning, Mathematics]
- [futurehouse/ether0-benchmark](https://huggingface.co/datasets/futurehouse/ether0-benchmark): Chemistry reasoning benchmark covering SMILES-based tasks including reaction prediction, retrosynthesis, and molecular property estimation for evaluating chemistry LLMs. [Chemistry, Medicine, Benchmark, Scientific Reasoning, Mathematics]
- [ONERA/SARLO-80](https://huggingface.co/datasets/ONERA/SARLO-80): 119K paired SAR/optical images with text captions at 80cm resolution for multimodal learning. [Earth Science, Engineering]
- [tahoebio/Tahoe-100M](https://huggingface.co/datasets/tahoebio/Tahoe-100M): Giga-scale perturbation atlas with 100M+ single-cell profiles from 50 cancer cell lines and 1,100 drugs. [Biology, Medicine, Genomics]
- [tahoebio/Tahoe-x1-embeddings](https://huggingface.co/datasets/tahoebio/Tahoe-x1-embeddings): Pre-computed cell and gene embeddings from the Tahoe-x1 foundation model. [Biology, Medicine, Genomics]
- [owkin/plism-dataset-tiles](https://huggingface.co/datasets/owkin/plism-dataset-tiles): Large-scale histopathology tile dataset for benchmarking robustness of pathology foundation models across staining and scanner variability. [Medicine, Biology]
- [owkin/nct-crc-he](https://huggingface.co/datasets/owkin/nct-crc-he): Colorectal cancer tissue classification dataset with H&E-stained patches across 9 tissue classes, widely used for benchmarking pathology models. [Medicine, Biology, Benchmark]
- [owkin/camelyon16-features](https://huggingface.co/datasets/owkin/camelyon16-features): Pre-extracted features from the CAMELYON16 breast cancer lymph node metastasis detection challenge, enabling efficient benchmarking of MIL methods. [Medicine, Biology, Benchmark]
- [owkin/her2-challenge-2026](https://huggingface.co/datasets/owkin/her2-challenge-2026): HER2 scoring challenge dataset with H&E-stained whole-slide images for evaluating AI-based HER2 status prediction in breast cancer. [Medicine, Biology, Benchmark]
- [Xaira-Therapeutics/X-Atlas-Orion](https://huggingface.co/datasets/Xaira-Therapeutics/X-Atlas-Orion): Large-scale single-cell transcriptomics atlas with millions of cell profiles from diverse human tissues, designed for training perturbation-aware foundation models. [Biology, Medicine, Genomics]
- [Xaira-Therapeutics/X-Atlas-Pisces](https://huggingface.co/datasets/Xaira-Therapeutics/X-Atlas-Pisces): CRISPRi perturbation single-cell dataset pairing genetic knockdowns with transcriptomic responses, used for training and evaluating the X-Cell model. [Biology, Medicine, Genomics]
- [SAIRfoundation/equational-theories-selected-problems](https://huggingface.co/datasets/SAIRfoundation/equational-theories-selected-problems): Curated selection of equational theory problems for benchmarking LLM mathematical reasoning and automated theorem proving. [Mathematics, Scientific Reasoning, Benchmark]
- [SAIRfoundation/equational-theories-benchmark](https://huggingface.co/datasets/SAIRfoundation/equational-theories-benchmark): Full benchmark suite of equational theory problems spanning algebraic structures, designed to evaluate formal reasoning capabilities of AI models. [Mathematics, Scientific Reasoning, Benchmark]
- [AllTheBacteria/ATB](https://huggingface.co/datasets/AllTheBacteria/ATB): AllTheBacteria: a comprehensive collection of ~2 million bacterial genome assemblies from public sequence databases, standardized for large-scale genomic analysis. [Biology, Genomics]
- [AllTheBacteria/Bac-Corpus-protein-sequences-high-diversity](https://huggingface.co/datasets/AllTheBacteria/Bac-Corpus-protein-sequences-high-diversity): High-diversity corpus of bacterial protein sequences derived from the ATB collection, filtered for maximum sequence diversity to support protein language model pretraining. [Biology, Genomics]
- [AllTheBacteria/Bac-Corpus-dna-intergenic-sequences-high-diversity](https://huggingface.co/datasets/AllTheBacteria/Bac-Corpus-dna-intergenic-sequences-high-diversity): High-diversity corpus of bacterial intergenic DNA sequences for training DNA language models on non-coding regulatory regions. [Biology, Genomics]
- [AllTheBacteria/SPIRE](https://huggingface.co/datasets/AllTheBacteria/SPIRE): Searchable Planetary-scale mIcrobiome REsource: a large-scale metagenomics resource aggregating environmental microbiome samples from diverse global habitats. [Biology, Genomics, Ecology, Earth Science]
- [isp-uv-es/WorldFloodsv2](https://huggingface.co/datasets/isp-uv-es/WorldFloodsv2): Global flood mapping dataset with Sentinel-1/2 and Landsat imagery paired with flood extent labels across hundreds of flood events worldwide. [Earth Science, Climate]
- [isp-uv-es/CloudSEN12Plus](https://huggingface.co/datasets/isp-uv-es/CloudSEN12Plus): Large-scale cloud detection dataset with 49,000+ Sentinel-2 patches and expert-quality cloud/shadow annotations across global biomes and seasons. [Earth Science, Climate, Benchmark]
- [isp-uv-es/rtm_emulation](https://huggingface.co/datasets/isp-uv-es/rtm_emulation): Atmospheric radiative transfer model emulation dataset for training fast neural surrogates to replace computationally expensive RTM simulations in satellite data processing. [Earth Science, Climate, Physics]
- [isp-uv-es/opensr-test](https://huggingface.co/datasets/isp-uv-es/opensr-test): Benchmark dataset for real-world Sentinel-2 super-resolution, with paired low/high-resolution imagery and perceptual quality metrics. [Earth Science, Benchmark]
- [opig/OAS](https://huggingface.co/datasets/opig/OAS): Observed Antibody Space: a curated database of over one billion antibody sequences from immune repertoire sequencing studies, the standard resource for antibody ML. [Biology, Medicine, Chemistry]
- [UniverseTBD/arxiv-abstracts-large](https://huggingface.co/datasets/UniverseTBD/arxiv-abstracts-large): 1.7 million scholarly article abstracts spanning physics, computer science, and statistics from arXiv, structured for pretraining and fine-tuning astronomy and scientific language models. [Astronomy, Physics]
- [UniverseTBD/AstroLLaVA_convos](https://huggingface.co/datasets/UniverseTBD/AstroLLaVA_convos): Astronomical images paired with detailed captions and question-answer pairs sourced from APOD, ESO, and ESA Hubble archives, for training multimodal vision-language models on astrophysics. [Astronomy, Physics]
- [openai/healthbench](https://huggingface.co/datasets/openai/healthbench): Realistic multi-turn health conversations graded against physician-written rubrics across multiple axes (accuracy, completeness, communication) — an open evaluation benchmark for AI assistants in medicine. [Medicine, Benchmark, Scientific Reasoning]
- [openai/healthbench-professional](https://huggingface.co/datasets/openai/healthbench-professional): Professional-graded subset of HealthBench: physician evaluators score model responses to clinically realistic conversations, targeting expert-level health assessment. [Medicine, Benchmark, Scientific Reasoning]
- [openai/frontierscience](https://huggingface.co/datasets/openai/frontierscience): Frontier science evaluation benchmark probing model capabilities on expert-level reasoning across natural sciences — designed to surface what AI systems can and cannot do at the research frontier. [Scientific Reasoning, Benchmark]
- [wanglab/CT_DeepLesion-MedSAM2](https://huggingface.co/datasets/wanglab/CT_DeepLesion-MedSAM2): CT volumes from the DeepLesion benchmark with mask annotations restructured for training and evaluating MedSAM2, the universal medical image segmentation foundation model. [Medicine, Biology]
- [wanglab/img_virus_plasmid](https://huggingface.co/datasets/wanglab/img_virus_plasmid): Combined IMG/VR (uncultivated virus genomes) and IMG/PR (plasmids from genomes and metagenomes) catalog with rich functional, taxonomic, and ecological metadata. [Biology, Genomics, Biotechnology]
- [wanglab/kegg](https://huggingface.co/datasets/wanglab/kegg): KEGG pathway entries paired with variant annotations for training and evaluating multimodal biological reasoning models (used by the BioReason work). [Biology, Genomics, Scientific Reasoning]
- [AI-MO/olympiads](https://huggingface.co/datasets/AI-MO/olympiads): Olympiad-level mathematical problems collected from international and national competitions, formatted for training and evaluating mathematical reasoning models. [Mathematics, Scientific Reasoning]
- [OpenMed/MedDialog](https://huggingface.co/datasets/OpenMed/MedDialog): Doctor-patient medical dialogue dataset for training and evaluating clinical conversation models — covers triage, symptom checking, and diagnostic reasoning. [Medicine, Scientific Reasoning]
- [OpenMed/Medical-Reasoning-SFT-Mega](https://huggingface.co/datasets/OpenMed/Medical-Reasoning-SFT-Mega): Large supervised fine-tuning corpus for clinical reasoning — multi-step medical question-answer chains with rationales for training instruction-following medical LLMs. [Medicine, Scientific Reasoning]
- [OpenMed/synthvision-annotated-qwen](https://huggingface.co/datasets/OpenMed/synthvision-annotated-qwen): Synthetic medical-imaging dataset annotated by Qwen — used in OpenMed’s SynthVision pipeline for training and validating medical multimodal models. [Medicine, Biology]
- [OpenMed/synthvision-seeds](https://huggingface.co/datasets/OpenMed/synthvision-seeds): Seed prompts and source imagery feeding the SynthVision generation pipeline that produces OpenMed’s annotated medical-imaging training corpora. [Medicine, Biology]
- [OpenMed/synthvision-annotated-kimi](https://huggingface.co/datasets/OpenMed/synthvision-annotated-kimi): Synthetic medical-imaging dataset annotated by Kimi — sister set to the Qwen-annotated split, supporting cross-annotator validation in the SynthVision pipeline. [Medicine, Biology]
- [allenai/peS2o](https://huggingface.co/datasets/allenai/peS2o): Approximately 40M cleaned, filtered, and formatted open-access academic papers derived from S2ORC — a large multi-domain pretraining corpus for science-aware language models, spanning biology, chemistry, engineering, computer science, and physics. [Scientific Reasoning, Biology, Chemistry, Physics, Engineering]
- [Anthropic/BioMysteryBench-preview](https://huggingface.co/datasets/Anthropic/BioMysteryBench-preview): Preview slice of BioMysteryBench — challenging, expert-curated biology problems for evaluating AI scientific reasoning capability. [Biology, Medicine, Scientific Reasoning, Benchmark]
- [Anthropic/BioMysteryBench-full](https://huggingface.co/datasets/Anthropic/BioMysteryBench-full): Full BioMysteryBench evaluation set — challenging biology problems used to probe expert-level scientific reasoning in frontier models. [Biology, Medicine, Scientific Reasoning, Benchmark]
- [neashton/drivaerml](https://huggingface.co/datasets/neashton/drivaerml): High-fidelity CFD simulation dataset of the DrivAer reference automotive geometry — resolved-flow data for training ML models on aerodynamics prediction (drag, downforce, surface pressure). [Engineering, Physics]
- [PLAID-datasets/AirfRANS_original](https://huggingface.co/datasets/PLAID-datasets/AirfRANS_original): Original AirfRANS airfoil RANS simulation dataset — graph-structured CFD over NACA airfoils for benchmarking physics-informed and graph neural networks. [Physics, Engineering, Scientific Reasoning]
- [luminary-shift/SUV](https://huggingface.co/datasets/luminary-shift/SUV): Large-scale CFD dataset of SUV-class vehicles for training ML models on automotive aerodynamics — surface pressures, wake structures, and aerodynamic performance metrics. [Engineering, Physics]
- [luminary-shift/Pump](https://huggingface.co/datasets/luminary-shift/Pump): CFD simulations of centrifugal pumps spanning operating conditions — for training ML surrogates of turbomachinery flow and performance. [Engineering, Physics]
- [luminary-shift/SHIFT-Crash](https://huggingface.co/datasets/luminary-shift/SHIFT-Crash): Vehicle crash-simulation dataset capturing structural deformation under impact — for ML-based safety and structural-mechanics modelling. [Engineering, Physics]
- [luminary-shift/WING](https://huggingface.co/datasets/luminary-shift/WING): Wing-flow CFD dataset for ML-driven aerodynamics — covers a range of geometries and flight conditions for surrogate modelling. [Engineering, Physics]
- [luminary-shift/CCA](https://huggingface.co/datasets/luminary-shift/CCA): Common Compressor Aero (CCA) dataset — compressor and turbomachinery simulations for ML-augmented aerospace design workflows. [Engineering, Physics]
- [luminary-shift/Submarine](https://huggingface.co/datasets/luminary-shift/Submarine): Submarine hydrodynamics CFD dataset — submerged-body flow simulations for ML-based marine engineering and naval design. [Engineering, Physics]
- [jablonkagroup/chempile-instruction](https://huggingface.co/datasets/jablonkagroup/chempile-instruction): Instruction-tuning corpus for chemistry — curated Q&A and dialogue traces drawn from chemical literature and educational sources for training chemistry-specialist LLMs. [Chemistry, Scientific Reasoning]
- [jablonkagroup/chempile-reasoning](https://huggingface.co/datasets/jablonkagroup/chempile-reasoning): Multi-step chemistry reasoning corpus — open-domain QA, NLI, and multiple-choice items with chains of reasoning for training and evaluating chemical reasoning models. [Chemistry, Scientific Reasoning]
- [jablonkagroup/chempile-lift](https://huggingface.co/datasets/jablonkagroup/chempile-lift): ChemPile-LIFT — large-scale language-modelling dataset combining curated chemistry literature and structured chemical knowledge for foundation-model pretraining. [Chemistry, Scientific Reasoning]
- [jablonkagroup/chempile-education](https://huggingface.co/datasets/jablonkagroup/chempile-education): Educational chemistry corpus — multiple-choice and open-ended items spanning introductory through graduate chemistry for assessing model educational capability. [Chemistry, Scientific Reasoning]
- [jablonkagroup/chempile-caption](https://huggingface.co/datasets/jablonkagroup/chempile-caption): Image-to-text dataset of chemistry figures (molecular structures, reaction schemes, plots) with expert captions for training multimodal chemistry models. [Chemistry, Scientific Reasoning]
- [jablonkagroup/chempile-code](https://huggingface.co/datasets/jablonkagroup/chempile-code): Curated chemistry-relevant code (RDKit, ASE, simulation tooling) drawn from The Stack — supports training models that can read and write computational chemistry workflows. [Chemistry, Scientific Reasoning]
- [jablonkagroup/MaCBench](https://huggingface.co/datasets/jablonkagroup/MaCBench): Materials Chemistry Benchmark — multimodal QA, multiple-choice, and visual-question-answering items for evaluating LLMs on materials and inorganic chemistry tasks. [Chemistry, Materials Science, Benchmark]
- [miriad/miriad-5.8M](https://huggingface.co/datasets/miriad/miriad-5.8M): 5.8M-example medical instruction-tuning and reasoning corpus curated from clinical literature for training healthcare LLMs at scale. [Medicine, Scientific Reasoning]
- [miriad/miriad-4.4M](https://huggingface.co/datasets/miriad/miriad-4.4M): 4.4M-example medical reasoning subset of MIRIAD — earlier release used for benchmarking medical instruction-tuning workflows. [Medicine, Scientific Reasoning]
- [maomlab/Molecule3D](https://huggingface.co/datasets/maomlab/Molecule3D): Curated 3D molecular structures with computed properties — supports geometric deep learning for property prediction and conformer-aware modelling. [Chemistry, Biology]
- [maomlab/TDC](https://huggingface.co/datasets/maomlab/TDC): Therapeutics Data Commons subset — drug-discovery tasks (ADMET, drug-target interaction, generation) curated for benchmarking molecular ML. [Medicine, Chemistry, Biology]
- [maomlab/B3DB](https://huggingface.co/datasets/maomlab/B3DB): Blood-Brain Barrier Database (B3DB) — curated permeability measurements for compounds, supporting CNS drug-discovery ML benchmarks. [Medicine, Chemistry]
- [maomlab/ChAFF](https://huggingface.co/datasets/maomlab/ChAFF): ChAFF — chemistry dataset for ML benchmarking on filtered/curated molecular properties, part of the Maom Lab pharmacology suite. [Chemistry]
- [maomlab/CryptoCEN](https://huggingface.co/datasets/maomlab/CryptoCEN): CryptoCEN — Cryptococcus coexpression network dataset for fungal pathogen biology and drug-target prioritisation. [Biology, Medicine]
- [imageomics/TreeOfLife-200M](https://huggingface.co/datasets/imageomics/TreeOfLife-200M): Foundational 200M-image dataset for organismal biology — multilingual species labels (en, la) at biodiversity scale, used to train BioCLIP-2 for zero-shot species classification. [Biology, Ecology, Conservation]
- [microsoft/msr-acc-tae25](https://huggingface.co/datasets/microsoft/msr-acc-tae25): Microsoft Research Accurate Chemistry Collection — large dataset of high-accuracy electronic-structure calculations (TAE25 split) for training and evaluating quantum-chemistry ML models. [Chemistry, Physics]
- [Aignostics/OpenTME](https://huggingface.co/datasets/Aignostics/OpenTME): Pre-analyzed H&E whole-slide images from TCGA across breast, bladder, colorectal, liver, and lung cancers — cell-level annotations and tumour-microenvironment spatial features generated by Atlas H&E-TME. [Medicine, Biology]
- [Orbital-Materials/MofasaDB](https://huggingface.co/datasets/Orbital-Materials/MofasaDB): Metal-organic framework dataset from Orbital — large-scale curated MOF structures for materials-discovery ML and synthetic chemistry workflows. [Materials Science, Chemistry]
- [wanglab/bioreason-pro-sft-reasoning-data](https://huggingface.co/datasets/wanglab/bioreason-pro-sft-reasoning-data): Reasoning trace dataset used to supervised-fine-tune BioReason-Pro — multimodal biological problems with rationales over genomic variants and pathway data. [Biology, Genomics, Scientific Reasoning]
- [foundry-ml/foundry_oqmd_band_gaps_v1-1](https://huggingface.co/datasets/foundry-ml/foundry_oqmd_band_gaps_v1-1): Band-gap values from the Open Quantum Materials Database (OQMD), prepared for ML benchmarking on inorganic crystal electronic structure. [Materials Science, Physics, Chemistry]
- [foundry-ml/foundry_aflow_band_gaps_v1-1](https://huggingface.co/datasets/foundry-ml/foundry_aflow_band_gaps_v1-1): Band-gap values from the AFLOW high-throughput materials database, formatted for ML model training and evaluation. [Materials Science, Physics, Chemistry]
- [foundry-ml/foundry_mp_band_gaps_v1-1](https://huggingface.co/datasets/foundry-ml/foundry_mp_band_gaps_v1-1): Band-gap values curated from the Materials Project for ML benchmarking on inorganic electronic structure. [Materials Science, Physics, Chemistry]
- [foundry-ml/double_perovskite_bandgap_v1-1](https://huggingface.co/datasets/foundry-ml/double_perovskite_bandgap_v1-1): Computed band gaps for double-perovskite compounds — supports ML-based screening for photovoltaic and optoelectronic applications. [Materials Science, Physics, Chemistry, Energy]
- [foundry-ml/wolverton_oxides_v1-1](https://huggingface.co/datasets/foundry-ml/wolverton_oxides_v1-1): Wolverton oxide property dataset — DFT-computed properties for binary and ternary oxides, used for ML benchmarking on inorganic chemistry. [Materials Science, Chemistry]
- [foundry-ml/dataset_perovskite_formatione](https://huggingface.co/datasets/foundry-ml/dataset_perovskite_formatione): Formation energies for perovskite compounds — supports ML screening for stability and synthesisability. [Materials Science, Chemistry, Energy]
- [foundry-ml/dataset_perovskite_stability_updated](https://huggingface.co/datasets/foundry-ml/dataset_perovskite_stability_updated): Curated perovskite stability data (updated release) for benchmarking ML models on photovoltaic-material durability prediction. [Materials Science, Chemistry, Energy]
- [foundry-ml/perovskite_stability_v1-1](https://huggingface.co/datasets/foundry-ml/perovskite_stability_v1-1): Perovskite stability dataset (v1.1 release) — paired structure and stability labels for ML benchmarking. [Materials Science, Chemistry, Energy]
- [foundry-ml/perovskite_opbandcenter_v1-1](https://huggingface.co/datasets/foundry-ml/perovskite_opbandcenter_v1-1): O p-band center values for perovskite oxides — descriptors for catalytic activity prediction in oxygen-evolution reactions. [Materials Science, Chemistry, Energy]
- [foundry-ml/dataset_perovskite_conductivity](https://huggingface.co/datasets/foundry-ml/dataset_perovskite_conductivity): Ionic and electronic conductivity measurements for perovskite materials — supports ML screening for solid-oxide fuel cell electrolytes. [Materials Science, Chemistry, Energy]
- [foundry-ml/dataset_perovskite_habs](https://huggingface.co/datasets/foundry-ml/dataset_perovskite_habs): Hot-air-balance (HABS) data for perovskite materials — thermal-stability characterisation supporting durability ML. [Materials Science, Chemistry, Energy]
- [foundry-ml/dataset_perovskite_tec](https://huggingface.co/datasets/foundry-ml/dataset_perovskite_tec): Thermal expansion coefficients for perovskite materials — curated for ML thermal-property prediction. [Materials Science, Chemistry, Physics]
- [foundry-ml/dataset_perovskite_asr](https://huggingface.co/datasets/foundry-ml/dataset_perovskite_asr): Area-specific resistance (ASR) data for perovskite electrodes — used in solid-oxide fuel cell ML modelling. [Materials Science, Chemistry, Energy]
- [foundry-ml/atomvison_atomistic_stm_images_2d_materials_unique_chemical_compositions_structure_v1-1](https://huggingface.co/datasets/foundry-ml/atomvison_atomistic_stm_images_2d_materials_unique_chemical_compositions_structure_v1-1): Simulated STM images for 2D materials with unique chemical compositions — supports ML on atomic-resolution microscopy. [Materials Science, Physics]
- [foundry-ml/atomvison_simulated_atomistic_stem_images_2d_materials_unique_chemical_compositions_structure_ba](https://huggingface.co/datasets/foundry-ml/atomvison_simulated_atomistic_stem_images_2d_materials_unique_chemical_compositions_structure_ba): Simulated STEM images for 2D materials — paired with structure metadata for training ML models on electron microscopy. [Materials Science, Physics]
- [foundry-ml/training_locating_atoms_stem_images_v1-2](https://huggingface.co/datasets/foundry-ml/training_locating_atoms_stem_images_v1-2): STEM image training set for atomic-position localisation — supports ML pipelines for automated microscopy analysis. [Materials Science, Physics]
- [foundry-ml/mask_rcnn_defect_detection_v1-1](https://huggingface.co/datasets/foundry-ml/mask_rcnn_defect_detection_v1-1): Microscopy image dataset annotated for instance-segmentation defect detection — Mask R-CNN training data for materials inspection. [Materials Science, Engineering]
- [foundry-ml/foundry_stan_segmentation_v1-1](https://huggingface.co/datasets/foundry-ml/foundry_stan_segmentation_v1-1): Segmentation dataset (STAN) for materials microscopy images — supports ML feature extraction from electron-microscopy data. [Materials Science, Engineering]
- [foundry-ml/direct_electron_detectorceleritas_xs_simulated_readout_images_electron_counting_model_v1-1](https://huggingface.co/datasets/foundry-ml/direct_electron_detectorceleritas_xs_simulated_readout_images_electron_counting_model_v1-1): Simulated readout images from a Celeritas XS direct-electron detector — training data for electron-counting models in cryo-EM and STEM. [Materials Science, Physics]
- [foundry-ml/elastic_tensor_v1-1](https://huggingface.co/datasets/foundry-ml/elastic_tensor_v1-1): Elastic tensor data for inorganic materials — supports ML prediction of bulk and shear moduli. [Materials Science, Physics, Engineering]
- [foundry-ml/piezoelectric_tensor_v1-1](https://huggingface.co/datasets/foundry-ml/piezoelectric_tensor_v1-1): Piezoelectric tensor data for inorganic materials — supports ML for sensor and actuator material design. [Materials Science, Physics, Engineering]
- [foundry-ml/dielectric_constant_v1-1](https://huggingface.co/datasets/foundry-ml/dielectric_constant_v1-1): Dielectric-constant values for inorganic compounds — supports ML screening of high-k materials for capacitors and devices. [Materials Science, Physics]
- [foundry-ml/semiconductor_defectlevels_v1-1](https://huggingface.co/datasets/foundry-ml/semiconductor_defectlevels_v1-1): Computed defect-energy levels in semiconductors — descriptors for ML doping and trap-state prediction. [Materials Science, Physics]
- [foundry-ml/superconductivity_v1-1](https://huggingface.co/datasets/foundry-ml/superconductivity_v1-1): Curated superconductor dataset — measured Tc values for ML-based discovery of new superconducting materials. [Materials Science, Physics, Energy]
- [foundry-ml/electromigration_v1-1](https://huggingface.co/datasets/foundry-ml/electromigration_v1-1): Electromigration data for interconnect materials — supports ML prediction of failure rates in microelectronic devices. [Materials Science, Engineering]
- [foundry-ml/steel_strength_v1-1](https://huggingface.co/datasets/foundry-ml/steel_strength_v1-1): Steel strength dataset — composition-property pairs for ML-based alloy design and high-strength materials. [Materials Science, Engineering]
- [foundry-ml/dataset_mg_alloy](https://huggingface.co/datasets/foundry-ml/dataset_mg_alloy): Magnesium alloy dataset — composition and property data for ML modelling of lightweight structural alloys. [Materials Science, Engineering]
- [foundry-ml/dataset_metallicglass_rc](https://huggingface.co/datasets/foundry-ml/dataset_metallicglass_rc): Critical cooling rate (Rc) data for metallic glasses — supports ML prediction of glass-forming ability. [Materials Science, Engineering]
- [foundry-ml/dataset_metallicglass_rc_llm](https://huggingface.co/datasets/foundry-ml/dataset_metallicglass_rc_llm): LLM-extracted critical cooling rate data for metallic glasses — text-mined complement to the structured Rc dataset. [Materials Science, Engineering, Scientific Reasoning]
- [foundry-ml/dataset_metallicglass_dmax](https://huggingface.co/datasets/foundry-ml/dataset_metallicglass_dmax): Maximum glass-forming diameter (Dmax) data for bulk metallic glasses — for ML screening of casting feasibility. [Materials Science, Engineering]
- [foundry-ml/dataset_concrete_compressive_strength](https://huggingface.co/datasets/foundry-ml/dataset_concrete_compressive_strength): Concrete compressive-strength dataset — mix-design and test data for ML-based civil-engineering material modelling. [Materials Science, Engineering]
- [foundry-ml/dataset_rpv_tts](https://huggingface.co/datasets/foundry-ml/dataset_rpv_tts): Reactor pressure-vessel (RPV) transition-temperature shift dataset — supports ML prediction of irradiation embrittlement. [Materials Science, Engineering, Physics]
- [foundry-ml/dataset_exfoliatione](https://huggingface.co/datasets/foundry-ml/dataset_exfoliatione): Exfoliation energy dataset for 2D materials — supports ML-driven discovery of layered compounds suitable for monolayer isolation. [Materials Science, Physics, Chemistry]
- [foundry-ml/dataset_thermalexp_aflow](https://huggingface.co/datasets/foundry-ml/dataset_thermalexp_aflow): Thermal expansion coefficients from the AFLOW database — for ML thermal-mechanical modelling of inorganic materials. [Materials Science, Physics]
- [foundry-ml/dataset_thermalcond_aflow](https://huggingface.co/datasets/foundry-ml/dataset_thermalcond_aflow): Thermal conductivity values from the AFLOW database — supports ML-based screening of thermal management materials. [Materials Science, Physics]
- [foundry-ml/dataset_debyet_aflow](https://huggingface.co/datasets/foundry-ml/dataset_debyet_aflow): Debye temperature data from the AFLOW database — fundamental thermal-vibrational descriptor for ML materials property prediction. [Materials Science, Physics]
- [foundry-ml/heusler_magnetization_v1-1](https://huggingface.co/datasets/foundry-ml/heusler_magnetization_v1-1): Magnetisation data for Heusler-alloy compounds — supports ML discovery of half-metallic and magnetocaloric materials. [Materials Science, Physics]
- [foundry-ml/dataset_li_conductivity](https://huggingface.co/datasets/foundry-ml/dataset_li_conductivity): Lithium-ion conductivity dataset for solid electrolytes — supports ML discovery of next-generation battery materials. [Materials Science, Chemistry, Energy]
- [foundry-ml/elwood_md_v1-2](https://huggingface.co/datasets/foundry-ml/elwood_md_v1-2): Elwood molecular-dynamics simulation set — trajectory and energy data for ML molecular-property prediction. [Chemistry, Materials Science]
- [foundry-ml/foundry_g4mp2_solvation_v1-2](https://huggingface.co/datasets/foundry-ml/foundry_g4mp2_solvation_v1-2): High-accuracy G4MP2 solvation-energy data — supports ML for quantum-chemical accuracy on aqueous and organic systems. [Chemistry, Physics]
- [foundry-ml/foundry_moses_v1-1](https://huggingface.co/datasets/foundry-ml/foundry_moses_v1-1): Foundry mirror of MOSES — molecular sets benchmark for evaluating generative chemistry models on drug-like molecule generation. [Chemistry, Medicine]
- [foundry-ml/foundry_osdb_v1-1](https://huggingface.co/datasets/foundry-ml/foundry_osdb_v1-1): Organic Semiconductor Database (OSDB) curated for ML — supports property prediction and screening of organic optoelectronic materials. [Chemistry, Materials Science, Energy]
- [foundry-ml/foundry_qmc_ml_v1-1](https://huggingface.co/datasets/foundry-ml/foundry_qmc_ml_v1-1): Quantum Monte Carlo (QMC) reference data for ML benchmarking — high-accuracy electronic structure calculations on small molecules. [Chemistry, Physics]
- [foundry-ml/diffusion_v1-4](https://huggingface.co/datasets/foundry-ml/diffusion_v1-4): Diffusion-coefficient dataset for inorganic systems — supports ML modelling of solid-state ion transport and electrolyte design. [Materials Science, Chemistry, Energy]
- [FreedomIntelligence/medical-o1-reasoning-SFT](https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT): Medical chain-of-thought reasoning dataset (o1-style) for supervised fine-tuning of medical LLMs — one of the most-liked medical training corpora on Hugging Face (1000+ likes). [Medicine, Scientific Reasoning]
- [FreedomIntelligence/medical-o1-verifiable-problem](https://huggingface.co/datasets/FreedomIntelligence/medical-o1-verifiable-problem): Verifiable medical reasoning problems with checker functions — supports RL/reward-model training for medical-LLM alignment beyond static SFT. [Medicine, Scientific Reasoning]
- [tattabio/OMG](https://huggingface.co/datasets/tattabio/OMG): Open Mixed Genomes (OMG) — large mixed-organism nucleotide corpus underpinning Tatta Bio’s gLM2 genomic foundation models. [Biology, Genomics]
- [tattabio/OG](https://huggingface.co/datasets/tattabio/OG): Open Genomes (OG) — curated genome-sequence corpus from Tatta Bio for genomic ML pretraining and benchmarking. [Biology, Genomics]
- [mist-models/excess-properties](https://huggingface.co/datasets/mist-models/excess-properties): Excess-property dataset for binary/ternary chemical mixtures — used to fine-tune MIST mixtures models on thermodynamic deviations from ideal mixing. [Chemistry, Materials Science]
- [ADSKAILab/ABC-1M](https://huggingface.co/datasets/ADSKAILab/ABC-1M): One million CAD-quality 3D shapes drawn from the ABC dataset — the foundation training corpus for the Make-A-Shape and WaLa generative models. [Engineering, Materials Science]
- [ADSKAILab/Zero-To-CAD-1m](https://huggingface.co/datasets/ADSKAILab/Zero-To-CAD-1m): 1M paired image-and-CAD-program examples for training vision-language models that synthesise parametric CAD from images. [Engineering, Scientific Reasoning]
- [ADSKAILab/Zero-To-CAD-100k](https://huggingface.co/datasets/ADSKAILab/Zero-To-CAD-100k): Curated 100K-example subset of Zero-To-CAD — useful for benchmarking and lightweight fine-tuning of CAD-from-image models. [Engineering, Scientific Reasoning]
- [ADSKAILab/LLM-narrative-planning-taskset](https://huggingface.co/datasets/ADSKAILab/LLM-narrative-planning-taskset): Narrative planning task set for evaluating LLM planning and reasoning over multi-step design and engineering scenarios. [Engineering, Scientific Reasoning]
- [ADSKAILab/codeparrot_megatron](https://huggingface.co/datasets/ADSKAILab/codeparrot_megatron): Megatron-formatted CodeParrot release used for large-scale code language-model pretraining experiments at Autodesk AI Lab. [Engineering, Scientific Reasoning]
- [recursionpharma/rxrx3](https://huggingface.co/datasets/recursionpharma/rxrx3): Full RxRx3 release — multi-million image high-content microscopy dataset spanning genetic and chemical perturbations across human cell lines, paired with rich text annotations for image-based drug discovery. [Biology, Medicine, Chemistry]
- [recursionpharma/rxrx3-core](https://huggingface.co/datasets/recursionpharma/rxrx3-core): Curated core subset of RxRx3 — high-quality phenomics images for benchmarking and lower-cost training of phenomic foundation models like OpenPhenom. [Biology, Medicine, Chemistry]
- [arcinstitute/Perturb-Sapiens](https://huggingface.co/datasets/arcinstitute/Perturb-Sapiens): Large-scale human single-cell perturbation dataset used in the STACK foundation-model lineage — paired baseline and perturbed expression profiles for genetic perturbation screens. [Biology, Genomics, Medicine]
- [arcinstitute/Replogle-Nadig-Preprint](https://huggingface.co/datasets/arcinstitute/Replogle-Nadig-Preprint): Replogle-Nadig single-cell perturbation dataset (preprint release) — Perturb-seq screens used in the STATE single-cell embedding work for perturbation-response modelling. [Biology, Genomics, Medicine]
- [arcinstitute/State-Tahoe-Filtered](https://huggingface.co/datasets/arcinstitute/State-Tahoe-Filtered): Filtered Tahoe-100M slice used in the STATE workflow — high-quality single-cell perturbation profiles for training and benchmarking cross-study cell-state models. [Biology, Genomics, Medicine]
- [Ahmad0067/MedSynth](https://huggingface.co/datasets/Ahmad0067/MedSynth): Realistic synthetic medical dialogue–SOAP note pairs generated to support training and evaluation of clinical documentation models without exposing real patient data. [Medicine]

## Models

- [Evo-2 40B](https://huggingface.co/arcinstitute/evo2_40b): 40B-parameter DNA language model trained on 9.3 trillion nucleotides across all domains of life — zero-shot function prediction, variant effect scoring, and sequence generation. [Biology, Genomics, Medicine]
- [Evo-2 7B](https://huggingface.co/arcinstitute/evo2_7b): 7B-parameter instruction-tuned DNA language model for gene function prediction, CRISPR guide design, and cross-species sequence analysis. [Biology, Genomics, Medicine]
- [STACK Large](https://huggingface.co/arcinstitute/Stack-Large): Large-scale single-cell transcriptomics foundation model supporting in-context learning across cell types and perturbation states. [Biology, Genomics, Medicine]
- [FNO Active Matter](https://huggingface.co/polymathic-ai/FNO-active_matter): Fourier Neural Operator for active matter prediction. [Physics, Engineering]
- [Aion Base](https://huggingface.co/polymathic-ai/aion-base): Multi-domain scientific foundation model. [Physics, Astronomy, Engineering]
- [WALRUS](https://huggingface.co/polymathic-ai/walrus): Foundation model for continuum dynamics pre-trained across 15 physics simulation datasets, enabling zero-shot and few-shot PDE generalisation. [Physics, Engineering]
- [AstroCLIP](https://huggingface.co/polymathic-ai/astroclip): Multimodal astronomy model aligning galaxy spectra and images into a shared embedding space for downstream astrophysical property prediction. [Astronomy, Physics]
- [TEDDY](https://huggingface.co/Merck/TEDDY): Transformer for Enabling Drug Discovery - foundation models trained on 116M single cells for genomics and drug discovery. [Biology, Genomics, Medicine]
- [NatureLM-audio](https://huggingface.co/EarthSpeciesProject/NatureLM-audio): First audio-language foundation model for bioacoustics - species classification, detection, and captioning of animal vocalizations. [Biology, Ecology, Conservation, Earth Science]
- [AVES2-BEATs](https://huggingface.co/EarthSpeciesProject/esp-aves2-sl-beats-all): Self-supervised BEATs-based audio encoder trained on broad bioacoustic data for species detection, classification, and embedding across animal taxa. [Biology, Ecology, Conservation, Earth Science]
- [AQAffinity](https://huggingface.co/SandboxAQ/AQAffinity): Open-source protein-ligand binding affinity prediction model for drug discovery. [Chemistry, Medicine, Biology]
- [HiRO-ACE](https://huggingface.co/allenai/HiRO-ACE): AI framework for efficient climate and weather simulation with kilometer-scale precipitation downscaling. [Earth Science, Climate]
- [ACE2-ERA5](https://huggingface.co/allenai/ACE2-ERA5): Ai2 Climate Emulator v2 trained on ERA5 reanalysis — fast, stable atmospheric simulation at global scale for multi-year climate projections. [Earth Science, Climate]
- [FourCastNet 3](https://huggingface.co/nvidia/fourcastnet3): Advanced ML model for global weather forecasting - produces 60-day forecasts in under 4 minutes on a single GPU. [Earth Science, Climate, Physics]
- [cBottle](https://huggingface.co/nvidia/cbottle): Diffusion-based generative model that generates atmospheric states at kilometer resolution. [Earth Science, Climate]
- [StormCast V1](https://huggingface.co/nvidia/stormcast-v1-era5-hrrr): Mesoscale ML model for convection-allowing weather forecasting at kilometer-scale resolution. [Earth Science, Climate, Physics]
- [Surya 1.0](https://huggingface.co/nasa-ibm-ai4science/Surya-1.0): First open-source AI foundation model for heliophysics - solar flare forecasting and space weather prediction. [Astronomy, Physics]
- [Surya Solar Flares](https://huggingface.co/nasa-ibm-ai4science/solar_flares_surya): Surya-1.0 fine-tuned for solar flare prediction from full-disk magnetogram and EUV time series. [Astronomy, Physics]
- [Surya Solar Wind](https://huggingface.co/nasa-ibm-ai4science/solar_wind_surya): Surya-1.0 fine-tuned for solar wind plasma and interplanetary magnetic field forecasting at the L1 Lagrange point. [Astronomy, Physics]
- [MedGemma 1.5 4B](https://huggingface.co/google/medgemma-1.5-4b-it): Multimodal medical AI model for medical imaging and clinical text understanding. [Medicine, Biology]
- [MedGemma 27B](https://huggingface.co/google/medgemma-27b-it): Large-scale instruction-tuned medical AI for radiology report generation, pathology image analysis, dermatology, and clinical question answering. [Medicine, Biology]
- [AlphaGenome](https://huggingface.co/google/alphagenome-all-folds): Google DeepMind model predicting DNA regulatory features — gene expression, chromatin accessibility, and TF binding — at single-nucleotide resolution. [Biology, Genomics]
- [MedASR](https://huggingface.co/google/medasr): Medical automatic speech recognition model for clinical documentation. [Medicine]
- [MedSigLIP](https://huggingface.co/google/medsiglip-448): Medical image-language model for visual understanding in healthcare. [Medicine, Biology]
- [TxGemma 2B](https://huggingface.co/google/txgemma-2b-predict): Lightweight therapeutic prediction model for drug discovery tasks. [Medicine, Chemistry, Biology]
- [TxGemma 9B Predict](https://huggingface.co/google/txgemma-9b-predict): Mid-size therapeutic prediction model for drug property prediction. [Medicine, Chemistry, Biology]
- [TxGemma 9B Chat](https://huggingface.co/google/txgemma-9b-chat): Conversational therapeutic model for drug discovery with reasoning explanations. [Medicine, Chemistry, Biology]
- [TxGemma 27B Predict](https://huggingface.co/google/txgemma-27b-predict): Large therapeutic prediction model achieving best-in-class performance on 66 tasks. [Medicine, Chemistry, Biology]
- [TxGemma 27B Chat](https://huggingface.co/google/txgemma-27b-chat): Large conversational therapeutic model with advanced reasoning capabilities. [Medicine, Chemistry, Biology]
- [Path Foundation](https://huggingface.co/google/path-foundation): Vision transformer for histopathology image embeddings - trained on 60M patches from TCGA. [Medicine, Biology]
- [NTv3 650M](https://huggingface.co/InstaDeepAI/NTv3_650M_post): Multi-species genomics foundation model handling 1Mb context for functional track prediction. [Biology, Genomics]
- [Nucleotide Transformer v2 500M](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species): 500M multi-species DNA language model with improved tokenisation and benchmark performance across 18 genomic prediction tasks. [Biology, Genomics]
- [Nucleotide Transformer 2.5B](https://huggingface.co/InstaDeepAI/nucleotide-transformer-2.5b-multi-species): 2.5B-parameter DNA language model trained on 850 species genomes — state-of-the-art on promoter, enhancer, and splice site prediction tasks. [Biology, Genomics]
- [ChatNT](https://huggingface.co/InstaDeepAI/ChatNT): 8B multimodal conversational model for DNA, RNA, and protein tasks — instruction-following for sequence annotation, classification, and generation. [Biology, Genomics]
- [Isoformer](https://huggingface.co/InstaDeepAI/isoformer): Transformer model integrating DNA sequence, RNA expression, and protein context for isoform-level gene expression prediction. [Biology, Genomics]
- [ether0](https://huggingface.co/futurehouse/ether0): 24B parameter model for molecular reasoning - SMILES generation, property prediction, and retrosynthesis. [Chemistry, Medicine]
- [NASA-SMD-IBM](https://huggingface.co/nasa-impact/nasa-smd-ibm-v0.1): RoBERTa-based language model pre-trained on NASA Science Mission Directorate literature for earth and space science information extraction. [Earth Science, Physics, Astronomy]
- [Indus SDE v0.2](https://huggingface.co/nasa-impact/indus-sde-v0.2): Science domain extraction model for identifying and classifying scientific concepts, variables, and entities from geoscience and atmospheric science text. [Earth Science, Climate]
- [CYP Inhibition Model](https://huggingface.co/openadmet/cyp1a2-cyp2d6-cyp3a4-cyp3c9-chemeleon-baseline): Multi-task model predicting inhibition of four major cytochrome P450 isoforms (CYP1A2, CYP2D6, CYP3A4, CYP3C9) critical for drug metabolism assessment. [Medicine, Chemistry]
- [PXR Activation Model](https://huggingface.co/openadmet/pxr-chemeleon-baseline): Pregnane X receptor (PXR) activation predictor for early identification of drug-drug interaction liability via nuclear receptor-mediated CYP induction. [Medicine, Chemistry]
- [ESM2 650M](https://huggingface.co/facebook/esm2_t33_650M_UR50D): 650M-parameter protein language model trained on UniRef50 — state-of-the-art embeddings for structure prediction, function annotation, and mutation effect scoring. [Biology, Chemistry, Medicine]
- [OMat24](https://huggingface.co/facebook/OMAT24): Machine learning models for predicting inorganic material properties using EquiformerV2 and eSEN architectures. [Materials Science, Chemistry, Physics, Engineering]
- [OMol25](https://huggingface.co/facebook/OMol25): Open Molecules 2025 - dataset and models for molecular property prediction including polymer extensions. [Chemistry, Materials Science, Engineering]
- [UMA](https://huggingface.co/facebook/UMA): Universal Models for Atoms - mixture-of-experts graph network trained on billions of atoms across 5 datasets. [Chemistry, Materials Science, Physics, Engineering]
- [Tahoe-x1](https://huggingface.co/tahoebio/Tahoe-x1): Perturbation-trained single-cell foundation models (70M-3B) for cancer research and drug discovery. [Biology, Medicine, Genomics]
- [Tahoe-100M-SCVI](https://huggingface.co/tahoebio/Tahoe-100M-SCVI-v1): scVI-based variational autoencoder trained on the full Tahoe-100M atlas of 100M+ single-cell profiles across 50 cancer lines and 1,100 drug perturbations. [Biology, Medicine, Genomics]
- [PeptiVerse](https://huggingface.co/ChatterjeeLab/PeptiVerse): Foundation model for peptide design and analysis. [Biology, Chemistry, Medicine]
- [CoLiPRI](https://huggingface.co/microsoft/colipri): Contrastive learning model for protein-ligand interaction prediction. [Biology, Chemistry, Medicine]
- [Phikon-v2](https://huggingface.co/owkin/phikon-v2): State-of-the-art histopathology vision foundation model trained with DINOv2 on 460K whole-slide images, achieving top performance on cancer subtyping and survival prediction. [Medicine, Biology]
- [Phikon](https://huggingface.co/owkin/phikon): ViT-based pathology foundation model trained on TCGA and other large histopathology cohorts via self-supervised learning for cancer tissue representation. [Medicine, Biology]
- [X-Cell](https://huggingface.co/Xaira-Therapeutics/X-Cell): Diffusion-based model for predicting transcriptomic responses to CRISPRi perturbations at single-cell resolution, trained on the X-Atlas-Pisces dataset. [Biology, Medicine, Genomics]
- [SuperIX](https://huggingface.co/isp-uv-es/superIX): Explainable AI super-resolution model for Sentinel-2 imagery, enhancing 10m resolution to finer scales with interpretable uncertainty estimates. [Earth Science, Climate]
- [ML4Floods](https://huggingface.co/isp-uv-es/ml4floods): Image segmentation model for near-real-time flood extent mapping from Sentinel-2 and Landsat imagery, supporting disaster response and humanitarian aid. [Earth Science, Climate]
- [StarCOP](https://huggingface.co/isp-uv-es/starcop): Methane plume detection model for EMIT and AVIRIS hyperspectral imagery, enabling automated identification of point-source greenhouse gas emissions from space. [Earth Science, Climate, Engineering]
- [p-IgGen](https://huggingface.co/opig/p-IgGen): GPT-NeoX-based generative language model for antibody sequence design, trained on the Observed Antibody Space to generate diverse immunoglobulin heavy and light chains. [Biology, Medicine, Chemistry]
- [AstroLLaMA](https://huggingface.co/UniverseTBD/astrollama): Llama-2 7B fine-tuned on 300K+ astronomy arXiv abstracts for astrophysics text generation, literature summarization, and hypothesis completion — first open LLM specialized for astronomy. [Astronomy, Physics]
- [OpenFold3](https://huggingface.co/OpenFold/OpenFold3): Open replication of AlphaFold3 — predicts structures of proteins, nucleic acids, ligands, and their complexes for drug discovery and structural biology. [Biology, Medicine, Chemistry]
- [MedSAM](https://huggingface.co/wanglab/medsam-vit-base): SAM ViT-Base finetuned on a large-scale dataset of CT, MRI, X-ray, ultrasound, and histology — a universal promptable foundation model for medical image segmentation. [Medicine, Biology]
- [Clinical Camel 70B](https://huggingface.co/wanglab/ClinicalCamel-70B): Llama-2 70B finetuned with QLoRA on physician-patient dialogues, clinical articles, and MedQA-style reasoning chains for medical conversation and decision support. [Medicine, Scientific Reasoning]
- [GO-GPT](https://huggingface.co/wanglab/gogpt): Generative model that predicts Gene Ontology functional annotations directly from protein sequences — bringing LLM-style decoding to functional protein characterisation. [Biology, Genomics, Medicine]
- [Kimina-Prover Preview Distill 7B](https://huggingface.co/AI-MO/Kimina-Prover-Preview-Distill-7B): Distilled 7B preview of Kimina-Prover — a reinforcement-learning-trained model that generates Lean 4 proofs for olympiad-level mathematics problems. [Mathematics, Scientific Reasoning]
- [Kimina-Prover Distill 1.7B](https://huggingface.co/AI-MO/Kimina-Prover-Distill-1.7B): Compact 1.7B distilled Kimina-Prover variant for Lean 4 proof generation on olympiad-level theorems — runs on a single consumer GPU. [Mathematics, Scientific Reasoning]
- [Kimina-Prover Distill 8B](https://huggingface.co/AI-MO/Kimina-Prover-Distill-8B): 8B distilled Kimina-Prover variant — Lean 4 theorem-proving model trained on olympiad-level mathematical problems with reinforcement learning over proof traces. [Mathematics, Scientific Reasoning]
- [Equiformer v3](https://huggingface.co/mirror-physics/equiformer_v3): Equivariant graph transformer for molecular and materials modeling — predicts energies, forces, and properties on molecular structures and crystals. [Chemistry, Physics, Materials Science]
- [OpenMed PharmaDetect](https://huggingface.co/OpenMed/OpenMed-NER-PharmaDetect-SuperClinical-434M): Token-classification model for pharmaceutical entity recognition in clinical text — built on the SuperClinical 434M backbone for high-recall drug, dose, and regimen extraction. [Medicine, Biology]
- [OpenMed BloodCancerDetect](https://huggingface.co/OpenMed/OpenMed-NER-BloodCancerDetect-TinyMed-65M): Compact 65M token-classification model that identifies haematologic malignancy mentions (leukaemia, lymphoma, myeloma subtypes) in clinical and biomedical text. [Medicine, Biology]
- [OpenMed ChemicalDetect](https://huggingface.co/OpenMed/OpenMed-NER-ChemicalDetect-ModernMed-149M): Chemical-entity NER over biomedical literature — identifies drug names, compounds, and chemical substances using the ModernMed 149M backbone. [Medicine, Chemistry, Biology]
- [OpenMed SpeciesDetect](https://huggingface.co/OpenMed/OpenMed-NER-SpeciesDetect-ElectraMed-109M): Species-mention NER over biomedical literature — identifies organisms and taxonomic references using the ElectraMed 109M backbone. [Biology, Medicine]
- [OpenMed DNADetect](https://huggingface.co/OpenMed/OpenMed-NER-DNADetect-SuperMedical-125M): DNA-mention NER for biomedical text — extracts gene-level DNA sequence references and locus identifiers using the SuperMedical 125M backbone. [Biology, Genomics, Medicine]
- [OpenMed PathologyDetect](https://huggingface.co/OpenMed/OpenMed-NER-PathologyDetect-TinyMed-135M): Pathology-finding NER over clinical and biomedical text — surfaces histopathological observations, lesion descriptions, and tissue-level abnormalities. [Medicine, Biology]
- [OpenMed AnatomyDetect](https://huggingface.co/OpenMed/OpenMed-NER-AnatomyDetect-ElectraMed-109M): Anatomical-entity NER for biomedical text — labels body parts, organ systems, and tissue references using the ElectraMed 109M backbone. [Medicine, Biology]
- [OpenMed OncologyDetect](https://huggingface.co/OpenMed/OpenMed-NER-OncologyDetect-MultiMed-568M): Oncology-focused NER that identifies cancer-type mentions, tumour grading, and staging language across clinical and biomedical literature. [Medicine, Biology]
- [OpenMed OrganismDetect](https://huggingface.co/OpenMed/OpenMed-NER-OrganismDetect-TinyMed-82M): Organism-mention NER for biomedical text — broader than SpeciesDetect, also picking up genera, strains, and informal organism references. [Biology, Medicine]
- [OpenMed DiseaseDetect](https://huggingface.co/OpenMed/OpenMed-NER-DiseaseDetect-BioMed-335M): Disease-mention NER trained on the BioMed 335M backbone — recognises disease names, syndromes, and condition references in clinical and biomedical literature. [Medicine, Biology]
- [OpenMed GenomicDetect](https://huggingface.co/OpenMed/OpenMed-NER-GenomicDetect-PubMed-335M): Genomic-entity NER over PubMed-style text — labels genes, transcripts, and other genomic references for downstream knowledge extraction. [Biology, Genomics, Medicine]
- [OpenMed ProteinDetect](https://huggingface.co/OpenMed/OpenMed-NER-ProteinDetect-SuperClinical-141M): Protein-mention NER for biomedical and clinical text — extracts protein names, family references, and post-translational modification descriptors. [Biology, Medicine]
- [OpenMed GenomeDetect](https://huggingface.co/OpenMed/OpenMed-NER-GenomeDetect-ModernMed-149M): Genome-mention NER complementary to GenomicDetect — focuses on whole-genome and assembly-level references in biomedical text. [Biology, Genomics, Medicine]
- [BioReason-Pro SFT](https://huggingface.co/wanglab/bioreason-pro-sft): Supervised fine-tuned variant of BioReason-Pro — multimodal biological reasoning over genomic variants and pathway data with chain-of-thought rationales. [Biology, Genomics, Scientific Reasoning]
- [BioReason-Pro RL](https://huggingface.co/wanglab/bioreason-pro-rl): RL-tuned variant of BioReason-Pro — reinforcement-learning fine-tuning over BioReason’s SFT base for sharper biological reasoning across KEGG pathways and variant data. [Biology, Genomics, Scientific Reasoning]
- [NexaMass V3 Struct](https://huggingface.co/AethronPhantom/NexaMass-V3-Struct): Self-supervised representation model for MS/MS spectra in metabolomics — learns molecular fingerprints to support compound identification and structure inference. [Chemistry, Biology, Scientific Reasoning]
- [MMPT-FM](https://huggingface.co/Merck/MMPT-FM): Multi-modal pharma foundation model from Merck — integrates molecular and biological signals for drug discovery and target prediction. [Biology, Medicine, Chemistry]
- [OC25](https://huggingface.co/facebook/OC25): Open Catalyst 2025 — successor to OC22, modelling explicit-solvent and catalyst systems for electrochemistry and energy applications. [Chemistry, Materials Science, Energy]
- [OMC25](https://huggingface.co/facebook/OMC25): Open Molecular Crystals 2025 — Meta FAIR Chemistry release for predicting properties of organic molecular crystals (pharmaceutical polymorphs, energetic materials, OLEDs). [Chemistry, Materials Science]
- [BioCLIP 2](https://huggingface.co/imageomics/bioclip-2): OpenCLIP-based foundation model for organismal biology — zero-shot species classification from photographs across the tree of life, trained on TreeOfLife-200M. [Biology, Ecology, Conservation]
- [Skala 1.1](https://huggingface.co/microsoft/skala-1.1): Deep-learning exchange-correlation functional for density functional theory — covers main-group thermochemistry, reaction kinetics, noncovalent interactions, and molecular geometries. [Chemistry, Physics]
- [Aurora](https://huggingface.co/microsoft/aurora): Foundation model for the Earth system — global weather forecasting, atmospheric chemistry, ocean waves, and tropical-cyclone tracking from a single shared backbone. [Climate, Earth Science, Physics]
- [BioEmu](https://huggingface.co/microsoft/bioemu): Generative model for protein structural ensembles — emulates conformational dynamics for drug discovery and structural biology beyond static AlphaFold-style predictions. [Biology, Medicine, Chemistry]
- [MatterGen](https://huggingface.co/microsoft/mattergen): Generative AI for materials design — proposes novel inorganic crystal structures with specified properties for energy, catalysis, and functional-materials research. [Materials Science, Chemistry, Energy]
- [MatterSim](https://huggingface.co/microsoft/mattersim): Foundation-model atomistic simulator for materials over a wide range of temperatures and pressures — drop-in replacement for ab-initio MD for property prediction. [Materials Science, Chemistry, Physics]
- [OrbMol](https://huggingface.co/Orbital-Materials/OrbMol): Foundation-model potential for molecular systems — energies, forces, and properties for organic and metal-organic chemistry, supporting catalyst and pharma workflows. [Chemistry, Materials Science]
- [OneGenome-Rice](https://huggingface.co/ZhejiangLab/OneGenome-Rice): Mixtral-architecture genomic foundation model specialised for rice (Oryza sativa) — supports variant analysis, expression prediction, and breeding-relevant trait modelling. [Biology, Genomics]
- [Genos 1.2B](https://huggingface.co/ZhejiangLab/Genos-1.2B): General-purpose 1.2B-parameter genomic foundation model spanning multiple organisms — base model for downstream gene-level and sequence-level prediction tasks. [Biology, Genomics]
- [eva-rna](https://huggingface.co/ScientaLab/eva-rna): Transformer foundation model producing sample-level and gene-level embeddings from RNA-seq profiles (bulk, microarray, pseudobulked single-cell) in human and mouse. [Biology, Genomics, Medicine]
- [Skala 1.0](https://huggingface.co/microsoft/skala-1.0): First release of Skala — deep-learning exchange-correlation functional for density functional theory, predecessor to Skala 1.1. [Chemistry, Physics]
- [AIMNet2-rxn](https://huggingface.co/isayevlab/aimnet2-rxn): AIMNet2 trained on reaction data — neural-network interatomic potential supporting reactive molecular simulations. [Chemistry, Physics]
- [AIMNet2 ωB97M-D3](https://huggingface.co/isayevlab/aimnet2-wb97m-d3): Neural network interatomic potential for fast and accurate molecular simulations, trained at the ωB97M-D3 level of theory. [Chemistry, Physics]
- [AIMNet2 (B97-3c, 2025)](https://huggingface.co/isayevlab/aimnet2-2025): AIMNet2 retrained at the B97-3c level of theory — 2025 release with improved coverage and accuracy. [Chemistry, Physics]
- [AIMNet2-NSE](https://huggingface.co/isayevlab/aimnet2-nse): AIMNet2 specialised for open-shell chemistry (radicals, transition states) — neural network interatomic potential for non-singlet electronic states. [Chemistry, Physics]
- [AIMNet2-Pd](https://huggingface.co/isayevlab/aimnet2-pd): AIMNet2 specialised for palladium-containing organometallic systems — supports homogeneous catalysis simulation at near-DFT accuracy. [Chemistry, Materials Science, Physics]
- [MACE-MP-0](https://huggingface.co/mace-foundations/mace-mp-0): MACE foundation model trained on the Materials Project — equivariant message-passing potential for inorganic crystal simulation across most of the periodic table. [Materials Science, Chemistry, Physics]
- [MACE-MPA-0](https://huggingface.co/mace-foundations/mace-mpa-0): MACE foundation model trained on the Materials Project + Alexandria datasets — broader coverage variant for inorganic-materials simulation. [Materials Science, Chemistry, Physics]
- [MACE-MH-0](https://huggingface.co/mace-foundations/mace-mh-0): MACE foundation model targeting molecular and hybrid systems — equivariant potential trained on a unified molecular/materials dataset. [Materials Science, Chemistry, Physics]
- [MACE-MH-1](https://huggingface.co/mace-foundations/mace-mh-1): Updated MACE-MH foundation potential with refined molecular/materials hybrid training — successor to MACE-MH-0. [Materials Science, Chemistry, Physics]
- [GENA-LM BERT large (T2T)](https://huggingface.co/AIRI-Institute/gena-lm-bert-large-t2t): BERT-large-style genomic foundation model trained on telomere-to-telomere human assemblies — supports variant interpretation, regulatory prediction, and downstream genomic tasks. [Biology, Genomics, Medicine]
- [GENA-LM BERT base (T2T)](https://huggingface.co/AIRI-Institute/gena-lm-bert-base-t2t): BERT-base-style genomic foundation model trained on T2T assemblies — lighter-weight backbone for genomic sequence understanding. [Biology, Genomics, Medicine]
- [ModernGENA large](https://huggingface.co/AIRI-Institute/moderngena-large): GENA-LM rebuilt on the ModernBERT architecture — larger, longer-context, RoPE-equipped genomic foundation model. [Biology, Genomics, Medicine]
- [ModernGENA base](https://huggingface.co/AIRI-Institute/moderngena-base): Compact ModernBERT-based GENA-LM variant — efficient genomic foundation model for downstream variant and expression tasks. [Biology, Genomics, Medicine]
- [HuatuoGPT-Vision 7B](https://huggingface.co/FreedomIntelligence/HuatuoGPT-Vision-7B): Medical multimodal LLM from the HuatuoGPT family — answers clinical questions over medical imagery (radiology, pathology, dermatology) using a 7B vision-language backbone. [Medicine, Biology, Scientific Reasoning]
- [FlashPPI](https://huggingface.co/tattabio/flashppi): Fast protein-protein interaction prediction model — trained for high-throughput screening of interaction networks. [Biology, Medicine]
- [gLM2 650M](https://huggingface.co/tattabio/gLM2_650M): 650M-parameter genomic foundation model from Tatta Bio — trained on the OMG open-mixed-genomes corpus for sequence-level biological reasoning. [Biology, Genomics]
- [MIST 28M base](https://huggingface.co/mist-models/mist-28M-ti624ev1): MIST 28M base — pretrained molecular language model (fill-mask) used as the starting point for downstream property-prediction fine-tunes. [Chemistry]
- [MIST 1.8B base](https://huggingface.co/mist-models/mist-1.8B-dh61satt): MIST 1.8B base — large pretrained molecular language model (fill-mask) for downstream chemistry property prediction at scale. [Chemistry]
- [MIST mixtures](https://huggingface.co/mist-models/mist-mixtures-zffffbex): MIST mixtures variant — pretrained on chemical mixtures rather than individual molecules. [Chemistry]
- [MIST 28M · QM9](https://huggingface.co/mist-models/mist-28M-kkgx0omx-qm9): MIST 28M fine-tuned on QM9 — quantum-mechanical property prediction over small organic molecules. [Chemistry, Physics]
- [MIST 28M · QM8](https://huggingface.co/mist-models/mist-28M-gzwqzpcr-qm8): MIST 28M fine-tuned on QM8 — electronic-spectra property prediction over small organic molecules. [Chemistry, Physics]
- [MIST 28M · Tox21](https://huggingface.co/mist-models/mist-28M-kw4ks27p-tox21): MIST 28M fine-tuned on Tox21 — toxicity classification across 12 nuclear-receptor and stress-response assays. [Chemistry, Medicine]
- [MIST 28M · ClinTox](https://huggingface.co/mist-models/mist-28M-97vfcykk-clintox): MIST 28M fine-tuned on ClinTox — clinical toxicity classification of FDA-approved drugs and failed candidates. [Chemistry, Medicine]
- [MIST 28M · SIDER](https://huggingface.co/mist-models/mist-28M-z8qo16uy-sider): MIST 28M fine-tuned on SIDER — side-effect prediction across 27 system-organ classes for marketed drugs. [Chemistry, Medicine]
- [MIST 28M · BBBP](https://huggingface.co/mist-models/mist-28M-3xpfhv48-bbbp): MIST 28M fine-tuned on BBBP — blood-brain-barrier permeability classification for CNS drug candidates. [Chemistry, Medicine]
- [MIST 28M · HIV](https://huggingface.co/mist-models/mist-28M-8fh43gke-hiv): MIST 28M fine-tuned on HIV — anti-HIV activity classification from MoleculeNet. [Chemistry, Medicine]
- [MIST 28M · Lipo](https://huggingface.co/mist-models/mist-28M-xzr5ulva-lipo): MIST 28M fine-tuned on Lipophilicity — octanol/water distribution coefficient prediction. [Chemistry, Medicine]
- [MIST 28M · ToxCast](https://huggingface.co/mist-models/mist-28M-ttqcvt6fs-toxcast): MIST 28M fine-tuned on ToxCast — multi-task toxicity prediction across hundreds of in-vitro assays. [Chemistry, Medicine]
- [MIST 28M · BACE](https://huggingface.co/mist-models/mist-28M-8loj3bab-bace): MIST 28M fine-tuned on BACE — beta-secretase 1 (Alzheimer target) inhibition classification. [Chemistry, Medicine]
- [MIST 28M · MUV](https://huggingface.co/mist-models/mist-28M-yr1urd2c-muv): MIST 28M fine-tuned on MUV — maximum-unbiased-validation virtual-screening benchmark. [Chemistry, Medicine]
- [MIST 28M · ESOL](https://huggingface.co/mist-models/mist-28M-kcwb9le5-esol): MIST 28M fine-tuned on ESOL — aqueous solubility regression (Delaney dataset). [Chemistry]
- [MIST 28M · FreeSolv](https://huggingface.co/mist-models/mist-28M-0uiq7o7m-freesolv): MIST 28M fine-tuned on FreeSolv — hydration free-energy regression for small molecules. [Chemistry]
- [MIST 28M · tmQM](https://huggingface.co/mist-models/mist-28M-ggd8iisr-tmQM): MIST 28M fine-tuned on tmQM — quantum-mechanical property prediction for transition-metal complexes. [Chemistry, Materials Science]
- [MIST 28M · pKa](https://huggingface.co/mist-models/mist-28M-6zlgl2qn-pKa): MIST 28M fine-tuned for pKa — acid-dissociation-constant prediction. [Chemistry]
- [MIST 28M · solvent properties](https://huggingface.co/mist-models/mist-28M-solvent-properties): MIST 28M fine-tuned for solvent-property prediction — bulk physical descriptors of organic solvents. [Chemistry]
- [MIST 26.9M · melting point](https://huggingface.co/mist-models/mist-26.9M-y3ge5pf9-mp): MIST 26.9M fine-tuned for melting-point regression. [Chemistry, Materials Science]
- [MIST 26.9M · boiling point](https://huggingface.co/mist-models/mist-26.9M-b302p09x-bp): MIST 26.9M fine-tuned for boiling-point regression. [Chemistry, Materials Science]
- [MIST 26.9M · flash point](https://huggingface.co/mist-models/mist-26.9M-cyuo2xb6-fp): MIST 26.9M fine-tuned for flash-point regression. [Chemistry, Engineering]
- [MIST 26.9M · odour](https://huggingface.co/mist-models/mist-26.9M-48kpooqf-odour): MIST 26.9M fine-tuned for odour-quality prediction. [Chemistry]
- [MIST 26.9M · dn](https://huggingface.co/mist-models/mist-26.9M-6hk5coof-dn): MIST 26.9M fine-tuned for dn property regression. [Chemistry]
- [MIST 27.0M · conductivity](https://huggingface.co/mist-models/mist-conductivity-27.0M-2mpg8dcd): MIST 27.0M fine-tuned for ionic-conductivity prediction in chemical mixtures and electrolytes. [Chemistry, Materials Science, Energy]
- [MIST 27.1M · ETN](https://huggingface.co/mist-models/mist-27.1M-1gcxtg8y-ETN): MIST 27.1M fine-tuned on the ETN (empirical thermodynamic network) benchmark. [Chemistry, Materials Science]
- [MIST 1.8B · G298](https://huggingface.co/mist-models/mist-1.8B-09sntn03-g298): MIST 1.8B fine-tuned for G298 — Gibbs free energy at 298 K from QM9. [Chemistry, Physics]
- [MIST 1.8B · H298](https://huggingface.co/mist-models/mist-1.8B-3fbbz4is-h298): MIST 1.8B fine-tuned for H298 — enthalpy at 298 K from QM9. [Chemistry, Physics]
- [MIST 1.8B · U298](https://huggingface.co/mist-models/mist-1.8B-85f24xkj-u298): MIST 1.8B fine-tuned for U298 — internal energy at 298 K from QM9. [Chemistry, Physics]
- [MIST 1.8B · U0](https://huggingface.co/mist-models/mist-1.8B-a7akimjj-u0): MIST 1.8B fine-tuned for U0 — internal energy at 0 K from QM9. [Chemistry, Physics]
- [MIST 1.8B · μ (dipole)](https://huggingface.co/mist-models/mist-1.8B-ez05expv-mu): MIST 1.8B fine-tuned for dipole moment from QM9. [Chemistry, Physics]
- [MIST 1.8B · α (polarizability)](https://huggingface.co/mist-models/mist-1.8B-rcwary93-alpha): MIST 1.8B fine-tuned for isotropic polarizability from QM9. [Chemistry, Physics]
- [MIST 1.8B · HOMO](https://huggingface.co/mist-models/mist-1.8B-jmjosq12-homo): MIST 1.8B fine-tuned for HOMO energy from QM9. [Chemistry, Physics]
- [MIST 1.8B · LUMO](https://huggingface.co/mist-models/mist-1.8B-n14wshc9-lumo): MIST 1.8B fine-tuned for LUMO energy from QM9. [Chemistry, Physics]
- [MIST 1.8B · HOMO-LUMO gap](https://huggingface.co/mist-models/mist-1.8B-kayun6v3-gap): MIST 1.8B fine-tuned for HOMO-LUMO gap from QM9. [Chemistry, Physics]
- [MIST 1.8B · ZPVE](https://huggingface.co/mist-models/mist-1.8B-6nmcwyrp-zpve): MIST 1.8B fine-tuned for zero-point vibrational energy from QM9. [Chemistry, Physics]
- [MIST 1.8B · ⟨R²⟩](https://huggingface.co/mist-models/mist-1.8B-xxe7t35e-r2): MIST 1.8B fine-tuned for electronic spatial extent from QM9. [Chemistry, Physics]
- [MIST 1.8B · Cv](https://huggingface.co/mist-models/mist-1.8B-j356b3nf-cv): MIST 1.8B fine-tuned for heat capacity Cv from QM9. [Chemistry, Physics]
- [MIST 1.8B · QM8](https://huggingface.co/mist-models/mist-1.8B-8nd1ot5j-qm8): MIST 1.8B fine-tuned on QM8 — electronic-spectra prediction at scale. [Chemistry, Physics]
- [MIST 1.8B · Tox21](https://huggingface.co/mist-models/mist-1.8B-uop1z0dc-tox21): MIST 1.8B fine-tuned on Tox21 — large-scale toxicity classification across nuclear-receptor and stress assays. [Chemistry, Medicine]
- [MIST 1.8B · ClinTox](https://huggingface.co/mist-models/mist-1.8B-lu1l5ieh-clintox): MIST 1.8B fine-tuned on ClinTox — clinical toxicity classification. [Chemistry, Medicine]
- [MIST 1.8B · SIDER](https://huggingface.co/mist-models/mist-1.8B-l1wfo7oa-sider): MIST 1.8B fine-tuned on SIDER — side-effect prediction. [Chemistry, Medicine]
- [MIST 1.8B · BBBP](https://huggingface.co/mist-models/mist-1.8B-fbdn8e35-bbbp): MIST 1.8B fine-tuned on BBBP — blood-brain-barrier permeability. [Chemistry, Medicine]
- [MIST 1.8B · HIV](https://huggingface.co/mist-models/mist-1.8B-1a4puhg2-hiv): MIST 1.8B fine-tuned on HIV — anti-HIV activity classification. [Chemistry, Medicine]
- [MIST 1.8B · Lipo](https://huggingface.co/mist-models/mist-1.8B-jvt4azpz-lipo): MIST 1.8B fine-tuned on Lipophilicity — large-scale logD prediction. [Chemistry, Medicine]
- [MIST 1.8B · BACE](https://huggingface.co/mist-models/mist-1.8B-m50jgolp-bace): MIST 1.8B fine-tuned on BACE — Alzheimer-target inhibition classification. [Chemistry, Medicine]
- [MIST 1.8B · ESOL](https://huggingface.co/mist-models/mist-1.8B-hxiygjsm-esol): MIST 1.8B fine-tuned on ESOL — aqueous solubility regression. [Chemistry]
- [MIST 1.8B · FreeSolv](https://huggingface.co/mist-models/mist-1.8B-iwqj2cld-freesolv): MIST 1.8B fine-tuned on FreeSolv — hydration free-energy regression. [Chemistry]
- [Zero-To-CAD Qwen3-VL 2B](https://huggingface.co/ADSKAILab/Zero-To-CAD-Qwen3-VL-2B): Qwen3-VL fine-tuned to generate parametric CAD models directly from images — bridges vision-language reasoning and engineering geometry synthesis. [Engineering, Scientific Reasoning]
- [Make-A-Shape · single-view 20M](https://huggingface.co/ADSKAILab/Make-A-Shape-single-view-20m): Make-A-Shape variant trained to generate 3D geometry from a single 2D image — supports CAD reconstruction and engineering shape synthesis. [Engineering, Materials Science]
- [Make-A-Shape · multi-view 20M](https://huggingface.co/ADSKAILab/Make-A-Shape-multi-view-20m): Make-A-Shape multi-view variant — generates 3D geometry from multiple 2D image perspectives for higher-fidelity CAD reconstruction. [Engineering, Materials Science]
- [Make-A-Shape · point-cloud 20M](https://huggingface.co/ADSKAILab/Make-A-Shape-point-cloud-20m): Make-A-Shape point-cloud variant — completes and refines 3D geometry from sparse point-cloud input. [Engineering, Materials Science]
- [Make-A-Shape · voxel 32³](https://huggingface.co/ADSKAILab/Make-A-Shape-voxel-32res-20m): Make-A-Shape voxel variant at 32³ resolution — generates voxelised 3D geometries for low-resolution shape exploration. [Engineering, Materials Science]
- [Make-A-Shape · voxel 16³](https://huggingface.co/ADSKAILab/Make-A-Shape-voxel-16res-20m): Coarser 16³ voxel variant of Make-A-Shape for fast prototyping of 3D geometries. [Engineering, Materials Science]
- [WaLa SV 1B](https://huggingface.co/ADSKAILab/WaLa-SV-1B): WaLa (Wavelet-Latent) 1B model conditioned on single-view input — large-scale wavelet-domain 3D shape generation. [Engineering, Materials Science]
- [WaLa RGB4 1B](https://huggingface.co/ADSKAILab/WaLa-RGB4-1B): WaLa 1B variant conditioned on four RGB views — multi-view colour-image-driven 3D shape generation. [Engineering, Materials Science]
- [WaLa DM4 1B](https://huggingface.co/ADSKAILab/WaLa-DM4-1B): WaLa 1B variant conditioned on four depth maps — depth-driven 3D shape generation. [Engineering, Materials Science]
- [WaLa DM6 1B](https://huggingface.co/ADSKAILab/WaLa-DM6-1B): WaLa 1B variant conditioned on six depth maps for high-coverage depth-driven 3D shape generation. [Engineering, Materials Science]
- [WaLa PC 1B](https://huggingface.co/ADSKAILab/WaLa-PC-1B): WaLa 1B variant conditioned on point clouds — wavelet-latent shape completion from sparse point input. [Engineering, Materials Science]
- [WaLa VX16 1B](https://huggingface.co/ADSKAILab/WaLa-VX16-1B): WaLa 1B variant conditioned on 16³ voxel grids — coarse-voxel-driven 3D shape generation. [Engineering, Materials Science]
- [WaLa UN 1B](https://huggingface.co/ADSKAILab/WaLa-UN-1B): WaLa 1B unconditional variant — generates 3D shapes from noise alone for design-space exploration. [Engineering, Materials Science]
- [WaLa SK 1B](https://huggingface.co/ADSKAILab/WaLa-SK-1B): WaLa 1B variant conditioned on sketches — supports designer-driven shape generation from line art. [Engineering, Materials Science]
- [WaLa DM1 1B](https://huggingface.co/ADSKAILab/WaLa-DM1-1B): WaLa 1B variant conditioned on a single depth map — minimal-input depth-to-shape generation. [Engineering, Materials Science]
- [WaLa MVDream RGB4](https://huggingface.co/ADSKAILab/WaLa-MVDream-RGB4): WaLa coupled with MVDream for text-conditioned 3D shape generation via four RGB-view diffusion. [Engineering, Materials Science]
- [WaLa MVDream DM6](https://huggingface.co/ADSKAILab/WaLa-MVDream-DM6): WaLa coupled with MVDream and six depth views for text-conditioned 3D geometry generation. [Engineering, Materials Science]
- [OpenPhenom](https://huggingface.co/recursionpharma/OpenPhenom): Masked-autoencoder foundation model for high-content cell imaging — learns phenomic embeddings from millions of microscopy images for downstream drug-discovery and perturbation analysis. [Biology, Medicine, Chemistry]
- [Stack-Large Aligned](https://huggingface.co/arcinstitute/Stack-Large-Aligned): Aligned variant of STACK-Large — single-cell foundation model fine-tuned for cross-batch consistency, supporting multi-study perturbation analysis and downstream alignment tasks. [Biology, Genomics, Medicine]
- [SE-600M](https://huggingface.co/arcinstitute/SE-600M): 600M-parameter Single-cell Embeddings model from the STATE collection — generates embeddings for human single-cell RNA expression profiles to support cell-state and perturbation analysis. [Biology, Genomics, Medicine]

## Blog Posts

- [AI for PDEs](https://huggingface.co/blog/hugging-science/pde) — 2025-01-01 — hugging-science: Exploring AI approaches to solving partial differential equations. [Physics, Mathematics, Engineering]
- [SARLO-80: SAR Optic Language Dataset](https://huggingface.co/blog/hugging-science/sarlo-80-sar-optic-language-dataset) — 2025-01-01 — hugging-science: Introducing a large-scale dataset for SAR and optical remote sensing with language descriptions. [Earth Science, Climate]
- [Eve Bio: Mapping the Pharmone Drug Interaction](https://huggingface.co/blog/hugging-science/eve-bio-mapping-the-pharmone-drug-interaction) — 2025-01-01 — hugging-science: Understanding drug interactions through AI-powered pharmacogenomics. [Medicine, Biology, Chemistry]
- [The ExpansionRx OpenADMET Blind Challenge](https://huggingface.co/blog/hugging-science/the-expansionrx-openadmet-blind-challenge) — 2025-01-01 — hugging-science: A blind challenge for predicting ADMET properties in drug discovery. [Medicine, Chemistry]
- [PromoterGPT](https://huggingface.co/blog/hugging-science/promoter-gpt) — 2025-01-01 — hugging-science: AI-powered promoter sequence design and analysis. [Biology, Genomics]
- [AI for Food Allergies](https://huggingface.co/blog/hugging-science/ai-for-food-allergies) — 2025-01-01 — hugging-science: Applying AI to understand and predict food allergies. [Medicine, Biology]
- [GDP: Generative Design for Proteins](https://huggingface.co/blog/cgeorgiaw/gdp) — 2025-01-01 — cgeorgiaw: Generative models for protein design and engineering. [Biology, Chemistry]
- [Constellation Fusion Challenge](https://huggingface.co/blog/cgeorgiaw/constellaration-fusion-challenge) — 2025-01-01 — cgeorgiaw: A challenge for advancing fusion energy through AI. [Physics, Energy, Engineering]
- [Making Antibody Embeddings and Predictions](https://huggingface.co/blog/ginkgo-datapoints/making-antibody-embeddings-and-predictions) — 2025-01-01 — ginkgo-datapoints: How to create and use antibody embeddings for therapeutic applications. [Biology, Medicine, Biotechnology]
- [LeMaterial: An Open-Source Initiative to Accelerate Materials Discovery](https://huggingface.co/blog/lematerial) — 2024-12-10 — lvwerra: Introducing LeMaterial, a community effort to build the largest open database of materials and accelerate AI-driven discovery of new compounds and structures. [Materials Science, Chemistry, Engineering]
- [SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence](https://huggingface.co/blog/SandboxAQ/sair-data-accelerating-drug-discovery-with-ai) — 2025-09-06 — SandboxAQ: How SandboxAQ's SAIR dataset of 1M+ protein–ligand structures is enabling AI-powered drug discovery with unprecedented structural coverage. [Chemistry, Medicine, Biology]
- [How to Build a Benchmark with a Private Test Set on Hugging Face](https://huggingface.co/blog/hugging-science/building-a-benchmark-or-challenge) — 2026-02-16 — hugging-science: A step-by-step guide to creating, hosting, and managing a benchmark challenge with a hidden test set on Hugging Face. []
- [Open-R1: A Fully Open Reproduction of DeepSeek-R1](https://huggingface.co/blog/open-r1) — 2025-01-28 — lvwerra: A fully open reproduction of DeepSeek-R1's math reasoning training pipeline — data, code, and models — bringing transparent reasoning model training to the community. [Mathematics]
- [Illustrating Reinforcement Learning from Human Feedback (RLHF)](https://huggingface.co/blog/rlhf) — 2022-12-09 — natolambert: A clear, illustrated walkthrough of how RLHF works — the technique behind ChatGPT and modern instruction-following models. One of HF's most-read posts. []
- [Tropical Quivers for Modern AI: A Guided Tour of a Research Program](https://huggingface.co/blog/AmelieSchreiber/tropical-quivers-of-archs) — 2026-03-22 — AmelieSchreiber: A tour of tropical quiver representations and how their combinatorial structure connects to modern AI architectures. [Mathematics]
- [Surface Orders, Cyclic Time, and a Concrete Hilbert–Pólya Framework](https://huggingface.co/blog/AmelieSchreiber/hilbert-polya-for-grh) — 2026-03-17 — AmelieSchreiber: A concrete construction toward the Hilbert–Pólya conjecture using surface orders and cyclic-time symmetry as a route to the Riemann Hypothesis. [Mathematics]
- [ThermoGFN-IF for Catalysis](https://huggingface.co/blog/AmelieSchreiber/thermogfn-if) — 2026-03-10 — AmelieSchreiber: A protein sequence design model fine-tuned with GFlowNets for thermostable and kinetically-aware enzyme engineering. [Biology, Chemistry, Medicine]
- [A New Era in Multistep Enzyme Design](https://huggingface.co/blog/AmelieSchreiber/a-new-era-of-enzyme-engineering) — 2024-10-16 — AmelieSchreiber: Exploring generative AI approaches for designing multistep enzymatic pathways for biosynthesis and biocatalysis. [Biology, Chemistry]
- [A Guide to Designing New Functional Proteins](https://huggingface.co/blog/AmelieSchreiber/protein-optimization-and-design) — 2024-07-02 — AmelieSchreiber: A comprehensive guide to improving protein function, stability, and diversity using generative AI and ESM-2. [Biology, Chemistry]
- [RFDiffusion Potentials](https://huggingface.co/blog/AmelieSchreiber/rfdiffusion-potentials) — 2024-05-14 — AmelieSchreiber: Using RFDiffusion with custom guiding potentials to steer protein structure generation toward desired functional properties. [Biology, Chemistry]
- [Predicting the Effects of Mutations on Protein Function with ESM-2](https://huggingface.co/blog/AmelieSchreiber/mutation-scoring) — 2023-12-13 — AmelieSchreiber: Using ESM-2 protein language model embeddings to score and predict the functional impact of point mutations. [Biology, Genomics]
- [Faster Persistent Homology Alignment and Protein Complex Clustering](https://huggingface.co/blog/AmelieSchreiber/faster-pha) — 2023-11-30 — AmelieSchreiber: Accelerating persistent homology alignment with ESM-2 embeddings and persistence landscapes for protein complex clustering. [Biology, Mathematics]
- [Clustering Protein Complexes using Persistent Homology](https://huggingface.co/blog/AmelieSchreiber/esm-ppi) — 2023-11-29 — AmelieSchreiber: Combining persistent homology with ESM-2 fine-tuning for protein–protein interaction network prediction and complex clustering. [Biology, Chemistry]
- [ESM-2 for Generating and Optimizing Peptide Binders](https://huggingface.co/blog/AmelieSchreiber/esm-interact) — 2023-11-23 — AmelieSchreiber: Generating and optimising peptide binders for target proteins using ESM-2 embeddings and directed evolution. [Biology, Medicine]
- [Persistent Homology Alignment: Replacing Multiple Sequence Alignments](https://huggingface.co/blog/AmelieSchreiber/plm-persistent-homology-msa-replacement) — 2023-11-15 — AmelieSchreiber: Replacing traditional multiple sequence alignments with ESM-2 embeddings and persistent homology for structure-aware protein comparison. [Biology, Mathematics]
- [In Silico Directed Evolution of Protein Sequences with ESM-2](https://huggingface.co/blog/AmelieSchreiber/directed-evolution-with-esm2) — 2023-11-13 — AmelieSchreiber: Using ESM-2 and EvoProtGrad to simulate directed evolution in silico, optimising protein sequences for target properties. [Biology, Chemistry]
- [QLoRA for ESM-2 and Post Translational Modification Site Prediction](https://huggingface.co/blog/AmelieSchreiber/esm2-ptm) — 2023-11-11 — AmelieSchreiber: Applying QLoRA fine-tuning to ESM-2 for accurate prediction of post-translational modification sites across protein sequences. [Biology, Genomics]
- [Estimating the Intrinsic Dimension of Protein Sequence Embeddings](https://huggingface.co/blog/AmelieSchreiber/intrinsic-dimension-of-proteins) — 2023-10-18 — AmelieSchreiber: Measuring the intrinsic dimensionality of ESM-2 protein embeddings to understand the geometric structure of protein sequence space. [Biology, Mathematics]
- [Predicting Protein–Protein Interactions Using a Protein Language Model](https://huggingface.co/blog/AmelieSchreiber/protein-binding-partners-with-esm2) — 2023-10-15 — AmelieSchreiber: Using ESM-2 embeddings and linear sum assignment to predict protein–protein binding partners at scale. [Biology, Chemistry]
- [ESMBind Ensemble Models](https://huggingface.co/blog/AmelieSchreiber/esmbind-ensemble) — 2023-09-22 — AmelieSchreiber: Ensemble methods for ESMBind models to improve binding site prediction accuracy and robustness across protein families. [Biology, Genomics]
- [ESMBind: Low Rank Adaptation of ESM-2 for Protein Binding Site Prediction](https://huggingface.co/blog/AmelieSchreiber/esmbind) — 2023-09-15 — AmelieSchreiber: Fine-tuning ESM-2 with LoRA adapters to predict protein binding sites with high accuracy and parameter efficiency. [Biology, Genomics]
- [Physics Informed Neural Networks (PINNs): An Intuitive Guide](https://towardsdatascience.com/physics-informed-neural-networks-pinns-an-intuitive-guide-fff138069563/) — 2025-01-28 — towardsdatascience.com: A clear, intuitive walkthrough of how PINNs embed physical laws directly into neural network training — bridging traditional PDE-based modeling with data-driven deep learning. [Physics, Mathematics, Engineering]
- [A Living Review of Machine Learning for Particle Physics](https://iml-wg.github.io/HEPML-LivingReview/) — 2020-06-01 — iml-wg.github.io: A continuously updated, near-comprehensive survey of ML techniques applied to experimental, phenomenological, and theoretical high-energy physics — maintained by the Inter-Experimental LHC ML Working Group. [Physics]
- [Did GPT-5.2 Make a Breakthrough Discovery in Theoretical Physics?](https://huggingface.co/blog/dlouapre/gpt-single-minus-gluons) — 2026-02-01 — dlouapre: GPT-5.2 conjectured a compact formula for single-minus gluon tree amplitudes previously assumed to be zero for 40 years — a striking example of AI contributing to original theoretical physics. [Physics, Mathematics]
- [A Comprehensive Introduction to AI for Proteins (2026)](https://www.tamarind.bio/blog/a-comprehensive-introduction-to-ai-for-proteins) — 2026-01-01 — tamarind.bio: A thorough primer on the state of AI for protein science — covering structure prediction, protein language models, generative design, and the full open-source model landscape. [Biology, Chemistry, Medicine]
- [Boltz-2: State of the Art Structure and Binding Affinity Prediction](https://www.tamarind.bio/blog/boltz2-state-of-the-art-structure-and-binding-affinity-prediction) — 2025-06-18 — tamarind.bio: Boltz-2 outperforms AlphaFold3 on antibody-antigen interfaces and sets a new state of the art for protein-ligand binding affinity prediction. [Biology, Chemistry, Medicine]
- [Boltzdesign1: Designing De Novo Binders to More Than Just Proteins](https://www.tamarind.bio/blog/boltzdesign1-small-molecule-rna-dna-protein-metal-binder-design) — 2025-06-01 — tamarind.bio: BoltzDesign1 extends de novo binder design beyond protein targets to small molecules, RNA, DNA, and metal ions. [Biology, Chemistry, Medicine]
- [OpenFold3 and The Future of Protein Folding](https://www.tamarind.bio/blog/openfold3-fully-open-alphafold3-alternative) — 2025-04-01 — tamarind.bio: OpenFold3 is a fully open-source, commercially available AlphaFold3 alternative backed by the OpenFold Consortium — enabling unrestricted biomolecular structure prediction. [Biology, Chemistry]
- [IntFold: A New Best Structure Prediction Protocol](https://www.tamarind.bio/blog/intfold-a-new-state-of-the-art) — 2025-03-01 — tamarind.bio: IntFold establishes a new state-of-the-art protocol for biomolecular complex structure prediction, setting records across standard benchmarks. [Biology, Chemistry]
- [Chai-1r: AlphaFold3 Level Performance, Now Completely Open Source](https://www.tamarind.bio/blog/chai-1-alphafold3-level-performance-now-completely-open-source) — 2025-02-01 — tamarind.bio: Chai-1r achieves AlphaFold3-level accuracy on protein-protein and antibody-antigen complexes with fully open weights and no usage restrictions. [Biology, Chemistry, Medicine]
- [Computational De Novo Design of Antibodies and Nanobodies](https://www.tamarind.bio/blog/de-novo-antibody-nanobody-vhh-scfv-rfdiffusion) — 2025-01-01 — tamarind.bio: A practical guide to designing antibody VHHs and scFvs de novo using RFdiffusion and ProteinMPNN, from target epitope to validated sequence. [Biology, Medicine, Chemistry, Biotechnology]
- [Predicting Antibody Properties & Developability](https://www.tamarind.bio/blog/predicting-antibody-properties-developability) — 2025-01-01 — tamarind.bio: ML approaches for predicting key biophysical properties of therapeutic antibody candidates — stability, solubility, and immunogenicity — before wet-lab validation. [Biology, Medicine, Chemistry, Biotechnology]
- [Are Mini Proteins the Next Antibodies?](https://www.tamarind.bio/blog/mini-protein-antibodies) — 2025-01-01 — tamarind.bio: Examining the therapeutic potential of computationally designed miniproteins as a next-generation alternative to traditional antibody drugs. [Biology, Medicine, Chemistry]
- [Boltz-1: AlphaFold3 Level Performance, Truly Open Source](https://www.tamarind.bio/blog/boltz-1-alphafold3-level-performance-truly-open-source-and-commercially-available) — 2024-11-01 — tamarind.bio: Boltz-1 from MIT achieves AlphaFold3-level accuracy on protein and protein-ligand structure prediction with no restrictions on commercial use or input types. [Biology, Chemistry]
- [Computational De Novo Miniproteins As Therapeutics](https://www.tamarind.bio/blog/computationaly-de-novo-minibinders-therapeutic-applications) — 2024-12-01 — tamarind.bio: How computationally designed de novo miniproteins and minibinders are being developed as a new class of targeted therapeutics. [Biology, Medicine, Chemistry]
- [Computational Protein–Protein Interaction Screening](https://www.tamarind.bio/blog/ppi-screen) — 2024-12-01 — tamarind.bio: A practical guide to screening for protein–protein interactions (PPIs) as drug discovery targets using structure prediction and ML scoring. [Biology, Medicine, Chemistry]