# Hugging Science — AI for Science Resource Index

> A curated catalog of scientific datasets, models, and blog posts for ML researchers.
> Browse by topic using the /topics/{tag}.md files listed below, or fetch /llms-full.txt for everything at once.

## Topic Files

- [Astronomy](https://huggingscience.co/topics/astronomy.md): Space science and astrophysics
- [Benchmark](https://huggingscience.co/topics/benchmark.md): Evaluation and benchmarking datasets
- [Biology](https://huggingscience.co/topics/biology.md): Life sciences, genomics, and biological systems
- [Biotechnology](https://huggingscience.co/topics/biotechnology.md): Biological engineering and synthetic biology
- [Chemistry](https://huggingscience.co/topics/chemistry.md): Molecular science, reactions, and materials
- [Climate](https://huggingscience.co/topics/climate.md): Climate science and environmental modeling
- [Conservation](https://huggingscience.co/topics/conservation.md): Wildlife and habitat preservation
- [Earth Science](https://huggingscience.co/topics/earth-science.md): Geology, oceanography, and planetary science
- [Ecology](https://huggingscience.co/topics/ecology.md): Ecosystems and environmental biology
- [Energy](https://huggingscience.co/topics/energy.md): Energy systems and sustainability
- [Engineering](https://huggingscience.co/topics/engineering.md): Applied science and technical systems
- [Genomics](https://huggingscience.co/topics/genomics.md): DNA, RNA, and genetic analysis
- [Materials Science](https://huggingscience.co/topics/materials-science.md): Material properties and discovery
- [Mathematics](https://huggingscience.co/topics/mathematics.md): Mathematical modeling and computational methods
- [Medicine](https://huggingscience.co/topics/medicine.md): Healthcare, drug discovery, and clinical research
- [Physics](https://huggingscience.co/topics/physics.md): Fundamental forces, particles, and physical systems
- [Scientific Reasoning](https://huggingscience.co/topics/scientific-reasoning.md): Scientific QA, theorem proving, and multi-step problem-solving datasets

## Datasets

- [arcinstitute/opengenome2](https://huggingface.co/datasets/arcinstitute/opengenome2): Curated collection of prokaryotic and eukaryotic genomic sequences for training and benchmarking large-scale biological foundation models. [Biology, Genomics, Medicine]
- [arcinstitute/SE-167M-Human](https://huggingface.co/datasets/arcinstitute/SE-167M-Human): 167M human single-cell RNA expression profiles across diverse tissues and cell types, used for training STACK and SE single-cell foundation models. [Biology, Genomics, Medicine]
- [arcinstitute/Stack-CellxGene45M](https://huggingface.co/datasets/arcinstitute/Stack-CellxGene45M): 45M curated single-cell profiles drawn from the CellxGene corpus, standardised for in-context learning and cross-study perturbation analysis. [Biology, Genomics, Medicine]
- [polymathic-ai/active_matter](https://huggingface.co/datasets/polymathic-ai/active_matter): High-fidelity simulations of self-propelled particle systems for benchmarking learned PDE solvers and emergent collective behaviour models. [Physics, Engineering, Benchmark]
- [polymathic-ai/MHD_64](https://huggingface.co/datasets/polymathic-ai/MHD_64): 3D magnetohydrodynamics turbulence simulations at 64³ resolution for training and benchmarking physics-informed neural operators. [Physics, Engineering, Benchmark]
- [polymathic-ai/planetswe](https://huggingface.co/datasets/polymathic-ai/planetswe): Spherical shallow-water equation simulations modelling large-scale planetary atmospheric dynamics for weather and climate surrogate models. [Physics, Earth Science]
- [polymathic-ai/rayleigh_benard](https://huggingface.co/datasets/polymathic-ai/rayleigh_benard): Rayleigh–Bénard thermal convection simulations at varying Rayleigh and Prandtl numbers for benchmarking turbulence and heat transfer models. [Physics, Engineering, Benchmark]
- [polymathic-ai/supernova_explosion_64](https://huggingface.co/datasets/polymathic-ai/supernova_explosion_64): Hydrodynamic simulations of core-collapse supernova explosions at 64³ resolution, spanning diverse progenitor masses and explosion energies. [Physics, Astronomy]
- [ginkgo-datapoints/GDPa1](https://huggingface.co/datasets/ginkgo-datapoints/GDPa1): Antibody developability dataset with biophysical assay data for 242 antibodies across 9 assays. [Biology, Biotechnology]
- [ginkgo-datapoints/GDPx1](https://huggingface.co/datasets/ginkgo-datapoints/GDPx1): DRUG-seq functional genomics dataset with chemical perturbation experiments in A549 cells. [Biology, Biotechnology]
- [ginkgo-datapoints/GDPx2](https://huggingface.co/datasets/ginkgo-datapoints/GDPx2): DRUG-seq transcriptomic profiling across 4 primary human cell types with 85 compounds. [Biology, Biotechnology]
- [ginkgo-datapoints/GDPx3](https://huggingface.co/datasets/ginkgo-datapoints/GDPx3): High-content Cell Painting imaging dataset for AI/ML model training in drug discovery. [Biology, Biotechnology]
- [ginkgo-datapoints/GDPx4](https://huggingface.co/datasets/ginkgo-datapoints/GDPx4): DRUG-seq transcriptomic profiling in engineered HEK293 cells with inducible gene overexpression, enabling systematic study of gene-drug interactions. [Biology, Biotechnology]
- [eve-bio/drug-target-activity](https://huggingface.co/datasets/eve-bio/drug-target-activity): Drug-target interaction measurements for 1,397 FDA-approved small molecule drugs. [Biology, Medicine, Chemistry]
- [nasa-impact/WxC-Bench](https://huggingface.co/datasets/nasa-impact/WxC-Bench): Standardised benchmark for evaluating AI models across six atmospheric and earth science tasks including gravity wave parameterisation, turbulence prediction, and hurricane track forecasting. [Earth Science, Climate, Physics, Benchmark]
- [nasa-impact/EO-via-NLP](https://huggingface.co/datasets/nasa-impact/EO-via-NLP): Paired earth observation imagery and natural-language descriptions for training and evaluating multimodal models on remote sensing understanding tasks. [Earth Science, Climate]
- [proxima-fusion/constellaration](https://huggingface.co/datasets/proxima-fusion/constellaration): Large-scale dataset of quasi-isodynamic stellarator designs with MHD equilibria for fusion energy research. [Physics, Energy, Engineering]
- [EarthSpeciesProject/BEANS-Zero](https://huggingface.co/datasets/EarthSpeciesProject/BEANS-Zero): Zero-shot bioacoustics benchmark evaluating audio-language models on species detection, classification, and captioning across diverse animal taxa. [Biology, Ecology, Conservation, Benchmark, Earth Science]
- [SandboxAQ/SAIR](https://huggingface.co/datasets/SandboxAQ/SAIR): Largest public dataset of protein-ligand 3D structures with binding affinity measurements (1M+ pairs). [Chemistry, Medicine, Biology]
- [SandboxAQ/aqcat25-dataset](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset): 13.5M DFT calculation trajectories for heterogeneous catalysis and ML potential training. [Chemistry, Materials Science, Engineering]
- [jablonkagroup/chempile-mlift](https://huggingface.co/datasets/jablonkagroup/chempile-mlift): Curated lift-off subset of the ChemPile corpus for instruction-tuning and benchmarking chemistry language models across synthesis, property prediction, and reaction tasks. [Chemistry]
- [jablonkagroup/ChemBench](https://huggingface.co/datasets/jablonkagroup/ChemBench): Manually curated benchmark of 3,000+ chemistry and materials science questions across spectroscopy, reactivity, synthesis, and property prediction for evaluating LLMs. [Chemistry, Materials Science, Benchmark, Scientific Reasoning, Engineering, Mathematics]
- [jablonkagroup/chempile-paper](https://huggingface.co/datasets/jablonkagroup/chempile-paper): Large corpus of peer-reviewed chemistry papers and preprints for pre-training and fine-tuning chemistry language models. [Chemistry]
- [AI-MO/aops_raw](https://huggingface.co/datasets/AI-MO/aops_raw): Raw problem posts and discussion threads from the Art of Problem Solving forums, spanning AMC, AIME, and international olympiad competitions. [Mathematics]
- [AI-MO/olympiads-ref-base](https://huggingface.co/datasets/AI-MO/olympiads-ref-base): Canonical reference set of international and national mathematical olympiad problems, used as the base for downstream NuminaMath training splits. [Mathematics]
- [AI-MO/olympiads-ref](https://huggingface.co/datasets/AI-MO/olympiads-ref): Extended reference set of olympiad problems with verified step-by-step solutions, used for Chain-of-Thought and formal reasoning training. [Mathematics, Scientific Reasoning]
- [AI-MO/Kimina-Prover-Promptset](https://huggingface.co/datasets/AI-MO/Kimina-Prover-Promptset): Prompt-set for training and evaluating Kimina, a Lean 4 theorem prover that uses reinforcement learning over formal mathematical proofs. [Mathematics, Scientific Reasoning]
- [AI-MO/NuminaMath-LEAN](https://huggingface.co/datasets/AI-MO/NuminaMath-LEAN): Mathematical problems formalized in LEAN proof assistant. [Mathematics]
- [AI-MO/GeometryLeanBench](https://huggingface.co/datasets/AI-MO/GeometryLeanBench): Geometry theorem proving problems formalised in Lean 4, covering Euclidean, affine, and metric geometry for automated reasoning evaluation. [Mathematics, Benchmark, Scientific Reasoning]
- [AI-MO/CombiBench](https://huggingface.co/datasets/AI-MO/CombiBench): Combinatorics problems drawn from AMC, AIME, and olympiad competitions, formalised for benchmarking discrete-mathematics reasoning in language models. [Mathematics, Benchmark, Scientific Reasoning]
- [AI-MO/minif2f_test](https://huggingface.co/datasets/AI-MO/minif2f_test): Test set for miniF2F formal mathematics benchmark. [Mathematics, Benchmark, Scientific Reasoning]
- [AI-MO/aimo-validation-amc](https://huggingface.co/datasets/AI-MO/aimo-validation-amc): AMC 10/12 competition problems reformatted for AIMO challenge validation, covering algebra, geometry, and number theory at difficulty levels 1–5. [Mathematics, Benchmark, Scientific Reasoning]
- [AI-MO/aimo-validation-aime](https://huggingface.co/datasets/AI-MO/aimo-validation-aime): AIME I/II problems reformatted for AIMO challenge validation — 15-question integer-answer format, covering competition math at difficulty levels 5–9. [Mathematics, Benchmark, Scientific Reasoning]
- [AI-MO/NuminaMath-1.5](https://huggingface.co/datasets/AI-MO/NuminaMath-1.5): 860K+ competition math problems from 17 sources with verified solutions — the training backbone of the gold-medal solution at the 2024 AI Mathematical Olympiad. [Mathematics]
- [AI-MO/NuminaMath-TIR](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR): NuminaMath with Tool-Integrated Reasoning annotations. [Mathematics, Scientific Reasoning]
- [AI-MO/NuminaMath-CoT](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT): NuminaMath with Chain-of-Thought reasoning annotations. [Mathematics, Scientific Reasoning]
- [AI-MO/aimo-validation-math-level-4](https://huggingface.co/datasets/AI-MO/aimo-validation-math-level-4): Level-4 MATH benchmark problems (pre-calculus difficulty) used for AIMO challenge validation and fine-grained model evaluation. [Mathematics, Benchmark, Scientific Reasoning]
- [AI-MO/aimo-validation-math-level-5](https://huggingface.co/datasets/AI-MO/aimo-validation-math-level-5): Level-5 MATH benchmark problems (highest difficulty) used for AIMO challenge validation and measuring the ceiling of model mathematical reasoning. [Mathematics, Benchmark, Scientific Reasoning]
- [meta-math/MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA): Mathematical question-answering dataset for training and evaluating math reasoning. [Mathematics]
- [google/spiqa](https://huggingface.co/datasets/google/spiqa): Scientific Paper Image Question Answering benchmark requiring multimodal reasoning over figures, charts, and diagrams from research papers across scientific domains. [Biology, Chemistry, Physics, Benchmark, Scientific Reasoning, Mathematics]
- [nasa-ibm-ai4science/surya-bench-flare-forecasting](https://huggingface.co/datasets/nasa-ibm-ai4science/surya-bench-flare-forecasting): Full-disk solar flare forecasting dataset from NOAA GOES observations, providing multi-hour-ahead flare probability labels for heliophysics model evaluation. [Astronomy, Physics, Benchmark]
- [nasa-ibm-ai4science/core-sdo](https://huggingface.co/datasets/nasa-ibm-ai4science/core-sdo): Multi-modal Solar Dynamics Observatory dataset combining EUV imagery, magnetograms, and irradiance spectra for solar foundation model pre-training. [Astronomy, Physics]
- [LeMaterial/LeMat-Bulk-MLIP-Hull](https://huggingface.co/datasets/LeMaterial/LeMat-Bulk-MLIP-Hull): Convex hull data for bulk materials from MLIP calculations. [Materials Science, Chemistry, Physics, Engineering]
- [LeMaterial/LeMat-Bulk-DFT-Hull-All](https://huggingface.co/datasets/LeMaterial/LeMat-Bulk-DFT-Hull-All): Complete DFT convex hull dataset for bulk materials discovery. [Materials Science, Chemistry, Physics, Engineering]
- [LeMaterial/LeMat-Bulk-DFT-Hull](https://huggingface.co/datasets/LeMaterial/LeMat-Bulk-DFT-Hull): DFT convex hull reference data for materials stability analysis. [Materials Science, Chemistry, Physics, Engineering]
- [LeMaterial/LeMat-Bulk](https://huggingface.co/datasets/LeMaterial/LeMat-Bulk): Primary bulk materials database aggregating 1M+ crystal structures with DFT-computed formation energies, band gaps, and elastic properties for materials discovery. [Materials Science, Chemistry, Physics, Engineering]
- [LeMaterial/LeMat-Traj](https://huggingface.co/datasets/LeMaterial/LeMat-Traj): Large-scale molecular dynamics trajectory dataset for training machine learning interatomic potentials across diverse bulk material compositions. [Materials Science, Chemistry, Physics, Engineering]
- [openadmet/openadmet-expansionrx-challenge-train-data](https://huggingface.co/datasets/openadmet/openadmet-expansionrx-challenge-train-data): Training data for the OpenADMET ExpansionRx ADMET prediction challenge. [Medicine, Chemistry]
- [openadmet/openadmet-expansionrx-challenge-data](https://huggingface.co/datasets/openadmet/openadmet-expansionrx-challenge-data): Full ExpansionRx challenge dataset of RNA-targeted small-molecule compounds with measured ADMET properties for open pharmacokinetics benchmarking. [Medicine, Chemistry, Benchmark]
- [openadmet/Octant_CYP_inhibition_reactivity_blog_release](https://huggingface.co/datasets/openadmet/Octant_CYP_inhibition_reactivity_blog_release): Octant CYP inhibition and chemical reactivity dataset measuring cytochrome P450 activity across a diverse compound library for ADMET modelling. [Medicine, Chemistry]
- [InstaDeepAI/NTv3_benchmark_dataset](https://huggingface.co/datasets/InstaDeepAI/NTv3_benchmark_dataset): Benchmark dataset with functional tracks and genome annotations across 7 species. [Biology, Genomics, Benchmark]
- [InstaDeepAI/nucleotide_transformer_downstream_tasks](https://huggingface.co/datasets/InstaDeepAI/nucleotide_transformer_downstream_tasks): 18 genomic prediction benchmark tasks covering histone marks, regulatory regions, splice sites, and promoter activity across human and multi-species genomes. [Biology, Genomics, Benchmark]
- [InstaDeepAI/multi_species_genomes](https://huggingface.co/datasets/InstaDeepAI/multi_species_genomes): Whole-genome sequences for 850 species spanning bacteria, fungi, plants, and animals — the pre-training corpus for the Nucleotide Transformer model family. [Biology, Genomics]
- [InstaDeepAI/plant-genomic-benchmark](https://huggingface.co/datasets/InstaDeepAI/plant-genomic-benchmark): Plant genomics benchmark spanning gene expression, chromatin accessibility, and agronomic trait prediction tasks across multiple crop and model plant species. [Biology, Genomics, Benchmark]
- [InstaDeepAI/winnow-ms-datasets](https://huggingface.co/datasets/InstaDeepAI/winnow-ms-datasets): Mass spectrometry datasets for protein analysis and ML model training. [Biology, Chemistry]
- [InstaDeepAI/true-cds-protein-tasks](https://huggingface.co/datasets/InstaDeepAI/true-cds-protein-tasks): Coding sequence and protein function prediction benchmark tasks. [Biology, Genomics, Benchmark]
- [facebook/principia-collection](https://huggingface.co/datasets/facebook/principia-collection): Large-scale STEM reasoning dataset from Meta covering mathematics, physics, chemistry, and biology problems for training and evaluating scientific reasoning in language models. [Mathematics, Physics, Chemistry, Scientific Reasoning]
- [facebook/principia-bench](https://huggingface.co/datasets/facebook/principia-bench): Curated benchmark of challenging STEM problems requiring multi-step reasoning, quantitative analysis, and domain knowledge across natural sciences. [Mathematics, Physics, Chemistry, Benchmark, Scientific Reasoning]
- [futurehouse/BixBench](https://huggingface.co/datasets/futurehouse/BixBench): Benchmark with 205 reproducible research questions paired with data capsules for AI evaluation. [Biology, Chemistry, Benchmark, Scientific Reasoning, Mathematics]
- [futurehouse/lab-bench](https://huggingface.co/datasets/futurehouse/lab-bench): Language Agent Biology Benchmark - 8 categories of scientific research tasks including cloning, figures, and protocols. [Biology, Benchmark, Scientific Reasoning, Mathematics]
- [futurehouse/ether0-benchmark](https://huggingface.co/datasets/futurehouse/ether0-benchmark): Chemistry reasoning benchmark covering SMILES-based tasks including reaction prediction, retrosynthesis, and molecular property estimation for evaluating chemistry LLMs. [Chemistry, Medicine, Benchmark, Scientific Reasoning, Mathematics]
- [ONERA/SARLO-80](https://huggingface.co/datasets/ONERA/SARLO-80): 119K paired SAR/optical images with text captions at 80cm resolution for multimodal learning. [Earth Science, Engineering]
- [tahoebio/Tahoe-100M](https://huggingface.co/datasets/tahoebio/Tahoe-100M): Giga-scale perturbation atlas with 100M+ single-cell profiles from 50 cancer cell lines and 1,100 drugs. [Biology, Medicine, Genomics]
- [tahoebio/Tahoe-x1-embeddings](https://huggingface.co/datasets/tahoebio/Tahoe-x1-embeddings): Pre-computed cell and gene embeddings from the Tahoe-x1 foundation model. [Biology, Medicine, Genomics]
- [owkin/plism-dataset-tiles](https://huggingface.co/datasets/owkin/plism-dataset-tiles): Large-scale histopathology tile dataset for benchmarking robustness of pathology foundation models across staining and scanner variability. [Medicine, Biology]
- [owkin/nct-crc-he](https://huggingface.co/datasets/owkin/nct-crc-he): Colorectal cancer tissue classification dataset with H&E-stained patches across 9 tissue classes, widely used for benchmarking pathology models. [Medicine, Biology, Benchmark]
- [owkin/camelyon16-features](https://huggingface.co/datasets/owkin/camelyon16-features): Pre-extracted features from the CAMELYON16 breast cancer lymph node metastasis detection challenge, enabling efficient benchmarking of MIL methods. [Medicine, Biology, Benchmark]
- [owkin/her2-challenge-2026](https://huggingface.co/datasets/owkin/her2-challenge-2026): HER2 scoring challenge dataset with H&E-stained whole-slide images for evaluating AI-based HER2 status prediction in breast cancer. [Medicine, Biology, Benchmark]
- [Xaira-Therapeutics/X-Atlas-Orion](https://huggingface.co/datasets/Xaira-Therapeutics/X-Atlas-Orion): Large-scale single-cell transcriptomics atlas with millions of cell profiles from diverse human tissues, designed for training perturbation-aware foundation models. [Biology, Medicine, Genomics]
- [Xaira-Therapeutics/X-Atlas-Pisces](https://huggingface.co/datasets/Xaira-Therapeutics/X-Atlas-Pisces): CRISPRi perturbation single-cell dataset pairing genetic knockdowns with transcriptomic responses, used for training and evaluating the X-Cell model. [Biology, Medicine, Genomics]
- [SAIRfoundation/equational-theories-selected-problems](https://huggingface.co/datasets/SAIRfoundation/equational-theories-selected-problems): Curated selection of equational theory problems for benchmarking LLM mathematical reasoning and automated theorem proving. [Mathematics, Scientific Reasoning, Benchmark]
- [SAIRfoundation/equational-theories-benchmark](https://huggingface.co/datasets/SAIRfoundation/equational-theories-benchmark): Full benchmark suite of equational theory problems spanning algebraic structures, designed to evaluate formal reasoning capabilities of AI models. [Mathematics, Scientific Reasoning, Benchmark]
- [AllTheBacteria/ATB](https://huggingface.co/datasets/AllTheBacteria/ATB): AllTheBacteria: a comprehensive collection of ~2 million bacterial genome assemblies from public sequence databases, standardized for large-scale genomic analysis. [Biology, Genomics]
- [AllTheBacteria/Bac-Corpus-protein-sequences-high-diversity](https://huggingface.co/datasets/AllTheBacteria/Bac-Corpus-protein-sequences-high-diversity): High-diversity corpus of bacterial protein sequences derived from the ATB collection, filtered for maximum sequence diversity to support protein language model pretraining. [Biology, Genomics]
- [AllTheBacteria/Bac-Corpus-dna-intergenic-sequences-high-diversity](https://huggingface.co/datasets/AllTheBacteria/Bac-Corpus-dna-intergenic-sequences-high-diversity): High-diversity corpus of bacterial intergenic DNA sequences for training DNA language models on non-coding regulatory regions. [Biology, Genomics]
- [AllTheBacteria/SPIRE](https://huggingface.co/datasets/AllTheBacteria/SPIRE): Searchable Planetary-scale mIcrobiome REsource: a large-scale metagenomics resource aggregating environmental microbiome samples from diverse global habitats. [Biology, Genomics, Ecology, Earth Science]
- [isp-uv-es/WorldFloodsv2](https://huggingface.co/datasets/isp-uv-es/WorldFloodsv2): Global flood mapping dataset with Sentinel-1/2 and Landsat imagery paired with flood extent labels across hundreds of flood events worldwide. [Earth Science, Climate]
- [isp-uv-es/CloudSEN12Plus](https://huggingface.co/datasets/isp-uv-es/CloudSEN12Plus): Large-scale cloud detection dataset with 49,000+ Sentinel-2 patches and expert-quality cloud/shadow annotations across global biomes and seasons. [Earth Science, Climate, Benchmark]
- [isp-uv-es/rtm_emulation](https://huggingface.co/datasets/isp-uv-es/rtm_emulation): Atmospheric radiative transfer model emulation dataset for training fast neural surrogates to replace computationally expensive RTM simulations in satellite data processing. [Earth Science, Climate, Physics]
- [isp-uv-es/opensr-test](https://huggingface.co/datasets/isp-uv-es/opensr-test): Benchmark dataset for real-world Sentinel-2 super-resolution, with paired low/high-resolution imagery and perceptual quality metrics. [Earth Science, Benchmark]
- [opig/OAS](https://huggingface.co/datasets/opig/OAS): Observed Antibody Space: a curated database of over one billion antibody sequences from immune repertoire sequencing studies, the standard resource for antibody ML. [Biology, Medicine, Chemistry]
- [UniverseTBD/arxiv-abstracts-large](https://huggingface.co/datasets/UniverseTBD/arxiv-abstracts-large): 1.7 million scholarly article abstracts spanning physics, computer science, and statistics from arXiv, structured for pretraining and fine-tuning astronomy and scientific language models. [Astronomy, Physics]
- [UniverseTBD/AstroLLaVA_convos](https://huggingface.co/datasets/UniverseTBD/AstroLLaVA_convos): Astronomical images paired with detailed captions and question-answer pairs sourced from APOD, ESO, and ESA Hubble archives, for training multimodal vision-language models on astrophysics. [Astronomy, Physics]
- [openai/healthbench](https://huggingface.co/datasets/openai/healthbench): Realistic multi-turn health conversations graded against physician-written rubrics across multiple axes (accuracy, completeness, communication) — an open evaluation benchmark for AI assistants in medicine. [Medicine, Benchmark, Scientific Reasoning]
- [openai/healthbench-professional](https://huggingface.co/datasets/openai/healthbench-professional): Professional-graded subset of HealthBench: physician evaluators score model responses to clinically realistic conversations, targeting expert-level health assessment. [Medicine, Benchmark, Scientific Reasoning]
- [openai/frontierscience](https://huggingface.co/datasets/openai/frontierscience): Frontier science evaluation benchmark probing model capabilities on expert-level reasoning across natural sciences — designed to surface what AI systems can and cannot do at the research frontier. [Scientific Reasoning, Benchmark]
- [wanglab/CT_DeepLesion-MedSAM2](https://huggingface.co/datasets/wanglab/CT_DeepLesion-MedSAM2): CT volumes from the DeepLesion benchmark with mask annotations restructured for training and evaluating MedSAM2, the universal medical image segmentation foundation model. [Medicine, Biology]
- [wanglab/img_virus_plasmid](https://huggingface.co/datasets/wanglab/img_virus_plasmid): Combined IMG/VR (uncultivated virus genomes) and IMG/PR (plasmids from genomes and metagenomes) catalog with rich functional, taxonomic, and ecological metadata. [Biology, Genomics, Biotechnology]
- [wanglab/kegg](https://huggingface.co/datasets/wanglab/kegg): KEGG pathway entries paired with variant annotations for training and evaluating multimodal biological reasoning models (used by the BioReason work). [Biology, Genomics, Scientific Reasoning]
- [AI-MO/olympiads](https://huggingface.co/datasets/AI-MO/olympiads): Olympiad-level mathematical problems collected from international and national competitions, formatted for training and evaluating mathematical reasoning models. [Mathematics, Scientific Reasoning]
- [OpenMed/MedDialog](https://huggingface.co/datasets/OpenMed/MedDialog): Doctor-patient medical dialogue dataset for training and evaluating clinical conversation models — covers triage, symptom checking, and diagnostic reasoning. [Medicine, Scientific Reasoning]
- [OpenMed/Medical-Reasoning-SFT-Mega](https://huggingface.co/datasets/OpenMed/Medical-Reasoning-SFT-Mega): Large supervised fine-tuning corpus for clinical reasoning — multi-step medical question-answer chains with rationales for training instruction-following medical LLMs. [Medicine, Scientific Reasoning]
- [OpenMed/synthvision-annotated-qwen](https://huggingface.co/datasets/OpenMed/synthvision-annotated-qwen): Synthetic medical-imaging dataset annotated by Qwen — used in OpenMed’s SynthVision pipeline for training and validating medical multimodal models. [Medicine, Biology]
- [OpenMed/synthvision-seeds](https://huggingface.co/datasets/OpenMed/synthvision-seeds): Seed prompts and source imagery feeding the SynthVision generation pipeline that produces OpenMed’s annotated medical-imaging training corpora. [Medicine, Biology]
- [OpenMed/synthvision-annotated-kimi](https://huggingface.co/datasets/OpenMed/synthvision-annotated-kimi): Synthetic medical-imaging dataset annotated by Kimi — sister set to the Qwen-annotated split, supporting cross-annotator validation in the SynthVision pipeline. [Medicine, Biology]
- [allenai/peS2o](https://huggingface.co/datasets/allenai/peS2o): Approximately 40M cleaned, filtered, and formatted open-access academic papers derived from S2ORC — a large multi-domain pretraining corpus for science-aware language models, spanning biology, chemistry, engineering, computer science, and physics. [Scientific Reasoning, Biology, Chemistry, Physics, Engineering]
- [Anthropic/BioMysteryBench-preview](https://huggingface.co/datasets/Anthropic/BioMysteryBench-preview): Preview slice of BioMysteryBench — challenging, expert-curated biology problems for evaluating AI scientific reasoning capability. [Biology, Medicine, Scientific Reasoning, Benchmark]
- [Anthropic/BioMysteryBench-full](https://huggingface.co/datasets/Anthropic/BioMysteryBench-full): Full BioMysteryBench evaluation set — challenging biology problems used to probe expert-level scientific reasoning in frontier models. [Biology, Medicine, Scientific Reasoning, Benchmark]
- [neashton/drivaerml](https://huggingface.co/datasets/neashton/drivaerml): High-fidelity CFD simulation dataset of the DrivAer reference automotive geometry — resolved-flow data for training ML models on aerodynamics prediction (drag, downforce, surface pressure). [Engineering, Physics]
- [PLAID-datasets/AirfRANS_original](https://huggingface.co/datasets/PLAID-datasets/AirfRANS_original): Original AirfRANS airfoil RANS simulation dataset — graph-structured CFD over NACA airfoils for benchmarking physics-informed and graph neural networks. [Physics, Engineering, Scientific Reasoning]
- [luminary-shift/SUV](https://huggingface.co/datasets/luminary-shift/SUV): Large-scale CFD dataset of SUV-class vehicles for training ML models on automotive aerodynamics — surface pressures, wake structures, and aerodynamic performance metrics. [Engineering, Physics]
- [luminary-shift/Pump](https://huggingface.co/datasets/luminary-shift/Pump): CFD simulations of centrifugal pumps spanning operating conditions — for training ML surrogates of turbomachinery flow and performance. [Engineering, Physics]
- [luminary-shift/SHIFT-Crash](https://huggingface.co/datasets/luminary-shift/SHIFT-Crash): Vehicle crash-simulation dataset capturing structural deformation under impact — for ML-based safety and structural-mechanics modelling. [Engineering, Physics]
- [luminary-shift/WING](https://huggingface.co/datasets/luminary-shift/WING): Wing-flow CFD dataset for ML-driven aerodynamics — covers a range of geometries and flight conditions for surrogate modelling. [Engineering, Physics]
- [luminary-shift/CCA](https://huggingface.co/datasets/luminary-shift/CCA): Common Compressor Aero (CCA) dataset — compressor and turbomachinery simulations for ML-augmented aerospace design workflows. [Engineering, Physics]
- [luminary-shift/Submarine](https://huggingface.co/datasets/luminary-shift/Submarine): Submarine hydrodynamics CFD dataset — submerged-body flow simulations for ML-based marine engineering and naval design. [Engineering, Physics]
- [jablonkagroup/chempile-instruction](https://huggingface.co/datasets/jablonkagroup/chempile-instruction): Instruction-tuning corpus for chemistry — curated Q&A and dialogue traces drawn from chemical literature and educational sources for training chemistry-specialist LLMs. [Chemistry, Scientific Reasoning]
- [jablonkagroup/chempile-reasoning](https://huggingface.co/datasets/jablonkagroup/chempile-reasoning): Multi-step chemistry reasoning corpus — open-domain QA, NLI, and multiple-choice items with chains of reasoning for training and evaluating chemical reasoning models. [Chemistry, Scientific Reasoning]
- [jablonkagroup/chempile-lift](https://huggingface.co/datasets/jablonkagroup/chempile-lift): ChemPile-LIFT — large-scale language-modelling dataset combining curated chemistry literature and structured chemical knowledge for foundation-model pretraining. [Chemistry, Scientific Reasoning]
- [jablonkagroup/chempile-education](https://huggingface.co/datasets/jablonkagroup/chempile-education): Educational chemistry corpus — multiple-choice and open-ended items spanning introductory through graduate chemistry for assessing model educational capability. [Chemistry, Scientific Reasoning]
- [jablonkagroup/chempile-caption](https://huggingface.co/datasets/jablonkagroup/chempile-caption): Image-to-text dataset of chemistry figures (molecular structures, reaction schemes, plots) with expert captions for training multimodal chemistry models. [Chemistry, Scientific Reasoning]
- [jablonkagroup/chempile-code](https://huggingface.co/datasets/jablonkagroup/chempile-code): Curated chemistry-relevant code (RDKit, ASE, simulation tooling) drawn from The Stack — supports training models that can read and write computational chemistry workflows. [Chemistry, Scientific Reasoning]
- [jablonkagroup/MaCBench](https://huggingface.co/datasets/jablonkagroup/MaCBench): Materials Chemistry Benchmark — multimodal QA, multiple-choice, and visual-question-answering items for evaluating LLMs on materials and inorganic chemistry tasks. [Chemistry, Materials Science, Benchmark]
- [miriad/miriad-5.8M](https://huggingface.co/datasets/miriad/miriad-5.8M): 5.8M-example medical instruction-tuning and reasoning corpus curated from clinical literature for training healthcare LLMs at scale. [Medicine, Scientific Reasoning]
- [miriad/miriad-4.4M](https://huggingface.co/datasets/miriad/miriad-4.4M): 4.4M-example medical reasoning subset of MIRIAD — earlier release used for benchmarking medical instruction-tuning workflows. [Medicine, Scientific Reasoning]
- [maomlab/Molecule3D](https://huggingface.co/datasets/maomlab/Molecule3D): Curated 3D molecular structures with computed properties — supports geometric deep learning for property prediction and conformer-aware modelling. [Chemistry, Biology]
- [maomlab/TDC](https://huggingface.co/datasets/maomlab/TDC): Therapeutics Data Commons subset — drug-discovery tasks (ADMET, drug-target interaction, generation) curated for benchmarking molecular ML. [Medicine, Chemistry, Biology]
- [maomlab/B3DB](https://huggingface.co/datasets/maomlab/B3DB): Blood-Brain Barrier Database (B3DB) — curated permeability measurements for compounds, supporting CNS drug-discovery ML benchmarks. [Medicine, Chemistry]
- [maomlab/ChAFF](https://huggingface.co/datasets/maomlab/ChAFF): ChAFF — chemistry dataset for ML benchmarking on filtered/curated molecular properties, part of the Maom Lab pharmacology suite. [Chemistry]
- [maomlab/CryptoCEN](https://huggingface.co/datasets/maomlab/CryptoCEN): CryptoCEN — Cryptococcus coexpression network dataset for fungal pathogen biology and drug-target prioritisation. [Biology, Medicine]
- [imageomics/TreeOfLife-200M](https://huggingface.co/datasets/imageomics/TreeOfLife-200M): Foundational 200M-image dataset for organismal biology — multilingual species labels (en, la) at biodiversity scale, used to train BioCLIP-2 for zero-shot species classification. [Biology, Ecology, Conservation]
- [microsoft/msr-acc-tae25](https://huggingface.co/datasets/microsoft/msr-acc-tae25): Microsoft Research Accurate Chemistry Collection — large dataset of high-accuracy electronic-structure calculations (TAE25 split) for training and evaluating quantum-chemistry ML models. [Chemistry, Physics]
- [Aignostics/OpenTME](https://huggingface.co/datasets/Aignostics/OpenTME): Pre-analyzed H&E whole-slide images from TCGA across breast, bladder, colorectal, liver, and lung cancers — cell-level annotations and tumour-microenvironment spatial features generated by Atlas H&E-TME. [Medicine, Biology]
- [Orbital-Materials/MofasaDB](https://huggingface.co/datasets/Orbital-Materials/MofasaDB): Metal-organic framework dataset from Orbital — large-scale curated MOF structures for materials-discovery ML and synthetic chemistry workflows. [Materials Science, Chemistry]
- [wanglab/bioreason-pro-sft-reasoning-data](https://huggingface.co/datasets/wanglab/bioreason-pro-sft-reasoning-data): Reasoning trace dataset used to supervised-fine-tune BioReason-Pro — multimodal biological problems with rationales over genomic variants and pathway data. [Biology, Genomics, Scientific Reasoning]
- [foundry-ml/foundry_oqmd_band_gaps_v1-1](https://huggingface.co/datasets/foundry-ml/foundry_oqmd_band_gaps_v1-1): Band-gap values from the Open Quantum Materials Database (OQMD), prepared for ML benchmarking on inorganic crystal electronic structure. [Materials Science, Physics, Chemistry]
- [foundry-ml/foundry_aflow_band_gaps_v1-1](https://huggingface.co/datasets/foundry-ml/foundry_aflow_band_gaps_v1-1): Band-gap values from the AFLOW high-throughput materials database, formatted for ML model training and evaluation. [Materials Science, Physics, Chemistry]
- [foundry-ml/foundry_mp_band_gaps_v1-1](https://huggingface.co/datasets/foundry-ml/foundry_mp_band_gaps_v1-1): Band-gap values curated from the Materials Project for ML benchmarking on inorganic electronic structure. [Materials Science, Physics, Chemistry]
- [foundry-ml/double_perovskite_bandgap_v1-1](https://huggingface.co/datasets/foundry-ml/double_perovskite_bandgap_v1-1): Computed band gaps for double-perovskite compounds — supports ML-based screening for photovoltaic and optoelectronic applications. [Materials Science, Physics, Chemistry, Energy]
- [foundry-ml/wolverton_oxides_v1-1](https://huggingface.co/datasets/foundry-ml/wolverton_oxides_v1-1): Wolverton oxide property dataset — DFT-computed properties for binary and ternary oxides, used for ML benchmarking on inorganic chemistry. [Materials Science, Chemistry]
- [foundry-ml/dataset_perovskite_formatione](https://huggingface.co/datasets/foundry-ml/dataset_perovskite_formatione): Formation energies for perovskite compounds — supports ML screening for stability and synthesisability. [Materials Science, Chemistry, Energy]
- [foundry-ml/dataset_perovskite_stability_updated](https://huggingface.co/datasets/foundry-ml/dataset_perovskite_stability_updated): Curated perovskite stability data (updated release) for benchmarking ML models on photovoltaic-material durability prediction. [Materials Science, Chemistry, Energy]
- [foundry-ml/perovskite_stability_v1-1](https://huggingface.co/datasets/foundry-ml/perovskite_stability_v1-1): Perovskite stability dataset (v1.1 release) — paired structure and stability labels for ML benchmarking. [Materials Science, Chemistry, Energy]
- [foundry-ml/perovskite_opbandcenter_v1-1](https://huggingface.co/datasets/foundry-ml/perovskite_opbandcenter_v1-1): O p-band center values for perovskite oxides — descriptors for catalytic activity prediction in oxygen-evolution reactions. [Materials Science, Chemistry, Energy]
- [foundry-ml/dataset_perovskite_conductivity](https://huggingface.co/datasets/foundry-ml/dataset_perovskite_conductivity): Ionic and electronic conductivity measurements for perovskite materials — supports ML screening for solid-oxide fuel cell electrolytes. [Materials Science, Chemistry, Energy]
- [foundry-ml/dataset_perovskite_habs](https://huggingface.co/datasets/foundry-ml/dataset_perovskite_habs): Hot-air-balance (HABS) data for perovskite materials — thermal-stability characterisation supporting durability ML. [Materials Science, Chemistry, Energy]
- [foundry-ml/dataset_perovskite_tec](https://huggingface.co/datasets/foundry-ml/dataset_perovskite_tec): Thermal expansion coefficients for perovskite materials — curated for ML thermal-property prediction. [Materials Science, Chemistry, Physics]
- [foundry-ml/dataset_perovskite_asr](https://huggingface.co/datasets/foundry-ml/dataset_perovskite_asr): Area-specific resistance (ASR) data for perovskite electrodes — used in solid-oxide fuel cell ML modelling. [Materials Science, Chemistry, Energy]
- [foundry-ml/atomvison_atomistic_stm_images_2d_materials_unique_chemical_compositions_structure_v1-1](https://huggingface.co/datasets/foundry-ml/atomvison_atomistic_stm_images_2d_materials_unique_chemical_compositions_structure_v1-1): Simulated STM images for 2D materials with unique chemical compositions — supports ML on atomic-resolution microscopy. [Materials Science, Physics]
- [foundry-ml/atomvison_simulated_atomistic_stem_images_2d_materials_unique_chemical_compositions_structure_ba](https://huggingface.co/datasets/foundry-ml/atomvison_simulated_atomistic_stem_images_2d_materials_unique_chemical_compositions_structure_ba): Simulated STEM images for 2D materials — paired with structure metadata for training ML models on electron microscopy. [Materials Science, Physics]
- [foundry-ml/training_locating_atoms_stem_images_v1-2](https://huggingface.co/datasets/foundry-ml/training_locating_atoms_stem_images_v1-2): STEM image training set for atomic-position localisation — supports ML pipelines for automated microscopy analysis. [Materials Science, Physics]
- [foundry-ml/mask_rcnn_defect_detection_v1-1](https://huggingface.co/datasets/foundry-ml/mask_rcnn_defect_detection_v1-1): Microscopy image dataset annotated for instance-segmentation defect detection — Mask R-CNN training data for materials inspection. [Materials Science, Engineering]
- [foundry-ml/foundry_stan_segmentation_v1-1](https://huggingface.co/datasets/foundry-ml/foundry_stan_segmentation_v1-1): Segmentation dataset (STAN) for materials microscopy images — supports ML feature extraction from electron-microscopy data. [Materials Science, Engineering]
- [foundry-ml/direct_electron_detectorceleritas_xs_simulated_readout_images_electron_counting_model_v1-1](https://huggingface.co/datasets/foundry-ml/direct_electron_detectorceleritas_xs_simulated_readout_images_electron_counting_model_v1-1): Simulated readout images from a Celeritas XS direct-electron detector — training data for electron-counting models in cryo-EM and STEM. [Materials Science, Physics]
- [foundry-ml/elastic_tensor_v1-1](https://huggingface.co/datasets/foundry-ml/elastic_tensor_v1-1): Elastic tensor data for inorganic materials — supports ML prediction of bulk and shear moduli. [Materials Science, Physics, Engineering]
- [foundry-ml/piezoelectric_tensor_v1-1](https://huggingface.co/datasets/foundry-ml/piezoelectric_tensor_v1-1): Piezoelectric tensor data for inorganic materials — supports ML for sensor and actuator material design. [Materials Science, Physics, Engineering]
- [foundry-ml/dielectric_constant_v1-1](https://huggingface.co/datasets/foundry-ml/dielectric_constant_v1-1): Dielectric-constant values for inorganic compounds — supports ML screening of high-k materials for capacitors and devices. [Materials Science, Physics]
- [foundry-ml/semiconductor_defectlevels_v1-1](https://huggingface.co/datasets/foundry-ml/semiconductor_defectlevels_v1-1): Computed defect-energy levels in semiconductors — descriptors for ML doping and trap-state prediction. [Materials Science, Physics]
- [foundry-ml/superconductivity_v1-1](https://huggingface.co/datasets/foundry-ml/superconductivity_v1-1): Curated superconductor dataset — measured Tc values for ML-based discovery of new superconducting materials. [Materials Science, Physics, Energy]
- [foundry-ml/electromigration_v1-1](https://huggingface.co/datasets/foundry-ml/electromigration_v1-1): Electromigration data for interconnect materials — supports ML prediction of failure rates in microelectronic devices. [Materials Science, Engineering]
- [foundry-ml/steel_strength_v1-1](https://huggingface.co/datasets/foundry-ml/steel_strength_v1-1): Steel strength dataset — composition-property pairs for ML-based alloy design and high-strength materials. [Materials Science, Engineering]
- [foundry-ml/dataset_mg_alloy](https://huggingface.co/datasets/foundry-ml/dataset_mg_alloy): Magnesium alloy dataset — composition and property data for ML modelling of lightweight structural alloys. [Materials Science, Engineering]
- [foundry-ml/dataset_metallicglass_rc](https://huggingface.co/datasets/foundry-ml/dataset_metallicglass_rc): Critical cooling rate (Rc) data for metallic glasses — supports ML prediction of glass-forming ability. [Materials Science, Engineering]
- [foundry-ml/dataset_metallicglass_rc_llm](https://huggingface.co/datasets/foundry-ml/dataset_metallicglass_rc_llm): LLM-extracted critical cooling rate data for metallic glasses — text-mined complement to the structured Rc dataset. [Materials Science, Engineering, Scientific Reasoning]
- [foundry-ml/dataset_metallicglass_dmax](https://huggingface.co/datasets/foundry-ml/dataset_metallicglass_dmax): Maximum glass-forming diameter (Dmax) data for bulk metallic glasses — for ML screening of casting feasibility. [Materials Science, Engineering]
- [foundry-ml/dataset_concrete_compressive_strength](https://huggingface.co/datasets/foundry-ml/dataset_concrete_compressive_strength): Concrete compressive-strength dataset — mix-design and test data for ML-based civil-engineering material modelling. [Materials Science, Engineering]
- [foundry-ml/dataset_rpv_tts](https://huggingface.co/datasets/foundry-ml/dataset_rpv_tts): Reactor pressure-vessel (RPV) transition-temperature shift dataset — supports ML prediction of irradiation embrittlement. [Materials Science, Engineering, Physics]
- [foundry-ml/dataset_exfoliatione](https://huggingface.co/datasets/foundry-ml/dataset_exfoliatione): Exfoliation energy dataset for 2D materials — supports ML-driven discovery of layered compounds suitable for monolayer isolation. [Materials Science, Physics, Chemistry]
- [foundry-ml/dataset_thermalexp_aflow](https://huggingface.co/datasets/foundry-ml/dataset_thermalexp_aflow): Thermal expansion coefficients from the AFLOW database — for ML thermal-mechanical modelling of inorganic materials. [Materials Science, Physics]
- [foundry-ml/dataset_thermalcond_aflow](https://huggingface.co/datasets/foundry-ml/dataset_thermalcond_aflow): Thermal conductivity values from the AFLOW database — supports ML-based screening of thermal management materials. [Materials Science, Physics]
- [foundry-ml/dataset_debyet_aflow](https://huggingface.co/datasets/foundry-ml/dataset_debyet_aflow): Debye temperature data from the AFLOW database — fundamental thermal-vibrational descriptor for ML materials property prediction. [Materials Science, Physics]
- [foundry-ml/heusler_magnetization_v1-1](https://huggingface.co/datasets/foundry-ml/heusler_magnetization_v1-1): Magnetisation data for Heusler-alloy compounds — supports ML discovery of half-metallic and magnetocaloric materials. [Materials Science, Physics]
- [foundry-ml/dataset_li_conductivity](https://huggingface.co/datasets/foundry-ml/dataset_li_conductivity): Lithium-ion conductivity dataset for solid electrolytes — supports ML discovery of next-generation battery materials. [Materials Science, Chemistry, Energy]
- [foundry-ml/elwood_md_v1-2](https://huggingface.co/datasets/foundry-ml/elwood_md_v1-2): Elwood molecular-dynamics simulation set — trajectory and energy data for ML molecular-property prediction. [Chemistry, Materials Science]
- [foundry-ml/foundry_g4mp2_solvation_v1-2](https://huggingface.co/datasets/foundry-ml/foundry_g4mp2_solvation_v1-2): High-accuracy G4MP2 solvation-energy data — supports ML for quantum-chemical accuracy on aqueous and organic systems. [Chemistry, Physics]
- [foundry-ml/foundry_moses_v1-1](https://huggingface.co/datasets/foundry-ml/foundry_moses_v1-1): Foundry mirror of MOSES — molecular sets benchmark for evaluating generative chemistry models on drug-like molecule generation. [Chemistry, Medicine]
- [foundry-ml/foundry_osdb_v1-1](https://huggingface.co/datasets/foundry-ml/foundry_osdb_v1-1): Organic Semiconductor Database (OSDB) curated for ML — supports property prediction and screening of organic optoelectronic materials. [Chemistry, Materials Science, Energy]
- [foundry-ml/foundry_qmc_ml_v1-1](https://huggingface.co/datasets/foundry-ml/foundry_qmc_ml_v1-1): Quantum Monte Carlo (QMC) reference data for ML benchmarking — high-accuracy electronic structure calculations on small molecules. [Chemistry, Physics]
- [foundry-ml/diffusion_v1-4](https://huggingface.co/datasets/foundry-ml/diffusion_v1-4): Diffusion-coefficient dataset for inorganic systems — supports ML modelling of solid-state ion transport and electrolyte design. [Materials Science, Chemistry, Energy]
- [FreedomIntelligence/medical-o1-reasoning-SFT](https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT): Medical chain-of-thought reasoning dataset (o1-style) for supervised fine-tuning of medical LLMs — one of the most-liked medical training corpora on Hugging Face (1000+ likes). [Medicine, Scientific Reasoning]
- [FreedomIntelligence/medical-o1-verifiable-problem](https://huggingface.co/datasets/FreedomIntelligence/medical-o1-verifiable-problem): Verifiable medical reasoning problems with checker functions — supports RL/reward-model training for medical-LLM alignment beyond static SFT. [Medicine, Scientific Reasoning]
- [tattabio/OMG](https://huggingface.co/datasets/tattabio/OMG): Open Mixed Genomes (OMG) — large mixed-organism nucleotide corpus underpinning Tatta Bio’s gLM2 genomic foundation models. [Biology, Genomics]
- [tattabio/OG](https://huggingface.co/datasets/tattabio/OG): Open Genomes (OG) — curated genome-sequence corpus from Tatta Bio for genomic ML pretraining and benchmarking. [Biology, Genomics]
- [mist-models/excess-properties](https://huggingface.co/datasets/mist-models/excess-properties): Excess-property dataset for binary/ternary chemical mixtures — used to fine-tune MIST mixtures models on thermodynamic deviations from ideal mixing. [Chemistry, Materials Science]
- [ADSKAILab/ABC-1M](https://huggingface.co/datasets/ADSKAILab/ABC-1M): One million CAD-quality 3D shapes drawn from the ABC dataset — the foundation training corpus for the Make-A-Shape and WaLa generative models. [Engineering, Materials Science]
- [ADSKAILab/Zero-To-CAD-1m](https://huggingface.co/datasets/ADSKAILab/Zero-To-CAD-1m): 1M paired image-and-CAD-program examples for training vision-language models that synthesise parametric CAD from images. [Engineering, Scientific Reasoning]
- [ADSKAILab/Zero-To-CAD-100k](https://huggingface.co/datasets/ADSKAILab/Zero-To-CAD-100k): Curated 100K-example subset of Zero-To-CAD — useful for benchmarking and lightweight fine-tuning of CAD-from-image models. [Engineering, Scientific Reasoning]
- [ADSKAILab/LLM-narrative-planning-taskset](https://huggingface.co/datasets/ADSKAILab/LLM-narrative-planning-taskset): Narrative planning task set for evaluating LLM planning and reasoning over multi-step design and engineering scenarios. [Engineering, Scientific Reasoning]
- [ADSKAILab/codeparrot_megatron](https://huggingface.co/datasets/ADSKAILab/codeparrot_megatron): Megatron-formatted CodeParrot release used for large-scale code language-model pretraining experiments at Autodesk AI Lab. [Engineering, Scientific Reasoning]
- [recursionpharma/rxrx3](https://huggingface.co/datasets/recursionpharma/rxrx3): Full RxRx3 release — multi-million image high-content microscopy dataset spanning genetic and chemical perturbations across human cell lines, paired with rich text annotations for image-based drug discovery. [Biology, Medicine, Chemistry]
- [recursionpharma/rxrx3-core](https://huggingface.co/datasets/recursionpharma/rxrx3-core): Curated core subset of RxRx3 — high-quality phenomics images for benchmarking and lower-cost training of phenomic foundation models like OpenPhenom. [Biology, Medicine, Chemistry]
- [arcinstitute/Perturb-Sapiens](https://huggingface.co/datasets/arcinstitute/Perturb-Sapiens): Large-scale human single-cell perturbation dataset used in the STACK foundation-model lineage — paired baseline and perturbed expression profiles for genetic perturbation screens. [Biology, Genomics, Medicine]
- [arcinstitute/Replogle-Nadig-Preprint](https://huggingface.co/datasets/arcinstitute/Replogle-Nadig-Preprint): Replogle-Nadig single-cell perturbation dataset (preprint release) — Perturb-seq screens used in the STATE single-cell embedding work for perturbation-response modelling. [Biology, Genomics, Medicine]
- [arcinstitute/State-Tahoe-Filtered](https://huggingface.co/datasets/arcinstitute/State-Tahoe-Filtered): Filtered Tahoe-100M slice used in the STATE workflow — high-quality single-cell perturbation profiles for training and benchmarking cross-study cell-state models. [Biology, Genomics, Medicine]
- [Ahmad0067/MedSynth](https://huggingface.co/datasets/Ahmad0067/MedSynth): Realistic synthetic medical dialogue–SOAP note pairs generated to support training and evaluation of clinical documentation models without exposing real patient data. [Medicine]

## Models

- [Evo-2 40B](https://huggingface.co/arcinstitute/evo2_40b): 40B-parameter DNA language model trained on 9.3 trillion nucleotides across all domains of life — zero-shot function prediction, variant effect scoring, and sequence generation. [Biology, Genomics, Medicine]
- [Evo-2 7B](https://huggingface.co/arcinstitute/evo2_7b): 7B-parameter instruction-tuned DNA language model for gene function prediction, CRISPR guide design, and cross-species sequence analysis. [Biology, Genomics, Medicine]
- [STACK Large](https://huggingface.co/arcinstitute/Stack-Large): Large-scale single-cell transcriptomics foundation model supporting in-context learning across cell types and perturbation states. [Biology, Genomics, Medicine]
- [FNO Active Matter](https://huggingface.co/polymathic-ai/FNO-active_matter): Fourier Neural Operator for active matter prediction. [Physics, Engineering]
- [Aion Base](https://huggingface.co/polymathic-ai/aion-base): Multi-domain scientific foundation model. [Physics, Astronomy, Engineering]
- [WALRUS](https://huggingface.co/polymathic-ai/walrus): Foundation model for continuum dynamics pre-trained across 15 physics simulation datasets, enabling zero-shot and few-shot PDE generalisation. [Physics, Engineering]
- [AstroCLIP](https://huggingface.co/polymathic-ai/astroclip): Multimodal astronomy model aligning galaxy spectra and images into a shared embedding space for downstream astrophysical property prediction. [Astronomy, Physics]
- [TEDDY](https://huggingface.co/Merck/TEDDY): Transformer for Enabling Drug Discovery - foundation models trained on 116M single cells for genomics and drug discovery. [Biology, Genomics, Medicine]
- [NatureLM-audio](https://huggingface.co/EarthSpeciesProject/NatureLM-audio): First audio-language foundation model for bioacoustics - species classification, detection, and captioning of animal vocalizations. [Biology, Ecology, Conservation, Earth Science]
- [AVES2-BEATs](https://huggingface.co/EarthSpeciesProject/esp-aves2-sl-beats-all): Self-supervised BEATs-based audio encoder trained on broad bioacoustic data for species detection, classification, and embedding across animal taxa. [Biology, Ecology, Conservation, Earth Science]
- [AQAffinity](https://huggingface.co/SandboxAQ/AQAffinity): Open-source protein-ligand binding affinity prediction model for drug discovery. [Chemistry, Medicine, Biology]
- [HiRO-ACE](https://huggingface.co/allenai/HiRO-ACE): AI framework for efficient climate and weather simulation with kilometer-scale precipitation downscaling. [Earth Science, Climate]
- [ACE2-ERA5](https://huggingface.co/allenai/ACE2-ERA5): Ai2 Climate Emulator v2 trained on ERA5 reanalysis — fast, stable atmospheric simulation at global scale for multi-year climate projections. [Earth Science, Climate]
- [FourCastNet 3](https://huggingface.co/nvidia/fourcastnet3): Advanced ML model for global weather forecasting - produces 60-day forecasts in under 4 minutes on a single GPU. [Earth Science, Climate, Physics]
- [cBottle](https://huggingface.co/nvidia/cbottle): Diffusion-based generative model that generates atmospheric states at kilometer resolution. [Earth Science, Climate]
- [StormCast V1](https://huggingface.co/nvidia/stormcast-v1-era5-hrrr): Mesoscale ML model for convection-allowing weather forecasting at kilometer-scale resolution. [Earth Science, Climate, Physics]
- [Surya 1.0](https://huggingface.co/nasa-ibm-ai4science/Surya-1.0): First open-source AI foundation model for heliophysics - solar flare forecasting and space weather prediction. [Astronomy, Physics]
- [Surya Solar Flares](https://huggingface.co/nasa-ibm-ai4science/solar_flares_surya): Surya-1.0 fine-tuned for solar flare prediction from full-disk magnetogram and EUV time series. [Astronomy, Physics]
- [Surya Solar Wind](https://huggingface.co/nasa-ibm-ai4science/solar_wind_surya): Surya-1.0 fine-tuned for solar wind plasma and interplanetary magnetic field forecasting at the L1 Lagrange point. [Astronomy, Physics]
- [MedGemma 1.5 4B](https://huggingface.co/google/medgemma-1.5-4b-it): Multimodal medical AI model for medical imaging and clinical text understanding. [Medicine, Biology]
- [MedGemma 27B](https://huggingface.co/google/medgemma-27b-it): Large-scale instruction-tuned medical AI for radiology report generation, pathology image analysis, dermatology, and clinical question answering. [Medicine, Biology]
- [AlphaGenome](https://huggingface.co/google/alphagenome-all-folds): Google DeepMind model predicting DNA regulatory features — gene expression, chromatin accessibility, and TF binding — at single-nucleotide resolution. [Biology, Genomics]
- [MedASR](https://huggingface.co/google/medasr): Medical automatic speech recognition model for clinical documentation. [Medicine]
- [MedSigLIP](https://huggingface.co/google/medsiglip-448): Medical image-language model for visual understanding in healthcare. [Medicine, Biology]
- [TxGemma 2B](https://huggingface.co/google/txgemma-2b-predict): Lightweight therapeutic prediction model for drug discovery tasks. [Medicine, Chemistry, Biology]
- [TxGemma 9B Predict](https://huggingface.co/google/txgemma-9b-predict): Mid-size therapeutic prediction model for drug property prediction. [Medicine, Chemistry, Biology]
- [TxGemma 9B Chat](https://huggingface.co/google/txgemma-9b-chat): Conversational therapeutic model for drug discovery with reasoning explanations. [Medicine, Chemistry, Biology]
- [TxGemma 27B Predict](https://huggingface.co/google/txgemma-27b-predict): Large therapeutic prediction model achieving best-in-class performance on 66 tasks. [Medicine, Chemistry, Biology]
- [TxGemma 27B Chat](https://huggingface.co/google/txgemma-27b-chat): Large conversational therapeutic model with advanced reasoning capabilities. [Medicine, Chemistry, Biology]
- [Path Foundation](https://huggingface.co/google/path-foundation): Vision transformer for histopathology image embeddings - trained on 60M patches from TCGA. [Medicine, Biology]
- [NTv3 650M](https://huggingface.co/InstaDeepAI/NTv3_650M_post): Multi-species genomics foundation model handling 1Mb context for functional track prediction. [Biology, Genomics]
- [Nucleotide Transformer v2 500M](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species): 500M multi-species DNA language model with improved tokenisation and benchmark performance across 18 genomic prediction tasks. [Biology, Genomics]
- [Nucleotide Transformer 2.5B](https://huggingface.co/InstaDeepAI/nucleotide-transformer-2.5b-multi-species): 2.5B-parameter DNA language model trained on 850 species genomes — state-of-the-art on promoter, enhancer, and splice site prediction tasks. [Biology, Genomics]
- [ChatNT](https://huggingface.co/InstaDeepAI/ChatNT): 8B multimodal conversational model for DNA, RNA, and protein tasks — instruction-following for sequence annotation, classification, and generation. [Biology, Genomics]
- [Isoformer](https://huggingface.co/InstaDeepAI/isoformer): Transformer model integrating DNA sequence, RNA expression, and protein context for isoform-level gene expression prediction. [Biology, Genomics]
- [ether0](https://huggingface.co/futurehouse/ether0): 24B parameter model for molecular reasoning - SMILES generation, property prediction, and retrosynthesis. [Chemistry, Medicine]
- [NASA-SMD-IBM](https://huggingface.co/nasa-impact/nasa-smd-ibm-v0.1): RoBERTa-based language model pre-trained on NASA Science Mission Directorate literature for earth and space science information extraction. [Earth Science, Physics, Astronomy]
- [Indus SDE v0.2](https://huggingface.co/nasa-impact/indus-sde-v0.2): Science domain extraction model for identifying and classifying scientific concepts, variables, and entities from geoscience and atmospheric science text. [Earth Science, Climate]
- [CYP Inhibition Model](https://huggingface.co/openadmet/cyp1a2-cyp2d6-cyp3a4-cyp3c9-chemeleon-baseline): Multi-task model predicting inhibition of four major cytochrome P450 isoforms (CYP1A2, CYP2D6, CYP3A4, CYP3C9) critical for drug metabolism assessment. [Medicine, Chemistry]
- [PXR Activation Model](https://huggingface.co/openadmet/pxr-chemeleon-baseline): Pregnane X receptor (PXR) activation predictor for early identification of drug-drug interaction liability via nuclear receptor-mediated CYP induction. [Medicine, Chemistry]
- [ESM2 650M](https://huggingface.co/facebook/esm2_t33_650M_UR50D): 650M-parameter protein language model trained on UniRef50 — state-of-the-art embeddings for structure prediction, function annotation, and mutation effect scoring. [Biology, Chemistry, Medicine]
- [OMat24](https://huggingface.co/facebook/OMAT24): Machine learning models for predicting inorganic material properties using EquiformerV2 and eSEN architectures. [Materials Science, Chemistry, Physics, Engineering]
- [OMol25](https://huggingface.co/facebook/OMol25): Open Molecules 2025 - dataset and models for molecular property prediction including polymer extensions. [Chemistry, Materials Science, Engineering]
- [UMA](https://huggingface.co/facebook/UMA): Universal Models for Atoms - mixture-of-experts graph network trained on billions of atoms across 5 datasets. [Chemistry, Materials Science, Physics, Engineering]
- [Tahoe-x1](https://huggingface.co/tahoebio/Tahoe-x1): Perturbation-trained single-cell foundation models (70M-3B) for cancer research and drug discovery. [Biology, Medicine, Genomics]
- [Tahoe-100M-SCVI](https://huggingface.co/tahoebio/Tahoe-100M-SCVI-v1): scVI-based variational autoencoder trained on the full Tahoe-100M atlas of 100M+ single-cell profiles across 50 cancer lines and 1,100 drug perturbations. [Biology, Medicine, Genomics]
- [PeptiVerse](https://huggingface.co/ChatterjeeLab/PeptiVerse): Foundation model for peptide design and analysis. [Biology, Chemistry, Medicine]
- [CoLiPRI](https://huggingface.co/microsoft/colipri): Contrastive learning model for protein-ligand interaction prediction. [Biology, Chemistry, Medicine]
- [Phikon-v2](https://huggingface.co/owkin/phikon-v2): State-of-the-art histopathology vision foundation model trained with DINOv2 on 460K whole-slide images, achieving top performance on cancer subtyping and survival prediction. [Medicine, Biology]
- [Phikon](https://huggingface.co/owkin/phikon): ViT-based pathology foundation model trained on TCGA and other large histopathology cohorts via self-supervised learning for cancer tissue representation. [Medicine, Biology]
- [X-Cell](https://huggingface.co/Xaira-Therapeutics/X-Cell): Diffusion-based model for predicting transcriptomic responses to CRISPRi perturbations at single-cell resolution, trained on the X-Atlas-Pisces dataset. [Biology, Medicine, Genomics]
- [SuperIX](https://huggingface.co/isp-uv-es/superIX): Explainable AI super-resolution model for Sentinel-2 imagery, enhancing 10m resolution to finer scales with interpretable uncertainty estimates. [Earth Science, Climate]
- [ML4Floods](https://huggingface.co/isp-uv-es/ml4floods): Image segmentation model for near-real-time flood extent mapping from Sentinel-2 and Landsat imagery, supporting disaster response and humanitarian aid. [Earth Science, Climate]
- [StarCOP](https://huggingface.co/isp-uv-es/starcop): Methane plume detection model for EMIT and AVIRIS hyperspectral imagery, enabling automated identification of point-source greenhouse gas emissions from space. [Earth Science, Climate, Engineering]
- [p-IgGen](https://huggingface.co/opig/p-IgGen): GPT-NeoX-based generative language model for antibody sequence design, trained on the Observed Antibody Space to generate diverse immunoglobulin heavy and light chains. [Biology, Medicine, Chemistry]
- [AstroLLaMA](https://huggingface.co/UniverseTBD/astrollama): Llama-2 7B fine-tuned on 300K+ astronomy arXiv abstracts for astrophysics text generation, literature summarization, and hypothesis completion — first open LLM specialized for astronomy. [Astronomy, Physics]
- [OpenFold3](https://huggingface.co/OpenFold/OpenFold3): Open replication of AlphaFold3 — predicts structures of proteins, nucleic acids, ligands, and their complexes for drug discovery and structural biology. [Biology, Medicine, Chemistry]
- [MedSAM](https://huggingface.co/wanglab/medsam-vit-base): SAM ViT-Base finetuned on a large-scale dataset of CT, MRI, X-ray, ultrasound, and histology — a universal promptable foundation model for medical image segmentation. [Medicine, Biology]
- [Clinical Camel 70B](https://huggingface.co/wanglab/ClinicalCamel-70B): Llama-2 70B finetuned with QLoRA on physician-patient dialogues, clinical articles, and MedQA-style reasoning chains for medical conversation and decision support. [Medicine, Scientific Reasoning]
- [GO-GPT](https://huggingface.co/wanglab/gogpt): Generative model that predicts Gene Ontology functional annotations directly from protein sequences — bringing LLM-style decoding to functional protein characterisation. [Biology, Genomics, Medicine]
- [Kimina-Prover Preview Distill 7B](https://huggingface.co/AI-MO/Kimina-Prover-Preview-Distill-7B): Distilled 7B preview of Kimina-Prover — a reinforcement-learning-trained model that generates Lean 4 proofs for olympiad-level mathematics problems. [Mathematics, Scientific Reasoning]
- [Kimina-Prover Distill 1.7B](https://huggingface.co/AI-MO/Kimina-Prover-Distill-1.7B): Compact 1.7B distilled Kimina-Prover variant for Lean 4 proof generation on olympiad-level theorems — runs on a single consumer GPU. [Mathematics, Scientific Reasoning]
- [Kimina-Prover Distill 8B](https://huggingface.co/AI-MO/Kimina-Prover-Distill-8B): 8B distilled Kimina-Prover variant — Lean 4 theorem-proving model trained on olympiad-level mathematical problems with reinforcement learning over proof traces. [Mathematics, Scientific Reasoning]
- [Equiformer v3](https://huggingface.co/mirror-physics/equiformer_v3): Equivariant graph transformer for molecular and materials modeling — predicts energies, forces, and properties on molecular structures and crystals. [Chemistry, Physics, Materials Science]
- [OpenMed PharmaDetect](https://huggingface.co/OpenMed/OpenMed-NER-PharmaDetect-SuperClinical-434M): Token-classification model for pharmaceutical entity recognition in clinical text — built on the SuperClinical 434M backbone for high-recall drug, dose, and regimen extraction. [Medicine, Biology]
- [OpenMed BloodCancerDetect](https://huggingface.co/OpenMed/OpenMed-NER-BloodCancerDetect-TinyMed-65M): Compact 65M token-classification model that identifies haematologic malignancy mentions (leukaemia, lymphoma, myeloma subtypes) in clinical and biomedical text. [Medicine, Biology]
- [OpenMed ChemicalDetect](https://huggingface.co/OpenMed/OpenMed-NER-ChemicalDetect-ModernMed-149M): Chemical-entity NER over biomedical literature — identifies drug names, compounds, and chemical substances using the ModernMed 149M backbone. [Medicine, Chemistry, Biology]
- [OpenMed SpeciesDetect](https://huggingface.co/OpenMed/OpenMed-NER-SpeciesDetect-ElectraMed-109M): Species-mention NER over biomedical literature — identifies organisms and taxonomic references using the ElectraMed 109M backbone. [Biology, Medicine]
- [OpenMed DNADetect](https://huggingface.co/OpenMed/OpenMed-NER-DNADetect-SuperMedical-125M): DNA-mention NER for biomedical text — extracts gene-level DNA sequence references and locus identifiers using the SuperMedical 125M backbone. [Biology, Genomics, Medicine]
- [OpenMed PathologyDetect](https://huggingface.co/OpenMed/OpenMed-NER-PathologyDetect-TinyMed-135M): Pathology-finding NER over clinical and biomedical text — surfaces histopathological observations, lesion descriptions, and tissue-level abnormalities. [Medicine, Biology]
- [OpenMed AnatomyDetect](https://huggingface.co/OpenMed/OpenMed-NER-AnatomyDetect-ElectraMed-109M): Anatomical-entity NER for biomedical text — labels body parts, organ systems, and tissue references using the ElectraMed 109M backbone. [Medicine, Biology]
- [OpenMed OncologyDetect](https://huggingface.co/OpenMed/OpenMed-NER-OncologyDetect-MultiMed-568M): Oncology-focused NER that identifies cancer-type mentions, tumour grading, and staging language across clinical and biomedical literature. [Medicine, Biology]
- [OpenMed OrganismDetect](https://huggingface.co/OpenMed/OpenMed-NER-OrganismDetect-TinyMed-82M): Organism-mention NER for biomedical text — broader than SpeciesDetect, also picking up genera, strains, and informal organism references. [Biology, Medicine]
- [OpenMed DiseaseDetect](https://huggingface.co/OpenMed/OpenMed-NER-DiseaseDetect-BioMed-335M): Disease-mention NER trained on the BioMed 335M backbone — recognises disease names, syndromes, and condition references in clinical and biomedical literature. [Medicine, Biology]
- [OpenMed GenomicDetect](https://huggingface.co/OpenMed/OpenMed-NER-GenomicDetect-PubMed-335M): Genomic-entity NER over PubMed-style text — labels genes, transcripts, and other genomic references for downstream knowledge extraction. [Biology, Genomics, Medicine]
- [OpenMed ProteinDetect](https://huggingface.co/OpenMed/OpenMed-NER-ProteinDetect-SuperClinical-141M): Protein-mention NER for biomedical and clinical text — extracts protein names, family references, and post-translational modification descriptors. [Biology, Medicine]
- [OpenMed GenomeDetect](https://huggingface.co/OpenMed/OpenMed-NER-GenomeDetect-ModernMed-149M): Genome-mention NER complementary to GenomicDetect — focuses on whole-genome and assembly-level references in biomedical text. [Biology, Genomics, Medicine]
- [BioReason-Pro SFT](https://huggingface.co/wanglab/bioreason-pro-sft): Supervised fine-tuned variant of BioReason-Pro — multimodal biological reasoning over genomic variants and pathway data with chain-of-thought rationales. [Biology, Genomics, Scientific Reasoning]
- [BioReason-Pro RL](https://huggingface.co/wanglab/bioreason-pro-rl): RL-tuned variant of BioReason-Pro — reinforcement-learning fine-tuning over BioReason’s SFT base for sharper biological reasoning across KEGG pathways and variant data. [Biology, Genomics, Scientific Reasoning]
- [NexaMass V3 Struct](https://huggingface.co/AethronPhantom/NexaMass-V3-Struct): Self-supervised representation model for MS/MS spectra in metabolomics — learns molecular fingerprints to support compound identification and structure inference. [Chemistry, Biology, Scientific Reasoning]
- [MMPT-FM](https://huggingface.co/Merck/MMPT-FM): Multi-modal pharma foundation model from Merck — integrates molecular and biological signals for drug discovery and target prediction. [Biology, Medicine, Chemistry]
- [OC25](https://huggingface.co/facebook/OC25): Open Catalyst 2025 — successor to OC22, modelling explicit-solvent and catalyst systems for electrochemistry and energy applications. [Chemistry, Materials Science, Energy]
- [OMC25](https://huggingface.co/facebook/OMC25): Open Molecular Crystals 2025 — Meta FAIR Chemistry release for predicting properties of organic molecular crystals (pharmaceutical polymorphs, energetic materials, OLEDs). [Chemistry, Materials Science]
- [BioCLIP 2](https://huggingface.co/imageomics/bioclip-2): OpenCLIP-based foundation model for organismal biology — zero-shot species classification from photographs across the tree of life, trained on TreeOfLife-200M. [Biology, Ecology, Conservation]
- [Skala 1.1](https://huggingface.co/microsoft/skala-1.1): Deep-learning exchange-correlation functional for density functional theory — covers main-group thermochemistry, reaction kinetics, noncovalent interactions, and molecular geometries. [Chemistry, Physics]
- [Aurora](https://huggingface.co/microsoft/aurora): Foundation model for the Earth system — global weather forecasting, atmospheric chemistry, ocean waves, and tropical-cyclone tracking from a single shared backbone. [Climate, Earth Science, Physics]
- [BioEmu](https://huggingface.co/microsoft/bioemu): Generative model for protein structural ensembles — emulates conformational dynamics for drug discovery and structural biology beyond static AlphaFold-style predictions. [Biology, Medicine, Chemistry]
- [MatterGen](https://huggingface.co/microsoft/mattergen): Generative AI for materials design — proposes novel inorganic crystal structures with specified properties for energy, catalysis, and functional-materials research. [Materials Science, Chemistry, Energy]
- [MatterSim](https://huggingface.co/microsoft/mattersim): Foundation-model atomistic simulator for materials over a wide range of temperatures and pressures — drop-in replacement for ab-initio MD for property prediction. [Materials Science, Chemistry, Physics]
- [OrbMol](https://huggingface.co/Orbital-Materials/OrbMol): Foundation-model potential for molecular systems — energies, forces, and properties for organic and metal-organic chemistry, supporting catalyst and pharma workflows. [Chemistry, Materials Science]
- [OneGenome-Rice](https://huggingface.co/ZhejiangLab/OneGenome-Rice): Mixtral-architecture genomic foundation model specialised for rice (Oryza sativa) — supports variant analysis, expression prediction, and breeding-relevant trait modelling. [Biology, Genomics]
- [Genos 1.2B](https://huggingface.co/ZhejiangLab/Genos-1.2B): General-purpose 1.2B-parameter genomic foundation model spanning multiple organisms — base model for downstream gene-level and sequence-level prediction tasks. [Biology, Genomics]
- [eva-rna](https://huggingface.co/ScientaLab/eva-rna): Transformer foundation model producing sample-level and gene-level embeddings from RNA-seq profiles (bulk, microarray, pseudobulked single-cell) in human and mouse. [Biology, Genomics, Medicine]
- [Skala 1.0](https://huggingface.co/microsoft/skala-1.0): First release of Skala — deep-learning exchange-correlation functional for density functional theory, predecessor to Skala 1.1. [Chemistry, Physics]
- [AIMNet2-rxn](https://huggingface.co/isayevlab/aimnet2-rxn): AIMNet2 trained on reaction data — neural-network interatomic potential supporting reactive molecular simulations. [Chemistry, Physics]
- [AIMNet2 ωB97M-D3](https://huggingface.co/isayevlab/aimnet2-wb97m-d3): Neural network interatomic potential for fast and accurate molecular simulations, trained at the ωB97M-D3 level of theory. [Chemistry, Physics]
- [AIMNet2 (B97-3c, 2025)](https://huggingface.co/isayevlab/aimnet2-2025): AIMNet2 retrained at the B97-3c level of theory — 2025 release with improved coverage and accuracy. [Chemistry, Physics]
- [AIMNet2-NSE](https://huggingface.co/isayevlab/aimnet2-nse): AIMNet2 specialised for open-shell chemistry (radicals, transition states) — neural network interatomic potential for non-singlet electronic states. [Chemistry, Physics]
- [AIMNet2-Pd](https://huggingface.co/isayevlab/aimnet2-pd): AIMNet2 specialised for palladium-containing organometallic systems — supports homogeneous catalysis simulation at near-DFT accuracy. [Chemistry, Materials Science, Physics]
- [MACE-MP-0](https://huggingface.co/mace-foundations/mace-mp-0): MACE foundation model trained on the Materials Project — equivariant message-passing potential for inorganic crystal simulation across most of the periodic table. [Materials Science, Chemistry, Physics]
- [MACE-MPA-0](https://huggingface.co/mace-foundations/mace-mpa-0): MACE foundation model trained on the Materials Project + Alexandria datasets — broader coverage variant for inorganic-materials simulation. [Materials Science, Chemistry, Physics]
- [MACE-MH-0](https://huggingface.co/mace-foundations/mace-mh-0): MACE foundation model targeting molecular and hybrid systems — equivariant potential trained on a unified molecular/materials dataset. [Materials Science, Chemistry, Physics]
- [MACE-MH-1](https://huggingface.co/mace-foundations/mace-mh-1): Updated MACE-MH foundation potential with refined molecular/materials hybrid training — successor to MACE-MH-0. [Materials Science, Chemistry, Physics]
- [GENA-LM BERT large (T2T)](https://huggingface.co/AIRI-Institute/gena-lm-bert-large-t2t): BERT-large-style genomic foundation model trained on telomere-to-telomere human assemblies — supports variant interpretation, regulatory prediction, and downstream genomic tasks. [Biology, Genomics, Medicine]
- [GENA-LM BERT base (T2T)](https://huggingface.co/AIRI-Institute/gena-lm-bert-base-t2t): BERT-base-style genomic foundation model trained on T2T assemblies — lighter-weight backbone for genomic sequence understanding. [Biology, Genomics, Medicine]
- [ModernGENA large](https://huggingface.co/AIRI-Institute/moderngena-large): GENA-LM rebuilt on the ModernBERT architecture — larger, longer-context, RoPE-equipped genomic foundation model. [Biology, Genomics, Medicine]
- [ModernGENA base](https://huggingface.co/AIRI-Institute/moderngena-base): Compact ModernBERT-based GENA-LM variant — efficient genomic foundation model for downstream variant and expression tasks. [Biology, Genomics, Medicine]
- [HuatuoGPT-Vision 7B](https://huggingface.co/FreedomIntelligence/HuatuoGPT-Vision-7B): Medical multimodal LLM from the HuatuoGPT family — answers clinical questions over medical imagery (radiology, pathology, dermatology) using a 7B vision-language backbone. [Medicine, Biology, Scientific Reasoning]
- [FlashPPI](https://huggingface.co/tattabio/flashppi): Fast protein-protein interaction prediction model — trained for high-throughput screening of interaction networks. [Biology, Medicine]
- [gLM2 650M](https://huggingface.co/tattabio/gLM2_650M): 650M-parameter genomic foundation model from Tatta Bio — trained on the OMG open-mixed-genomes corpus for sequence-level biological reasoning. [Biology, Genomics]
- [MIST 28M base](https://huggingface.co/mist-models/mist-28M-ti624ev1): MIST 28M base — pretrained molecular language model (fill-mask) used as the starting point for downstream property-prediction fine-tunes. [Chemistry]
- [MIST 1.8B base](https://huggingface.co/mist-models/mist-1.8B-dh61satt): MIST 1.8B base — large pretrained molecular language model (fill-mask) for downstream chemistry property prediction at scale. [Chemistry]
- [MIST mixtures](https://huggingface.co/mist-models/mist-mixtures-zffffbex): MIST mixtures variant — pretrained on chemical mixtures rather than individual molecules. [Chemistry]
- [MIST 28M · QM9](https://huggingface.co/mist-models/mist-28M-kkgx0omx-qm9): MIST 28M fine-tuned on QM9 — quantum-mechanical property prediction over small organic molecules. [Chemistry, Physics]
- [MIST 28M · QM8](https://huggingface.co/mist-models/mist-28M-gzwqzpcr-qm8): MIST 28M fine-tuned on QM8 — electronic-spectra property prediction over small organic molecules. [Chemistry, Physics]
- [MIST 28M · Tox21](https://huggingface.co/mist-models/mist-28M-kw4ks27p-tox21): MIST 28M fine-tuned on Tox21 — toxicity classification across 12 nuclear-receptor and stress-response assays. [Chemistry, Medicine]
- [MIST 28M · ClinTox](https://huggingface.co/mist-models/mist-28M-97vfcykk-clintox): MIST 28M fine-tuned on ClinTox — clinical toxicity classification of FDA-approved drugs and failed candidates. [Chemistry, Medicine]
- [MIST 28M · SIDER](https://huggingface.co/mist-models/mist-28M-z8qo16uy-sider): MIST 28M fine-tuned on SIDER — side-effect prediction across 27 system-organ classes for marketed drugs. [Chemistry, Medicine]
- [MIST 28M · BBBP](https://huggingface.co/mist-models/mist-28M-3xpfhv48-bbbp): MIST 28M fine-tuned on BBBP — blood-brain-barrier permeability classification for CNS drug candidates. [Chemistry, Medicine]
- [MIST 28M · HIV](https://huggingface.co/mist-models/mist-28M-8fh43gke-hiv): MIST 28M fine-tuned on HIV — anti-HIV activity classification from MoleculeNet. [Chemistry, Medicine]
- [MIST 28M · Lipo](https://huggingface.co/mist-models/mist-28M-xzr5ulva-lipo): MIST 28M fine-tuned on Lipophilicity — octanol/water distribution coefficient prediction. [Chemistry, Medicine]
- [MIST 28M · ToxCast](https://huggingface.co/mist-models/mist-28M-ttqcvt6fs-toxcast): MIST 28M fine-tuned on ToxCast — multi-task toxicity prediction across hundreds of in-vitro assays. [Chemistry, Medicine]
- [MIST 28M · BACE](https://huggingface.co/mist-models/mist-28M-8loj3bab-bace): MIST 28M fine-tuned on BACE — beta-secretase 1 (Alzheimer target) inhibition classification. [Chemistry, Medicine]
- [MIST 28M · MUV](https://huggingface.co/mist-models/mist-28M-yr1urd2c-muv): MIST 28M fine-tuned on MUV — maximum-unbiased-validation virtual-screening benchmark. [Chemistry, Medicine]
- [MIST 28M · ESOL](https://huggingface.co/mist-models/mist-28M-kcwb9le5-esol): MIST 28M fine-tuned on ESOL — aqueous solubility regression (Delaney dataset). [Chemistry]
- [MIST 28M · FreeSolv](https://huggingface.co/mist-models/mist-28M-0uiq7o7m-freesolv): MIST 28M fine-tuned on FreeSolv — hydration free-energy regression for small molecules. [Chemistry]
- [MIST 28M · tmQM](https://huggingface.co/mist-models/mist-28M-ggd8iisr-tmQM): MIST 28M fine-tuned on tmQM — quantum-mechanical property prediction for transition-metal complexes. [Chemistry, Materials Science]
- [MIST 28M · pKa](https://huggingface.co/mist-models/mist-28M-6zlgl2qn-pKa): MIST 28M fine-tuned for pKa — acid-dissociation-constant prediction. [Chemistry]
- [MIST 28M · solvent properties](https://huggingface.co/mist-models/mist-28M-solvent-properties): MIST 28M fine-tuned for solvent-property prediction — bulk physical descriptors of organic solvents. [Chemistry]
- [MIST 26.9M · melting point](https://huggingface.co/mist-models/mist-26.9M-y3ge5pf9-mp): MIST 26.9M fine-tuned for melting-point regression. [Chemistry, Materials Science]
- [MIST 26.9M · boiling point](https://huggingface.co/mist-models/mist-26.9M-b302p09x-bp): MIST 26.9M fine-tuned for boiling-point regression. [Chemistry, Materials Science]
- [MIST 26.9M · flash point](https://huggingface.co/mist-models/mist-26.9M-cyuo2xb6-fp): MIST 26.9M fine-tuned for flash-point regression. [Chemistry, Engineering]
- [MIST 26.9M · odour](https://huggingface.co/mist-models/mist-26.9M-48kpooqf-odour): MIST 26.9M fine-tuned for odour-quality prediction. [Chemistry]
- [MIST 26.9M · dn](https://huggingface.co/mist-models/mist-26.9M-6hk5coof-dn): MIST 26.9M fine-tuned for dn property regression. [Chemistry]
- [MIST 27.0M · conductivity](https://huggingface.co/mist-models/mist-conductivity-27.0M-2mpg8dcd): MIST 27.0M fine-tuned for ionic-conductivity prediction in chemical mixtures and electrolytes. [Chemistry, Materials Science, Energy]
- [MIST 27.1M · ETN](https://huggingface.co/mist-models/mist-27.1M-1gcxtg8y-ETN): MIST 27.1M fine-tuned on the ETN (empirical thermodynamic network) benchmark. [Chemistry, Materials Science]
- [MIST 1.8B · G298](https://huggingface.co/mist-models/mist-1.8B-09sntn03-g298): MIST 1.8B fine-tuned for G298 — Gibbs free energy at 298 K from QM9. [Chemistry, Physics]
- [MIST 1.8B · H298](https://huggingface.co/mist-models/mist-1.8B-3fbbz4is-h298): MIST 1.8B fine-tuned for H298 — enthalpy at 298 K from QM9. [Chemistry, Physics]
- [MIST 1.8B · U298](https://huggingface.co/mist-models/mist-1.8B-85f24xkj-u298): MIST 1.8B fine-tuned for U298 — internal energy at 298 K from QM9. [Chemistry, Physics]
- [MIST 1.8B · U0](https://huggingface.co/mist-models/mist-1.8B-a7akimjj-u0): MIST 1.8B fine-tuned for U0 — internal energy at 0 K from QM9. [Chemistry, Physics]
- [MIST 1.8B · μ (dipole)](https://huggingface.co/mist-models/mist-1.8B-ez05expv-mu): MIST 1.8B fine-tuned for dipole moment from QM9. [Chemistry, Physics]
- [MIST 1.8B · α (polarizability)](https://huggingface.co/mist-models/mist-1.8B-rcwary93-alpha): MIST 1.8B fine-tuned for isotropic polarizability from QM9. [Chemistry, Physics]
- [MIST 1.8B · HOMO](https://huggingface.co/mist-models/mist-1.8B-jmjosq12-homo): MIST 1.8B fine-tuned for HOMO energy from QM9. [Chemistry, Physics]
- [MIST 1.8B · LUMO](https://huggingface.co/mist-models/mist-1.8B-n14wshc9-lumo): MIST 1.8B fine-tuned for LUMO energy from QM9. [Chemistry, Physics]
- [MIST 1.8B · HOMO-LUMO gap](https://huggingface.co/mist-models/mist-1.8B-kayun6v3-gap): MIST 1.8B fine-tuned for HOMO-LUMO gap from QM9. [Chemistry, Physics]
- [MIST 1.8B · ZPVE](https://huggingface.co/mist-models/mist-1.8B-6nmcwyrp-zpve): MIST 1.8B fine-tuned for zero-point vibrational energy from QM9. [Chemistry, Physics]
- [MIST 1.8B · ⟨R²⟩](https://huggingface.co/mist-models/mist-1.8B-xxe7t35e-r2): MIST 1.8B fine-tuned for electronic spatial extent from QM9. [Chemistry, Physics]
- [MIST 1.8B · Cv](https://huggingface.co/mist-models/mist-1.8B-j356b3nf-cv): MIST 1.8B fine-tuned for heat capacity Cv from QM9. [Chemistry, Physics]
- [MIST 1.8B · QM8](https://huggingface.co/mist-models/mist-1.8B-8nd1ot5j-qm8): MIST 1.8B fine-tuned on QM8 — electronic-spectra prediction at scale. [Chemistry, Physics]
- [MIST 1.8B · Tox21](https://huggingface.co/mist-models/mist-1.8B-uop1z0dc-tox21): MIST 1.8B fine-tuned on Tox21 — large-scale toxicity classification across nuclear-receptor and stress assays. [Chemistry, Medicine]
- [MIST 1.8B · ClinTox](https://huggingface.co/mist-models/mist-1.8B-lu1l5ieh-clintox): MIST 1.8B fine-tuned on ClinTox — clinical toxicity classification. [Chemistry, Medicine]
- [MIST 1.8B · SIDER](https://huggingface.co/mist-models/mist-1.8B-l1wfo7oa-sider): MIST 1.8B fine-tuned on SIDER — side-effect prediction. [Chemistry, Medicine]
- [MIST 1.8B · BBBP](https://huggingface.co/mist-models/mist-1.8B-fbdn8e35-bbbp): MIST 1.8B fine-tuned on BBBP — blood-brain-barrier permeability. [Chemistry, Medicine]
- [MIST 1.8B · HIV](https://huggingface.co/mist-models/mist-1.8B-1a4puhg2-hiv): MIST 1.8B fine-tuned on HIV — anti-HIV activity classification. [Chemistry, Medicine]
- [MIST 1.8B · Lipo](https://huggingface.co/mist-models/mist-1.8B-jvt4azpz-lipo): MIST 1.8B fine-tuned on Lipophilicity — large-scale logD prediction. [Chemistry, Medicine]
- [MIST 1.8B · BACE](https://huggingface.co/mist-models/mist-1.8B-m50jgolp-bace): MIST 1.8B fine-tuned on BACE — Alzheimer-target inhibition classification. [Chemistry, Medicine]
- [MIST 1.8B · ESOL](https://huggingface.co/mist-models/mist-1.8B-hxiygjsm-esol): MIST 1.8B fine-tuned on ESOL — aqueous solubility regression. [Chemistry]
- [MIST 1.8B · FreeSolv](https://huggingface.co/mist-models/mist-1.8B-iwqj2cld-freesolv): MIST 1.8B fine-tuned on FreeSolv — hydration free-energy regression. [Chemistry]
- [Zero-To-CAD Qwen3-VL 2B](https://huggingface.co/ADSKAILab/Zero-To-CAD-Qwen3-VL-2B): Qwen3-VL fine-tuned to generate parametric CAD models directly from images — bridges vision-language reasoning and engineering geometry synthesis. [Engineering, Scientific Reasoning]
- [Make-A-Shape · single-view 20M](https://huggingface.co/ADSKAILab/Make-A-Shape-single-view-20m): Make-A-Shape variant trained to generate 3D geometry from a single 2D image — supports CAD reconstruction and engineering shape synthesis. [Engineering, Materials Science]
- [Make-A-Shape · multi-view 20M](https://huggingface.co/ADSKAILab/Make-A-Shape-multi-view-20m): Make-A-Shape multi-view variant — generates 3D geometry from multiple 2D image perspectives for higher-fidelity CAD reconstruction. [Engineering, Materials Science]
- [Make-A-Shape · point-cloud 20M](https://huggingface.co/ADSKAILab/Make-A-Shape-point-cloud-20m): Make-A-Shape point-cloud variant — completes and refines 3D geometry from sparse point-cloud input. [Engineering, Materials Science]
- [Make-A-Shape · voxel 32³](https://huggingface.co/ADSKAILab/Make-A-Shape-voxel-32res-20m): Make-A-Shape voxel variant at 32³ resolution — generates voxelised 3D geometries for low-resolution shape exploration. [Engineering, Materials Science]
- [Make-A-Shape · voxel 16³](https://huggingface.co/ADSKAILab/Make-A-Shape-voxel-16res-20m): Coarser 16³ voxel variant of Make-A-Shape for fast prototyping of 3D geometries. [Engineering, Materials Science]
- [WaLa SV 1B](https://huggingface.co/ADSKAILab/WaLa-SV-1B): WaLa (Wavelet-Latent) 1B model conditioned on single-view input — large-scale wavelet-domain 3D shape generation. [Engineering, Materials Science]
- [WaLa RGB4 1B](https://huggingface.co/ADSKAILab/WaLa-RGB4-1B): WaLa 1B variant conditioned on four RGB views — multi-view colour-image-driven 3D shape generation. [Engineering, Materials Science]
- [WaLa DM4 1B](https://huggingface.co/ADSKAILab/WaLa-DM4-1B): WaLa 1B variant conditioned on four depth maps — depth-driven 3D shape generation. [Engineering, Materials Science]
- [WaLa DM6 1B](https://huggingface.co/ADSKAILab/WaLa-DM6-1B): WaLa 1B variant conditioned on six depth maps for high-coverage depth-driven 3D shape generation. [Engineering, Materials Science]
- [WaLa PC 1B](https://huggingface.co/ADSKAILab/WaLa-PC-1B): WaLa 1B variant conditioned on point clouds — wavelet-latent shape completion from sparse point input. [Engineering, Materials Science]
- [WaLa VX16 1B](https://huggingface.co/ADSKAILab/WaLa-VX16-1B): WaLa 1B variant conditioned on 16³ voxel grids — coarse-voxel-driven 3D shape generation. [Engineering, Materials Science]
- [WaLa UN 1B](https://huggingface.co/ADSKAILab/WaLa-UN-1B): WaLa 1B unconditional variant — generates 3D shapes from noise alone for design-space exploration. [Engineering, Materials Science]
- [WaLa SK 1B](https://huggingface.co/ADSKAILab/WaLa-SK-1B): WaLa 1B variant conditioned on sketches — supports designer-driven shape generation from line art. [Engineering, Materials Science]
- [WaLa DM1 1B](https://huggingface.co/ADSKAILab/WaLa-DM1-1B): WaLa 1B variant conditioned on a single depth map — minimal-input depth-to-shape generation. [Engineering, Materials Science]
- [WaLa MVDream RGB4](https://huggingface.co/ADSKAILab/WaLa-MVDream-RGB4): WaLa coupled with MVDream for text-conditioned 3D shape generation via four RGB-view diffusion. [Engineering, Materials Science]
- [WaLa MVDream DM6](https://huggingface.co/ADSKAILab/WaLa-MVDream-DM6): WaLa coupled with MVDream and six depth views for text-conditioned 3D geometry generation. [Engineering, Materials Science]
- [OpenPhenom](https://huggingface.co/recursionpharma/OpenPhenom): Masked-autoencoder foundation model for high-content cell imaging — learns phenomic embeddings from millions of microscopy images for downstream drug-discovery and perturbation analysis. [Biology, Medicine, Chemistry]
- [Stack-Large Aligned](https://huggingface.co/arcinstitute/Stack-Large-Aligned): Aligned variant of STACK-Large — single-cell foundation model fine-tuned for cross-batch consistency, supporting multi-study perturbation analysis and downstream alignment tasks. [Biology, Genomics, Medicine]
- [SE-600M](https://huggingface.co/arcinstitute/SE-600M): 600M-parameter Single-cell Embeddings model from the STATE collection — generates embeddings for human single-cell RNA expression profiles to support cell-state and perturbation analysis. [Biology, Genomics, Medicine]

## Blog Posts

- [AI for PDEs](https://huggingface.co/blog/hugging-science/pde) — 2025-01-01 — hugging-science: Exploring AI approaches to solving partial differential equations. [Physics, Mathematics, Engineering]
- [SARLO-80: SAR Optic Language Dataset](https://huggingface.co/blog/hugging-science/sarlo-80-sar-optic-language-dataset) — 2025-01-01 — hugging-science: Introducing a large-scale dataset for SAR and optical remote sensing with language descriptions. [Earth Science, Climate]
- [Eve Bio: Mapping the Pharmone Drug Interaction](https://huggingface.co/blog/hugging-science/eve-bio-mapping-the-pharmone-drug-interaction) — 2025-01-01 — hugging-science: Understanding drug interactions through AI-powered pharmacogenomics. [Medicine, Biology, Chemistry]
- [The ExpansionRx OpenADMET Blind Challenge](https://huggingface.co/blog/hugging-science/the-expansionrx-openadmet-blind-challenge) — 2025-01-01 — hugging-science: A blind challenge for predicting ADMET properties in drug discovery. [Medicine, Chemistry]
- [PromoterGPT](https://huggingface.co/blog/hugging-science/promoter-gpt) — 2025-01-01 — hugging-science: AI-powered promoter sequence design and analysis. [Biology, Genomics]
- [AI for Food Allergies](https://huggingface.co/blog/hugging-science/ai-for-food-allergies) — 2025-01-01 — hugging-science: Applying AI to understand and predict food allergies. [Medicine, Biology]
- [GDP: Generative Design for Proteins](https://huggingface.co/blog/cgeorgiaw/gdp) — 2025-01-01 — cgeorgiaw: Generative models for protein design and engineering. [Biology, Chemistry]
- [Constellation Fusion Challenge](https://huggingface.co/blog/cgeorgiaw/constellaration-fusion-challenge) — 2025-01-01 — cgeorgiaw: A challenge for advancing fusion energy through AI. [Physics, Energy, Engineering]
- [Making Antibody Embeddings and Predictions](https://huggingface.co/blog/ginkgo-datapoints/making-antibody-embeddings-and-predictions) — 2025-01-01 — ginkgo-datapoints: How to create and use antibody embeddings for therapeutic applications. [Biology, Medicine, Biotechnology]
- [LeMaterial: An Open-Source Initiative to Accelerate Materials Discovery](https://huggingface.co/blog/lematerial) — 2024-12-10 — lvwerra: Introducing LeMaterial, a community effort to build the largest open database of materials and accelerate AI-driven discovery of new compounds and structures. [Materials Science, Chemistry, Engineering]
- [SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence](https://huggingface.co/blog/SandboxAQ/sair-data-accelerating-drug-discovery-with-ai) — 2025-09-06 — SandboxAQ: How SandboxAQ's SAIR dataset of 1M+ protein–ligand structures is enabling AI-powered drug discovery with unprecedented structural coverage. [Chemistry, Medicine, Biology]
- [How to Build a Benchmark with a Private Test Set on Hugging Face](https://huggingface.co/blog/hugging-science/building-a-benchmark-or-challenge) — 2026-02-16 — hugging-science: A step-by-step guide to creating, hosting, and managing a benchmark challenge with a hidden test set on Hugging Face. []
- [Open-R1: A Fully Open Reproduction of DeepSeek-R1](https://huggingface.co/blog/open-r1) — 2025-01-28 — lvwerra: A fully open reproduction of DeepSeek-R1's math reasoning training pipeline — data, code, and models — bringing transparent reasoning model training to the community. [Mathematics]
- [Illustrating Reinforcement Learning from Human Feedback (RLHF)](https://huggingface.co/blog/rlhf) — 2022-12-09 — natolambert: A clear, illustrated walkthrough of how RLHF works — the technique behind ChatGPT and modern instruction-following models. One of HF's most-read posts. []
- [Tropical Quivers for Modern AI: A Guided Tour of a Research Program](https://huggingface.co/blog/AmelieSchreiber/tropical-quivers-of-archs) — 2026-03-22 — AmelieSchreiber: A tour of tropical quiver representations and how their combinatorial structure connects to modern AI architectures. [Mathematics]
- [Surface Orders, Cyclic Time, and a Concrete Hilbert–Pólya Framework](https://huggingface.co/blog/AmelieSchreiber/hilbert-polya-for-grh) — 2026-03-17 — AmelieSchreiber: A concrete construction toward the Hilbert–Pólya conjecture using surface orders and cyclic-time symmetry as a route to the Riemann Hypothesis. [Mathematics]
- [ThermoGFN-IF for Catalysis](https://huggingface.co/blog/AmelieSchreiber/thermogfn-if) — 2026-03-10 — AmelieSchreiber: A protein sequence design model fine-tuned with GFlowNets for thermostable and kinetically-aware enzyme engineering. [Biology, Chemistry, Medicine]
- [A New Era in Multistep Enzyme Design](https://huggingface.co/blog/AmelieSchreiber/a-new-era-of-enzyme-engineering) — 2024-10-16 — AmelieSchreiber: Exploring generative AI approaches for designing multistep enzymatic pathways for biosynthesis and biocatalysis. [Biology, Chemistry]
- [A Guide to Designing New Functional Proteins](https://huggingface.co/blog/AmelieSchreiber/protein-optimization-and-design) — 2024-07-02 — AmelieSchreiber: A comprehensive guide to improving protein function, stability, and diversity using generative AI and ESM-2. [Biology, Chemistry]
- [RFDiffusion Potentials](https://huggingface.co/blog/AmelieSchreiber/rfdiffusion-potentials) — 2024-05-14 — AmelieSchreiber: Using RFDiffusion with custom guiding potentials to steer protein structure generation toward desired functional properties. [Biology, Chemistry]
- [Predicting the Effects of Mutations on Protein Function with ESM-2](https://huggingface.co/blog/AmelieSchreiber/mutation-scoring) — 2023-12-13 — AmelieSchreiber: Using ESM-2 protein language model embeddings to score and predict the functional impact of point mutations. [Biology, Genomics]
- [Faster Persistent Homology Alignment and Protein Complex Clustering](https://huggingface.co/blog/AmelieSchreiber/faster-pha) — 2023-11-30 — AmelieSchreiber: Accelerating persistent homology alignment with ESM-2 embeddings and persistence landscapes for protein complex clustering. [Biology, Mathematics]
- [Clustering Protein Complexes using Persistent Homology](https://huggingface.co/blog/AmelieSchreiber/esm-ppi) — 2023-11-29 — AmelieSchreiber: Combining persistent homology with ESM-2 fine-tuning for protein–protein interaction network prediction and complex clustering. [Biology, Chemistry]
- [ESM-2 for Generating and Optimizing Peptide Binders](https://huggingface.co/blog/AmelieSchreiber/esm-interact) — 2023-11-23 — AmelieSchreiber: Generating and optimising peptide binders for target proteins using ESM-2 embeddings and directed evolution. [Biology, Medicine]
- [Persistent Homology Alignment: Replacing Multiple Sequence Alignments](https://huggingface.co/blog/AmelieSchreiber/plm-persistent-homology-msa-replacement) — 2023-11-15 — AmelieSchreiber: Replacing traditional multiple sequence alignments with ESM-2 embeddings and persistent homology for structure-aware protein comparison. [Biology, Mathematics]
- [In Silico Directed Evolution of Protein Sequences with ESM-2](https://huggingface.co/blog/AmelieSchreiber/directed-evolution-with-esm2) — 2023-11-13 — AmelieSchreiber: Using ESM-2 and EvoProtGrad to simulate directed evolution in silico, optimising protein sequences for target properties. [Biology, Chemistry]
- [QLoRA for ESM-2 and Post Translational Modification Site Prediction](https://huggingface.co/blog/AmelieSchreiber/esm2-ptm) — 2023-11-11 — AmelieSchreiber: Applying QLoRA fine-tuning to ESM-2 for accurate prediction of post-translational modification sites across protein sequences. [Biology, Genomics]
- [Estimating the Intrinsic Dimension of Protein Sequence Embeddings](https://huggingface.co/blog/AmelieSchreiber/intrinsic-dimension-of-proteins) — 2023-10-18 — AmelieSchreiber: Measuring the intrinsic dimensionality of ESM-2 protein embeddings to understand the geometric structure of protein sequence space. [Biology, Mathematics]
- [Predicting Protein–Protein Interactions Using a Protein Language Model](https://huggingface.co/blog/AmelieSchreiber/protein-binding-partners-with-esm2) — 2023-10-15 — AmelieSchreiber: Using ESM-2 embeddings and linear sum assignment to predict protein–protein binding partners at scale. [Biology, Chemistry]
- [ESMBind Ensemble Models](https://huggingface.co/blog/AmelieSchreiber/esmbind-ensemble) — 2023-09-22 — AmelieSchreiber: Ensemble methods for ESMBind models to improve binding site prediction accuracy and robustness across protein families. [Biology, Genomics]
- [ESMBind: Low Rank Adaptation of ESM-2 for Protein Binding Site Prediction](https://huggingface.co/blog/AmelieSchreiber/esmbind) — 2023-09-15 — AmelieSchreiber: Fine-tuning ESM-2 with LoRA adapters to predict protein binding sites with high accuracy and parameter efficiency. [Biology, Genomics]
- [Physics Informed Neural Networks (PINNs): An Intuitive Guide](https://towardsdatascience.com/physics-informed-neural-networks-pinns-an-intuitive-guide-fff138069563/) — 2025-01-28 — towardsdatascience.com: A clear, intuitive walkthrough of how PINNs embed physical laws directly into neural network training — bridging traditional PDE-based modeling with data-driven deep learning. [Physics, Mathematics, Engineering]
- [A Living Review of Machine Learning for Particle Physics](https://iml-wg.github.io/HEPML-LivingReview/) — 2020-06-01 — iml-wg.github.io: A continuously updated, near-comprehensive survey of ML techniques applied to experimental, phenomenological, and theoretical high-energy physics — maintained by the Inter-Experimental LHC ML Working Group. [Physics]
- [Did GPT-5.2 Make a Breakthrough Discovery in Theoretical Physics?](https://huggingface.co/blog/dlouapre/gpt-single-minus-gluons) — 2026-02-01 — dlouapre: GPT-5.2 conjectured a compact formula for single-minus gluon tree amplitudes previously assumed to be zero for 40 years — a striking example of AI contributing to original theoretical physics. [Physics, Mathematics]
- [A Comprehensive Introduction to AI for Proteins (2026)](https://www.tamarind.bio/blog/a-comprehensive-introduction-to-ai-for-proteins) — 2026-01-01 — tamarind.bio: A thorough primer on the state of AI for protein science — covering structure prediction, protein language models, generative design, and the full open-source model landscape. [Biology, Chemistry, Medicine]
- [Boltz-2: State of the Art Structure and Binding Affinity Prediction](https://www.tamarind.bio/blog/boltz2-state-of-the-art-structure-and-binding-affinity-prediction) — 2025-06-18 — tamarind.bio: Boltz-2 outperforms AlphaFold3 on antibody-antigen interfaces and sets a new state of the art for protein-ligand binding affinity prediction. [Biology, Chemistry, Medicine]
- [Boltzdesign1: Designing De Novo Binders to More Than Just Proteins](https://www.tamarind.bio/blog/boltzdesign1-small-molecule-rna-dna-protein-metal-binder-design) — 2025-06-01 — tamarind.bio: BoltzDesign1 extends de novo binder design beyond protein targets to small molecules, RNA, DNA, and metal ions. [Biology, Chemistry, Medicine]
- [OpenFold3 and The Future of Protein Folding](https://www.tamarind.bio/blog/openfold3-fully-open-alphafold3-alternative) — 2025-04-01 — tamarind.bio: OpenFold3 is a fully open-source, commercially available AlphaFold3 alternative backed by the OpenFold Consortium — enabling unrestricted biomolecular structure prediction. [Biology, Chemistry]
- [IntFold: A New Best Structure Prediction Protocol](https://www.tamarind.bio/blog/intfold-a-new-state-of-the-art) — 2025-03-01 — tamarind.bio: IntFold establishes a new state-of-the-art protocol for biomolecular complex structure prediction, setting records across standard benchmarks. [Biology, Chemistry]
- [Chai-1r: AlphaFold3 Level Performance, Now Completely Open Source](https://www.tamarind.bio/blog/chai-1-alphafold3-level-performance-now-completely-open-source) — 2025-02-01 — tamarind.bio: Chai-1r achieves AlphaFold3-level accuracy on protein-protein and antibody-antigen complexes with fully open weights and no usage restrictions. [Biology, Chemistry, Medicine]
- [Computational De Novo Design of Antibodies and Nanobodies](https://www.tamarind.bio/blog/de-novo-antibody-nanobody-vhh-scfv-rfdiffusion) — 2025-01-01 — tamarind.bio: A practical guide to designing antibody VHHs and scFvs de novo using RFdiffusion and ProteinMPNN, from target epitope to validated sequence. [Biology, Medicine, Chemistry, Biotechnology]
- [Predicting Antibody Properties & Developability](https://www.tamarind.bio/blog/predicting-antibody-properties-developability) — 2025-01-01 — tamarind.bio: ML approaches for predicting key biophysical properties of therapeutic antibody candidates — stability, solubility, and immunogenicity — before wet-lab validation. [Biology, Medicine, Chemistry, Biotechnology]
- [Are Mini Proteins the Next Antibodies?](https://www.tamarind.bio/blog/mini-protein-antibodies) — 2025-01-01 — tamarind.bio: Examining the therapeutic potential of computationally designed miniproteins as a next-generation alternative to traditional antibody drugs. [Biology, Medicine, Chemistry]
- [Boltz-1: AlphaFold3 Level Performance, Truly Open Source](https://www.tamarind.bio/blog/boltz-1-alphafold3-level-performance-truly-open-source-and-commercially-available) — 2024-11-01 — tamarind.bio: Boltz-1 from MIT achieves AlphaFold3-level accuracy on protein and protein-ligand structure prediction with no restrictions on commercial use or input types. [Biology, Chemistry]
- [Computational De Novo Miniproteins As Therapeutics](https://www.tamarind.bio/blog/computationaly-de-novo-minibinders-therapeutic-applications) — 2024-12-01 — tamarind.bio: How computationally designed de novo miniproteins and minibinders are being developed as a new class of targeted therapeutics. [Biology, Medicine, Chemistry]
- [Computational Protein–Protein Interaction Screening](https://www.tamarind.bio/blog/ppi-screen) — 2024-12-01 — tamarind.bio: A practical guide to screening for protein–protein interactions (PPIs) as drug discovery targets using structure prediction and ML scoring. [Biology, Medicine, Chemistry]


================================================================================
## Topic: Astronomy (/topics/astronomy.md)
================================================================================

# Astronomy — Hugging Science

> Space science and astrophysics

_Auto-generated from site data. Fetch `/llms.txt` for the full index._

## Datasets (5)

### polymathic-ai/supernova_explosion_64
- **Type**: Astrophysics Simulation
- **Tags**: Physics, Astronomy
- **HuggingFace**: https://huggingface.co/datasets/polymathic-ai/supernova_explosion_64

Hydrodynamic simulations of core-collapse supernova explosions at 64³ resolution, spanning diverse progenitor masses and explosion energies.

### nasa-ibm-ai4science/surya-bench-flare-forecasting
- **Type**: Solar Physics
- **Tags**: Astronomy, Physics, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/nasa-ibm-ai4science/surya-bench-flare-forecasting

Full-disk solar flare forecasting dataset from NOAA GOES observations, providing multi-hour-ahead flare probability labels for heliophysics model evaluation.

### nasa-ibm-ai4science/core-sdo
- **Type**: Solar Physics
- **Tags**: Astronomy, Physics
- **HuggingFace**: https://huggingface.co/datasets/nasa-ibm-ai4science/core-sdo

Multi-modal Solar Dynamics Observatory dataset combining EUV imagery, magnetograms, and irradiance spectra for solar foundation model pre-training.

### UniverseTBD/arxiv-abstracts-large
- **Type**: Scientific Literature
- **Tags**: Astronomy, Physics
- **HuggingFace**: https://huggingface.co/datasets/UniverseTBD/arxiv-abstracts-large

1.7 million scholarly article abstracts spanning physics, computer science, and statistics from arXiv, structured for pretraining and fine-tuning astronomy and scientific language models.

### UniverseTBD/AstroLLaVA_convos
- **Type**: Multimodal Astronomy
- **Tags**: Astronomy, Physics
- **HuggingFace**: https://huggingface.co/datasets/UniverseTBD/AstroLLaVA_convos

Astronomical images paired with detailed captions and question-answer pairs sourced from APOD, ESO, and ESA Hubble archives, for training multimodal vision-language models on astrophysics.

## Models (7)

### Aion Base
- **Type**: Foundation Model
- **Tags**: Physics, Astronomy, Engineering
- **HuggingFace**: https://huggingface.co/polymathic-ai/aion-base

Multi-domain scientific foundation model.

### AstroCLIP
- **Type**: Astronomy Foundation Model
- **Tags**: Astronomy, Physics
- **HuggingFace**: https://huggingface.co/polymathic-ai/astroclip

Multimodal astronomy model aligning galaxy spectra and images into a shared embedding space for downstream astrophysical property prediction.

### Surya 1.0
- **Type**: Heliophysics
- **Tags**: Astronomy, Physics
- **HuggingFace**: https://huggingface.co/nasa-ibm-ai4science/Surya-1.0

First open-source AI foundation model for heliophysics - solar flare forecasting and space weather prediction.

### Surya Solar Flares
- **Type**: Solar Physics
- **Tags**: Astronomy, Physics
- **HuggingFace**: https://huggingface.co/nasa-ibm-ai4science/solar_flares_surya

Surya-1.0 fine-tuned for solar flare prediction from full-disk magnetogram and EUV time series.

### Surya Solar Wind
- **Type**: Solar Physics
- **Tags**: Astronomy, Physics
- **HuggingFace**: https://huggingface.co/nasa-ibm-ai4science/solar_wind_surya

Surya-1.0 fine-tuned for solar wind plasma and interplanetary magnetic field forecasting at the L1 Lagrange point.

### NASA-SMD-IBM
- **Type**: Earth Science NLP
- **Tags**: Earth Science, Physics, Astronomy
- **HuggingFace**: https://huggingface.co/nasa-impact/nasa-smd-ibm-v0.1

RoBERTa-based language model pre-trained on NASA Science Mission Directorate literature for earth and space science information extraction.

### AstroLLaMA
- **Type**: Astronomy Language Model
- **Tags**: Astronomy, Physics
- **HuggingFace**: https://huggingface.co/UniverseTBD/astrollama

Llama-2 7B fine-tuned on 300K+ astronomy arXiv abstracts for astrophysics text generation, literature summarization, and hypothesis completion — first open LLM specialized for astronomy.


================================================================================
## Topic: Benchmark (/topics/benchmark.md)
================================================================================

# Benchmark — Hugging Science

> Evaluation and benchmarking datasets

_Auto-generated from site data. Fetch `/llms.txt` for the full index._

## Datasets (37)

### polymathic-ai/active_matter
- **Type**: Physics Simulation
- **Tags**: Physics, Engineering, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/polymathic-ai/active_matter

High-fidelity simulations of self-propelled particle systems for benchmarking learned PDE solvers and emergent collective behaviour models.

### polymathic-ai/MHD_64
- **Type**: Physics Simulation
- **Tags**: Physics, Engineering, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/polymathic-ai/MHD_64

3D magnetohydrodynamics turbulence simulations at 64³ resolution for training and benchmarking physics-informed neural operators.

### polymathic-ai/rayleigh_benard
- **Type**: Physics Simulation
- **Tags**: Physics, Engineering, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/polymathic-ai/rayleigh_benard

Rayleigh–Bénard thermal convection simulations at varying Rayleigh and Prandtl numbers for benchmarking turbulence and heat transfer models.

### nasa-impact/WxC-Bench
- **Type**: Climate Benchmark
- **Tags**: Earth Science, Climate, Physics, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/nasa-impact/WxC-Bench

Standardised benchmark for evaluating AI models across six atmospheric and earth science tasks including gravity wave parameterisation, turbulence prediction, and hurricane track forecasting.

### EarthSpeciesProject/BEANS-Zero
- **Type**: Bioacoustics Benchmark
- **Tags**: Biology, Ecology, Conservation, Benchmark, Earth Science
- **HuggingFace**: https://huggingface.co/datasets/EarthSpeciesProject/BEANS-Zero

Zero-shot bioacoustics benchmark evaluating audio-language models on species detection, classification, and captioning across diverse animal taxa.

### jablonkagroup/ChemBench
- **Type**: Chemistry Benchmark
- **Tags**: Chemistry, Materials Science, Benchmark, Scientific Reasoning, Engineering, Mathematics
- **HuggingFace**: https://huggingface.co/datasets/jablonkagroup/ChemBench

Manually curated benchmark of 3,000+ chemistry and materials science questions across spectroscopy, reactivity, synthesis, and property prediction for evaluating LLMs.

### AI-MO/GeometryLeanBench
- **Type**: Theorem Proving
- **Tags**: Mathematics, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/GeometryLeanBench

Geometry theorem proving problems formalised in Lean 4, covering Euclidean, affine, and metric geometry for automated reasoning evaluation.

### AI-MO/CombiBench
- **Type**: Combinatorics
- **Tags**: Mathematics, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/CombiBench

Combinatorics problems drawn from AMC, AIME, and olympiad competitions, formalised for benchmarking discrete-mathematics reasoning in language models.

### AI-MO/minif2f_test
- **Type**: Theorem Proving
- **Tags**: Mathematics, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/minif2f_test

Test set for miniF2F formal mathematics benchmark.

### AI-MO/aimo-validation-amc
- **Type**: Competition Math
- **Tags**: Mathematics, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/aimo-validation-amc

AMC 10/12 competition problems reformatted for AIMO challenge validation, covering algebra, geometry, and number theory at difficulty levels 1–5.

### AI-MO/aimo-validation-aime
- **Type**: Competition Math
- **Tags**: Mathematics, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/aimo-validation-aime

AIME I/II problems reformatted for AIMO challenge validation — 15-question integer-answer format, covering competition math at difficulty levels 5–9.

### AI-MO/aimo-validation-math-level-4
- **Type**: Math Problems
- **Tags**: Mathematics, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/aimo-validation-math-level-4

Level-4 MATH benchmark problems (pre-calculus difficulty) used for AIMO challenge validation and fine-grained model evaluation.

### AI-MO/aimo-validation-math-level-5
- **Type**: Math Problems
- **Tags**: Mathematics, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/aimo-validation-math-level-5

Level-5 MATH benchmark problems (highest difficulty) used for AIMO challenge validation and measuring the ceiling of model mathematical reasoning.

### google/spiqa
- **Type**: Scientific Benchmark
- **Tags**: Biology, Chemistry, Physics, Benchmark, Scientific Reasoning, Mathematics
- **HuggingFace**: https://huggingface.co/datasets/google/spiqa

Scientific Paper Image Question Answering benchmark requiring multimodal reasoning over figures, charts, and diagrams from research papers across scientific domains.

### nasa-ibm-ai4science/surya-bench-flare-forecasting
- **Type**: Solar Physics
- **Tags**: Astronomy, Physics, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/nasa-ibm-ai4science/surya-bench-flare-forecasting

Full-disk solar flare forecasting dataset from NOAA GOES observations, providing multi-hour-ahead flare probability labels for heliophysics model evaluation.

### openadmet/openadmet-expansionrx-challenge-data
- **Type**: Drug Discovery
- **Tags**: Medicine, Chemistry, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/openadmet/openadmet-expansionrx-challenge-data

Full ExpansionRx challenge dataset of RNA-targeted small-molecule compounds with measured ADMET properties for open pharmacokinetics benchmarking.

### InstaDeepAI/NTv3_benchmark_dataset
- **Type**: Genomics
- **Tags**: Biology, Genomics, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/InstaDeepAI/NTv3_benchmark_dataset

Benchmark dataset with functional tracks and genome annotations across 7 species.

### InstaDeepAI/nucleotide_transformer_downstream_tasks
- **Type**: Genomics
- **Tags**: Biology, Genomics, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/InstaDeepAI/nucleotide_transformer_downstream_tasks

18 genomic prediction benchmark tasks covering histone marks, regulatory regions, splice sites, and promoter activity across human and multi-species genomes.

### InstaDeepAI/plant-genomic-benchmark
- **Type**: Plant Genomics
- **Tags**: Biology, Genomics, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/InstaDeepAI/plant-genomic-benchmark

Plant genomics benchmark spanning gene expression, chromatin accessibility, and agronomic trait prediction tasks across multiple crop and model plant species.

### InstaDeepAI/true-cds-protein-tasks
- **Type**: Protein Tasks
- **Tags**: Biology, Genomics, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/InstaDeepAI/true-cds-protein-tasks

Coding sequence and protein function prediction benchmark tasks.

### facebook/principia-bench
- **Type**: STEM Benchmark
- **Tags**: Mathematics, Physics, Chemistry, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/facebook/principia-bench

Curated benchmark of challenging STEM problems requiring multi-step reasoning, quantitative analysis, and domain knowledge across natural sciences.

### futurehouse/BixBench
- **Type**: Research Benchmark
- **Tags**: Biology, Chemistry, Benchmark, Scientific Reasoning, Mathematics
- **HuggingFace**: https://huggingface.co/datasets/futurehouse/BixBench

Benchmark with 205 reproducible research questions paired with data capsules for AI evaluation.

### futurehouse/lab-bench
- **Type**: Research Benchmark
- **Tags**: Biology, Benchmark, Scientific Reasoning, Mathematics
- **HuggingFace**: https://huggingface.co/datasets/futurehouse/lab-bench

Language Agent Biology Benchmark - 8 categories of scientific research tasks including cloning, figures, and protocols.

### futurehouse/ether0-benchmark
- **Type**: Chemistry Benchmark
- **Tags**: Chemistry, Medicine, Benchmark, Scientific Reasoning, Mathematics
- **HuggingFace**: https://huggingface.co/datasets/futurehouse/ether0-benchmark

Chemistry reasoning benchmark covering SMILES-based tasks including reaction prediction, retrosynthesis, and molecular property estimation for evaluating chemistry LLMs.

### owkin/nct-crc-he
- **Type**: Computational Pathology
- **Tags**: Medicine, Biology, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/owkin/nct-crc-he

Colorectal cancer tissue classification dataset with H&E-stained patches across 9 tissue classes, widely used for benchmarking pathology models.

### owkin/camelyon16-features
- **Type**: Computational Pathology
- **Tags**: Medicine, Biology, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/owkin/camelyon16-features

Pre-extracted features from the CAMELYON16 breast cancer lymph node metastasis detection challenge, enabling efficient benchmarking of MIL methods.

### owkin/her2-challenge-2026
- **Type**: Computational Pathology
- **Tags**: Medicine, Biology, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/owkin/her2-challenge-2026

HER2 scoring challenge dataset with H&E-stained whole-slide images for evaluating AI-based HER2 status prediction in breast cancer.

### SAIRfoundation/equational-theories-selected-problems
- **Type**: Mathematical Reasoning
- **Tags**: Mathematics, Scientific Reasoning, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/SAIRfoundation/equational-theories-selected-problems

Curated selection of equational theory problems for benchmarking LLM mathematical reasoning and automated theorem proving.

### SAIRfoundation/equational-theories-benchmark
- **Type**: Mathematical Reasoning
- **Tags**: Mathematics, Scientific Reasoning, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/SAIRfoundation/equational-theories-benchmark

Full benchmark suite of equational theory problems spanning algebraic structures, designed to evaluate formal reasoning capabilities of AI models.

### isp-uv-es/CloudSEN12Plus
- **Type**: Earth Observation
- **Tags**: Earth Science, Climate, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/isp-uv-es/CloudSEN12Plus

Large-scale cloud detection dataset with 49,000+ Sentinel-2 patches and expert-quality cloud/shadow annotations across global biomes and seasons.

### isp-uv-es/opensr-test
- **Type**: Earth Observation
- **Tags**: Earth Science, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/isp-uv-es/opensr-test

Benchmark dataset for real-world Sentinel-2 super-resolution, with paired low/high-resolution imagery and perceptual quality metrics.

### openai/healthbench
- **Type**: Medical Benchmark
- **Tags**: Medicine, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/openai/healthbench

Realistic multi-turn health conversations graded against physician-written rubrics across multiple axes (accuracy, completeness, communication) — an open evaluation benchmark for AI assistants in medicine.

### openai/healthbench-professional
- **Type**: Medical Benchmark
- **Tags**: Medicine, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/openai/healthbench-professional

Professional-graded subset of HealthBench: physician evaluators score model responses to clinically realistic conversations, targeting expert-level health assessment.

### openai/frontierscience
- **Type**: Scientific Reasoning Benchmark
- **Tags**: Scientific Reasoning, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/openai/frontierscience

Frontier science evaluation benchmark probing model capabilities on expert-level reasoning across natural sciences — designed to surface what AI systems can and cannot do at the research frontier.

### Anthropic/BioMysteryBench-preview
- **Type**: Biology Benchmark
- **Tags**: Biology, Medicine, Scientific Reasoning, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/Anthropic/BioMysteryBench-preview

Preview slice of BioMysteryBench — challenging, expert-curated biology problems for evaluating AI scientific reasoning capability.

### Anthropic/BioMysteryBench-full
- **Type**: Biology Benchmark
- **Tags**: Biology, Medicine, Scientific Reasoning, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/Anthropic/BioMysteryBench-full

Full BioMysteryBench evaluation set — challenging biology problems used to probe expert-level scientific reasoning in frontier models.

### jablonkagroup/MaCBench
- **Type**: Materials Chemistry Benchmark
- **Tags**: Chemistry, Materials Science, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/jablonkagroup/MaCBench

Materials Chemistry Benchmark — multimodal QA, multiple-choice, and visual-question-answering items for evaluating LLMs on materials and inorganic chemistry tasks.


================================================================================
## Topic: Biology (/topics/biology.md)
================================================================================

# Biology — Hugging Science

> Life sciences, genomics, and biological systems

_Auto-generated from site data. Fetch `/llms.txt` for the full index._

## Datasets (55)

### arcinstitute/opengenome2
- **Type**: Genomics
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/datasets/arcinstitute/opengenome2

Curated collection of prokaryotic and eukaryotic genomic sequences for training and benchmarking large-scale biological foundation models.

### arcinstitute/SE-167M-Human
- **Type**: Single-Cell Biology
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/datasets/arcinstitute/SE-167M-Human

167M human single-cell RNA expression profiles across diverse tissues and cell types, used for training STACK and SE single-cell foundation models.

### arcinstitute/Stack-CellxGene45M
- **Type**: Single-Cell Biology
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/datasets/arcinstitute/Stack-CellxGene45M

45M curated single-cell profiles drawn from the CellxGene corpus, standardised for in-context learning and cross-study perturbation analysis.

### ginkgo-datapoints/GDPa1
- **Type**: Antibody Developability
- **Tags**: Biology, Biotechnology
- **HuggingFace**: https://huggingface.co/datasets/ginkgo-datapoints/GDPa1

Antibody developability dataset with biophysical assay data for 242 antibodies across 9 assays.

### ginkgo-datapoints/GDPx1
- **Type**: Functional Genomics
- **Tags**: Biology, Biotechnology
- **HuggingFace**: https://huggingface.co/datasets/ginkgo-datapoints/GDPx1

DRUG-seq functional genomics dataset with chemical perturbation experiments in A549 cells.

### ginkgo-datapoints/GDPx2
- **Type**: Functional Genomics
- **Tags**: Biology, Biotechnology
- **HuggingFace**: https://huggingface.co/datasets/ginkgo-datapoints/GDPx2

DRUG-seq transcriptomic profiling across 4 primary human cell types with 85 compounds.

### ginkgo-datapoints/GDPx3
- **Type**: Cell Imaging
- **Tags**: Biology, Biotechnology
- **HuggingFace**: https://huggingface.co/datasets/ginkgo-datapoints/GDPx3

High-content Cell Painting imaging dataset for AI/ML model training in drug discovery.

### ginkgo-datapoints/GDPx4
- **Type**: Functional Genomics
- **Tags**: Biology, Biotechnology
- **HuggingFace**: https://huggingface.co/datasets/ginkgo-datapoints/GDPx4

DRUG-seq transcriptomic profiling in engineered HEK293 cells with inducible gene overexpression, enabling systematic study of gene-drug interactions.

### eve-bio/drug-target-activity
- **Type**: Drug Discovery
- **Tags**: Biology, Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/eve-bio/drug-target-activity

Drug-target interaction measurements for 1,397 FDA-approved small molecule drugs.

### EarthSpeciesProject/BEANS-Zero
- **Type**: Bioacoustics Benchmark
- **Tags**: Biology, Ecology, Conservation, Benchmark, Earth Science
- **HuggingFace**: https://huggingface.co/datasets/EarthSpeciesProject/BEANS-Zero

Zero-shot bioacoustics benchmark evaluating audio-language models on species detection, classification, and captioning across diverse animal taxa.

### SandboxAQ/SAIR
- **Type**: Drug Discovery
- **Tags**: Chemistry, Medicine, Biology
- **HuggingFace**: https://huggingface.co/datasets/SandboxAQ/SAIR

Largest public dataset of protein-ligand 3D structures with binding affinity measurements (1M+ pairs).

### google/spiqa
- **Type**: Scientific Benchmark
- **Tags**: Biology, Chemistry, Physics, Benchmark, Scientific Reasoning, Mathematics
- **HuggingFace**: https://huggingface.co/datasets/google/spiqa

Scientific Paper Image Question Answering benchmark requiring multimodal reasoning over figures, charts, and diagrams from research papers across scientific domains.

### InstaDeepAI/NTv3_benchmark_dataset
- **Type**: Genomics
- **Tags**: Biology, Genomics, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/InstaDeepAI/NTv3_benchmark_dataset

Benchmark dataset with functional tracks and genome annotations across 7 species.

### InstaDeepAI/nucleotide_transformer_downstream_tasks
- **Type**: Genomics
- **Tags**: Biology, Genomics, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/InstaDeepAI/nucleotide_transformer_downstream_tasks

18 genomic prediction benchmark tasks covering histone marks, regulatory regions, splice sites, and promoter activity across human and multi-species genomes.

### InstaDeepAI/multi_species_genomes
- **Type**: Genomics
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/datasets/InstaDeepAI/multi_species_genomes

Whole-genome sequences for 850 species spanning bacteria, fungi, plants, and animals — the pre-training corpus for the Nucleotide Transformer model family.

### InstaDeepAI/plant-genomic-benchmark
- **Type**: Plant Genomics
- **Tags**: Biology, Genomics, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/InstaDeepAI/plant-genomic-benchmark

Plant genomics benchmark spanning gene expression, chromatin accessibility, and agronomic trait prediction tasks across multiple crop and model plant species.

### InstaDeepAI/winnow-ms-datasets
- **Type**: Proteomics
- **Tags**: Biology, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/InstaDeepAI/winnow-ms-datasets

Mass spectrometry datasets for protein analysis and ML model training.

### InstaDeepAI/true-cds-protein-tasks
- **Type**: Protein Tasks
- **Tags**: Biology, Genomics, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/InstaDeepAI/true-cds-protein-tasks

Coding sequence and protein function prediction benchmark tasks.

### futurehouse/BixBench
- **Type**: Research Benchmark
- **Tags**: Biology, Chemistry, Benchmark, Scientific Reasoning, Mathematics
- **HuggingFace**: https://huggingface.co/datasets/futurehouse/BixBench

Benchmark with 205 reproducible research questions paired with data capsules for AI evaluation.

### futurehouse/lab-bench
- **Type**: Research Benchmark
- **Tags**: Biology, Benchmark, Scientific Reasoning, Mathematics
- **HuggingFace**: https://huggingface.co/datasets/futurehouse/lab-bench

Language Agent Biology Benchmark - 8 categories of scientific research tasks including cloning, figures, and protocols.

### tahoebio/Tahoe-100M
- **Type**: Single-Cell Biology
- **Tags**: Biology, Medicine, Genomics
- **HuggingFace**: https://huggingface.co/datasets/tahoebio/Tahoe-100M

Giga-scale perturbation atlas with 100M+ single-cell profiles from 50 cancer cell lines and 1,100 drugs.

### tahoebio/Tahoe-x1-embeddings
- **Type**: Single-Cell Biology
- **Tags**: Biology, Medicine, Genomics
- **HuggingFace**: https://huggingface.co/datasets/tahoebio/Tahoe-x1-embeddings

Pre-computed cell and gene embeddings from the Tahoe-x1 foundation model.

### owkin/plism-dataset-tiles
- **Type**: Computational Pathology
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/datasets/owkin/plism-dataset-tiles

Large-scale histopathology tile dataset for benchmarking robustness of pathology foundation models across staining and scanner variability.

### owkin/nct-crc-he
- **Type**: Computational Pathology
- **Tags**: Medicine, Biology, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/owkin/nct-crc-he

Colorectal cancer tissue classification dataset with H&E-stained patches across 9 tissue classes, widely used for benchmarking pathology models.

### owkin/camelyon16-features
- **Type**: Computational Pathology
- **Tags**: Medicine, Biology, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/owkin/camelyon16-features

Pre-extracted features from the CAMELYON16 breast cancer lymph node metastasis detection challenge, enabling efficient benchmarking of MIL methods.

### owkin/her2-challenge-2026
- **Type**: Computational Pathology
- **Tags**: Medicine, Biology, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/owkin/her2-challenge-2026

HER2 scoring challenge dataset with H&E-stained whole-slide images for evaluating AI-based HER2 status prediction in breast cancer.

### Xaira-Therapeutics/X-Atlas-Orion
- **Type**: Single-Cell Biology
- **Tags**: Biology, Medicine, Genomics
- **HuggingFace**: https://huggingface.co/datasets/Xaira-Therapeutics/X-Atlas-Orion

Large-scale single-cell transcriptomics atlas with millions of cell profiles from diverse human tissues, designed for training perturbation-aware foundation models.

### Xaira-Therapeutics/X-Atlas-Pisces
- **Type**: Single-Cell Biology
- **Tags**: Biology, Medicine, Genomics
- **HuggingFace**: https://huggingface.co/datasets/Xaira-Therapeutics/X-Atlas-Pisces

CRISPRi perturbation single-cell dataset pairing genetic knockdowns with transcriptomic responses, used for training and evaluating the X-Cell model.

### AllTheBacteria/ATB
- **Type**: Microbial Genomics
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/datasets/AllTheBacteria/ATB

AllTheBacteria: a comprehensive collection of ~2 million bacterial genome assemblies from public sequence databases, standardized for large-scale genomic analysis.

### AllTheBacteria/Bac-Corpus-protein-sequences-high-diversity
- **Type**: Microbial Genomics
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/datasets/AllTheBacteria/Bac-Corpus-protein-sequences-high-diversity

High-diversity corpus of bacterial protein sequences derived from the ATB collection, filtered for maximum sequence diversity to support protein language model pretraining.

### AllTheBacteria/Bac-Corpus-dna-intergenic-sequences-high-diversity
- **Type**: Microbial Genomics
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/datasets/AllTheBacteria/Bac-Corpus-dna-intergenic-sequences-high-diversity

High-diversity corpus of bacterial intergenic DNA sequences for training DNA language models on non-coding regulatory regions.

### AllTheBacteria/SPIRE
- **Type**: Microbial Genomics
- **Tags**: Biology, Genomics, Ecology, Earth Science
- **HuggingFace**: https://huggingface.co/datasets/AllTheBacteria/SPIRE

Searchable Planetary-scale mIcrobiome REsource: a large-scale metagenomics resource aggregating environmental microbiome samples from diverse global habitats.

### opig/OAS
- **Type**: Antibody Sequences
- **Tags**: Biology, Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/opig/OAS

Observed Antibody Space: a curated database of over one billion antibody sequences from immune repertoire sequencing studies, the standard resource for antibody ML.

### wanglab/CT_DeepLesion-MedSAM2
- **Type**: Medical Imaging
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/datasets/wanglab/CT_DeepLesion-MedSAM2

CT volumes from the DeepLesion benchmark with mask annotations restructured for training and evaluating MedSAM2, the universal medical image segmentation foundation model.

### wanglab/img_virus_plasmid
- **Type**: Microbial Genomics
- **Tags**: Biology, Genomics, Biotechnology
- **HuggingFace**: https://huggingface.co/datasets/wanglab/img_virus_plasmid

Combined IMG/VR (uncultivated virus genomes) and IMG/PR (plasmids from genomes and metagenomes) catalog with rich functional, taxonomic, and ecological metadata.

### wanglab/kegg
- **Type**: Biological Reasoning
- **Tags**: Biology, Genomics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/wanglab/kegg

KEGG pathway entries paired with variant annotations for training and evaluating multimodal biological reasoning models (used by the BioReason work).

### OpenMed/synthvision-annotated-qwen
- **Type**: Synthetic Medical Vision
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/datasets/OpenMed/synthvision-annotated-qwen

Synthetic medical-imaging dataset annotated by Qwen — used in OpenMed’s SynthVision pipeline for training and validating medical multimodal models.

### OpenMed/synthvision-seeds
- **Type**: Synthetic Medical Vision
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/datasets/OpenMed/synthvision-seeds

Seed prompts and source imagery feeding the SynthVision generation pipeline that produces OpenMed’s annotated medical-imaging training corpora.

### OpenMed/synthvision-annotated-kimi
- **Type**: Synthetic Medical Vision
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/datasets/OpenMed/synthvision-annotated-kimi

Synthetic medical-imaging dataset annotated by Kimi — sister set to the Qwen-annotated split, supporting cross-annotator validation in the SynthVision pipeline.

### allenai/peS2o
- **Type**: Pretraining Corpus
- **Tags**: Scientific Reasoning, Biology, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/allenai/peS2o

Approximately 40M cleaned, filtered, and formatted open-access academic papers derived from S2ORC — a large multi-domain pretraining corpus for science-aware language models, spanning biology, chemistry, engineering, computer science, and physics.

### Anthropic/BioMysteryBench-preview
- **Type**: Biology Benchmark
- **Tags**: Biology, Medicine, Scientific Reasoning, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/Anthropic/BioMysteryBench-preview

Preview slice of BioMysteryBench — challenging, expert-curated biology problems for evaluating AI scientific reasoning capability.

### Anthropic/BioMysteryBench-full
- **Type**: Biology Benchmark
- **Tags**: Biology, Medicine, Scientific Reasoning, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/Anthropic/BioMysteryBench-full

Full BioMysteryBench evaluation set — challenging biology problems used to probe expert-level scientific reasoning in frontier models.

### maomlab/Molecule3D
- **Type**: Molecular Properties
- **Tags**: Chemistry, Biology
- **HuggingFace**: https://huggingface.co/datasets/maomlab/Molecule3D

Curated 3D molecular structures with computed properties — supports geometric deep learning for property prediction and conformer-aware modelling.

### maomlab/TDC
- **Type**: Therapeutics Benchmark
- **Tags**: Medicine, Chemistry, Biology
- **HuggingFace**: https://huggingface.co/datasets/maomlab/TDC

Therapeutics Data Commons subset — drug-discovery tasks (ADMET, drug-target interaction, generation) curated for benchmarking molecular ML.

### maomlab/CryptoCEN
- **Type**: Coexpression Network
- **Tags**: Biology, Medicine
- **HuggingFace**: https://huggingface.co/datasets/maomlab/CryptoCEN

CryptoCEN — Cryptococcus coexpression network dataset for fungal pathogen biology and drug-target prioritisation.

### imageomics/TreeOfLife-200M
- **Type**: Biodiversity Image Corpus
- **Tags**: Biology, Ecology, Conservation
- **HuggingFace**: https://huggingface.co/datasets/imageomics/TreeOfLife-200M

Foundational 200M-image dataset for organismal biology — multilingual species labels (en, la) at biodiversity scale, used to train BioCLIP-2 for zero-shot species classification.

### Aignostics/OpenTME
- **Type**: Digital Pathology
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/datasets/Aignostics/OpenTME

Pre-analyzed H&E whole-slide images from TCGA across breast, bladder, colorectal, liver, and lung cancers — cell-level annotations and tumour-microenvironment spatial features generated by Atlas H&E-TME.

### wanglab/bioreason-pro-sft-reasoning-data
- **Type**: Biological Reasoning Corpus
- **Tags**: Biology, Genomics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/wanglab/bioreason-pro-sft-reasoning-data

Reasoning trace dataset used to supervised-fine-tune BioReason-Pro — multimodal biological problems with rationales over genomic variants and pathway data.

### tattabio/OMG
- **Type**: Genomic Corpus
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/datasets/tattabio/OMG

Open Mixed Genomes (OMG) — large mixed-organism nucleotide corpus underpinning Tatta Bio’s gLM2 genomic foundation models.

### tattabio/OG
- **Type**: Genomic Corpus
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/datasets/tattabio/OG

Open Genomes (OG) — curated genome-sequence corpus from Tatta Bio for genomic ML pretraining and benchmarking.

### recursionpharma/rxrx3
- **Type**: Phenomics Imaging
- **Tags**: Biology, Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/recursionpharma/rxrx3

Full RxRx3 release — multi-million image high-content microscopy dataset spanning genetic and chemical perturbations across human cell lines, paired with rich text annotations for image-based drug discovery.

### recursionpharma/rxrx3-core
- **Type**: Phenomics Imaging
- **Tags**: Biology, Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/recursionpharma/rxrx3-core

Curated core subset of RxRx3 — high-quality phenomics images for benchmarking and lower-cost training of phenomic foundation models like OpenPhenom.

### arcinstitute/Perturb-Sapiens
- **Type**: Single-Cell Perturbation
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/datasets/arcinstitute/Perturb-Sapiens

Large-scale human single-cell perturbation dataset used in the STACK foundation-model lineage — paired baseline and perturbed expression profiles for genetic perturbation screens.

### arcinstitute/Replogle-Nadig-Preprint
- **Type**: Single-Cell Perturbation
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/datasets/arcinstitute/Replogle-Nadig-Preprint

Replogle-Nadig single-cell perturbation dataset (preprint release) — Perturb-seq screens used in the STATE single-cell embedding work for perturbation-response modelling.

### arcinstitute/State-Tahoe-Filtered
- **Type**: Single-Cell Perturbation
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/datasets/arcinstitute/State-Tahoe-Filtered

Filtered Tahoe-100M slice used in the STATE workflow — high-quality single-cell perturbation profiles for training and benchmarking cross-study cell-state models.

## Models (66)

### Evo-2 40B
- **Type**: Genomic Foundation Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/arcinstitute/evo2_40b

40B-parameter DNA language model trained on 9.3 trillion nucleotides across all domains of life — zero-shot function prediction, variant effect scoring, and sequence generation.

### Evo-2 7B
- **Type**: Genomic Foundation Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/arcinstitute/evo2_7b

7B-parameter instruction-tuned DNA language model for gene function prediction, CRISPR guide design, and cross-species sequence analysis.

### STACK Large
- **Type**: Single-Cell Foundation Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/arcinstitute/Stack-Large

Large-scale single-cell transcriptomics foundation model supporting in-context learning across cell types and perturbation states.

### TEDDY
- **Type**: Single-Cell Biology
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/Merck/TEDDY

Transformer for Enabling Drug Discovery - foundation models trained on 116M single cells for genomics and drug discovery.

### NatureLM-audio
- **Type**: Audio-Language Model
- **Tags**: Biology, Ecology, Conservation, Earth Science
- **HuggingFace**: https://huggingface.co/EarthSpeciesProject/NatureLM-audio

First audio-language foundation model for bioacoustics - species classification, detection, and captioning of animal vocalizations.

### AVES2-BEATs
- **Type**: Bioacoustics Model
- **Tags**: Biology, Ecology, Conservation, Earth Science
- **HuggingFace**: https://huggingface.co/EarthSpeciesProject/esp-aves2-sl-beats-all

Self-supervised BEATs-based audio encoder trained on broad bioacoustic data for species detection, classification, and embedding across animal taxa.

### AQAffinity
- **Type**: Drug Discovery
- **Tags**: Chemistry, Medicine, Biology
- **HuggingFace**: https://huggingface.co/SandboxAQ/AQAffinity

Open-source protein-ligand binding affinity prediction model for drug discovery.

### MedGemma 1.5 4B
- **Type**: Medical AI
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/google/medgemma-1.5-4b-it

Multimodal medical AI model for medical imaging and clinical text understanding.

### MedGemma 27B
- **Type**: Medical AI
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/google/medgemma-27b-it

Large-scale instruction-tuned medical AI for radiology report generation, pathology image analysis, dermatology, and clinical question answering.

### AlphaGenome
- **Type**: Genomics
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/google/alphagenome-all-folds

Google DeepMind model predicting DNA regulatory features — gene expression, chromatin accessibility, and TF binding — at single-nucleotide resolution.

### MedSigLIP
- **Type**: Medical Imaging
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/google/medsiglip-448

Medical image-language model for visual understanding in healthcare.

### TxGemma 2B
- **Type**: Drug Discovery
- **Tags**: Medicine, Chemistry, Biology
- **HuggingFace**: https://huggingface.co/google/txgemma-2b-predict

Lightweight therapeutic prediction model for drug discovery tasks.

### TxGemma 9B Predict
- **Type**: Drug Discovery
- **Tags**: Medicine, Chemistry, Biology
- **HuggingFace**: https://huggingface.co/google/txgemma-9b-predict

Mid-size therapeutic prediction model for drug property prediction.

### TxGemma 9B Chat
- **Type**: Drug Discovery
- **Tags**: Medicine, Chemistry, Biology
- **HuggingFace**: https://huggingface.co/google/txgemma-9b-chat

Conversational therapeutic model for drug discovery with reasoning explanations.

### TxGemma 27B Predict
- **Type**: Drug Discovery
- **Tags**: Medicine, Chemistry, Biology
- **HuggingFace**: https://huggingface.co/google/txgemma-27b-predict

Large therapeutic prediction model achieving best-in-class performance on 66 tasks.

### TxGemma 27B Chat
- **Type**: Drug Discovery
- **Tags**: Medicine, Chemistry, Biology
- **HuggingFace**: https://huggingface.co/google/txgemma-27b-chat

Large conversational therapeutic model with advanced reasoning capabilities.

### Path Foundation
- **Type**: Pathology
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/google/path-foundation

Vision transformer for histopathology image embeddings - trained on 60M patches from TCGA.

### NTv3 650M
- **Type**: Genomics
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/InstaDeepAI/NTv3_650M_post

Multi-species genomics foundation model handling 1Mb context for functional track prediction.

### Nucleotide Transformer v2 500M
- **Type**: Genomics
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species

500M multi-species DNA language model with improved tokenisation and benchmark performance across 18 genomic prediction tasks.

### Nucleotide Transformer 2.5B
- **Type**: Genomics
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/InstaDeepAI/nucleotide-transformer-2.5b-multi-species

2.5B-parameter DNA language model trained on 850 species genomes — state-of-the-art on promoter, enhancer, and splice site prediction tasks.

### ChatNT
- **Type**: Genomics
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/InstaDeepAI/ChatNT

8B multimodal conversational model for DNA, RNA, and protein tasks — instruction-following for sequence annotation, classification, and generation.

### Isoformer
- **Type**: Multi-Omics
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/InstaDeepAI/isoformer

Transformer model integrating DNA sequence, RNA expression, and protein context for isoform-level gene expression prediction.

### ESM2 650M
- **Type**: Protein Language Model
- **Tags**: Biology, Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/facebook/esm2_t33_650M_UR50D

650M-parameter protein language model trained on UniRef50 — state-of-the-art embeddings for structure prediction, function annotation, and mutation effect scoring.

### Tahoe-x1
- **Type**: Single-Cell Biology
- **Tags**: Biology, Medicine, Genomics
- **HuggingFace**: https://huggingface.co/tahoebio/Tahoe-x1

Perturbation-trained single-cell foundation models (70M-3B) for cancer research and drug discovery.

### Tahoe-100M-SCVI
- **Type**: Single-Cell Biology
- **Tags**: Biology, Medicine, Genomics
- **HuggingFace**: https://huggingface.co/tahoebio/Tahoe-100M-SCVI-v1

scVI-based variational autoencoder trained on the full Tahoe-100M atlas of 100M+ single-cell profiles across 50 cancer lines and 1,100 drug perturbations.

### PeptiVerse
- **Type**: Peptide Design
- **Tags**: Biology, Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/ChatterjeeLab/PeptiVerse

Foundation model for peptide design and analysis.

### CoLiPRI
- **Type**: Protein-Ligand Interaction
- **Tags**: Biology, Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/microsoft/colipri

Contrastive learning model for protein-ligand interaction prediction.

### Phikon-v2
- **Type**: Pathology Foundation Model
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/owkin/phikon-v2

State-of-the-art histopathology vision foundation model trained with DINOv2 on 460K whole-slide images, achieving top performance on cancer subtyping and survival prediction.

### Phikon
- **Type**: Pathology Foundation Model
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/owkin/phikon

ViT-based pathology foundation model trained on TCGA and other large histopathology cohorts via self-supervised learning for cancer tissue representation.

### X-Cell
- **Type**: Single-Cell Perturbation Model
- **Tags**: Biology, Medicine, Genomics
- **HuggingFace**: https://huggingface.co/Xaira-Therapeutics/X-Cell

Diffusion-based model for predicting transcriptomic responses to CRISPRi perturbations at single-cell resolution, trained on the X-Atlas-Pisces dataset.

### p-IgGen
- **Type**: Antibody Language Model
- **Tags**: Biology, Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/opig/p-IgGen

GPT-NeoX-based generative language model for antibody sequence design, trained on the Observed Antibody Space to generate diverse immunoglobulin heavy and light chains.

### OpenFold3
- **Type**: Protein Structure Prediction
- **Tags**: Biology, Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/OpenFold/OpenFold3

Open replication of AlphaFold3 — predicts structures of proteins, nucleic acids, ligands, and their complexes for drug discovery and structural biology.

### MedSAM
- **Type**: Medical Image Segmentation
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/wanglab/medsam-vit-base

SAM ViT-Base finetuned on a large-scale dataset of CT, MRI, X-ray, ultrasound, and histology — a universal promptable foundation model for medical image segmentation.

### GO-GPT
- **Type**: Protein Function Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/wanglab/gogpt

Generative model that predicts Gene Ontology functional annotations directly from protein sequences — bringing LLM-style decoding to functional protein characterisation.

### OpenMed PharmaDetect
- **Type**: Biomedical NER
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-PharmaDetect-SuperClinical-434M

Token-classification model for pharmaceutical entity recognition in clinical text — built on the SuperClinical 434M backbone for high-recall drug, dose, and regimen extraction.

### OpenMed BloodCancerDetect
- **Type**: Biomedical NER
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-BloodCancerDetect-TinyMed-65M

Compact 65M token-classification model that identifies haematologic malignancy mentions (leukaemia, lymphoma, myeloma subtypes) in clinical and biomedical text.

### OpenMed ChemicalDetect
- **Type**: Biomedical NER
- **Tags**: Medicine, Chemistry, Biology
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-ChemicalDetect-ModernMed-149M

Chemical-entity NER over biomedical literature — identifies drug names, compounds, and chemical substances using the ModernMed 149M backbone.

### OpenMed SpeciesDetect
- **Type**: Biomedical NER
- **Tags**: Biology, Medicine
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-SpeciesDetect-ElectraMed-109M

Species-mention NER over biomedical literature — identifies organisms and taxonomic references using the ElectraMed 109M backbone.

### OpenMed DNADetect
- **Type**: Biomedical NER
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-DNADetect-SuperMedical-125M

DNA-mention NER for biomedical text — extracts gene-level DNA sequence references and locus identifiers using the SuperMedical 125M backbone.

### OpenMed PathologyDetect
- **Type**: Biomedical NER
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-PathologyDetect-TinyMed-135M

Pathology-finding NER over clinical and biomedical text — surfaces histopathological observations, lesion descriptions, and tissue-level abnormalities.

### OpenMed AnatomyDetect
- **Type**: Biomedical NER
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-AnatomyDetect-ElectraMed-109M

Anatomical-entity NER for biomedical text — labels body parts, organ systems, and tissue references using the ElectraMed 109M backbone.

### OpenMed OncologyDetect
- **Type**: Biomedical NER
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-OncologyDetect-MultiMed-568M

Oncology-focused NER that identifies cancer-type mentions, tumour grading, and staging language across clinical and biomedical literature.

### OpenMed OrganismDetect
- **Type**: Biomedical NER
- **Tags**: Biology, Medicine
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-OrganismDetect-TinyMed-82M

Organism-mention NER for biomedical text — broader than SpeciesDetect, also picking up genera, strains, and informal organism references.

### OpenMed DiseaseDetect
- **Type**: Biomedical NER
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-DiseaseDetect-BioMed-335M

Disease-mention NER trained on the BioMed 335M backbone — recognises disease names, syndromes, and condition references in clinical and biomedical literature.

### OpenMed GenomicDetect
- **Type**: Biomedical NER
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-GenomicDetect-PubMed-335M

Genomic-entity NER over PubMed-style text — labels genes, transcripts, and other genomic references for downstream knowledge extraction.

### OpenMed ProteinDetect
- **Type**: Biomedical NER
- **Tags**: Biology, Medicine
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-ProteinDetect-SuperClinical-141M

Protein-mention NER for biomedical and clinical text — extracts protein names, family references, and post-translational modification descriptors.

### OpenMed GenomeDetect
- **Type**: Biomedical NER
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-GenomeDetect-ModernMed-149M

Genome-mention NER complementary to GenomicDetect — focuses on whole-genome and assembly-level references in biomedical text.

### BioReason-Pro SFT
- **Type**: Biological Reasoning Model
- **Tags**: Biology, Genomics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/wanglab/bioreason-pro-sft

Supervised fine-tuned variant of BioReason-Pro — multimodal biological reasoning over genomic variants and pathway data with chain-of-thought rationales.

### BioReason-Pro RL
- **Type**: Biological Reasoning Model
- **Tags**: Biology, Genomics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/wanglab/bioreason-pro-rl

RL-tuned variant of BioReason-Pro — reinforcement-learning fine-tuning over BioReason’s SFT base for sharper biological reasoning across KEGG pathways and variant data.

### NexaMass V3 Struct
- **Type**: Mass Spectrometry Model
- **Tags**: Chemistry, Biology, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/AethronPhantom/NexaMass-V3-Struct

Self-supervised representation model for MS/MS spectra in metabolomics — learns molecular fingerprints to support compound identification and structure inference.

### MMPT-FM
- **Type**: Pharma Foundation Model
- **Tags**: Biology, Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/Merck/MMPT-FM

Multi-modal pharma foundation model from Merck — integrates molecular and biological signals for drug discovery and target prediction.

### BioCLIP 2
- **Type**: Vision-Language Model
- **Tags**: Biology, Ecology, Conservation
- **HuggingFace**: https://huggingface.co/imageomics/bioclip-2

OpenCLIP-based foundation model for organismal biology — zero-shot species classification from photographs across the tree of life, trained on TreeOfLife-200M.

### BioEmu
- **Type**: Protein Dynamics Model
- **Tags**: Biology, Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/microsoft/bioemu

Generative model for protein structural ensembles — emulates conformational dynamics for drug discovery and structural biology beyond static AlphaFold-style predictions.

### OneGenome-Rice
- **Type**: Genomic Foundation Model
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/ZhejiangLab/OneGenome-Rice

Mixtral-architecture genomic foundation model specialised for rice (Oryza sativa) — supports variant analysis, expression prediction, and breeding-relevant trait modelling.

### Genos 1.2B
- **Type**: Genomic Foundation Model
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/ZhejiangLab/Genos-1.2B

General-purpose 1.2B-parameter genomic foundation model spanning multiple organisms — base model for downstream gene-level and sequence-level prediction tasks.

### eva-rna
- **Type**: Transcriptomics Foundation Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/ScientaLab/eva-rna

Transformer foundation model producing sample-level and gene-level embeddings from RNA-seq profiles (bulk, microarray, pseudobulked single-cell) in human and mouse.

### GENA-LM BERT large (T2T)
- **Type**: Genomic Language Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/AIRI-Institute/gena-lm-bert-large-t2t

BERT-large-style genomic foundation model trained on telomere-to-telomere human assemblies — supports variant interpretation, regulatory prediction, and downstream genomic tasks.

### GENA-LM BERT base (T2T)
- **Type**: Genomic Language Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/AIRI-Institute/gena-lm-bert-base-t2t

BERT-base-style genomic foundation model trained on T2T assemblies — lighter-weight backbone for genomic sequence understanding.

### ModernGENA large
- **Type**: Genomic Language Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/AIRI-Institute/moderngena-large

GENA-LM rebuilt on the ModernBERT architecture — larger, longer-context, RoPE-equipped genomic foundation model.

### ModernGENA base
- **Type**: Genomic Language Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/AIRI-Institute/moderngena-base

Compact ModernBERT-based GENA-LM variant — efficient genomic foundation model for downstream variant and expression tasks.

### HuatuoGPT-Vision 7B
- **Type**: Medical Vision-Language Model
- **Tags**: Medicine, Biology, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/FreedomIntelligence/HuatuoGPT-Vision-7B

Medical multimodal LLM from the HuatuoGPT family — answers clinical questions over medical imagery (radiology, pathology, dermatology) using a 7B vision-language backbone.

### FlashPPI
- **Type**: Protein-Protein Interaction Model
- **Tags**: Biology, Medicine
- **HuggingFace**: https://huggingface.co/tattabio/flashppi

Fast protein-protein interaction prediction model — trained for high-throughput screening of interaction networks.

### gLM2 650M
- **Type**: Genomic Language Model
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/tattabio/gLM2_650M

650M-parameter genomic foundation model from Tatta Bio — trained on the OMG open-mixed-genomes corpus for sequence-level biological reasoning.

### OpenPhenom
- **Type**: Phenomics Foundation Model
- **Tags**: Biology, Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/recursionpharma/OpenPhenom

Masked-autoencoder foundation model for high-content cell imaging — learns phenomic embeddings from millions of microscopy images for downstream drug-discovery and perturbation analysis.

### Stack-Large Aligned
- **Type**: Single-Cell Foundation Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/arcinstitute/Stack-Large-Aligned

Aligned variant of STACK-Large — single-cell foundation model fine-tuned for cross-batch consistency, supporting multi-study perturbation analysis and downstream alignment tasks.

### SE-600M
- **Type**: Single-Cell Foundation Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/arcinstitute/SE-600M

600M-parameter Single-cell Embeddings model from the STATE collection — generates embeddings for human single-cell RNA expression profiles to support cell-state and perturbation analysis.

## Blog Posts (33)

### Eve Bio: Mapping the Pharmone Drug Interaction
- **Author**: hugging-science
- **Date**: 2025-01-01
- **Tags**: Medicine, Biology, Chemistry
- **Link**: https://huggingface.co/blog/hugging-science/eve-bio-mapping-the-pharmone-drug-interaction

Understanding drug interactions through AI-powered pharmacogenomics.

### PromoterGPT
- **Author**: hugging-science
- **Date**: 2025-01-01
- **Tags**: Biology, Genomics
- **Link**: https://huggingface.co/blog/hugging-science/promoter-gpt

AI-powered promoter sequence design and analysis.

### AI for Food Allergies
- **Author**: hugging-science
- **Date**: 2025-01-01
- **Tags**: Medicine, Biology
- **Link**: https://huggingface.co/blog/hugging-science/ai-for-food-allergies

Applying AI to understand and predict food allergies.

### GDP: Generative Design for Proteins
- **Author**: cgeorgiaw
- **Date**: 2025-01-01
- **Tags**: Biology, Chemistry
- **Link**: https://huggingface.co/blog/cgeorgiaw/gdp

Generative models for protein design and engineering.

### Making Antibody Embeddings and Predictions
- **Author**: ginkgo-datapoints
- **Date**: 2025-01-01
- **Tags**: Biology, Medicine, Biotechnology
- **Link**: https://huggingface.co/blog/ginkgo-datapoints/making-antibody-embeddings-and-predictions

How to create and use antibody embeddings for therapeutic applications.

### SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence
- **Author**: SandboxAQ
- **Date**: 2025-09-06
- **Tags**: Chemistry, Medicine, Biology
- **Link**: https://huggingface.co/blog/SandboxAQ/sair-data-accelerating-drug-discovery-with-ai

How SandboxAQ's SAIR dataset of 1M+ protein–ligand structures is enabling AI-powered drug discovery with unprecedented structural coverage.

### ThermoGFN-IF for Catalysis
- **Author**: AmelieSchreiber
- **Date**: 2026-03-10
- **Tags**: Biology, Chemistry, Medicine
- **Link**: https://huggingface.co/blog/AmelieSchreiber/thermogfn-if

A protein sequence design model fine-tuned with GFlowNets for thermostable and kinetically-aware enzyme engineering.

### A New Era in Multistep Enzyme Design
- **Author**: AmelieSchreiber
- **Date**: 2024-10-16
- **Tags**: Biology, Chemistry
- **Link**: https://huggingface.co/blog/AmelieSchreiber/a-new-era-of-enzyme-engineering

Exploring generative AI approaches for designing multistep enzymatic pathways for biosynthesis and biocatalysis.

### A Guide to Designing New Functional Proteins
- **Author**: AmelieSchreiber
- **Date**: 2024-07-02
- **Tags**: Biology, Chemistry
- **Link**: https://huggingface.co/blog/AmelieSchreiber/protein-optimization-and-design

A comprehensive guide to improving protein function, stability, and diversity using generative AI and ESM-2.

### RFDiffusion Potentials
- **Author**: AmelieSchreiber
- **Date**: 2024-05-14
- **Tags**: Biology, Chemistry
- **Link**: https://huggingface.co/blog/AmelieSchreiber/rfdiffusion-potentials

Using RFDiffusion with custom guiding potentials to steer protein structure generation toward desired functional properties.

### Predicting the Effects of Mutations on Protein Function with ESM-2
- **Author**: AmelieSchreiber
- **Date**: 2023-12-13
- **Tags**: Biology, Genomics
- **Link**: https://huggingface.co/blog/AmelieSchreiber/mutation-scoring

Using ESM-2 protein language model embeddings to score and predict the functional impact of point mutations.

### Faster Persistent Homology Alignment and Protein Complex Clustering
- **Author**: AmelieSchreiber
- **Date**: 2023-11-30
- **Tags**: Biology, Mathematics
- **Link**: https://huggingface.co/blog/AmelieSchreiber/faster-pha

Accelerating persistent homology alignment with ESM-2 embeddings and persistence landscapes for protein complex clustering.

### Clustering Protein Complexes using Persistent Homology
- **Author**: AmelieSchreiber
- **Date**: 2023-11-29
- **Tags**: Biology, Chemistry
- **Link**: https://huggingface.co/blog/AmelieSchreiber/esm-ppi

Combining persistent homology with ESM-2 fine-tuning for protein–protein interaction network prediction and complex clustering.

### ESM-2 for Generating and Optimizing Peptide Binders
- **Author**: AmelieSchreiber
- **Date**: 2023-11-23
- **Tags**: Biology, Medicine
- **Link**: https://huggingface.co/blog/AmelieSchreiber/esm-interact

Generating and optimising peptide binders for target proteins using ESM-2 embeddings and directed evolution.

### Persistent Homology Alignment: Replacing Multiple Sequence Alignments
- **Author**: AmelieSchreiber
- **Date**: 2023-11-15
- **Tags**: Biology, Mathematics
- **Link**: https://huggingface.co/blog/AmelieSchreiber/plm-persistent-homology-msa-replacement

Replacing traditional multiple sequence alignments with ESM-2 embeddings and persistent homology for structure-aware protein comparison.

### In Silico Directed Evolution of Protein Sequences with ESM-2
- **Author**: AmelieSchreiber
- **Date**: 2023-11-13
- **Tags**: Biology, Chemistry
- **Link**: https://huggingface.co/blog/AmelieSchreiber/directed-evolution-with-esm2

Using ESM-2 and EvoProtGrad to simulate directed evolution in silico, optimising protein sequences for target properties.

### QLoRA for ESM-2 and Post Translational Modification Site Prediction
- **Author**: AmelieSchreiber
- **Date**: 2023-11-11
- **Tags**: Biology, Genomics
- **Link**: https://huggingface.co/blog/AmelieSchreiber/esm2-ptm

Applying QLoRA fine-tuning to ESM-2 for accurate prediction of post-translational modification sites across protein sequences.

### Estimating the Intrinsic Dimension of Protein Sequence Embeddings
- **Author**: AmelieSchreiber
- **Date**: 2023-10-18
- **Tags**: Biology, Mathematics
- **Link**: https://huggingface.co/blog/AmelieSchreiber/intrinsic-dimension-of-proteins

Measuring the intrinsic dimensionality of ESM-2 protein embeddings to understand the geometric structure of protein sequence space.

### Predicting Protein–Protein Interactions Using a Protein Language Model
- **Author**: AmelieSchreiber
- **Date**: 2023-10-15
- **Tags**: Biology, Chemistry
- **Link**: https://huggingface.co/blog/AmelieSchreiber/protein-binding-partners-with-esm2

Using ESM-2 embeddings and linear sum assignment to predict protein–protein binding partners at scale.

### ESMBind Ensemble Models
- **Author**: AmelieSchreiber
- **Date**: 2023-09-22
- **Tags**: Biology, Genomics
- **Link**: https://huggingface.co/blog/AmelieSchreiber/esmbind-ensemble

Ensemble methods for ESMBind models to improve binding site prediction accuracy and robustness across protein families.

### ESMBind: Low Rank Adaptation of ESM-2 for Protein Binding Site Prediction
- **Author**: AmelieSchreiber
- **Date**: 2023-09-15
- **Tags**: Biology, Genomics
- **Link**: https://huggingface.co/blog/AmelieSchreiber/esmbind

Fine-tuning ESM-2 with LoRA adapters to predict protein binding sites with high accuracy and parameter efficiency.

### A Comprehensive Introduction to AI for Proteins (2026)
- **Author**: tamarind.bio
- **Date**: 2026-01-01
- **Tags**: Biology, Chemistry, Medicine
- **Link**: https://www.tamarind.bio/blog/a-comprehensive-introduction-to-ai-for-proteins

A thorough primer on the state of AI for protein science — covering structure prediction, protein language models, generative design, and the full open-source model landscape.

### Boltz-2: State of the Art Structure and Binding Affinity Prediction
- **Author**: tamarind.bio
- **Date**: 2025-06-18
- **Tags**: Biology, Chemistry, Medicine
- **Link**: https://www.tamarind.bio/blog/boltz2-state-of-the-art-structure-and-binding-affinity-prediction

Boltz-2 outperforms AlphaFold3 on antibody-antigen interfaces and sets a new state of the art for protein-ligand binding affinity prediction.

### Boltzdesign1: Designing De Novo Binders to More Than Just Proteins
- **Author**: tamarind.bio
- **Date**: 2025-06-01
- **Tags**: Biology, Chemistry, Medicine
- **Link**: https://www.tamarind.bio/blog/boltzdesign1-small-molecule-rna-dna-protein-metal-binder-design

BoltzDesign1 extends de novo binder design beyond protein targets to small molecules, RNA, DNA, and metal ions.

### OpenFold3 and The Future of Protein Folding
- **Author**: tamarind.bio
- **Date**: 2025-04-01
- **Tags**: Biology, Chemistry
- **Link**: https://www.tamarind.bio/blog/openfold3-fully-open-alphafold3-alternative

OpenFold3 is a fully open-source, commercially available AlphaFold3 alternative backed by the OpenFold Consortium — enabling unrestricted biomolecular structure prediction.

### IntFold: A New Best Structure Prediction Protocol
- **Author**: tamarind.bio
- **Date**: 2025-03-01
- **Tags**: Biology, Chemistry
- **Link**: https://www.tamarind.bio/blog/intfold-a-new-state-of-the-art

IntFold establishes a new state-of-the-art protocol for biomolecular complex structure prediction, setting records across standard benchmarks.

### Chai-1r: AlphaFold3 Level Performance, Now Completely Open Source
- **Author**: tamarind.bio
- **Date**: 2025-02-01
- **Tags**: Biology, Chemistry, Medicine
- **Link**: https://www.tamarind.bio/blog/chai-1-alphafold3-level-performance-now-completely-open-source

Chai-1r achieves AlphaFold3-level accuracy on protein-protein and antibody-antigen complexes with fully open weights and no usage restrictions.

### Computational De Novo Design of Antibodies and Nanobodies
- **Author**: tamarind.bio
- **Date**: 2025-01-01
- **Tags**: Biology, Medicine, Chemistry, Biotechnology
- **Link**: https://www.tamarind.bio/blog/de-novo-antibody-nanobody-vhh-scfv-rfdiffusion

A practical guide to designing antibody VHHs and scFvs de novo using RFdiffusion and ProteinMPNN, from target epitope to validated sequence.

### Predicting Antibody Properties & Developability
- **Author**: tamarind.bio
- **Date**: 2025-01-01
- **Tags**: Biology, Medicine, Chemistry, Biotechnology
- **Link**: https://www.tamarind.bio/blog/predicting-antibody-properties-developability

ML approaches for predicting key biophysical properties of therapeutic antibody candidates — stability, solubility, and immunogenicity — before wet-lab validation.

### Are Mini Proteins the Next Antibodies?
- **Author**: tamarind.bio
- **Date**: 2025-01-01
- **Tags**: Biology, Medicine, Chemistry
- **Link**: https://www.tamarind.bio/blog/mini-protein-antibodies

Examining the therapeutic potential of computationally designed miniproteins as a next-generation alternative to traditional antibody drugs.

### Boltz-1: AlphaFold3 Level Performance, Truly Open Source
- **Author**: tamarind.bio
- **Date**: 2024-11-01
- **Tags**: Biology, Chemistry
- **Link**: https://www.tamarind.bio/blog/boltz-1-alphafold3-level-performance-truly-open-source-and-commercially-available

Boltz-1 from MIT achieves AlphaFold3-level accuracy on protein and protein-ligand structure prediction with no restrictions on commercial use or input types.

### Computational De Novo Miniproteins As Therapeutics
- **Author**: tamarind.bio
- **Date**: 2024-12-01
- **Tags**: Biology, Medicine, Chemistry
- **Link**: https://www.tamarind.bio/blog/computationaly-de-novo-minibinders-therapeutic-applications

How computationally designed de novo miniproteins and minibinders are being developed as a new class of targeted therapeutics.

### Computational Protein–Protein Interaction Screening
- **Author**: tamarind.bio
- **Date**: 2024-12-01
- **Tags**: Biology, Medicine, Chemistry
- **Link**: https://www.tamarind.bio/blog/ppi-screen

A practical guide to screening for protein–protein interactions (PPIs) as drug discovery targets using structure prediction and ML scoring.


================================================================================
## Topic: Biotechnology (/topics/biotechnology.md)
================================================================================

# Biotechnology — Hugging Science

> Biological engineering and synthetic biology

_Auto-generated from site data. Fetch `/llms.txt` for the full index._

## Datasets (6)

### ginkgo-datapoints/GDPa1
- **Type**: Antibody Developability
- **Tags**: Biology, Biotechnology
- **HuggingFace**: https://huggingface.co/datasets/ginkgo-datapoints/GDPa1

Antibody developability dataset with biophysical assay data for 242 antibodies across 9 assays.

### ginkgo-datapoints/GDPx1
- **Type**: Functional Genomics
- **Tags**: Biology, Biotechnology
- **HuggingFace**: https://huggingface.co/datasets/ginkgo-datapoints/GDPx1

DRUG-seq functional genomics dataset with chemical perturbation experiments in A549 cells.

### ginkgo-datapoints/GDPx2
- **Type**: Functional Genomics
- **Tags**: Biology, Biotechnology
- **HuggingFace**: https://huggingface.co/datasets/ginkgo-datapoints/GDPx2

DRUG-seq transcriptomic profiling across 4 primary human cell types with 85 compounds.

### ginkgo-datapoints/GDPx3
- **Type**: Cell Imaging
- **Tags**: Biology, Biotechnology
- **HuggingFace**: https://huggingface.co/datasets/ginkgo-datapoints/GDPx3

High-content Cell Painting imaging dataset for AI/ML model training in drug discovery.

### ginkgo-datapoints/GDPx4
- **Type**: Functional Genomics
- **Tags**: Biology, Biotechnology
- **HuggingFace**: https://huggingface.co/datasets/ginkgo-datapoints/GDPx4

DRUG-seq transcriptomic profiling in engineered HEK293 cells with inducible gene overexpression, enabling systematic study of gene-drug interactions.

### wanglab/img_virus_plasmid
- **Type**: Microbial Genomics
- **Tags**: Biology, Genomics, Biotechnology
- **HuggingFace**: https://huggingface.co/datasets/wanglab/img_virus_plasmid

Combined IMG/VR (uncultivated virus genomes) and IMG/PR (plasmids from genomes and metagenomes) catalog with rich functional, taxonomic, and ecological metadata.

## Blog Posts (3)

### Making Antibody Embeddings and Predictions
- **Author**: ginkgo-datapoints
- **Date**: 2025-01-01
- **Tags**: Biology, Medicine, Biotechnology
- **Link**: https://huggingface.co/blog/ginkgo-datapoints/making-antibody-embeddings-and-predictions

How to create and use antibody embeddings for therapeutic applications.

### Computational De Novo Design of Antibodies and Nanobodies
- **Author**: tamarind.bio
- **Date**: 2025-01-01
- **Tags**: Biology, Medicine, Chemistry, Biotechnology
- **Link**: https://www.tamarind.bio/blog/de-novo-antibody-nanobody-vhh-scfv-rfdiffusion

A practical guide to designing antibody VHHs and scFvs de novo using RFdiffusion and ProteinMPNN, from target epitope to validated sequence.

### Predicting Antibody Properties & Developability
- **Author**: tamarind.bio
- **Date**: 2025-01-01
- **Tags**: Biology, Medicine, Chemistry, Biotechnology
- **Link**: https://www.tamarind.bio/blog/predicting-antibody-properties-developability

ML approaches for predicting key biophysical properties of therapeutic antibody candidates — stability, solubility, and immunogenicity — before wet-lab validation.


================================================================================
## Topic: Chemistry (/topics/chemistry.md)
================================================================================

# Chemistry — Hugging Science

> Molecular science, reactions, and materials

_Auto-generated from site data. Fetch `/llms.txt` for the full index._

## Datasets (59)

### eve-bio/drug-target-activity
- **Type**: Drug Discovery
- **Tags**: Biology, Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/eve-bio/drug-target-activity

Drug-target interaction measurements for 1,397 FDA-approved small molecule drugs.

### SandboxAQ/SAIR
- **Type**: Drug Discovery
- **Tags**: Chemistry, Medicine, Biology
- **HuggingFace**: https://huggingface.co/datasets/SandboxAQ/SAIR

Largest public dataset of protein-ligand 3D structures with binding affinity measurements (1M+ pairs).

### SandboxAQ/aqcat25-dataset
- **Type**: Computational Chemistry
- **Tags**: Chemistry, Materials Science, Engineering
- **HuggingFace**: https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset

13.5M DFT calculation trajectories for heterogeneous catalysis and ML potential training.

### jablonkagroup/chempile-mlift
- **Type**: Molecular
- **Tags**: Chemistry
- **HuggingFace**: https://huggingface.co/datasets/jablonkagroup/chempile-mlift

Curated lift-off subset of the ChemPile corpus for instruction-tuning and benchmarking chemistry language models across synthesis, property prediction, and reaction tasks.

### jablonkagroup/ChemBench
- **Type**: Chemistry Benchmark
- **Tags**: Chemistry, Materials Science, Benchmark, Scientific Reasoning, Engineering, Mathematics
- **HuggingFace**: https://huggingface.co/datasets/jablonkagroup/ChemBench

Manually curated benchmark of 3,000+ chemistry and materials science questions across spectroscopy, reactivity, synthesis, and property prediction for evaluating LLMs.

### jablonkagroup/chempile-paper
- **Type**: Scientific Literature
- **Tags**: Chemistry
- **HuggingFace**: https://huggingface.co/datasets/jablonkagroup/chempile-paper

Large corpus of peer-reviewed chemistry papers and preprints for pre-training and fine-tuning chemistry language models.

### google/spiqa
- **Type**: Scientific Benchmark
- **Tags**: Biology, Chemistry, Physics, Benchmark, Scientific Reasoning, Mathematics
- **HuggingFace**: https://huggingface.co/datasets/google/spiqa

Scientific Paper Image Question Answering benchmark requiring multimodal reasoning over figures, charts, and diagrams from research papers across scientific domains.

### LeMaterial/LeMat-Bulk-MLIP-Hull
- **Type**: Materials Science
- **Tags**: Materials Science, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/LeMaterial/LeMat-Bulk-MLIP-Hull

Convex hull data for bulk materials from MLIP calculations.

### LeMaterial/LeMat-Bulk-DFT-Hull-All
- **Type**: Materials Science
- **Tags**: Materials Science, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/LeMaterial/LeMat-Bulk-DFT-Hull-All

Complete DFT convex hull dataset for bulk materials discovery.

### LeMaterial/LeMat-Bulk-DFT-Hull
- **Type**: Materials Science
- **Tags**: Materials Science, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/LeMaterial/LeMat-Bulk-DFT-Hull

DFT convex hull reference data for materials stability analysis.

### LeMaterial/LeMat-Bulk
- **Type**: Materials Science
- **Tags**: Materials Science, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/LeMaterial/LeMat-Bulk

Primary bulk materials database aggregating 1M+ crystal structures with DFT-computed formation energies, band gaps, and elastic properties for materials discovery.

### LeMaterial/LeMat-Traj
- **Type**: Materials Science
- **Tags**: Materials Science, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/LeMaterial/LeMat-Traj

Large-scale molecular dynamics trajectory dataset for training machine learning interatomic potentials across diverse bulk material compositions.

### openadmet/openadmet-expansionrx-challenge-train-data
- **Type**: Drug Discovery
- **Tags**: Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/openadmet/openadmet-expansionrx-challenge-train-data

Training data for the OpenADMET ExpansionRx ADMET prediction challenge.

### openadmet/openadmet-expansionrx-challenge-data
- **Type**: Drug Discovery
- **Tags**: Medicine, Chemistry, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/openadmet/openadmet-expansionrx-challenge-data

Full ExpansionRx challenge dataset of RNA-targeted small-molecule compounds with measured ADMET properties for open pharmacokinetics benchmarking.

### openadmet/Octant_CYP_inhibition_reactivity_blog_release
- **Type**: Drug Discovery
- **Tags**: Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/openadmet/Octant_CYP_inhibition_reactivity_blog_release

Octant CYP inhibition and chemical reactivity dataset measuring cytochrome P450 activity across a diverse compound library for ADMET modelling.

### InstaDeepAI/winnow-ms-datasets
- **Type**: Proteomics
- **Tags**: Biology, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/InstaDeepAI/winnow-ms-datasets

Mass spectrometry datasets for protein analysis and ML model training.

### facebook/principia-collection
- **Type**: STEM Reasoning
- **Tags**: Mathematics, Physics, Chemistry, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/facebook/principia-collection

Large-scale STEM reasoning dataset from Meta covering mathematics, physics, chemistry, and biology problems for training and evaluating scientific reasoning in language models.

### facebook/principia-bench
- **Type**: STEM Benchmark
- **Tags**: Mathematics, Physics, Chemistry, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/facebook/principia-bench

Curated benchmark of challenging STEM problems requiring multi-step reasoning, quantitative analysis, and domain knowledge across natural sciences.

### futurehouse/BixBench
- **Type**: Research Benchmark
- **Tags**: Biology, Chemistry, Benchmark, Scientific Reasoning, Mathematics
- **HuggingFace**: https://huggingface.co/datasets/futurehouse/BixBench

Benchmark with 205 reproducible research questions paired with data capsules for AI evaluation.

### futurehouse/ether0-benchmark
- **Type**: Chemistry Benchmark
- **Tags**: Chemistry, Medicine, Benchmark, Scientific Reasoning, Mathematics
- **HuggingFace**: https://huggingface.co/datasets/futurehouse/ether0-benchmark

Chemistry reasoning benchmark covering SMILES-based tasks including reaction prediction, retrosynthesis, and molecular property estimation for evaluating chemistry LLMs.

### opig/OAS
- **Type**: Antibody Sequences
- **Tags**: Biology, Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/opig/OAS

Observed Antibody Space: a curated database of over one billion antibody sequences from immune repertoire sequencing studies, the standard resource for antibody ML.

### allenai/peS2o
- **Type**: Pretraining Corpus
- **Tags**: Scientific Reasoning, Biology, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/allenai/peS2o

Approximately 40M cleaned, filtered, and formatted open-access academic papers derived from S2ORC — a large multi-domain pretraining corpus for science-aware language models, spanning biology, chemistry, engineering, computer science, and physics.

### jablonkagroup/chempile-instruction
- **Type**: Chemistry Instruction Corpus
- **Tags**: Chemistry, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/jablonkagroup/chempile-instruction

Instruction-tuning corpus for chemistry — curated Q&A and dialogue traces drawn from chemical literature and educational sources for training chemistry-specialist LLMs.

### jablonkagroup/chempile-reasoning
- **Type**: Chemistry Reasoning Corpus
- **Tags**: Chemistry, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/jablonkagroup/chempile-reasoning

Multi-step chemistry reasoning corpus — open-domain QA, NLI, and multiple-choice items with chains of reasoning for training and evaluating chemical reasoning models.

### jablonkagroup/chempile-lift
- **Type**: Chemistry Pretraining
- **Tags**: Chemistry, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/jablonkagroup/chempile-lift

ChemPile-LIFT — large-scale language-modelling dataset combining curated chemistry literature and structured chemical knowledge for foundation-model pretraining.

### jablonkagroup/chempile-education
- **Type**: Chemistry Education Corpus
- **Tags**: Chemistry, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/jablonkagroup/chempile-education

Educational chemistry corpus — multiple-choice and open-ended items spanning introductory through graduate chemistry for assessing model educational capability.

### jablonkagroup/chempile-caption
- **Type**: Chemistry Captioning
- **Tags**: Chemistry, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/jablonkagroup/chempile-caption

Image-to-text dataset of chemistry figures (molecular structures, reaction schemes, plots) with expert captions for training multimodal chemistry models.

### jablonkagroup/chempile-code
- **Type**: Chemistry Code Corpus
- **Tags**: Chemistry, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/jablonkagroup/chempile-code

Curated chemistry-relevant code (RDKit, ASE, simulation tooling) drawn from The Stack — supports training models that can read and write computational chemistry workflows.

### jablonkagroup/MaCBench
- **Type**: Materials Chemistry Benchmark
- **Tags**: Chemistry, Materials Science, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/jablonkagroup/MaCBench

Materials Chemistry Benchmark — multimodal QA, multiple-choice, and visual-question-answering items for evaluating LLMs on materials and inorganic chemistry tasks.

### maomlab/Molecule3D
- **Type**: Molecular Properties
- **Tags**: Chemistry, Biology
- **HuggingFace**: https://huggingface.co/datasets/maomlab/Molecule3D

Curated 3D molecular structures with computed properties — supports geometric deep learning for property prediction and conformer-aware modelling.

### maomlab/TDC
- **Type**: Therapeutics Benchmark
- **Tags**: Medicine, Chemistry, Biology
- **HuggingFace**: https://huggingface.co/datasets/maomlab/TDC

Therapeutics Data Commons subset — drug-discovery tasks (ADMET, drug-target interaction, generation) curated for benchmarking molecular ML.

### maomlab/B3DB
- **Type**: BBB Permeability
- **Tags**: Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/maomlab/B3DB

Blood-Brain Barrier Database (B3DB) — curated permeability measurements for compounds, supporting CNS drug-discovery ML benchmarks.

### maomlab/ChAFF
- **Type**: Chemistry Dataset
- **Tags**: Chemistry
- **HuggingFace**: https://huggingface.co/datasets/maomlab/ChAFF

ChAFF — chemistry dataset for ML benchmarking on filtered/curated molecular properties, part of the Maom Lab pharmacology suite.

### microsoft/msr-acc-tae25
- **Type**: Quantum Chemistry Dataset
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/datasets/microsoft/msr-acc-tae25

Microsoft Research Accurate Chemistry Collection — large dataset of high-accuracy electronic-structure calculations (TAE25 split) for training and evaluating quantum-chemistry ML models.

### Orbital-Materials/MofasaDB
- **Type**: MOF Dataset
- **Tags**: Materials Science, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/Orbital-Materials/MofasaDB

Metal-organic framework dataset from Orbital — large-scale curated MOF structures for materials-discovery ML and synthetic chemistry workflows.

### foundry-ml/foundry_oqmd_band_gaps_v1-1
- **Type**: Electronic Structure
- **Tags**: Materials Science, Physics, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/foundry_oqmd_band_gaps_v1-1

Band-gap values from the Open Quantum Materials Database (OQMD), prepared for ML benchmarking on inorganic crystal electronic structure.

### foundry-ml/foundry_aflow_band_gaps_v1-1
- **Type**: Electronic Structure
- **Tags**: Materials Science, Physics, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/foundry_aflow_band_gaps_v1-1

Band-gap values from the AFLOW high-throughput materials database, formatted for ML model training and evaluation.

### foundry-ml/foundry_mp_band_gaps_v1-1
- **Type**: Electronic Structure
- **Tags**: Materials Science, Physics, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/foundry_mp_band_gaps_v1-1

Band-gap values curated from the Materials Project for ML benchmarking on inorganic electronic structure.

### foundry-ml/double_perovskite_bandgap_v1-1
- **Type**: Electronic Structure
- **Tags**: Materials Science, Physics, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/double_perovskite_bandgap_v1-1

Computed band gaps for double-perovskite compounds — supports ML-based screening for photovoltaic and optoelectronic applications.

### foundry-ml/wolverton_oxides_v1-1
- **Type**: Oxide Properties
- **Tags**: Materials Science, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/wolverton_oxides_v1-1

Wolverton oxide property dataset — DFT-computed properties for binary and ternary oxides, used for ML benchmarking on inorganic chemistry.

### foundry-ml/dataset_perovskite_formatione
- **Type**: Perovskite Properties
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_perovskite_formatione

Formation energies for perovskite compounds — supports ML screening for stability and synthesisability.

### foundry-ml/dataset_perovskite_stability_updated
- **Type**: Perovskite Properties
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_perovskite_stability_updated

Curated perovskite stability data (updated release) for benchmarking ML models on photovoltaic-material durability prediction.

### foundry-ml/perovskite_stability_v1-1
- **Type**: Perovskite Properties
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/perovskite_stability_v1-1

Perovskite stability dataset (v1.1 release) — paired structure and stability labels for ML benchmarking.

### foundry-ml/perovskite_opbandcenter_v1-1
- **Type**: Perovskite Properties
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/perovskite_opbandcenter_v1-1

O p-band center values for perovskite oxides — descriptors for catalytic activity prediction in oxygen-evolution reactions.

### foundry-ml/dataset_perovskite_conductivity
- **Type**: Perovskite Properties
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_perovskite_conductivity

Ionic and electronic conductivity measurements for perovskite materials — supports ML screening for solid-oxide fuel cell electrolytes.

### foundry-ml/dataset_perovskite_habs
- **Type**: Perovskite Properties
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_perovskite_habs

Hot-air-balance (HABS) data for perovskite materials — thermal-stability characterisation supporting durability ML.

### foundry-ml/dataset_perovskite_tec
- **Type**: Perovskite Properties
- **Tags**: Materials Science, Chemistry, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_perovskite_tec

Thermal expansion coefficients for perovskite materials — curated for ML thermal-property prediction.

### foundry-ml/dataset_perovskite_asr
- **Type**: Perovskite Properties
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_perovskite_asr

Area-specific resistance (ASR) data for perovskite electrodes — used in solid-oxide fuel cell ML modelling.

### foundry-ml/dataset_exfoliatione
- **Type**: 2D Materials
- **Tags**: Materials Science, Physics, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_exfoliatione

Exfoliation energy dataset for 2D materials — supports ML-driven discovery of layered compounds suitable for monolayer isolation.

### foundry-ml/dataset_li_conductivity
- **Type**: Battery Materials
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_li_conductivity

Lithium-ion conductivity dataset for solid electrolytes — supports ML discovery of next-generation battery materials.

### foundry-ml/elwood_md_v1-2
- **Type**: Molecular Dynamics
- **Tags**: Chemistry, Materials Science
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/elwood_md_v1-2

Elwood molecular-dynamics simulation set — trajectory and energy data for ML molecular-property prediction.

### foundry-ml/foundry_g4mp2_solvation_v1-2
- **Type**: Solvation Energies
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/foundry_g4mp2_solvation_v1-2

High-accuracy G4MP2 solvation-energy data — supports ML for quantum-chemical accuracy on aqueous and organic systems.

### foundry-ml/foundry_moses_v1-1
- **Type**: Molecular Generation
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/foundry_moses_v1-1

Foundry mirror of MOSES — molecular sets benchmark for evaluating generative chemistry models on drug-like molecule generation.

### foundry-ml/foundry_osdb_v1-1
- **Type**: Organic Semiconductors
- **Tags**: Chemistry, Materials Science, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/foundry_osdb_v1-1

Organic Semiconductor Database (OSDB) curated for ML — supports property prediction and screening of organic optoelectronic materials.

### foundry-ml/foundry_qmc_ml_v1-1
- **Type**: Quantum Chemistry
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/foundry_qmc_ml_v1-1

Quantum Monte Carlo (QMC) reference data for ML benchmarking — high-accuracy electronic structure calculations on small molecules.

### foundry-ml/diffusion_v1-4
- **Type**: Diffusion Coefficients
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/diffusion_v1-4

Diffusion-coefficient dataset for inorganic systems — supports ML modelling of solid-state ion transport and electrolyte design.

### mist-models/excess-properties
- **Type**: Mixture Properties
- **Tags**: Chemistry, Materials Science
- **HuggingFace**: https://huggingface.co/datasets/mist-models/excess-properties

Excess-property dataset for binary/ternary chemical mixtures — used to fine-tune MIST mixtures models on thermodynamic deviations from ideal mixing.

### recursionpharma/rxrx3
- **Type**: Phenomics Imaging
- **Tags**: Biology, Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/recursionpharma/rxrx3

Full RxRx3 release — multi-million image high-content microscopy dataset spanning genetic and chemical perturbations across human cell lines, paired with rich text annotations for image-based drug discovery.

### recursionpharma/rxrx3-core
- **Type**: Phenomics Imaging
- **Tags**: Biology, Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/recursionpharma/rxrx3-core

Curated core subset of RxRx3 — high-quality phenomics images for benchmarking and lower-cost training of phenomic foundation models like OpenPhenom.

## Models (87)

### AQAffinity
- **Type**: Drug Discovery
- **Tags**: Chemistry, Medicine, Biology
- **HuggingFace**: https://huggingface.co/SandboxAQ/AQAffinity

Open-source protein-ligand binding affinity prediction model for drug discovery.

### TxGemma 2B
- **Type**: Drug Discovery
- **Tags**: Medicine, Chemistry, Biology
- **HuggingFace**: https://huggingface.co/google/txgemma-2b-predict

Lightweight therapeutic prediction model for drug discovery tasks.

### TxGemma 9B Predict
- **Type**: Drug Discovery
- **Tags**: Medicine, Chemistry, Biology
- **HuggingFace**: https://huggingface.co/google/txgemma-9b-predict

Mid-size therapeutic prediction model for drug property prediction.

### TxGemma 9B Chat
- **Type**: Drug Discovery
- **Tags**: Medicine, Chemistry, Biology
- **HuggingFace**: https://huggingface.co/google/txgemma-9b-chat

Conversational therapeutic model for drug discovery with reasoning explanations.

### TxGemma 27B Predict
- **Type**: Drug Discovery
- **Tags**: Medicine, Chemistry, Biology
- **HuggingFace**: https://huggingface.co/google/txgemma-27b-predict

Large therapeutic prediction model achieving best-in-class performance on 66 tasks.

### TxGemma 27B Chat
- **Type**: Drug Discovery
- **Tags**: Medicine, Chemistry, Biology
- **HuggingFace**: https://huggingface.co/google/txgemma-27b-chat

Large conversational therapeutic model with advanced reasoning capabilities.

### ether0
- **Type**: Chemistry
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/futurehouse/ether0

24B parameter model for molecular reasoning - SMILES generation, property prediction, and retrosynthesis.

### CYP Inhibition Model
- **Type**: ADMET Prediction
- **Tags**: Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/openadmet/cyp1a2-cyp2d6-cyp3a4-cyp3c9-chemeleon-baseline

Multi-task model predicting inhibition of four major cytochrome P450 isoforms (CYP1A2, CYP2D6, CYP3A4, CYP3C9) critical for drug metabolism assessment.

### PXR Activation Model
- **Type**: ADMET Prediction
- **Tags**: Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/openadmet/pxr-chemeleon-baseline

Pregnane X receptor (PXR) activation predictor for early identification of drug-drug interaction liability via nuclear receptor-mediated CYP induction.

### ESM2 650M
- **Type**: Protein Language Model
- **Tags**: Biology, Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/facebook/esm2_t33_650M_UR50D

650M-parameter protein language model trained on UniRef50 — state-of-the-art embeddings for structure prediction, function annotation, and mutation effect scoring.

### OMat24
- **Type**: Materials Science
- **Tags**: Materials Science, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/facebook/OMAT24

Machine learning models for predicting inorganic material properties using EquiformerV2 and eSEN architectures.

### OMol25
- **Type**: Molecular
- **Tags**: Chemistry, Materials Science, Engineering
- **HuggingFace**: https://huggingface.co/facebook/OMol25

Open Molecules 2025 - dataset and models for molecular property prediction including polymer extensions.

### UMA
- **Type**: Molecular
- **Tags**: Chemistry, Materials Science, Physics, Engineering
- **HuggingFace**: https://huggingface.co/facebook/UMA

Universal Models for Atoms - mixture-of-experts graph network trained on billions of atoms across 5 datasets.

### PeptiVerse
- **Type**: Peptide Design
- **Tags**: Biology, Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/ChatterjeeLab/PeptiVerse

Foundation model for peptide design and analysis.

### CoLiPRI
- **Type**: Protein-Ligand Interaction
- **Tags**: Biology, Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/microsoft/colipri

Contrastive learning model for protein-ligand interaction prediction.

### p-IgGen
- **Type**: Antibody Language Model
- **Tags**: Biology, Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/opig/p-IgGen

GPT-NeoX-based generative language model for antibody sequence design, trained on the Observed Antibody Space to generate diverse immunoglobulin heavy and light chains.

### OpenFold3
- **Type**: Protein Structure Prediction
- **Tags**: Biology, Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/OpenFold/OpenFold3

Open replication of AlphaFold3 — predicts structures of proteins, nucleic acids, ligands, and their complexes for drug discovery and structural biology.

### Equiformer v3
- **Type**: Equivariant GNN
- **Tags**: Chemistry, Physics, Materials Science
- **HuggingFace**: https://huggingface.co/mirror-physics/equiformer_v3

Equivariant graph transformer for molecular and materials modeling — predicts energies, forces, and properties on molecular structures and crystals.

### OpenMed ChemicalDetect
- **Type**: Biomedical NER
- **Tags**: Medicine, Chemistry, Biology
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-ChemicalDetect-ModernMed-149M

Chemical-entity NER over biomedical literature — identifies drug names, compounds, and chemical substances using the ModernMed 149M backbone.

### NexaMass V3 Struct
- **Type**: Mass Spectrometry Model
- **Tags**: Chemistry, Biology, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/AethronPhantom/NexaMass-V3-Struct

Self-supervised representation model for MS/MS spectra in metabolomics — learns molecular fingerprints to support compound identification and structure inference.

### MMPT-FM
- **Type**: Pharma Foundation Model
- **Tags**: Biology, Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/Merck/MMPT-FM

Multi-modal pharma foundation model from Merck — integrates molecular and biological signals for drug discovery and target prediction.

### OC25
- **Type**: Catalysis Model
- **Tags**: Chemistry, Materials Science, Energy
- **HuggingFace**: https://huggingface.co/facebook/OC25

Open Catalyst 2025 — successor to OC22, modelling explicit-solvent and catalyst systems for electrochemistry and energy applications.

### OMC25
- **Type**: Materials Model
- **Tags**: Chemistry, Materials Science
- **HuggingFace**: https://huggingface.co/facebook/OMC25

Open Molecular Crystals 2025 — Meta FAIR Chemistry release for predicting properties of organic molecular crystals (pharmaceutical polymorphs, energetic materials, OLEDs).

### Skala 1.1
- **Type**: DFT Functional
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/microsoft/skala-1.1

Deep-learning exchange-correlation functional for density functional theory — covers main-group thermochemistry, reaction kinetics, noncovalent interactions, and molecular geometries.

### BioEmu
- **Type**: Protein Dynamics Model
- **Tags**: Biology, Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/microsoft/bioemu

Generative model for protein structural ensembles — emulates conformational dynamics for drug discovery and structural biology beyond static AlphaFold-style predictions.

### MatterGen
- **Type**: Generative Materials Model
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/microsoft/mattergen

Generative AI for materials design — proposes novel inorganic crystal structures with specified properties for energy, catalysis, and functional-materials research.

### MatterSim
- **Type**: Materials Simulator
- **Tags**: Materials Science, Chemistry, Physics
- **HuggingFace**: https://huggingface.co/microsoft/mattersim

Foundation-model atomistic simulator for materials over a wide range of temperatures and pressures — drop-in replacement for ab-initio MD for property prediction.

### OrbMol
- **Type**: Molecular Foundation Model
- **Tags**: Chemistry, Materials Science
- **HuggingFace**: https://huggingface.co/Orbital-Materials/OrbMol

Foundation-model potential for molecular systems — energies, forces, and properties for organic and metal-organic chemistry, supporting catalyst and pharma workflows.

### Skala 1.0
- **Type**: DFT Functional
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/microsoft/skala-1.0

First release of Skala — deep-learning exchange-correlation functional for density functional theory, predecessor to Skala 1.1.

### AIMNet2-rxn
- **Type**: Neural Interatomic Potential
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/isayevlab/aimnet2-rxn

AIMNet2 trained on reaction data — neural-network interatomic potential supporting reactive molecular simulations.

### AIMNet2 ωB97M-D3
- **Type**: Neural Interatomic Potential
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/isayevlab/aimnet2-wb97m-d3

Neural network interatomic potential for fast and accurate molecular simulations, trained at the ωB97M-D3 level of theory.

### AIMNet2 (B97-3c, 2025)
- **Type**: Neural Interatomic Potential
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/isayevlab/aimnet2-2025

AIMNet2 retrained at the B97-3c level of theory — 2025 release with improved coverage and accuracy.

### AIMNet2-NSE
- **Type**: Neural Interatomic Potential
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/isayevlab/aimnet2-nse

AIMNet2 specialised for open-shell chemistry (radicals, transition states) — neural network interatomic potential for non-singlet electronic states.

### AIMNet2-Pd
- **Type**: Neural Interatomic Potential
- **Tags**: Chemistry, Materials Science, Physics
- **HuggingFace**: https://huggingface.co/isayevlab/aimnet2-pd

AIMNet2 specialised for palladium-containing organometallic systems — supports homogeneous catalysis simulation at near-DFT accuracy.

### MACE-MP-0
- **Type**: Foundation Potential
- **Tags**: Materials Science, Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mace-foundations/mace-mp-0

MACE foundation model trained on the Materials Project — equivariant message-passing potential for inorganic crystal simulation across most of the periodic table.

### MACE-MPA-0
- **Type**: Foundation Potential
- **Tags**: Materials Science, Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mace-foundations/mace-mpa-0

MACE foundation model trained on the Materials Project + Alexandria datasets — broader coverage variant for inorganic-materials simulation.

### MACE-MH-0
- **Type**: Foundation Potential
- **Tags**: Materials Science, Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mace-foundations/mace-mh-0

MACE foundation model targeting molecular and hybrid systems — equivariant potential trained on a unified molecular/materials dataset.

### MACE-MH-1
- **Type**: Foundation Potential
- **Tags**: Materials Science, Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mace-foundations/mace-mh-1

Updated MACE-MH foundation potential with refined molecular/materials hybrid training — successor to MACE-MH-0.

### MIST 28M base
- **Type**: Molecular Language Model
- **Tags**: Chemistry
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-ti624ev1

MIST 28M base — pretrained molecular language model (fill-mask) used as the starting point for downstream property-prediction fine-tunes.

### MIST 1.8B base
- **Type**: Molecular Language Model
- **Tags**: Chemistry
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-dh61satt

MIST 1.8B base — large pretrained molecular language model (fill-mask) for downstream chemistry property prediction at scale.

### MIST mixtures
- **Type**: Mixtures Foundation Model
- **Tags**: Chemistry
- **HuggingFace**: https://huggingface.co/mist-models/mist-mixtures-zffffbex

MIST mixtures variant — pretrained on chemical mixtures rather than individual molecules.

### MIST 28M · QM9
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-kkgx0omx-qm9

MIST 28M fine-tuned on QM9 — quantum-mechanical property prediction over small organic molecules.

### MIST 28M · QM8
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-gzwqzpcr-qm8

MIST 28M fine-tuned on QM8 — electronic-spectra property prediction over small organic molecules.

### MIST 28M · Tox21
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-kw4ks27p-tox21

MIST 28M fine-tuned on Tox21 — toxicity classification across 12 nuclear-receptor and stress-response assays.

### MIST 28M · ClinTox
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-97vfcykk-clintox

MIST 28M fine-tuned on ClinTox — clinical toxicity classification of FDA-approved drugs and failed candidates.

### MIST 28M · SIDER
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-z8qo16uy-sider

MIST 28M fine-tuned on SIDER — side-effect prediction across 27 system-organ classes for marketed drugs.

### MIST 28M · BBBP
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-3xpfhv48-bbbp

MIST 28M fine-tuned on BBBP — blood-brain-barrier permeability classification for CNS drug candidates.

### MIST 28M · HIV
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-8fh43gke-hiv

MIST 28M fine-tuned on HIV — anti-HIV activity classification from MoleculeNet.

### MIST 28M · Lipo
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-xzr5ulva-lipo

MIST 28M fine-tuned on Lipophilicity — octanol/water distribution coefficient prediction.

### MIST 28M · ToxCast
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-ttqcvt6fs-toxcast

MIST 28M fine-tuned on ToxCast — multi-task toxicity prediction across hundreds of in-vitro assays.

### MIST 28M · BACE
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-8loj3bab-bace

MIST 28M fine-tuned on BACE — beta-secretase 1 (Alzheimer target) inhibition classification.

### MIST 28M · MUV
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-yr1urd2c-muv

MIST 28M fine-tuned on MUV — maximum-unbiased-validation virtual-screening benchmark.

### MIST 28M · ESOL
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-kcwb9le5-esol

MIST 28M fine-tuned on ESOL — aqueous solubility regression (Delaney dataset).

### MIST 28M · FreeSolv
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-0uiq7o7m-freesolv

MIST 28M fine-tuned on FreeSolv — hydration free-energy regression for small molecules.

### MIST 28M · tmQM
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Materials Science
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-ggd8iisr-tmQM

MIST 28M fine-tuned on tmQM — quantum-mechanical property prediction for transition-metal complexes.

### MIST 28M · pKa
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-6zlgl2qn-pKa

MIST 28M fine-tuned for pKa — acid-dissociation-constant prediction.

### MIST 28M · solvent properties
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-solvent-properties

MIST 28M fine-tuned for solvent-property prediction — bulk physical descriptors of organic solvents.

### MIST 26.9M · melting point
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Materials Science
- **HuggingFace**: https://huggingface.co/mist-models/mist-26.9M-y3ge5pf9-mp

MIST 26.9M fine-tuned for melting-point regression.

### MIST 26.9M · boiling point
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Materials Science
- **HuggingFace**: https://huggingface.co/mist-models/mist-26.9M-b302p09x-bp

MIST 26.9M fine-tuned for boiling-point regression.

### MIST 26.9M · flash point
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Engineering
- **HuggingFace**: https://huggingface.co/mist-models/mist-26.9M-cyuo2xb6-fp

MIST 26.9M fine-tuned for flash-point regression.

### MIST 26.9M · odour
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry
- **HuggingFace**: https://huggingface.co/mist-models/mist-26.9M-48kpooqf-odour

MIST 26.9M fine-tuned for odour-quality prediction.

### MIST 26.9M · dn
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry
- **HuggingFace**: https://huggingface.co/mist-models/mist-26.9M-6hk5coof-dn

MIST 26.9M fine-tuned for dn property regression.

### MIST 27.0M · conductivity
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Materials Science, Energy
- **HuggingFace**: https://huggingface.co/mist-models/mist-conductivity-27.0M-2mpg8dcd

MIST 27.0M fine-tuned for ionic-conductivity prediction in chemical mixtures and electrolytes.

### MIST 27.1M · ETN
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Materials Science
- **HuggingFace**: https://huggingface.co/mist-models/mist-27.1M-1gcxtg8y-ETN

MIST 27.1M fine-tuned on the ETN (empirical thermodynamic network) benchmark.

### MIST 1.8B · G298
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-09sntn03-g298

MIST 1.8B fine-tuned for G298 — Gibbs free energy at 298 K from QM9.

### MIST 1.8B · H298
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-3fbbz4is-h298

MIST 1.8B fine-tuned for H298 — enthalpy at 298 K from QM9.

### MIST 1.8B · U298
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-85f24xkj-u298

MIST 1.8B fine-tuned for U298 — internal energy at 298 K from QM9.

### MIST 1.8B · U0
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-a7akimjj-u0

MIST 1.8B fine-tuned for U0 — internal energy at 0 K from QM9.

### MIST 1.8B · μ (dipole)
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-ez05expv-mu

MIST 1.8B fine-tuned for dipole moment from QM9.

### MIST 1.8B · α (polarizability)
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-rcwary93-alpha

MIST 1.8B fine-tuned for isotropic polarizability from QM9.

### MIST 1.8B · HOMO
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-jmjosq12-homo

MIST 1.8B fine-tuned for HOMO energy from QM9.

### MIST 1.8B · LUMO
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-n14wshc9-lumo

MIST 1.8B fine-tuned for LUMO energy from QM9.

### MIST 1.8B · HOMO-LUMO gap
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-kayun6v3-gap

MIST 1.8B fine-tuned for HOMO-LUMO gap from QM9.

### MIST 1.8B · ZPVE
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-6nmcwyrp-zpve

MIST 1.8B fine-tuned for zero-point vibrational energy from QM9.

### MIST 1.8B · ⟨R²⟩
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-xxe7t35e-r2

MIST 1.8B fine-tuned for electronic spatial extent from QM9.

### MIST 1.8B · Cv
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-j356b3nf-cv

MIST 1.8B fine-tuned for heat capacity Cv from QM9.

### MIST 1.8B · QM8
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-8nd1ot5j-qm8

MIST 1.8B fine-tuned on QM8 — electronic-spectra prediction at scale.

### MIST 1.8B · Tox21
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-uop1z0dc-tox21

MIST 1.8B fine-tuned on Tox21 — large-scale toxicity classification across nuclear-receptor and stress assays.

### MIST 1.8B · ClinTox
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-lu1l5ieh-clintox

MIST 1.8B fine-tuned on ClinTox — clinical toxicity classification.

### MIST 1.8B · SIDER
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-l1wfo7oa-sider

MIST 1.8B fine-tuned on SIDER — side-effect prediction.

### MIST 1.8B · BBBP
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-fbdn8e35-bbbp

MIST 1.8B fine-tuned on BBBP — blood-brain-barrier permeability.

### MIST 1.8B · HIV
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-1a4puhg2-hiv

MIST 1.8B fine-tuned on HIV — anti-HIV activity classification.

### MIST 1.8B · Lipo
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-jvt4azpz-lipo

MIST 1.8B fine-tuned on Lipophilicity — large-scale logD prediction.

### MIST 1.8B · BACE
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-m50jgolp-bace

MIST 1.8B fine-tuned on BACE — Alzheimer-target inhibition classification.

### MIST 1.8B · ESOL
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-hxiygjsm-esol

MIST 1.8B fine-tuned on ESOL — aqueous solubility regression.

### MIST 1.8B · FreeSolv
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-iwqj2cld-freesolv

MIST 1.8B fine-tuned on FreeSolv — hydration free-energy regression.

### OpenPhenom
- **Type**: Phenomics Foundation Model
- **Tags**: Biology, Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/recursionpharma/OpenPhenom

Masked-autoencoder foundation model for high-content cell imaging — learns phenomic embeddings from millions of microscopy images for downstream drug-discovery and perturbation analysis.

## Blog Posts (24)

### Eve Bio: Mapping the Pharmone Drug Interaction
- **Author**: hugging-science
- **Date**: 2025-01-01
- **Tags**: Medicine, Biology, Chemistry
- **Link**: https://huggingface.co/blog/hugging-science/eve-bio-mapping-the-pharmone-drug-interaction

Understanding drug interactions through AI-powered pharmacogenomics.

### The ExpansionRx OpenADMET Blind Challenge
- **Author**: hugging-science
- **Date**: 2025-01-01
- **Tags**: Medicine, Chemistry
- **Link**: https://huggingface.co/blog/hugging-science/the-expansionrx-openadmet-blind-challenge

A blind challenge for predicting ADMET properties in drug discovery.

### GDP: Generative Design for Proteins
- **Author**: cgeorgiaw
- **Date**: 2025-01-01
- **Tags**: Biology, Chemistry
- **Link**: https://huggingface.co/blog/cgeorgiaw/gdp

Generative models for protein design and engineering.

### LeMaterial: An Open-Source Initiative to Accelerate Materials Discovery
- **Author**: lvwerra
- **Date**: 2024-12-10
- **Tags**: Materials Science, Chemistry, Engineering
- **Link**: https://huggingface.co/blog/lematerial

Introducing LeMaterial, a community effort to build the largest open database of materials and accelerate AI-driven discovery of new compounds and structures.

### SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence
- **Author**: SandboxAQ
- **Date**: 2025-09-06
- **Tags**: Chemistry, Medicine, Biology
- **Link**: https://huggingface.co/blog/SandboxAQ/sair-data-accelerating-drug-discovery-with-ai

How SandboxAQ's SAIR dataset of 1M+ protein–ligand structures is enabling AI-powered drug discovery with unprecedented structural coverage.

### ThermoGFN-IF for Catalysis
- **Author**: AmelieSchreiber
- **Date**: 2026-03-10
- **Tags**: Biology, Chemistry, Medicine
- **Link**: https://huggingface.co/blog/AmelieSchreiber/thermogfn-if

A protein sequence design model fine-tuned with GFlowNets for thermostable and kinetically-aware enzyme engineering.

### A New Era in Multistep Enzyme Design
- **Author**: AmelieSchreiber
- **Date**: 2024-10-16
- **Tags**: Biology, Chemistry
- **Link**: https://huggingface.co/blog/AmelieSchreiber/a-new-era-of-enzyme-engineering

Exploring generative AI approaches for designing multistep enzymatic pathways for biosynthesis and biocatalysis.

### A Guide to Designing New Functional Proteins
- **Author**: AmelieSchreiber
- **Date**: 2024-07-02
- **Tags**: Biology, Chemistry
- **Link**: https://huggingface.co/blog/AmelieSchreiber/protein-optimization-and-design

A comprehensive guide to improving protein function, stability, and diversity using generative AI and ESM-2.

### RFDiffusion Potentials
- **Author**: AmelieSchreiber
- **Date**: 2024-05-14
- **Tags**: Biology, Chemistry
- **Link**: https://huggingface.co/blog/AmelieSchreiber/rfdiffusion-potentials

Using RFDiffusion with custom guiding potentials to steer protein structure generation toward desired functional properties.

### Clustering Protein Complexes using Persistent Homology
- **Author**: AmelieSchreiber
- **Date**: 2023-11-29
- **Tags**: Biology, Chemistry
- **Link**: https://huggingface.co/blog/AmelieSchreiber/esm-ppi

Combining persistent homology with ESM-2 fine-tuning for protein–protein interaction network prediction and complex clustering.

### In Silico Directed Evolution of Protein Sequences with ESM-2
- **Author**: AmelieSchreiber
- **Date**: 2023-11-13
- **Tags**: Biology, Chemistry
- **Link**: https://huggingface.co/blog/AmelieSchreiber/directed-evolution-with-esm2

Using ESM-2 and EvoProtGrad to simulate directed evolution in silico, optimising protein sequences for target properties.

### Predicting Protein–Protein Interactions Using a Protein Language Model
- **Author**: AmelieSchreiber
- **Date**: 2023-10-15
- **Tags**: Biology, Chemistry
- **Link**: https://huggingface.co/blog/AmelieSchreiber/protein-binding-partners-with-esm2

Using ESM-2 embeddings and linear sum assignment to predict protein–protein binding partners at scale.

### A Comprehensive Introduction to AI for Proteins (2026)
- **Author**: tamarind.bio
- **Date**: 2026-01-01
- **Tags**: Biology, Chemistry, Medicine
- **Link**: https://www.tamarind.bio/blog/a-comprehensive-introduction-to-ai-for-proteins

A thorough primer on the state of AI for protein science — covering structure prediction, protein language models, generative design, and the full open-source model landscape.

### Boltz-2: State of the Art Structure and Binding Affinity Prediction
- **Author**: tamarind.bio
- **Date**: 2025-06-18
- **Tags**: Biology, Chemistry, Medicine
- **Link**: https://www.tamarind.bio/blog/boltz2-state-of-the-art-structure-and-binding-affinity-prediction

Boltz-2 outperforms AlphaFold3 on antibody-antigen interfaces and sets a new state of the art for protein-ligand binding affinity prediction.

### Boltzdesign1: Designing De Novo Binders to More Than Just Proteins
- **Author**: tamarind.bio
- **Date**: 2025-06-01
- **Tags**: Biology, Chemistry, Medicine
- **Link**: https://www.tamarind.bio/blog/boltzdesign1-small-molecule-rna-dna-protein-metal-binder-design

BoltzDesign1 extends de novo binder design beyond protein targets to small molecules, RNA, DNA, and metal ions.

### OpenFold3 and The Future of Protein Folding
- **Author**: tamarind.bio
- **Date**: 2025-04-01
- **Tags**: Biology, Chemistry
- **Link**: https://www.tamarind.bio/blog/openfold3-fully-open-alphafold3-alternative

OpenFold3 is a fully open-source, commercially available AlphaFold3 alternative backed by the OpenFold Consortium — enabling unrestricted biomolecular structure prediction.

### IntFold: A New Best Structure Prediction Protocol
- **Author**: tamarind.bio
- **Date**: 2025-03-01
- **Tags**: Biology, Chemistry
- **Link**: https://www.tamarind.bio/blog/intfold-a-new-state-of-the-art

IntFold establishes a new state-of-the-art protocol for biomolecular complex structure prediction, setting records across standard benchmarks.

### Chai-1r: AlphaFold3 Level Performance, Now Completely Open Source
- **Author**: tamarind.bio
- **Date**: 2025-02-01
- **Tags**: Biology, Chemistry, Medicine
- **Link**: https://www.tamarind.bio/blog/chai-1-alphafold3-level-performance-now-completely-open-source

Chai-1r achieves AlphaFold3-level accuracy on protein-protein and antibody-antigen complexes with fully open weights and no usage restrictions.

### Computational De Novo Design of Antibodies and Nanobodies
- **Author**: tamarind.bio
- **Date**: 2025-01-01
- **Tags**: Biology, Medicine, Chemistry, Biotechnology
- **Link**: https://www.tamarind.bio/blog/de-novo-antibody-nanobody-vhh-scfv-rfdiffusion

A practical guide to designing antibody VHHs and scFvs de novo using RFdiffusion and ProteinMPNN, from target epitope to validated sequence.

### Predicting Antibody Properties & Developability
- **Author**: tamarind.bio
- **Date**: 2025-01-01
- **Tags**: Biology, Medicine, Chemistry, Biotechnology
- **Link**: https://www.tamarind.bio/blog/predicting-antibody-properties-developability

ML approaches for predicting key biophysical properties of therapeutic antibody candidates — stability, solubility, and immunogenicity — before wet-lab validation.

### Are Mini Proteins the Next Antibodies?
- **Author**: tamarind.bio
- **Date**: 2025-01-01
- **Tags**: Biology, Medicine, Chemistry
- **Link**: https://www.tamarind.bio/blog/mini-protein-antibodies

Examining the therapeutic potential of computationally designed miniproteins as a next-generation alternative to traditional antibody drugs.

### Boltz-1: AlphaFold3 Level Performance, Truly Open Source
- **Author**: tamarind.bio
- **Date**: 2024-11-01
- **Tags**: Biology, Chemistry
- **Link**: https://www.tamarind.bio/blog/boltz-1-alphafold3-level-performance-truly-open-source-and-commercially-available

Boltz-1 from MIT achieves AlphaFold3-level accuracy on protein and protein-ligand structure prediction with no restrictions on commercial use or input types.

### Computational De Novo Miniproteins As Therapeutics
- **Author**: tamarind.bio
- **Date**: 2024-12-01
- **Tags**: Biology, Medicine, Chemistry
- **Link**: https://www.tamarind.bio/blog/computationaly-de-novo-minibinders-therapeutic-applications

How computationally designed de novo miniproteins and minibinders are being developed as a new class of targeted therapeutics.

### Computational Protein–Protein Interaction Screening
- **Author**: tamarind.bio
- **Date**: 2024-12-01
- **Tags**: Biology, Medicine, Chemistry
- **Link**: https://www.tamarind.bio/blog/ppi-screen

A practical guide to screening for protein–protein interactions (PPIs) as drug discovery targets using structure prediction and ML scoring.


================================================================================
## Topic: Climate (/topics/climate.md)
================================================================================

# Climate — Hugging Science

> Climate science and environmental modeling

_Auto-generated from site data. Fetch `/llms.txt` for the full index._

## Datasets (5)

### nasa-impact/WxC-Bench
- **Type**: Climate Benchmark
- **Tags**: Earth Science, Climate, Physics, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/nasa-impact/WxC-Bench

Standardised benchmark for evaluating AI models across six atmospheric and earth science tasks including gravity wave parameterisation, turbulence prediction, and hurricane track forecasting.

### nasa-impact/EO-via-NLP
- **Type**: Earth Observation
- **Tags**: Earth Science, Climate
- **HuggingFace**: https://huggingface.co/datasets/nasa-impact/EO-via-NLP

Paired earth observation imagery and natural-language descriptions for training and evaluating multimodal models on remote sensing understanding tasks.

### isp-uv-es/WorldFloodsv2
- **Type**: Earth Observation
- **Tags**: Earth Science, Climate
- **HuggingFace**: https://huggingface.co/datasets/isp-uv-es/WorldFloodsv2

Global flood mapping dataset with Sentinel-1/2 and Landsat imagery paired with flood extent labels across hundreds of flood events worldwide.

### isp-uv-es/CloudSEN12Plus
- **Type**: Earth Observation
- **Tags**: Earth Science, Climate, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/isp-uv-es/CloudSEN12Plus

Large-scale cloud detection dataset with 49,000+ Sentinel-2 patches and expert-quality cloud/shadow annotations across global biomes and seasons.

### isp-uv-es/rtm_emulation
- **Type**: Earth Observation
- **Tags**: Earth Science, Climate, Physics
- **HuggingFace**: https://huggingface.co/datasets/isp-uv-es/rtm_emulation

Atmospheric radiative transfer model emulation dataset for training fast neural surrogates to replace computationally expensive RTM simulations in satellite data processing.

## Models (10)

### HiRO-ACE
- **Type**: Climate Emulation
- **Tags**: Earth Science, Climate
- **HuggingFace**: https://huggingface.co/allenai/HiRO-ACE

AI framework for efficient climate and weather simulation with kilometer-scale precipitation downscaling.

### ACE2-ERA5
- **Type**: Climate Emulation
- **Tags**: Earth Science, Climate
- **HuggingFace**: https://huggingface.co/allenai/ACE2-ERA5

Ai2 Climate Emulator v2 trained on ERA5 reanalysis — fast, stable atmospheric simulation at global scale for multi-year climate projections.

### FourCastNet 3
- **Type**: Weather Prediction
- **Tags**: Earth Science, Climate, Physics
- **HuggingFace**: https://huggingface.co/nvidia/fourcastnet3

Advanced ML model for global weather forecasting - produces 60-day forecasts in under 4 minutes on a single GPU.

### cBottle
- **Type**: Climate Modeling
- **Tags**: Earth Science, Climate
- **HuggingFace**: https://huggingface.co/nvidia/cbottle

Diffusion-based generative model that generates atmospheric states at kilometer resolution.

### StormCast V1
- **Type**: Weather Prediction
- **Tags**: Earth Science, Climate, Physics
- **HuggingFace**: https://huggingface.co/nvidia/stormcast-v1-era5-hrrr

Mesoscale ML model for convection-allowing weather forecasting at kilometer-scale resolution.

### Indus SDE v0.2
- **Type**: Earth Science NLP
- **Tags**: Earth Science, Climate
- **HuggingFace**: https://huggingface.co/nasa-impact/indus-sde-v0.2

Science domain extraction model for identifying and classifying scientific concepts, variables, and entities from geoscience and atmospheric science text.

### SuperIX
- **Type**: Satellite Super-Resolution
- **Tags**: Earth Science, Climate
- **HuggingFace**: https://huggingface.co/isp-uv-es/superIX

Explainable AI super-resolution model for Sentinel-2 imagery, enhancing 10m resolution to finer scales with interpretable uncertainty estimates.

### ML4Floods
- **Type**: Flood Detection
- **Tags**: Earth Science, Climate
- **HuggingFace**: https://huggingface.co/isp-uv-es/ml4floods

Image segmentation model for near-real-time flood extent mapping from Sentinel-2 and Landsat imagery, supporting disaster response and humanitarian aid.

### StarCOP
- **Type**: GHG Detection
- **Tags**: Earth Science, Climate, Engineering
- **HuggingFace**: https://huggingface.co/isp-uv-es/starcop

Methane plume detection model for EMIT and AVIRIS hyperspectral imagery, enabling automated identification of point-source greenhouse gas emissions from space.

### Aurora
- **Type**: Weather Foundation Model
- **Tags**: Climate, Earth Science, Physics
- **HuggingFace**: https://huggingface.co/microsoft/aurora

Foundation model for the Earth system — global weather forecasting, atmospheric chemistry, ocean waves, and tropical-cyclone tracking from a single shared backbone.

## Blog Posts (1)

### SARLO-80: SAR Optic Language Dataset
- **Author**: hugging-science
- **Date**: 2025-01-01
- **Tags**: Earth Science, Climate
- **Link**: https://huggingface.co/blog/hugging-science/sarlo-80-sar-optic-language-dataset

Introducing a large-scale dataset for SAR and optical remote sensing with language descriptions.


================================================================================
## Topic: Conservation (/topics/conservation.md)
================================================================================

# Conservation — Hugging Science

> Wildlife and habitat preservation

_Auto-generated from site data. Fetch `/llms.txt` for the full index._

## Datasets (2)

### EarthSpeciesProject/BEANS-Zero
- **Type**: Bioacoustics Benchmark
- **Tags**: Biology, Ecology, Conservation, Benchmark, Earth Science
- **HuggingFace**: https://huggingface.co/datasets/EarthSpeciesProject/BEANS-Zero

Zero-shot bioacoustics benchmark evaluating audio-language models on species detection, classification, and captioning across diverse animal taxa.

### imageomics/TreeOfLife-200M
- **Type**: Biodiversity Image Corpus
- **Tags**: Biology, Ecology, Conservation
- **HuggingFace**: https://huggingface.co/datasets/imageomics/TreeOfLife-200M

Foundational 200M-image dataset for organismal biology — multilingual species labels (en, la) at biodiversity scale, used to train BioCLIP-2 for zero-shot species classification.

## Models (3)

### NatureLM-audio
- **Type**: Audio-Language Model
- **Tags**: Biology, Ecology, Conservation, Earth Science
- **HuggingFace**: https://huggingface.co/EarthSpeciesProject/NatureLM-audio

First audio-language foundation model for bioacoustics - species classification, detection, and captioning of animal vocalizations.

### AVES2-BEATs
- **Type**: Bioacoustics Model
- **Tags**: Biology, Ecology, Conservation, Earth Science
- **HuggingFace**: https://huggingface.co/EarthSpeciesProject/esp-aves2-sl-beats-all

Self-supervised BEATs-based audio encoder trained on broad bioacoustic data for species detection, classification, and embedding across animal taxa.

### BioCLIP 2
- **Type**: Vision-Language Model
- **Tags**: Biology, Ecology, Conservation
- **HuggingFace**: https://huggingface.co/imageomics/bioclip-2

OpenCLIP-based foundation model for organismal biology — zero-shot species classification from photographs across the tree of life, trained on TreeOfLife-200M.


================================================================================
## Topic: Earth Science (/topics/earth-science.md)
================================================================================

# Earth Science — Hugging Science

> Geology, oceanography, and planetary science

_Auto-generated from site data. Fetch `/llms.txt` for the full index._

## Datasets (10)

### polymathic-ai/planetswe
- **Type**: Physics Simulation
- **Tags**: Physics, Earth Science
- **HuggingFace**: https://huggingface.co/datasets/polymathic-ai/planetswe

Spherical shallow-water equation simulations modelling large-scale planetary atmospheric dynamics for weather and climate surrogate models.

### nasa-impact/WxC-Bench
- **Type**: Climate Benchmark
- **Tags**: Earth Science, Climate, Physics, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/nasa-impact/WxC-Bench

Standardised benchmark for evaluating AI models across six atmospheric and earth science tasks including gravity wave parameterisation, turbulence prediction, and hurricane track forecasting.

### nasa-impact/EO-via-NLP
- **Type**: Earth Observation
- **Tags**: Earth Science, Climate
- **HuggingFace**: https://huggingface.co/datasets/nasa-impact/EO-via-NLP

Paired earth observation imagery and natural-language descriptions for training and evaluating multimodal models on remote sensing understanding tasks.

### EarthSpeciesProject/BEANS-Zero
- **Type**: Bioacoustics Benchmark
- **Tags**: Biology, Ecology, Conservation, Benchmark, Earth Science
- **HuggingFace**: https://huggingface.co/datasets/EarthSpeciesProject/BEANS-Zero

Zero-shot bioacoustics benchmark evaluating audio-language models on species detection, classification, and captioning across diverse animal taxa.

### ONERA/SARLO-80
- **Type**: Remote Sensing
- **Tags**: Earth Science, Engineering
- **HuggingFace**: https://huggingface.co/datasets/ONERA/SARLO-80

119K paired SAR/optical images with text captions at 80cm resolution for multimodal learning.

### AllTheBacteria/SPIRE
- **Type**: Microbial Genomics
- **Tags**: Biology, Genomics, Ecology, Earth Science
- **HuggingFace**: https://huggingface.co/datasets/AllTheBacteria/SPIRE

Searchable Planetary-scale mIcrobiome REsource: a large-scale metagenomics resource aggregating environmental microbiome samples from diverse global habitats.

### isp-uv-es/WorldFloodsv2
- **Type**: Earth Observation
- **Tags**: Earth Science, Climate
- **HuggingFace**: https://huggingface.co/datasets/isp-uv-es/WorldFloodsv2

Global flood mapping dataset with Sentinel-1/2 and Landsat imagery paired with flood extent labels across hundreds of flood events worldwide.

### isp-uv-es/CloudSEN12Plus
- **Type**: Earth Observation
- **Tags**: Earth Science, Climate, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/isp-uv-es/CloudSEN12Plus

Large-scale cloud detection dataset with 49,000+ Sentinel-2 patches and expert-quality cloud/shadow annotations across global biomes and seasons.

### isp-uv-es/rtm_emulation
- **Type**: Earth Observation
- **Tags**: Earth Science, Climate, Physics
- **HuggingFace**: https://huggingface.co/datasets/isp-uv-es/rtm_emulation

Atmospheric radiative transfer model emulation dataset for training fast neural surrogates to replace computationally expensive RTM simulations in satellite data processing.

### isp-uv-es/opensr-test
- **Type**: Earth Observation
- **Tags**: Earth Science, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/isp-uv-es/opensr-test

Benchmark dataset for real-world Sentinel-2 super-resolution, with paired low/high-resolution imagery and perceptual quality metrics.

## Models (13)

### NatureLM-audio
- **Type**: Audio-Language Model
- **Tags**: Biology, Ecology, Conservation, Earth Science
- **HuggingFace**: https://huggingface.co/EarthSpeciesProject/NatureLM-audio

First audio-language foundation model for bioacoustics - species classification, detection, and captioning of animal vocalizations.

### AVES2-BEATs
- **Type**: Bioacoustics Model
- **Tags**: Biology, Ecology, Conservation, Earth Science
- **HuggingFace**: https://huggingface.co/EarthSpeciesProject/esp-aves2-sl-beats-all

Self-supervised BEATs-based audio encoder trained on broad bioacoustic data for species detection, classification, and embedding across animal taxa.

### HiRO-ACE
- **Type**: Climate Emulation
- **Tags**: Earth Science, Climate
- **HuggingFace**: https://huggingface.co/allenai/HiRO-ACE

AI framework for efficient climate and weather simulation with kilometer-scale precipitation downscaling.

### ACE2-ERA5
- **Type**: Climate Emulation
- **Tags**: Earth Science, Climate
- **HuggingFace**: https://huggingface.co/allenai/ACE2-ERA5

Ai2 Climate Emulator v2 trained on ERA5 reanalysis — fast, stable atmospheric simulation at global scale for multi-year climate projections.

### FourCastNet 3
- **Type**: Weather Prediction
- **Tags**: Earth Science, Climate, Physics
- **HuggingFace**: https://huggingface.co/nvidia/fourcastnet3

Advanced ML model for global weather forecasting - produces 60-day forecasts in under 4 minutes on a single GPU.

### cBottle
- **Type**: Climate Modeling
- **Tags**: Earth Science, Climate
- **HuggingFace**: https://huggingface.co/nvidia/cbottle

Diffusion-based generative model that generates atmospheric states at kilometer resolution.

### StormCast V1
- **Type**: Weather Prediction
- **Tags**: Earth Science, Climate, Physics
- **HuggingFace**: https://huggingface.co/nvidia/stormcast-v1-era5-hrrr

Mesoscale ML model for convection-allowing weather forecasting at kilometer-scale resolution.

### NASA-SMD-IBM
- **Type**: Earth Science NLP
- **Tags**: Earth Science, Physics, Astronomy
- **HuggingFace**: https://huggingface.co/nasa-impact/nasa-smd-ibm-v0.1

RoBERTa-based language model pre-trained on NASA Science Mission Directorate literature for earth and space science information extraction.

### Indus SDE v0.2
- **Type**: Earth Science NLP
- **Tags**: Earth Science, Climate
- **HuggingFace**: https://huggingface.co/nasa-impact/indus-sde-v0.2

Science domain extraction model for identifying and classifying scientific concepts, variables, and entities from geoscience and atmospheric science text.

### SuperIX
- **Type**: Satellite Super-Resolution
- **Tags**: Earth Science, Climate
- **HuggingFace**: https://huggingface.co/isp-uv-es/superIX

Explainable AI super-resolution model for Sentinel-2 imagery, enhancing 10m resolution to finer scales with interpretable uncertainty estimates.

### ML4Floods
- **Type**: Flood Detection
- **Tags**: Earth Science, Climate
- **HuggingFace**: https://huggingface.co/isp-uv-es/ml4floods

Image segmentation model for near-real-time flood extent mapping from Sentinel-2 and Landsat imagery, supporting disaster response and humanitarian aid.

### StarCOP
- **Type**: GHG Detection
- **Tags**: Earth Science, Climate, Engineering
- **HuggingFace**: https://huggingface.co/isp-uv-es/starcop

Methane plume detection model for EMIT and AVIRIS hyperspectral imagery, enabling automated identification of point-source greenhouse gas emissions from space.

### Aurora
- **Type**: Weather Foundation Model
- **Tags**: Climate, Earth Science, Physics
- **HuggingFace**: https://huggingface.co/microsoft/aurora

Foundation model for the Earth system — global weather forecasting, atmospheric chemistry, ocean waves, and tropical-cyclone tracking from a single shared backbone.

## Blog Posts (1)

### SARLO-80: SAR Optic Language Dataset
- **Author**: hugging-science
- **Date**: 2025-01-01
- **Tags**: Earth Science, Climate
- **Link**: https://huggingface.co/blog/hugging-science/sarlo-80-sar-optic-language-dataset

Introducing a large-scale dataset for SAR and optical remote sensing with language descriptions.


================================================================================
## Topic: Ecology (/topics/ecology.md)
================================================================================

# Ecology — Hugging Science

> Ecosystems and environmental biology

_Auto-generated from site data. Fetch `/llms.txt` for the full index._

## Datasets (3)

### EarthSpeciesProject/BEANS-Zero
- **Type**: Bioacoustics Benchmark
- **Tags**: Biology, Ecology, Conservation, Benchmark, Earth Science
- **HuggingFace**: https://huggingface.co/datasets/EarthSpeciesProject/BEANS-Zero

Zero-shot bioacoustics benchmark evaluating audio-language models on species detection, classification, and captioning across diverse animal taxa.

### AllTheBacteria/SPIRE
- **Type**: Microbial Genomics
- **Tags**: Biology, Genomics, Ecology, Earth Science
- **HuggingFace**: https://huggingface.co/datasets/AllTheBacteria/SPIRE

Searchable Planetary-scale mIcrobiome REsource: a large-scale metagenomics resource aggregating environmental microbiome samples from diverse global habitats.

### imageomics/TreeOfLife-200M
- **Type**: Biodiversity Image Corpus
- **Tags**: Biology, Ecology, Conservation
- **HuggingFace**: https://huggingface.co/datasets/imageomics/TreeOfLife-200M

Foundational 200M-image dataset for organismal biology — multilingual species labels (en, la) at biodiversity scale, used to train BioCLIP-2 for zero-shot species classification.

## Models (3)

### NatureLM-audio
- **Type**: Audio-Language Model
- **Tags**: Biology, Ecology, Conservation, Earth Science
- **HuggingFace**: https://huggingface.co/EarthSpeciesProject/NatureLM-audio

First audio-language foundation model for bioacoustics - species classification, detection, and captioning of animal vocalizations.

### AVES2-BEATs
- **Type**: Bioacoustics Model
- **Tags**: Biology, Ecology, Conservation, Earth Science
- **HuggingFace**: https://huggingface.co/EarthSpeciesProject/esp-aves2-sl-beats-all

Self-supervised BEATs-based audio encoder trained on broad bioacoustic data for species detection, classification, and embedding across animal taxa.

### BioCLIP 2
- **Type**: Vision-Language Model
- **Tags**: Biology, Ecology, Conservation
- **HuggingFace**: https://huggingface.co/imageomics/bioclip-2

OpenCLIP-based foundation model for organismal biology — zero-shot species classification from photographs across the tree of life, trained on TreeOfLife-200M.


================================================================================
## Topic: Energy (/topics/energy.md)
================================================================================

# Energy — Hugging Science

> Energy systems and sustainability

_Auto-generated from site data. Fetch `/llms.txt` for the full index._

## Datasets (13)

### proxima-fusion/constellaration
- **Type**: Fusion Physics
- **Tags**: Physics, Energy, Engineering
- **HuggingFace**: https://huggingface.co/datasets/proxima-fusion/constellaration

Large-scale dataset of quasi-isodynamic stellarator designs with MHD equilibria for fusion energy research.

### foundry-ml/double_perovskite_bandgap_v1-1
- **Type**: Electronic Structure
- **Tags**: Materials Science, Physics, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/double_perovskite_bandgap_v1-1

Computed band gaps for double-perovskite compounds — supports ML-based screening for photovoltaic and optoelectronic applications.

### foundry-ml/dataset_perovskite_formatione
- **Type**: Perovskite Properties
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_perovskite_formatione

Formation energies for perovskite compounds — supports ML screening for stability and synthesisability.

### foundry-ml/dataset_perovskite_stability_updated
- **Type**: Perovskite Properties
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_perovskite_stability_updated

Curated perovskite stability data (updated release) for benchmarking ML models on photovoltaic-material durability prediction.

### foundry-ml/perovskite_stability_v1-1
- **Type**: Perovskite Properties
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/perovskite_stability_v1-1

Perovskite stability dataset (v1.1 release) — paired structure and stability labels for ML benchmarking.

### foundry-ml/perovskite_opbandcenter_v1-1
- **Type**: Perovskite Properties
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/perovskite_opbandcenter_v1-1

O p-band center values for perovskite oxides — descriptors for catalytic activity prediction in oxygen-evolution reactions.

### foundry-ml/dataset_perovskite_conductivity
- **Type**: Perovskite Properties
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_perovskite_conductivity

Ionic and electronic conductivity measurements for perovskite materials — supports ML screening for solid-oxide fuel cell electrolytes.

### foundry-ml/dataset_perovskite_habs
- **Type**: Perovskite Properties
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_perovskite_habs

Hot-air-balance (HABS) data for perovskite materials — thermal-stability characterisation supporting durability ML.

### foundry-ml/dataset_perovskite_asr
- **Type**: Perovskite Properties
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_perovskite_asr

Area-specific resistance (ASR) data for perovskite electrodes — used in solid-oxide fuel cell ML modelling.

### foundry-ml/superconductivity_v1-1
- **Type**: Superconductivity
- **Tags**: Materials Science, Physics, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/superconductivity_v1-1

Curated superconductor dataset — measured Tc values for ML-based discovery of new superconducting materials.

### foundry-ml/dataset_li_conductivity
- **Type**: Battery Materials
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_li_conductivity

Lithium-ion conductivity dataset for solid electrolytes — supports ML discovery of next-generation battery materials.

### foundry-ml/foundry_osdb_v1-1
- **Type**: Organic Semiconductors
- **Tags**: Chemistry, Materials Science, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/foundry_osdb_v1-1

Organic Semiconductor Database (OSDB) curated for ML — supports property prediction and screening of organic optoelectronic materials.

### foundry-ml/diffusion_v1-4
- **Type**: Diffusion Coefficients
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/diffusion_v1-4

Diffusion-coefficient dataset for inorganic systems — supports ML modelling of solid-state ion transport and electrolyte design.

## Models (3)

### OC25
- **Type**: Catalysis Model
- **Tags**: Chemistry, Materials Science, Energy
- **HuggingFace**: https://huggingface.co/facebook/OC25

Open Catalyst 2025 — successor to OC22, modelling explicit-solvent and catalyst systems for electrochemistry and energy applications.

### MatterGen
- **Type**: Generative Materials Model
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/microsoft/mattergen

Generative AI for materials design — proposes novel inorganic crystal structures with specified properties for energy, catalysis, and functional-materials research.

### MIST 27.0M · conductivity
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Materials Science, Energy
- **HuggingFace**: https://huggingface.co/mist-models/mist-conductivity-27.0M-2mpg8dcd

MIST 27.0M fine-tuned for ionic-conductivity prediction in chemical mixtures and electrolytes.

## Blog Posts (1)

### Constellation Fusion Challenge
- **Author**: cgeorgiaw
- **Date**: 2025-01-01
- **Tags**: Physics, Energy, Engineering
- **Link**: https://huggingface.co/blog/cgeorgiaw/constellaration-fusion-challenge

A challenge for advancing fusion energy through AI.


================================================================================
## Topic: Engineering (/topics/engineering.md)
================================================================================

# Engineering — Hugging Science

> Applied science and technical systems

_Auto-generated from site data. Fetch `/llms.txt` for the full index._

## Datasets (38)

### polymathic-ai/active_matter
- **Type**: Physics Simulation
- **Tags**: Physics, Engineering, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/polymathic-ai/active_matter

High-fidelity simulations of self-propelled particle systems for benchmarking learned PDE solvers and emergent collective behaviour models.

### polymathic-ai/MHD_64
- **Type**: Physics Simulation
- **Tags**: Physics, Engineering, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/polymathic-ai/MHD_64

3D magnetohydrodynamics turbulence simulations at 64³ resolution for training and benchmarking physics-informed neural operators.

### polymathic-ai/rayleigh_benard
- **Type**: Physics Simulation
- **Tags**: Physics, Engineering, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/polymathic-ai/rayleigh_benard

Rayleigh–Bénard thermal convection simulations at varying Rayleigh and Prandtl numbers for benchmarking turbulence and heat transfer models.

### proxima-fusion/constellaration
- **Type**: Fusion Physics
- **Tags**: Physics, Energy, Engineering
- **HuggingFace**: https://huggingface.co/datasets/proxima-fusion/constellaration

Large-scale dataset of quasi-isodynamic stellarator designs with MHD equilibria for fusion energy research.

### SandboxAQ/aqcat25-dataset
- **Type**: Computational Chemistry
- **Tags**: Chemistry, Materials Science, Engineering
- **HuggingFace**: https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset

13.5M DFT calculation trajectories for heterogeneous catalysis and ML potential training.

### jablonkagroup/ChemBench
- **Type**: Chemistry Benchmark
- **Tags**: Chemistry, Materials Science, Benchmark, Scientific Reasoning, Engineering, Mathematics
- **HuggingFace**: https://huggingface.co/datasets/jablonkagroup/ChemBench

Manually curated benchmark of 3,000+ chemistry and materials science questions across spectroscopy, reactivity, synthesis, and property prediction for evaluating LLMs.

### LeMaterial/LeMat-Bulk-MLIP-Hull
- **Type**: Materials Science
- **Tags**: Materials Science, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/LeMaterial/LeMat-Bulk-MLIP-Hull

Convex hull data for bulk materials from MLIP calculations.

### LeMaterial/LeMat-Bulk-DFT-Hull-All
- **Type**: Materials Science
- **Tags**: Materials Science, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/LeMaterial/LeMat-Bulk-DFT-Hull-All

Complete DFT convex hull dataset for bulk materials discovery.

### LeMaterial/LeMat-Bulk-DFT-Hull
- **Type**: Materials Science
- **Tags**: Materials Science, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/LeMaterial/LeMat-Bulk-DFT-Hull

DFT convex hull reference data for materials stability analysis.

### LeMaterial/LeMat-Bulk
- **Type**: Materials Science
- **Tags**: Materials Science, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/LeMaterial/LeMat-Bulk

Primary bulk materials database aggregating 1M+ crystal structures with DFT-computed formation energies, band gaps, and elastic properties for materials discovery.

### LeMaterial/LeMat-Traj
- **Type**: Materials Science
- **Tags**: Materials Science, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/LeMaterial/LeMat-Traj

Large-scale molecular dynamics trajectory dataset for training machine learning interatomic potentials across diverse bulk material compositions.

### ONERA/SARLO-80
- **Type**: Remote Sensing
- **Tags**: Earth Science, Engineering
- **HuggingFace**: https://huggingface.co/datasets/ONERA/SARLO-80

119K paired SAR/optical images with text captions at 80cm resolution for multimodal learning.

### allenai/peS2o
- **Type**: Pretraining Corpus
- **Tags**: Scientific Reasoning, Biology, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/allenai/peS2o

Approximately 40M cleaned, filtered, and formatted open-access academic papers derived from S2ORC — a large multi-domain pretraining corpus for science-aware language models, spanning biology, chemistry, engineering, computer science, and physics.

### neashton/drivaerml
- **Type**: Automotive CFD
- **Tags**: Engineering, Physics
- **HuggingFace**: https://huggingface.co/datasets/neashton/drivaerml

High-fidelity CFD simulation dataset of the DrivAer reference automotive geometry — resolved-flow data for training ML models on aerodynamics prediction (drag, downforce, surface pressure).

### PLAID-datasets/AirfRANS_original
- **Type**: Aerodynamics CFD
- **Tags**: Physics, Engineering, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/PLAID-datasets/AirfRANS_original

Original AirfRANS airfoil RANS simulation dataset — graph-structured CFD over NACA airfoils for benchmarking physics-informed and graph neural networks.

### luminary-shift/SUV
- **Type**: Automotive CFD
- **Tags**: Engineering, Physics
- **HuggingFace**: https://huggingface.co/datasets/luminary-shift/SUV

Large-scale CFD dataset of SUV-class vehicles for training ML models on automotive aerodynamics — surface pressures, wake structures, and aerodynamic performance metrics.

### luminary-shift/Pump
- **Type**: Turbomachinery CFD
- **Tags**: Engineering, Physics
- **HuggingFace**: https://huggingface.co/datasets/luminary-shift/Pump

CFD simulations of centrifugal pumps spanning operating conditions — for training ML surrogates of turbomachinery flow and performance.

### luminary-shift/SHIFT-Crash
- **Type**: Crash Simulation
- **Tags**: Engineering, Physics
- **HuggingFace**: https://huggingface.co/datasets/luminary-shift/SHIFT-Crash

Vehicle crash-simulation dataset capturing structural deformation under impact — for ML-based safety and structural-mechanics modelling.

### luminary-shift/WING
- **Type**: Aerospace CFD
- **Tags**: Engineering, Physics
- **HuggingFace**: https://huggingface.co/datasets/luminary-shift/WING

Wing-flow CFD dataset for ML-driven aerodynamics — covers a range of geometries and flight conditions for surrogate modelling.

### luminary-shift/CCA
- **Type**: Aerospace CFD
- **Tags**: Engineering, Physics
- **HuggingFace**: https://huggingface.co/datasets/luminary-shift/CCA

Common Compressor Aero (CCA) dataset — compressor and turbomachinery simulations for ML-augmented aerospace design workflows.

### luminary-shift/Submarine
- **Type**: Marine CFD
- **Tags**: Engineering, Physics
- **HuggingFace**: https://huggingface.co/datasets/luminary-shift/Submarine

Submarine hydrodynamics CFD dataset — submerged-body flow simulations for ML-based marine engineering and naval design.

### foundry-ml/mask_rcnn_defect_detection_v1-1
- **Type**: Defect Detection
- **Tags**: Materials Science, Engineering
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/mask_rcnn_defect_detection_v1-1

Microscopy image dataset annotated for instance-segmentation defect detection — Mask R-CNN training data for materials inspection.

### foundry-ml/foundry_stan_segmentation_v1-1
- **Type**: Microscopy Segmentation
- **Tags**: Materials Science, Engineering
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/foundry_stan_segmentation_v1-1

Segmentation dataset (STAN) for materials microscopy images — supports ML feature extraction from electron-microscopy data.

### foundry-ml/elastic_tensor_v1-1
- **Type**: Mechanical Properties
- **Tags**: Materials Science, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/elastic_tensor_v1-1

Elastic tensor data for inorganic materials — supports ML prediction of bulk and shear moduli.

### foundry-ml/piezoelectric_tensor_v1-1
- **Type**: Electromechanical Properties
- **Tags**: Materials Science, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/piezoelectric_tensor_v1-1

Piezoelectric tensor data for inorganic materials — supports ML for sensor and actuator material design.

### foundry-ml/electromigration_v1-1
- **Type**: Failure Mechanisms
- **Tags**: Materials Science, Engineering
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/electromigration_v1-1

Electromigration data for interconnect materials — supports ML prediction of failure rates in microelectronic devices.

### foundry-ml/steel_strength_v1-1
- **Type**: Alloy Mechanical Properties
- **Tags**: Materials Science, Engineering
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/steel_strength_v1-1

Steel strength dataset — composition-property pairs for ML-based alloy design and high-strength materials.

### foundry-ml/dataset_mg_alloy
- **Type**: Alloy Properties
- **Tags**: Materials Science, Engineering
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_mg_alloy

Magnesium alloy dataset — composition and property data for ML modelling of lightweight structural alloys.

### foundry-ml/dataset_metallicglass_rc
- **Type**: Metallic Glass
- **Tags**: Materials Science, Engineering
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_metallicglass_rc

Critical cooling rate (Rc) data for metallic glasses — supports ML prediction of glass-forming ability.

### foundry-ml/dataset_metallicglass_rc_llm
- **Type**: Metallic Glass
- **Tags**: Materials Science, Engineering, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_metallicglass_rc_llm

LLM-extracted critical cooling rate data for metallic glasses — text-mined complement to the structured Rc dataset.

### foundry-ml/dataset_metallicglass_dmax
- **Type**: Metallic Glass
- **Tags**: Materials Science, Engineering
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_metallicglass_dmax

Maximum glass-forming diameter (Dmax) data for bulk metallic glasses — for ML screening of casting feasibility.

### foundry-ml/dataset_concrete_compressive_strength
- **Type**: Construction Materials
- **Tags**: Materials Science, Engineering
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_concrete_compressive_strength

Concrete compressive-strength dataset — mix-design and test data for ML-based civil-engineering material modelling.

### foundry-ml/dataset_rpv_tts
- **Type**: Reactor Materials
- **Tags**: Materials Science, Engineering, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_rpv_tts

Reactor pressure-vessel (RPV) transition-temperature shift dataset — supports ML prediction of irradiation embrittlement.

### ADSKAILab/ABC-1M
- **Type**: CAD Geometry Corpus
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/datasets/ADSKAILab/ABC-1M

One million CAD-quality 3D shapes drawn from the ABC dataset — the foundation training corpus for the Make-A-Shape and WaLa generative models.

### ADSKAILab/Zero-To-CAD-1m
- **Type**: CAD Vision-Language Corpus
- **Tags**: Engineering, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/ADSKAILab/Zero-To-CAD-1m

1M paired image-and-CAD-program examples for training vision-language models that synthesise parametric CAD from images.

### ADSKAILab/Zero-To-CAD-100k
- **Type**: CAD Vision-Language Corpus
- **Tags**: Engineering, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/ADSKAILab/Zero-To-CAD-100k

Curated 100K-example subset of Zero-To-CAD — useful for benchmarking and lightweight fine-tuning of CAD-from-image models.

### ADSKAILab/LLM-narrative-planning-taskset
- **Type**: Planning Benchmark
- **Tags**: Engineering, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/ADSKAILab/LLM-narrative-planning-taskset

Narrative planning task set for evaluating LLM planning and reasoning over multi-step design and engineering scenarios.

### ADSKAILab/codeparrot_megatron
- **Type**: Code Pretraining
- **Tags**: Engineering, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/ADSKAILab/codeparrot_megatron

Megatron-formatted CodeParrot release used for large-scale code language-model pretraining experiments at Autodesk AI Lab.

## Models (25)

### FNO Active Matter
- **Type**: Physics Foundation Model
- **Tags**: Physics, Engineering
- **HuggingFace**: https://huggingface.co/polymathic-ai/FNO-active_matter

Fourier Neural Operator for active matter prediction.

### Aion Base
- **Type**: Foundation Model
- **Tags**: Physics, Astronomy, Engineering
- **HuggingFace**: https://huggingface.co/polymathic-ai/aion-base

Multi-domain scientific foundation model.

### WALRUS
- **Type**: Physics Foundation Model
- **Tags**: Physics, Engineering
- **HuggingFace**: https://huggingface.co/polymathic-ai/walrus

Foundation model for continuum dynamics pre-trained across 15 physics simulation datasets, enabling zero-shot and few-shot PDE generalisation.

### OMat24
- **Type**: Materials Science
- **Tags**: Materials Science, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/facebook/OMAT24

Machine learning models for predicting inorganic material properties using EquiformerV2 and eSEN architectures.

### OMol25
- **Type**: Molecular
- **Tags**: Chemistry, Materials Science, Engineering
- **HuggingFace**: https://huggingface.co/facebook/OMol25

Open Molecules 2025 - dataset and models for molecular property prediction including polymer extensions.

### UMA
- **Type**: Molecular
- **Tags**: Chemistry, Materials Science, Physics, Engineering
- **HuggingFace**: https://huggingface.co/facebook/UMA

Universal Models for Atoms - mixture-of-experts graph network trained on billions of atoms across 5 datasets.

### StarCOP
- **Type**: GHG Detection
- **Tags**: Earth Science, Climate, Engineering
- **HuggingFace**: https://huggingface.co/isp-uv-es/starcop

Methane plume detection model for EMIT and AVIRIS hyperspectral imagery, enabling automated identification of point-source greenhouse gas emissions from space.

### MIST 26.9M · flash point
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Engineering
- **HuggingFace**: https://huggingface.co/mist-models/mist-26.9M-cyuo2xb6-fp

MIST 26.9M fine-tuned for flash-point regression.

### Zero-To-CAD Qwen3-VL 2B
- **Type**: CAD Vision-Language Model
- **Tags**: Engineering, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/ADSKAILab/Zero-To-CAD-Qwen3-VL-2B

Qwen3-VL fine-tuned to generate parametric CAD models directly from images — bridges vision-language reasoning and engineering geometry synthesis.

### Make-A-Shape · single-view 20M
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/Make-A-Shape-single-view-20m

Make-A-Shape variant trained to generate 3D geometry from a single 2D image — supports CAD reconstruction and engineering shape synthesis.

### Make-A-Shape · multi-view 20M
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/Make-A-Shape-multi-view-20m

Make-A-Shape multi-view variant — generates 3D geometry from multiple 2D image perspectives for higher-fidelity CAD reconstruction.

### Make-A-Shape · point-cloud 20M
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/Make-A-Shape-point-cloud-20m

Make-A-Shape point-cloud variant — completes and refines 3D geometry from sparse point-cloud input.

### Make-A-Shape · voxel 32³
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/Make-A-Shape-voxel-32res-20m

Make-A-Shape voxel variant at 32³ resolution — generates voxelised 3D geometries for low-resolution shape exploration.

### Make-A-Shape · voxel 16³
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/Make-A-Shape-voxel-16res-20m

Coarser 16³ voxel variant of Make-A-Shape for fast prototyping of 3D geometries.

### WaLa SV 1B
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/WaLa-SV-1B

WaLa (Wavelet-Latent) 1B model conditioned on single-view input — large-scale wavelet-domain 3D shape generation.

### WaLa RGB4 1B
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/WaLa-RGB4-1B

WaLa 1B variant conditioned on four RGB views — multi-view colour-image-driven 3D shape generation.

### WaLa DM4 1B
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/WaLa-DM4-1B

WaLa 1B variant conditioned on four depth maps — depth-driven 3D shape generation.

### WaLa DM6 1B
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/WaLa-DM6-1B

WaLa 1B variant conditioned on six depth maps for high-coverage depth-driven 3D shape generation.

### WaLa PC 1B
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/WaLa-PC-1B

WaLa 1B variant conditioned on point clouds — wavelet-latent shape completion from sparse point input.

### WaLa VX16 1B
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/WaLa-VX16-1B

WaLa 1B variant conditioned on 16³ voxel grids — coarse-voxel-driven 3D shape generation.

### WaLa UN 1B
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/WaLa-UN-1B

WaLa 1B unconditional variant — generates 3D shapes from noise alone for design-space exploration.

### WaLa SK 1B
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/WaLa-SK-1B

WaLa 1B variant conditioned on sketches — supports designer-driven shape generation from line art.

### WaLa DM1 1B
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/WaLa-DM1-1B

WaLa 1B variant conditioned on a single depth map — minimal-input depth-to-shape generation.

### WaLa MVDream RGB4
- **Type**: Text-to-3D
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/WaLa-MVDream-RGB4

WaLa coupled with MVDream for text-conditioned 3D shape generation via four RGB-view diffusion.

### WaLa MVDream DM6
- **Type**: Text-to-3D
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/WaLa-MVDream-DM6

WaLa coupled with MVDream and six depth views for text-conditioned 3D geometry generation.

## Blog Posts (4)

### AI for PDEs
- **Author**: hugging-science
- **Date**: 2025-01-01
- **Tags**: Physics, Mathematics, Engineering
- **Link**: https://huggingface.co/blog/hugging-science/pde

Exploring AI approaches to solving partial differential equations.

### Constellation Fusion Challenge
- **Author**: cgeorgiaw
- **Date**: 2025-01-01
- **Tags**: Physics, Energy, Engineering
- **Link**: https://huggingface.co/blog/cgeorgiaw/constellaration-fusion-challenge

A challenge for advancing fusion energy through AI.

### LeMaterial: An Open-Source Initiative to Accelerate Materials Discovery
- **Author**: lvwerra
- **Date**: 2024-12-10
- **Tags**: Materials Science, Chemistry, Engineering
- **Link**: https://huggingface.co/blog/lematerial

Introducing LeMaterial, a community effort to build the largest open database of materials and accelerate AI-driven discovery of new compounds and structures.

### Physics Informed Neural Networks (PINNs): An Intuitive Guide
- **Author**: towardsdatascience.com
- **Date**: 2025-01-28
- **Tags**: Physics, Mathematics, Engineering
- **Link**: https://towardsdatascience.com/physics-informed-neural-networks-pinns-an-intuitive-guide-fff138069563/

A clear, intuitive walkthrough of how PINNs embed physical laws directly into neural network training — bridging traditional PDE-based modeling with data-driven deep learning.


================================================================================
## Topic: Genomics (/topics/genomics.md)
================================================================================

# Genomics — Hugging Science

> DNA, RNA, and genetic analysis

_Auto-generated from site data. Fetch `/llms.txt` for the full index._

## Datasets (24)

### arcinstitute/opengenome2
- **Type**: Genomics
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/datasets/arcinstitute/opengenome2

Curated collection of prokaryotic and eukaryotic genomic sequences for training and benchmarking large-scale biological foundation models.

### arcinstitute/SE-167M-Human
- **Type**: Single-Cell Biology
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/datasets/arcinstitute/SE-167M-Human

167M human single-cell RNA expression profiles across diverse tissues and cell types, used for training STACK and SE single-cell foundation models.

### arcinstitute/Stack-CellxGene45M
- **Type**: Single-Cell Biology
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/datasets/arcinstitute/Stack-CellxGene45M

45M curated single-cell profiles drawn from the CellxGene corpus, standardised for in-context learning and cross-study perturbation analysis.

### InstaDeepAI/NTv3_benchmark_dataset
- **Type**: Genomics
- **Tags**: Biology, Genomics, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/InstaDeepAI/NTv3_benchmark_dataset

Benchmark dataset with functional tracks and genome annotations across 7 species.

### InstaDeepAI/nucleotide_transformer_downstream_tasks
- **Type**: Genomics
- **Tags**: Biology, Genomics, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/InstaDeepAI/nucleotide_transformer_downstream_tasks

18 genomic prediction benchmark tasks covering histone marks, regulatory regions, splice sites, and promoter activity across human and multi-species genomes.

### InstaDeepAI/multi_species_genomes
- **Type**: Genomics
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/datasets/InstaDeepAI/multi_species_genomes

Whole-genome sequences for 850 species spanning bacteria, fungi, plants, and animals — the pre-training corpus for the Nucleotide Transformer model family.

### InstaDeepAI/plant-genomic-benchmark
- **Type**: Plant Genomics
- **Tags**: Biology, Genomics, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/InstaDeepAI/plant-genomic-benchmark

Plant genomics benchmark spanning gene expression, chromatin accessibility, and agronomic trait prediction tasks across multiple crop and model plant species.

### InstaDeepAI/true-cds-protein-tasks
- **Type**: Protein Tasks
- **Tags**: Biology, Genomics, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/InstaDeepAI/true-cds-protein-tasks

Coding sequence and protein function prediction benchmark tasks.

### tahoebio/Tahoe-100M
- **Type**: Single-Cell Biology
- **Tags**: Biology, Medicine, Genomics
- **HuggingFace**: https://huggingface.co/datasets/tahoebio/Tahoe-100M

Giga-scale perturbation atlas with 100M+ single-cell profiles from 50 cancer cell lines and 1,100 drugs.

### tahoebio/Tahoe-x1-embeddings
- **Type**: Single-Cell Biology
- **Tags**: Biology, Medicine, Genomics
- **HuggingFace**: https://huggingface.co/datasets/tahoebio/Tahoe-x1-embeddings

Pre-computed cell and gene embeddings from the Tahoe-x1 foundation model.

### Xaira-Therapeutics/X-Atlas-Orion
- **Type**: Single-Cell Biology
- **Tags**: Biology, Medicine, Genomics
- **HuggingFace**: https://huggingface.co/datasets/Xaira-Therapeutics/X-Atlas-Orion

Large-scale single-cell transcriptomics atlas with millions of cell profiles from diverse human tissues, designed for training perturbation-aware foundation models.

### Xaira-Therapeutics/X-Atlas-Pisces
- **Type**: Single-Cell Biology
- **Tags**: Biology, Medicine, Genomics
- **HuggingFace**: https://huggingface.co/datasets/Xaira-Therapeutics/X-Atlas-Pisces

CRISPRi perturbation single-cell dataset pairing genetic knockdowns with transcriptomic responses, used for training and evaluating the X-Cell model.

### AllTheBacteria/ATB
- **Type**: Microbial Genomics
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/datasets/AllTheBacteria/ATB

AllTheBacteria: a comprehensive collection of ~2 million bacterial genome assemblies from public sequence databases, standardized for large-scale genomic analysis.

### AllTheBacteria/Bac-Corpus-protein-sequences-high-diversity
- **Type**: Microbial Genomics
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/datasets/AllTheBacteria/Bac-Corpus-protein-sequences-high-diversity

High-diversity corpus of bacterial protein sequences derived from the ATB collection, filtered for maximum sequence diversity to support protein language model pretraining.

### AllTheBacteria/Bac-Corpus-dna-intergenic-sequences-high-diversity
- **Type**: Microbial Genomics
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/datasets/AllTheBacteria/Bac-Corpus-dna-intergenic-sequences-high-diversity

High-diversity corpus of bacterial intergenic DNA sequences for training DNA language models on non-coding regulatory regions.

### AllTheBacteria/SPIRE
- **Type**: Microbial Genomics
- **Tags**: Biology, Genomics, Ecology, Earth Science
- **HuggingFace**: https://huggingface.co/datasets/AllTheBacteria/SPIRE

Searchable Planetary-scale mIcrobiome REsource: a large-scale metagenomics resource aggregating environmental microbiome samples from diverse global habitats.

### wanglab/img_virus_plasmid
- **Type**: Microbial Genomics
- **Tags**: Biology, Genomics, Biotechnology
- **HuggingFace**: https://huggingface.co/datasets/wanglab/img_virus_plasmid

Combined IMG/VR (uncultivated virus genomes) and IMG/PR (plasmids from genomes and metagenomes) catalog with rich functional, taxonomic, and ecological metadata.

### wanglab/kegg
- **Type**: Biological Reasoning
- **Tags**: Biology, Genomics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/wanglab/kegg

KEGG pathway entries paired with variant annotations for training and evaluating multimodal biological reasoning models (used by the BioReason work).

### wanglab/bioreason-pro-sft-reasoning-data
- **Type**: Biological Reasoning Corpus
- **Tags**: Biology, Genomics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/wanglab/bioreason-pro-sft-reasoning-data

Reasoning trace dataset used to supervised-fine-tune BioReason-Pro — multimodal biological problems with rationales over genomic variants and pathway data.

### tattabio/OMG
- **Type**: Genomic Corpus
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/datasets/tattabio/OMG

Open Mixed Genomes (OMG) — large mixed-organism nucleotide corpus underpinning Tatta Bio’s gLM2 genomic foundation models.

### tattabio/OG
- **Type**: Genomic Corpus
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/datasets/tattabio/OG

Open Genomes (OG) — curated genome-sequence corpus from Tatta Bio for genomic ML pretraining and benchmarking.

### arcinstitute/Perturb-Sapiens
- **Type**: Single-Cell Perturbation
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/datasets/arcinstitute/Perturb-Sapiens

Large-scale human single-cell perturbation dataset used in the STACK foundation-model lineage — paired baseline and perturbed expression profiles for genetic perturbation screens.

### arcinstitute/Replogle-Nadig-Preprint
- **Type**: Single-Cell Perturbation
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/datasets/arcinstitute/Replogle-Nadig-Preprint

Replogle-Nadig single-cell perturbation dataset (preprint release) — Perturb-seq screens used in the STATE single-cell embedding work for perturbation-response modelling.

### arcinstitute/State-Tahoe-Filtered
- **Type**: Single-Cell Perturbation
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/datasets/arcinstitute/State-Tahoe-Filtered

Filtered Tahoe-100M slice used in the STATE workflow — high-quality single-cell perturbation profiles for training and benchmarking cross-study cell-state models.

## Models (29)

### Evo-2 40B
- **Type**: Genomic Foundation Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/arcinstitute/evo2_40b

40B-parameter DNA language model trained on 9.3 trillion nucleotides across all domains of life — zero-shot function prediction, variant effect scoring, and sequence generation.

### Evo-2 7B
- **Type**: Genomic Foundation Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/arcinstitute/evo2_7b

7B-parameter instruction-tuned DNA language model for gene function prediction, CRISPR guide design, and cross-species sequence analysis.

### STACK Large
- **Type**: Single-Cell Foundation Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/arcinstitute/Stack-Large

Large-scale single-cell transcriptomics foundation model supporting in-context learning across cell types and perturbation states.

### TEDDY
- **Type**: Single-Cell Biology
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/Merck/TEDDY

Transformer for Enabling Drug Discovery - foundation models trained on 116M single cells for genomics and drug discovery.

### AlphaGenome
- **Type**: Genomics
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/google/alphagenome-all-folds

Google DeepMind model predicting DNA regulatory features — gene expression, chromatin accessibility, and TF binding — at single-nucleotide resolution.

### NTv3 650M
- **Type**: Genomics
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/InstaDeepAI/NTv3_650M_post

Multi-species genomics foundation model handling 1Mb context for functional track prediction.

### Nucleotide Transformer v2 500M
- **Type**: Genomics
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species

500M multi-species DNA language model with improved tokenisation and benchmark performance across 18 genomic prediction tasks.

### Nucleotide Transformer 2.5B
- **Type**: Genomics
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/InstaDeepAI/nucleotide-transformer-2.5b-multi-species

2.5B-parameter DNA language model trained on 850 species genomes — state-of-the-art on promoter, enhancer, and splice site prediction tasks.

### ChatNT
- **Type**: Genomics
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/InstaDeepAI/ChatNT

8B multimodal conversational model for DNA, RNA, and protein tasks — instruction-following for sequence annotation, classification, and generation.

### Isoformer
- **Type**: Multi-Omics
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/InstaDeepAI/isoformer

Transformer model integrating DNA sequence, RNA expression, and protein context for isoform-level gene expression prediction.

### Tahoe-x1
- **Type**: Single-Cell Biology
- **Tags**: Biology, Medicine, Genomics
- **HuggingFace**: https://huggingface.co/tahoebio/Tahoe-x1

Perturbation-trained single-cell foundation models (70M-3B) for cancer research and drug discovery.

### Tahoe-100M-SCVI
- **Type**: Single-Cell Biology
- **Tags**: Biology, Medicine, Genomics
- **HuggingFace**: https://huggingface.co/tahoebio/Tahoe-100M-SCVI-v1

scVI-based variational autoencoder trained on the full Tahoe-100M atlas of 100M+ single-cell profiles across 50 cancer lines and 1,100 drug perturbations.

### X-Cell
- **Type**: Single-Cell Perturbation Model
- **Tags**: Biology, Medicine, Genomics
- **HuggingFace**: https://huggingface.co/Xaira-Therapeutics/X-Cell

Diffusion-based model for predicting transcriptomic responses to CRISPRi perturbations at single-cell resolution, trained on the X-Atlas-Pisces dataset.

### GO-GPT
- **Type**: Protein Function Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/wanglab/gogpt

Generative model that predicts Gene Ontology functional annotations directly from protein sequences — bringing LLM-style decoding to functional protein characterisation.

### OpenMed DNADetect
- **Type**: Biomedical NER
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-DNADetect-SuperMedical-125M

DNA-mention NER for biomedical text — extracts gene-level DNA sequence references and locus identifiers using the SuperMedical 125M backbone.

### OpenMed GenomicDetect
- **Type**: Biomedical NER
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-GenomicDetect-PubMed-335M

Genomic-entity NER over PubMed-style text — labels genes, transcripts, and other genomic references for downstream knowledge extraction.

### OpenMed GenomeDetect
- **Type**: Biomedical NER
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-GenomeDetect-ModernMed-149M

Genome-mention NER complementary to GenomicDetect — focuses on whole-genome and assembly-level references in biomedical text.

### BioReason-Pro SFT
- **Type**: Biological Reasoning Model
- **Tags**: Biology, Genomics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/wanglab/bioreason-pro-sft

Supervised fine-tuned variant of BioReason-Pro — multimodal biological reasoning over genomic variants and pathway data with chain-of-thought rationales.

### BioReason-Pro RL
- **Type**: Biological Reasoning Model
- **Tags**: Biology, Genomics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/wanglab/bioreason-pro-rl

RL-tuned variant of BioReason-Pro — reinforcement-learning fine-tuning over BioReason’s SFT base for sharper biological reasoning across KEGG pathways and variant data.

### OneGenome-Rice
- **Type**: Genomic Foundation Model
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/ZhejiangLab/OneGenome-Rice

Mixtral-architecture genomic foundation model specialised for rice (Oryza sativa) — supports variant analysis, expression prediction, and breeding-relevant trait modelling.

### Genos 1.2B
- **Type**: Genomic Foundation Model
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/ZhejiangLab/Genos-1.2B

General-purpose 1.2B-parameter genomic foundation model spanning multiple organisms — base model for downstream gene-level and sequence-level prediction tasks.

### eva-rna
- **Type**: Transcriptomics Foundation Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/ScientaLab/eva-rna

Transformer foundation model producing sample-level and gene-level embeddings from RNA-seq profiles (bulk, microarray, pseudobulked single-cell) in human and mouse.

### GENA-LM BERT large (T2T)
- **Type**: Genomic Language Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/AIRI-Institute/gena-lm-bert-large-t2t

BERT-large-style genomic foundation model trained on telomere-to-telomere human assemblies — supports variant interpretation, regulatory prediction, and downstream genomic tasks.

### GENA-LM BERT base (T2T)
- **Type**: Genomic Language Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/AIRI-Institute/gena-lm-bert-base-t2t

BERT-base-style genomic foundation model trained on T2T assemblies — lighter-weight backbone for genomic sequence understanding.

### ModernGENA large
- **Type**: Genomic Language Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/AIRI-Institute/moderngena-large

GENA-LM rebuilt on the ModernBERT architecture — larger, longer-context, RoPE-equipped genomic foundation model.

### ModernGENA base
- **Type**: Genomic Language Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/AIRI-Institute/moderngena-base

Compact ModernBERT-based GENA-LM variant — efficient genomic foundation model for downstream variant and expression tasks.

### gLM2 650M
- **Type**: Genomic Language Model
- **Tags**: Biology, Genomics
- **HuggingFace**: https://huggingface.co/tattabio/gLM2_650M

650M-parameter genomic foundation model from Tatta Bio — trained on the OMG open-mixed-genomes corpus for sequence-level biological reasoning.

### Stack-Large Aligned
- **Type**: Single-Cell Foundation Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/arcinstitute/Stack-Large-Aligned

Aligned variant of STACK-Large — single-cell foundation model fine-tuned for cross-batch consistency, supporting multi-study perturbation analysis and downstream alignment tasks.

### SE-600M
- **Type**: Single-Cell Foundation Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/arcinstitute/SE-600M

600M-parameter Single-cell Embeddings model from the STATE collection — generates embeddings for human single-cell RNA expression profiles to support cell-state and perturbation analysis.

## Blog Posts (5)

### PromoterGPT
- **Author**: hugging-science
- **Date**: 2025-01-01
- **Tags**: Biology, Genomics
- **Link**: https://huggingface.co/blog/hugging-science/promoter-gpt

AI-powered promoter sequence design and analysis.

### Predicting the Effects of Mutations on Protein Function with ESM-2
- **Author**: AmelieSchreiber
- **Date**: 2023-12-13
- **Tags**: Biology, Genomics
- **Link**: https://huggingface.co/blog/AmelieSchreiber/mutation-scoring

Using ESM-2 protein language model embeddings to score and predict the functional impact of point mutations.

### QLoRA for ESM-2 and Post Translational Modification Site Prediction
- **Author**: AmelieSchreiber
- **Date**: 2023-11-11
- **Tags**: Biology, Genomics
- **Link**: https://huggingface.co/blog/AmelieSchreiber/esm2-ptm

Applying QLoRA fine-tuning to ESM-2 for accurate prediction of post-translational modification sites across protein sequences.

### ESMBind Ensemble Models
- **Author**: AmelieSchreiber
- **Date**: 2023-09-22
- **Tags**: Biology, Genomics
- **Link**: https://huggingface.co/blog/AmelieSchreiber/esmbind-ensemble

Ensemble methods for ESMBind models to improve binding site prediction accuracy and robustness across protein families.

### ESMBind: Low Rank Adaptation of ESM-2 for Protein Binding Site Prediction
- **Author**: AmelieSchreiber
- **Date**: 2023-09-15
- **Tags**: Biology, Genomics
- **Link**: https://huggingface.co/blog/AmelieSchreiber/esmbind

Fine-tuning ESM-2 with LoRA adapters to predict protein binding sites with high accuracy and parameter efficiency.


================================================================================
## Topic: Materials Science (/topics/materials-science.md)
================================================================================

# Materials Science — Hugging Science

> Material properties and discovery

_Auto-generated from site data. Fetch `/llms.txt` for the full index._

## Datasets (52)

### SandboxAQ/aqcat25-dataset
- **Type**: Computational Chemistry
- **Tags**: Chemistry, Materials Science, Engineering
- **HuggingFace**: https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset

13.5M DFT calculation trajectories for heterogeneous catalysis and ML potential training.

### jablonkagroup/ChemBench
- **Type**: Chemistry Benchmark
- **Tags**: Chemistry, Materials Science, Benchmark, Scientific Reasoning, Engineering, Mathematics
- **HuggingFace**: https://huggingface.co/datasets/jablonkagroup/ChemBench

Manually curated benchmark of 3,000+ chemistry and materials science questions across spectroscopy, reactivity, synthesis, and property prediction for evaluating LLMs.

### LeMaterial/LeMat-Bulk-MLIP-Hull
- **Type**: Materials Science
- **Tags**: Materials Science, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/LeMaterial/LeMat-Bulk-MLIP-Hull

Convex hull data for bulk materials from MLIP calculations.

### LeMaterial/LeMat-Bulk-DFT-Hull-All
- **Type**: Materials Science
- **Tags**: Materials Science, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/LeMaterial/LeMat-Bulk-DFT-Hull-All

Complete DFT convex hull dataset for bulk materials discovery.

### LeMaterial/LeMat-Bulk-DFT-Hull
- **Type**: Materials Science
- **Tags**: Materials Science, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/LeMaterial/LeMat-Bulk-DFT-Hull

DFT convex hull reference data for materials stability analysis.

### LeMaterial/LeMat-Bulk
- **Type**: Materials Science
- **Tags**: Materials Science, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/LeMaterial/LeMat-Bulk

Primary bulk materials database aggregating 1M+ crystal structures with DFT-computed formation energies, band gaps, and elastic properties for materials discovery.

### LeMaterial/LeMat-Traj
- **Type**: Materials Science
- **Tags**: Materials Science, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/LeMaterial/LeMat-Traj

Large-scale molecular dynamics trajectory dataset for training machine learning interatomic potentials across diverse bulk material compositions.

### jablonkagroup/MaCBench
- **Type**: Materials Chemistry Benchmark
- **Tags**: Chemistry, Materials Science, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/jablonkagroup/MaCBench

Materials Chemistry Benchmark — multimodal QA, multiple-choice, and visual-question-answering items for evaluating LLMs on materials and inorganic chemistry tasks.

### Orbital-Materials/MofasaDB
- **Type**: MOF Dataset
- **Tags**: Materials Science, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/Orbital-Materials/MofasaDB

Metal-organic framework dataset from Orbital — large-scale curated MOF structures for materials-discovery ML and synthetic chemistry workflows.

### foundry-ml/foundry_oqmd_band_gaps_v1-1
- **Type**: Electronic Structure
- **Tags**: Materials Science, Physics, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/foundry_oqmd_band_gaps_v1-1

Band-gap values from the Open Quantum Materials Database (OQMD), prepared for ML benchmarking on inorganic crystal electronic structure.

### foundry-ml/foundry_aflow_band_gaps_v1-1
- **Type**: Electronic Structure
- **Tags**: Materials Science, Physics, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/foundry_aflow_band_gaps_v1-1

Band-gap values from the AFLOW high-throughput materials database, formatted for ML model training and evaluation.

### foundry-ml/foundry_mp_band_gaps_v1-1
- **Type**: Electronic Structure
- **Tags**: Materials Science, Physics, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/foundry_mp_band_gaps_v1-1

Band-gap values curated from the Materials Project for ML benchmarking on inorganic electronic structure.

### foundry-ml/double_perovskite_bandgap_v1-1
- **Type**: Electronic Structure
- **Tags**: Materials Science, Physics, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/double_perovskite_bandgap_v1-1

Computed band gaps for double-perovskite compounds — supports ML-based screening for photovoltaic and optoelectronic applications.

### foundry-ml/wolverton_oxides_v1-1
- **Type**: Oxide Properties
- **Tags**: Materials Science, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/wolverton_oxides_v1-1

Wolverton oxide property dataset — DFT-computed properties for binary and ternary oxides, used for ML benchmarking on inorganic chemistry.

### foundry-ml/dataset_perovskite_formatione
- **Type**: Perovskite Properties
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_perovskite_formatione

Formation energies for perovskite compounds — supports ML screening for stability and synthesisability.

### foundry-ml/dataset_perovskite_stability_updated
- **Type**: Perovskite Properties
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_perovskite_stability_updated

Curated perovskite stability data (updated release) for benchmarking ML models on photovoltaic-material durability prediction.

### foundry-ml/perovskite_stability_v1-1
- **Type**: Perovskite Properties
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/perovskite_stability_v1-1

Perovskite stability dataset (v1.1 release) — paired structure and stability labels for ML benchmarking.

### foundry-ml/perovskite_opbandcenter_v1-1
- **Type**: Perovskite Properties
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/perovskite_opbandcenter_v1-1

O p-band center values for perovskite oxides — descriptors for catalytic activity prediction in oxygen-evolution reactions.

### foundry-ml/dataset_perovskite_conductivity
- **Type**: Perovskite Properties
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_perovskite_conductivity

Ionic and electronic conductivity measurements for perovskite materials — supports ML screening for solid-oxide fuel cell electrolytes.

### foundry-ml/dataset_perovskite_habs
- **Type**: Perovskite Properties
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_perovskite_habs

Hot-air-balance (HABS) data for perovskite materials — thermal-stability characterisation supporting durability ML.

### foundry-ml/dataset_perovskite_tec
- **Type**: Perovskite Properties
- **Tags**: Materials Science, Chemistry, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_perovskite_tec

Thermal expansion coefficients for perovskite materials — curated for ML thermal-property prediction.

### foundry-ml/dataset_perovskite_asr
- **Type**: Perovskite Properties
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_perovskite_asr

Area-specific resistance (ASR) data for perovskite electrodes — used in solid-oxide fuel cell ML modelling.

### foundry-ml/atomvison_atomistic_stm_images_2d_materials_unique_chemical_compositions_structure_v1-1
- **Type**: STM Imaging
- **Tags**: Materials Science, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/atomvison_atomistic_stm_images_2d_materials_unique_chemical_compositions_structure_v1-1

Simulated STM images for 2D materials with unique chemical compositions — supports ML on atomic-resolution microscopy.

### foundry-ml/atomvison_simulated_atomistic_stem_images_2d_materials_unique_chemical_compositions_structure_ba
- **Type**: STEM Imaging
- **Tags**: Materials Science, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/atomvison_simulated_atomistic_stem_images_2d_materials_unique_chemical_compositions_structure_ba

Simulated STEM images for 2D materials — paired with structure metadata for training ML models on electron microscopy.

### foundry-ml/training_locating_atoms_stem_images_v1-2
- **Type**: STEM Imaging
- **Tags**: Materials Science, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/training_locating_atoms_stem_images_v1-2

STEM image training set for atomic-position localisation — supports ML pipelines for automated microscopy analysis.

### foundry-ml/mask_rcnn_defect_detection_v1-1
- **Type**: Defect Detection
- **Tags**: Materials Science, Engineering
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/mask_rcnn_defect_detection_v1-1

Microscopy image dataset annotated for instance-segmentation defect detection — Mask R-CNN training data for materials inspection.

### foundry-ml/foundry_stan_segmentation_v1-1
- **Type**: Microscopy Segmentation
- **Tags**: Materials Science, Engineering
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/foundry_stan_segmentation_v1-1

Segmentation dataset (STAN) for materials microscopy images — supports ML feature extraction from electron-microscopy data.

### foundry-ml/direct_electron_detectorceleritas_xs_simulated_readout_images_electron_counting_model_v1-1
- **Type**: Electron Microscopy
- **Tags**: Materials Science, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/direct_electron_detectorceleritas_xs_simulated_readout_images_electron_counting_model_v1-1

Simulated readout images from a Celeritas XS direct-electron detector — training data for electron-counting models in cryo-EM and STEM.

### foundry-ml/elastic_tensor_v1-1
- **Type**: Mechanical Properties
- **Tags**: Materials Science, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/elastic_tensor_v1-1

Elastic tensor data for inorganic materials — supports ML prediction of bulk and shear moduli.

### foundry-ml/piezoelectric_tensor_v1-1
- **Type**: Electromechanical Properties
- **Tags**: Materials Science, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/piezoelectric_tensor_v1-1

Piezoelectric tensor data for inorganic materials — supports ML for sensor and actuator material design.

### foundry-ml/dielectric_constant_v1-1
- **Type**: Dielectric Properties
- **Tags**: Materials Science, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dielectric_constant_v1-1

Dielectric-constant values for inorganic compounds — supports ML screening of high-k materials for capacitors and devices.

### foundry-ml/semiconductor_defectlevels_v1-1
- **Type**: Defect Properties
- **Tags**: Materials Science, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/semiconductor_defectlevels_v1-1

Computed defect-energy levels in semiconductors — descriptors for ML doping and trap-state prediction.

### foundry-ml/superconductivity_v1-1
- **Type**: Superconductivity
- **Tags**: Materials Science, Physics, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/superconductivity_v1-1

Curated superconductor dataset — measured Tc values for ML-based discovery of new superconducting materials.

### foundry-ml/electromigration_v1-1
- **Type**: Failure Mechanisms
- **Tags**: Materials Science, Engineering
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/electromigration_v1-1

Electromigration data for interconnect materials — supports ML prediction of failure rates in microelectronic devices.

### foundry-ml/steel_strength_v1-1
- **Type**: Alloy Mechanical Properties
- **Tags**: Materials Science, Engineering
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/steel_strength_v1-1

Steel strength dataset — composition-property pairs for ML-based alloy design and high-strength materials.

### foundry-ml/dataset_mg_alloy
- **Type**: Alloy Properties
- **Tags**: Materials Science, Engineering
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_mg_alloy

Magnesium alloy dataset — composition and property data for ML modelling of lightweight structural alloys.

### foundry-ml/dataset_metallicglass_rc
- **Type**: Metallic Glass
- **Tags**: Materials Science, Engineering
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_metallicglass_rc

Critical cooling rate (Rc) data for metallic glasses — supports ML prediction of glass-forming ability.

### foundry-ml/dataset_metallicglass_rc_llm
- **Type**: Metallic Glass
- **Tags**: Materials Science, Engineering, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_metallicglass_rc_llm

LLM-extracted critical cooling rate data for metallic glasses — text-mined complement to the structured Rc dataset.

### foundry-ml/dataset_metallicglass_dmax
- **Type**: Metallic Glass
- **Tags**: Materials Science, Engineering
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_metallicglass_dmax

Maximum glass-forming diameter (Dmax) data for bulk metallic glasses — for ML screening of casting feasibility.

### foundry-ml/dataset_concrete_compressive_strength
- **Type**: Construction Materials
- **Tags**: Materials Science, Engineering
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_concrete_compressive_strength

Concrete compressive-strength dataset — mix-design and test data for ML-based civil-engineering material modelling.

### foundry-ml/dataset_rpv_tts
- **Type**: Reactor Materials
- **Tags**: Materials Science, Engineering, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_rpv_tts

Reactor pressure-vessel (RPV) transition-temperature shift dataset — supports ML prediction of irradiation embrittlement.

### foundry-ml/dataset_exfoliatione
- **Type**: 2D Materials
- **Tags**: Materials Science, Physics, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_exfoliatione

Exfoliation energy dataset for 2D materials — supports ML-driven discovery of layered compounds suitable for monolayer isolation.

### foundry-ml/dataset_thermalexp_aflow
- **Type**: Thermal Properties
- **Tags**: Materials Science, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_thermalexp_aflow

Thermal expansion coefficients from the AFLOW database — for ML thermal-mechanical modelling of inorganic materials.

### foundry-ml/dataset_thermalcond_aflow
- **Type**: Thermal Properties
- **Tags**: Materials Science, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_thermalcond_aflow

Thermal conductivity values from the AFLOW database — supports ML-based screening of thermal management materials.

### foundry-ml/dataset_debyet_aflow
- **Type**: Thermal Properties
- **Tags**: Materials Science, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_debyet_aflow

Debye temperature data from the AFLOW database — fundamental thermal-vibrational descriptor for ML materials property prediction.

### foundry-ml/heusler_magnetization_v1-1
- **Type**: Magnetic Properties
- **Tags**: Materials Science, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/heusler_magnetization_v1-1

Magnetisation data for Heusler-alloy compounds — supports ML discovery of half-metallic and magnetocaloric materials.

### foundry-ml/dataset_li_conductivity
- **Type**: Battery Materials
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_li_conductivity

Lithium-ion conductivity dataset for solid electrolytes — supports ML discovery of next-generation battery materials.

### foundry-ml/elwood_md_v1-2
- **Type**: Molecular Dynamics
- **Tags**: Chemistry, Materials Science
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/elwood_md_v1-2

Elwood molecular-dynamics simulation set — trajectory and energy data for ML molecular-property prediction.

### foundry-ml/foundry_osdb_v1-1
- **Type**: Organic Semiconductors
- **Tags**: Chemistry, Materials Science, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/foundry_osdb_v1-1

Organic Semiconductor Database (OSDB) curated for ML — supports property prediction and screening of organic optoelectronic materials.

### foundry-ml/diffusion_v1-4
- **Type**: Diffusion Coefficients
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/diffusion_v1-4

Diffusion-coefficient dataset for inorganic systems — supports ML modelling of solid-state ion transport and electrolyte design.

### mist-models/excess-properties
- **Type**: Mixture Properties
- **Tags**: Chemistry, Materials Science
- **HuggingFace**: https://huggingface.co/datasets/mist-models/excess-properties

Excess-property dataset for binary/ternary chemical mixtures — used to fine-tune MIST mixtures models on thermodynamic deviations from ideal mixing.

### ADSKAILab/ABC-1M
- **Type**: CAD Geometry Corpus
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/datasets/ADSKAILab/ABC-1M

One million CAD-quality 3D shapes drawn from the ABC dataset — the foundation training corpus for the Make-A-Shape and WaLa generative models.

## Models (35)

### OMat24
- **Type**: Materials Science
- **Tags**: Materials Science, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/facebook/OMAT24

Machine learning models for predicting inorganic material properties using EquiformerV2 and eSEN architectures.

### OMol25
- **Type**: Molecular
- **Tags**: Chemistry, Materials Science, Engineering
- **HuggingFace**: https://huggingface.co/facebook/OMol25

Open Molecules 2025 - dataset and models for molecular property prediction including polymer extensions.

### UMA
- **Type**: Molecular
- **Tags**: Chemistry, Materials Science, Physics, Engineering
- **HuggingFace**: https://huggingface.co/facebook/UMA

Universal Models for Atoms - mixture-of-experts graph network trained on billions of atoms across 5 datasets.

### Equiformer v3
- **Type**: Equivariant GNN
- **Tags**: Chemistry, Physics, Materials Science
- **HuggingFace**: https://huggingface.co/mirror-physics/equiformer_v3

Equivariant graph transformer for molecular and materials modeling — predicts energies, forces, and properties on molecular structures and crystals.

### OC25
- **Type**: Catalysis Model
- **Tags**: Chemistry, Materials Science, Energy
- **HuggingFace**: https://huggingface.co/facebook/OC25

Open Catalyst 2025 — successor to OC22, modelling explicit-solvent and catalyst systems for electrochemistry and energy applications.

### OMC25
- **Type**: Materials Model
- **Tags**: Chemistry, Materials Science
- **HuggingFace**: https://huggingface.co/facebook/OMC25

Open Molecular Crystals 2025 — Meta FAIR Chemistry release for predicting properties of organic molecular crystals (pharmaceutical polymorphs, energetic materials, OLEDs).

### MatterGen
- **Type**: Generative Materials Model
- **Tags**: Materials Science, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/microsoft/mattergen

Generative AI for materials design — proposes novel inorganic crystal structures with specified properties for energy, catalysis, and functional-materials research.

### MatterSim
- **Type**: Materials Simulator
- **Tags**: Materials Science, Chemistry, Physics
- **HuggingFace**: https://huggingface.co/microsoft/mattersim

Foundation-model atomistic simulator for materials over a wide range of temperatures and pressures — drop-in replacement for ab-initio MD for property prediction.

### OrbMol
- **Type**: Molecular Foundation Model
- **Tags**: Chemistry, Materials Science
- **HuggingFace**: https://huggingface.co/Orbital-Materials/OrbMol

Foundation-model potential for molecular systems — energies, forces, and properties for organic and metal-organic chemistry, supporting catalyst and pharma workflows.

### AIMNet2-Pd
- **Type**: Neural Interatomic Potential
- **Tags**: Chemistry, Materials Science, Physics
- **HuggingFace**: https://huggingface.co/isayevlab/aimnet2-pd

AIMNet2 specialised for palladium-containing organometallic systems — supports homogeneous catalysis simulation at near-DFT accuracy.

### MACE-MP-0
- **Type**: Foundation Potential
- **Tags**: Materials Science, Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mace-foundations/mace-mp-0

MACE foundation model trained on the Materials Project — equivariant message-passing potential for inorganic crystal simulation across most of the periodic table.

### MACE-MPA-0
- **Type**: Foundation Potential
- **Tags**: Materials Science, Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mace-foundations/mace-mpa-0

MACE foundation model trained on the Materials Project + Alexandria datasets — broader coverage variant for inorganic-materials simulation.

### MACE-MH-0
- **Type**: Foundation Potential
- **Tags**: Materials Science, Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mace-foundations/mace-mh-0

MACE foundation model targeting molecular and hybrid systems — equivariant potential trained on a unified molecular/materials dataset.

### MACE-MH-1
- **Type**: Foundation Potential
- **Tags**: Materials Science, Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mace-foundations/mace-mh-1

Updated MACE-MH foundation potential with refined molecular/materials hybrid training — successor to MACE-MH-0.

### MIST 28M · tmQM
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Materials Science
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-ggd8iisr-tmQM

MIST 28M fine-tuned on tmQM — quantum-mechanical property prediction for transition-metal complexes.

### MIST 26.9M · melting point
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Materials Science
- **HuggingFace**: https://huggingface.co/mist-models/mist-26.9M-y3ge5pf9-mp

MIST 26.9M fine-tuned for melting-point regression.

### MIST 26.9M · boiling point
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Materials Science
- **HuggingFace**: https://huggingface.co/mist-models/mist-26.9M-b302p09x-bp

MIST 26.9M fine-tuned for boiling-point regression.

### MIST 27.0M · conductivity
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Materials Science, Energy
- **HuggingFace**: https://huggingface.co/mist-models/mist-conductivity-27.0M-2mpg8dcd

MIST 27.0M fine-tuned for ionic-conductivity prediction in chemical mixtures and electrolytes.

### MIST 27.1M · ETN
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Materials Science
- **HuggingFace**: https://huggingface.co/mist-models/mist-27.1M-1gcxtg8y-ETN

MIST 27.1M fine-tuned on the ETN (empirical thermodynamic network) benchmark.

### Make-A-Shape · single-view 20M
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/Make-A-Shape-single-view-20m

Make-A-Shape variant trained to generate 3D geometry from a single 2D image — supports CAD reconstruction and engineering shape synthesis.

### Make-A-Shape · multi-view 20M
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/Make-A-Shape-multi-view-20m

Make-A-Shape multi-view variant — generates 3D geometry from multiple 2D image perspectives for higher-fidelity CAD reconstruction.

### Make-A-Shape · point-cloud 20M
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/Make-A-Shape-point-cloud-20m

Make-A-Shape point-cloud variant — completes and refines 3D geometry from sparse point-cloud input.

### Make-A-Shape · voxel 32³
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/Make-A-Shape-voxel-32res-20m

Make-A-Shape voxel variant at 32³ resolution — generates voxelised 3D geometries for low-resolution shape exploration.

### Make-A-Shape · voxel 16³
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/Make-A-Shape-voxel-16res-20m

Coarser 16³ voxel variant of Make-A-Shape for fast prototyping of 3D geometries.

### WaLa SV 1B
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/WaLa-SV-1B

WaLa (Wavelet-Latent) 1B model conditioned on single-view input — large-scale wavelet-domain 3D shape generation.

### WaLa RGB4 1B
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/WaLa-RGB4-1B

WaLa 1B variant conditioned on four RGB views — multi-view colour-image-driven 3D shape generation.

### WaLa DM4 1B
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/WaLa-DM4-1B

WaLa 1B variant conditioned on four depth maps — depth-driven 3D shape generation.

### WaLa DM6 1B
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/WaLa-DM6-1B

WaLa 1B variant conditioned on six depth maps for high-coverage depth-driven 3D shape generation.

### WaLa PC 1B
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/WaLa-PC-1B

WaLa 1B variant conditioned on point clouds — wavelet-latent shape completion from sparse point input.

### WaLa VX16 1B
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/WaLa-VX16-1B

WaLa 1B variant conditioned on 16³ voxel grids — coarse-voxel-driven 3D shape generation.

### WaLa UN 1B
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/WaLa-UN-1B

WaLa 1B unconditional variant — generates 3D shapes from noise alone for design-space exploration.

### WaLa SK 1B
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/WaLa-SK-1B

WaLa 1B variant conditioned on sketches — supports designer-driven shape generation from line art.

### WaLa DM1 1B
- **Type**: 3D Shape Generation
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/WaLa-DM1-1B

WaLa 1B variant conditioned on a single depth map — minimal-input depth-to-shape generation.

### WaLa MVDream RGB4
- **Type**: Text-to-3D
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/WaLa-MVDream-RGB4

WaLa coupled with MVDream for text-conditioned 3D shape generation via four RGB-view diffusion.

### WaLa MVDream DM6
- **Type**: Text-to-3D
- **Tags**: Engineering, Materials Science
- **HuggingFace**: https://huggingface.co/ADSKAILab/WaLa-MVDream-DM6

WaLa coupled with MVDream and six depth views for text-conditioned 3D geometry generation.

## Blog Posts (1)

### LeMaterial: An Open-Source Initiative to Accelerate Materials Discovery
- **Author**: lvwerra
- **Date**: 2024-12-10
- **Tags**: Materials Science, Chemistry, Engineering
- **Link**: https://huggingface.co/blog/lematerial

Introducing LeMaterial, a community effort to build the largest open database of materials and accelerate AI-driven discovery of new compounds and structures.


================================================================================
## Topic: Mathematics (/topics/mathematics.md)
================================================================================

# Mathematics — Hugging Science

> Mathematical modeling and computational methods

_Auto-generated from site data. Fetch `/llms.txt` for the full index._

## Datasets (26)

### jablonkagroup/ChemBench
- **Type**: Chemistry Benchmark
- **Tags**: Chemistry, Materials Science, Benchmark, Scientific Reasoning, Engineering, Mathematics
- **HuggingFace**: https://huggingface.co/datasets/jablonkagroup/ChemBench

Manually curated benchmark of 3,000+ chemistry and materials science questions across spectroscopy, reactivity, synthesis, and property prediction for evaluating LLMs.

### AI-MO/aops_raw
- **Type**: Competition Math
- **Tags**: Mathematics
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/aops_raw

Raw problem posts and discussion threads from the Art of Problem Solving forums, spanning AMC, AIME, and international olympiad competitions.

### AI-MO/olympiads-ref-base
- **Type**: Competition Math
- **Tags**: Mathematics
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/olympiads-ref-base

Canonical reference set of international and national mathematical olympiad problems, used as the base for downstream NuminaMath training splits.

### AI-MO/olympiads-ref
- **Type**: Competition Math
- **Tags**: Mathematics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/olympiads-ref

Extended reference set of olympiad problems with verified step-by-step solutions, used for Chain-of-Thought and formal reasoning training.

### AI-MO/Kimina-Prover-Promptset
- **Type**: Theorem Proving
- **Tags**: Mathematics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/Kimina-Prover-Promptset

Prompt-set for training and evaluating Kimina, a Lean 4 theorem prover that uses reinforcement learning over formal mathematical proofs.

### AI-MO/NuminaMath-LEAN
- **Type**: Theorem Proving
- **Tags**: Mathematics
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/NuminaMath-LEAN

Mathematical problems formalized in LEAN proof assistant.

### AI-MO/GeometryLeanBench
- **Type**: Theorem Proving
- **Tags**: Mathematics, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/GeometryLeanBench

Geometry theorem proving problems formalised in Lean 4, covering Euclidean, affine, and metric geometry for automated reasoning evaluation.

### AI-MO/CombiBench
- **Type**: Combinatorics
- **Tags**: Mathematics, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/CombiBench

Combinatorics problems drawn from AMC, AIME, and olympiad competitions, formalised for benchmarking discrete-mathematics reasoning in language models.

### AI-MO/minif2f_test
- **Type**: Theorem Proving
- **Tags**: Mathematics, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/minif2f_test

Test set for miniF2F formal mathematics benchmark.

### AI-MO/aimo-validation-amc
- **Type**: Competition Math
- **Tags**: Mathematics, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/aimo-validation-amc

AMC 10/12 competition problems reformatted for AIMO challenge validation, covering algebra, geometry, and number theory at difficulty levels 1–5.

### AI-MO/aimo-validation-aime
- **Type**: Competition Math
- **Tags**: Mathematics, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/aimo-validation-aime

AIME I/II problems reformatted for AIMO challenge validation — 15-question integer-answer format, covering competition math at difficulty levels 5–9.

### AI-MO/NuminaMath-1.5
- **Type**: Math Problems
- **Tags**: Mathematics
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/NuminaMath-1.5

860K+ competition math problems from 17 sources with verified solutions — the training backbone of the gold-medal solution at the 2024 AI Mathematical Olympiad.

### AI-MO/NuminaMath-TIR
- **Type**: Math Problems
- **Tags**: Mathematics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/NuminaMath-TIR

NuminaMath with Tool-Integrated Reasoning annotations.

### AI-MO/NuminaMath-CoT
- **Type**: Math Problems
- **Tags**: Mathematics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/NuminaMath-CoT

NuminaMath with Chain-of-Thought reasoning annotations.

### AI-MO/aimo-validation-math-level-4
- **Type**: Math Problems
- **Tags**: Mathematics, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/aimo-validation-math-level-4

Level-4 MATH benchmark problems (pre-calculus difficulty) used for AIMO challenge validation and fine-grained model evaluation.

### AI-MO/aimo-validation-math-level-5
- **Type**: Math Problems
- **Tags**: Mathematics, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/aimo-validation-math-level-5

Level-5 MATH benchmark problems (highest difficulty) used for AIMO challenge validation and measuring the ceiling of model mathematical reasoning.

### meta-math/MetaMathQA
- **Type**: Math Problems
- **Tags**: Mathematics
- **HuggingFace**: https://huggingface.co/datasets/meta-math/MetaMathQA

Mathematical question-answering dataset for training and evaluating math reasoning.

### google/spiqa
- **Type**: Scientific Benchmark
- **Tags**: Biology, Chemistry, Physics, Benchmark, Scientific Reasoning, Mathematics
- **HuggingFace**: https://huggingface.co/datasets/google/spiqa

Scientific Paper Image Question Answering benchmark requiring multimodal reasoning over figures, charts, and diagrams from research papers across scientific domains.

### facebook/principia-collection
- **Type**: STEM Reasoning
- **Tags**: Mathematics, Physics, Chemistry, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/facebook/principia-collection

Large-scale STEM reasoning dataset from Meta covering mathematics, physics, chemistry, and biology problems for training and evaluating scientific reasoning in language models.

### facebook/principia-bench
- **Type**: STEM Benchmark
- **Tags**: Mathematics, Physics, Chemistry, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/facebook/principia-bench

Curated benchmark of challenging STEM problems requiring multi-step reasoning, quantitative analysis, and domain knowledge across natural sciences.

### futurehouse/BixBench
- **Type**: Research Benchmark
- **Tags**: Biology, Chemistry, Benchmark, Scientific Reasoning, Mathematics
- **HuggingFace**: https://huggingface.co/datasets/futurehouse/BixBench

Benchmark with 205 reproducible research questions paired with data capsules for AI evaluation.

### futurehouse/lab-bench
- **Type**: Research Benchmark
- **Tags**: Biology, Benchmark, Scientific Reasoning, Mathematics
- **HuggingFace**: https://huggingface.co/datasets/futurehouse/lab-bench

Language Agent Biology Benchmark - 8 categories of scientific research tasks including cloning, figures, and protocols.

### futurehouse/ether0-benchmark
- **Type**: Chemistry Benchmark
- **Tags**: Chemistry, Medicine, Benchmark, Scientific Reasoning, Mathematics
- **HuggingFace**: https://huggingface.co/datasets/futurehouse/ether0-benchmark

Chemistry reasoning benchmark covering SMILES-based tasks including reaction prediction, retrosynthesis, and molecular property estimation for evaluating chemistry LLMs.

### SAIRfoundation/equational-theories-selected-problems
- **Type**: Mathematical Reasoning
- **Tags**: Mathematics, Scientific Reasoning, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/SAIRfoundation/equational-theories-selected-problems

Curated selection of equational theory problems for benchmarking LLM mathematical reasoning and automated theorem proving.

### SAIRfoundation/equational-theories-benchmark
- **Type**: Mathematical Reasoning
- **Tags**: Mathematics, Scientific Reasoning, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/SAIRfoundation/equational-theories-benchmark

Full benchmark suite of equational theory problems spanning algebraic structures, designed to evaluate formal reasoning capabilities of AI models.

### AI-MO/olympiads
- **Type**: Math Reasoning Corpus
- **Tags**: Mathematics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/olympiads

Olympiad-level mathematical problems collected from international and national competitions, formatted for training and evaluating mathematical reasoning models.

## Models (3)

### Kimina-Prover Preview Distill 7B
- **Type**: Lean 4 Theorem Prover
- **Tags**: Mathematics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/AI-MO/Kimina-Prover-Preview-Distill-7B

Distilled 7B preview of Kimina-Prover — a reinforcement-learning-trained model that generates Lean 4 proofs for olympiad-level mathematics problems.

### Kimina-Prover Distill 1.7B
- **Type**: Lean 4 Theorem Prover
- **Tags**: Mathematics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/AI-MO/Kimina-Prover-Distill-1.7B

Compact 1.7B distilled Kimina-Prover variant for Lean 4 proof generation on olympiad-level theorems — runs on a single consumer GPU.

### Kimina-Prover Distill 8B
- **Type**: Lean 4 Theorem Prover
- **Tags**: Mathematics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/AI-MO/Kimina-Prover-Distill-8B

8B distilled Kimina-Prover variant — Lean 4 theorem-proving model trained on olympiad-level mathematical problems with reinforcement learning over proof traces.

## Blog Posts (9)

### AI for PDEs
- **Author**: hugging-science
- **Date**: 2025-01-01
- **Tags**: Physics, Mathematics, Engineering
- **Link**: https://huggingface.co/blog/hugging-science/pde

Exploring AI approaches to solving partial differential equations.

### Open-R1: A Fully Open Reproduction of DeepSeek-R1
- **Author**: lvwerra
- **Date**: 2025-01-28
- **Tags**: Mathematics
- **Link**: https://huggingface.co/blog/open-r1

A fully open reproduction of DeepSeek-R1's math reasoning training pipeline — data, code, and models — bringing transparent reasoning model training to the community.

### Tropical Quivers for Modern AI: A Guided Tour of a Research Program
- **Author**: AmelieSchreiber
- **Date**: 2026-03-22
- **Tags**: Mathematics
- **Link**: https://huggingface.co/blog/AmelieSchreiber/tropical-quivers-of-archs

A tour of tropical quiver representations and how their combinatorial structure connects to modern AI architectures.

### Surface Orders, Cyclic Time, and a Concrete Hilbert–Pólya Framework
- **Author**: AmelieSchreiber
- **Date**: 2026-03-17
- **Tags**: Mathematics
- **Link**: https://huggingface.co/blog/AmelieSchreiber/hilbert-polya-for-grh

A concrete construction toward the Hilbert–Pólya conjecture using surface orders and cyclic-time symmetry as a route to the Riemann Hypothesis.

### Faster Persistent Homology Alignment and Protein Complex Clustering
- **Author**: AmelieSchreiber
- **Date**: 2023-11-30
- **Tags**: Biology, Mathematics
- **Link**: https://huggingface.co/blog/AmelieSchreiber/faster-pha

Accelerating persistent homology alignment with ESM-2 embeddings and persistence landscapes for protein complex clustering.

### Persistent Homology Alignment: Replacing Multiple Sequence Alignments
- **Author**: AmelieSchreiber
- **Date**: 2023-11-15
- **Tags**: Biology, Mathematics
- **Link**: https://huggingface.co/blog/AmelieSchreiber/plm-persistent-homology-msa-replacement

Replacing traditional multiple sequence alignments with ESM-2 embeddings and persistent homology for structure-aware protein comparison.

### Estimating the Intrinsic Dimension of Protein Sequence Embeddings
- **Author**: AmelieSchreiber
- **Date**: 2023-10-18
- **Tags**: Biology, Mathematics
- **Link**: https://huggingface.co/blog/AmelieSchreiber/intrinsic-dimension-of-proteins

Measuring the intrinsic dimensionality of ESM-2 protein embeddings to understand the geometric structure of protein sequence space.

### Physics Informed Neural Networks (PINNs): An Intuitive Guide
- **Author**: towardsdatascience.com
- **Date**: 2025-01-28
- **Tags**: Physics, Mathematics, Engineering
- **Link**: https://towardsdatascience.com/physics-informed-neural-networks-pinns-an-intuitive-guide-fff138069563/

A clear, intuitive walkthrough of how PINNs embed physical laws directly into neural network training — bridging traditional PDE-based modeling with data-driven deep learning.

### Did GPT-5.2 Make a Breakthrough Discovery in Theoretical Physics?
- **Author**: dlouapre
- **Date**: 2026-02-01
- **Tags**: Physics, Mathematics
- **Link**: https://huggingface.co/blog/dlouapre/gpt-single-minus-gluons

GPT-5.2 conjectured a compact formula for single-minus gluon tree amplitudes previously assumed to be zero for 40 years — a striking example of AI contributing to original theoretical physics.


================================================================================
## Topic: Medicine (/topics/medicine.md)
================================================================================

# Medicine — Hugging Science

> Healthcare, drug discovery, and clinical research

_Auto-generated from site data. Fetch `/llms.txt` for the full index._

## Datasets (43)

### arcinstitute/opengenome2
- **Type**: Genomics
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/datasets/arcinstitute/opengenome2

Curated collection of prokaryotic and eukaryotic genomic sequences for training and benchmarking large-scale biological foundation models.

### arcinstitute/SE-167M-Human
- **Type**: Single-Cell Biology
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/datasets/arcinstitute/SE-167M-Human

167M human single-cell RNA expression profiles across diverse tissues and cell types, used for training STACK and SE single-cell foundation models.

### arcinstitute/Stack-CellxGene45M
- **Type**: Single-Cell Biology
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/datasets/arcinstitute/Stack-CellxGene45M

45M curated single-cell profiles drawn from the CellxGene corpus, standardised for in-context learning and cross-study perturbation analysis.

### eve-bio/drug-target-activity
- **Type**: Drug Discovery
- **Tags**: Biology, Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/eve-bio/drug-target-activity

Drug-target interaction measurements for 1,397 FDA-approved small molecule drugs.

### SandboxAQ/SAIR
- **Type**: Drug Discovery
- **Tags**: Chemistry, Medicine, Biology
- **HuggingFace**: https://huggingface.co/datasets/SandboxAQ/SAIR

Largest public dataset of protein-ligand 3D structures with binding affinity measurements (1M+ pairs).

### openadmet/openadmet-expansionrx-challenge-train-data
- **Type**: Drug Discovery
- **Tags**: Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/openadmet/openadmet-expansionrx-challenge-train-data

Training data for the OpenADMET ExpansionRx ADMET prediction challenge.

### openadmet/openadmet-expansionrx-challenge-data
- **Type**: Drug Discovery
- **Tags**: Medicine, Chemistry, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/openadmet/openadmet-expansionrx-challenge-data

Full ExpansionRx challenge dataset of RNA-targeted small-molecule compounds with measured ADMET properties for open pharmacokinetics benchmarking.

### openadmet/Octant_CYP_inhibition_reactivity_blog_release
- **Type**: Drug Discovery
- **Tags**: Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/openadmet/Octant_CYP_inhibition_reactivity_blog_release

Octant CYP inhibition and chemical reactivity dataset measuring cytochrome P450 activity across a diverse compound library for ADMET modelling.

### futurehouse/ether0-benchmark
- **Type**: Chemistry Benchmark
- **Tags**: Chemistry, Medicine, Benchmark, Scientific Reasoning, Mathematics
- **HuggingFace**: https://huggingface.co/datasets/futurehouse/ether0-benchmark

Chemistry reasoning benchmark covering SMILES-based tasks including reaction prediction, retrosynthesis, and molecular property estimation for evaluating chemistry LLMs.

### tahoebio/Tahoe-100M
- **Type**: Single-Cell Biology
- **Tags**: Biology, Medicine, Genomics
- **HuggingFace**: https://huggingface.co/datasets/tahoebio/Tahoe-100M

Giga-scale perturbation atlas with 100M+ single-cell profiles from 50 cancer cell lines and 1,100 drugs.

### tahoebio/Tahoe-x1-embeddings
- **Type**: Single-Cell Biology
- **Tags**: Biology, Medicine, Genomics
- **HuggingFace**: https://huggingface.co/datasets/tahoebio/Tahoe-x1-embeddings

Pre-computed cell and gene embeddings from the Tahoe-x1 foundation model.

### owkin/plism-dataset-tiles
- **Type**: Computational Pathology
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/datasets/owkin/plism-dataset-tiles

Large-scale histopathology tile dataset for benchmarking robustness of pathology foundation models across staining and scanner variability.

### owkin/nct-crc-he
- **Type**: Computational Pathology
- **Tags**: Medicine, Biology, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/owkin/nct-crc-he

Colorectal cancer tissue classification dataset with H&E-stained patches across 9 tissue classes, widely used for benchmarking pathology models.

### owkin/camelyon16-features
- **Type**: Computational Pathology
- **Tags**: Medicine, Biology, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/owkin/camelyon16-features

Pre-extracted features from the CAMELYON16 breast cancer lymph node metastasis detection challenge, enabling efficient benchmarking of MIL methods.

### owkin/her2-challenge-2026
- **Type**: Computational Pathology
- **Tags**: Medicine, Biology, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/owkin/her2-challenge-2026

HER2 scoring challenge dataset with H&E-stained whole-slide images for evaluating AI-based HER2 status prediction in breast cancer.

### Xaira-Therapeutics/X-Atlas-Orion
- **Type**: Single-Cell Biology
- **Tags**: Biology, Medicine, Genomics
- **HuggingFace**: https://huggingface.co/datasets/Xaira-Therapeutics/X-Atlas-Orion

Large-scale single-cell transcriptomics atlas with millions of cell profiles from diverse human tissues, designed for training perturbation-aware foundation models.

### Xaira-Therapeutics/X-Atlas-Pisces
- **Type**: Single-Cell Biology
- **Tags**: Biology, Medicine, Genomics
- **HuggingFace**: https://huggingface.co/datasets/Xaira-Therapeutics/X-Atlas-Pisces

CRISPRi perturbation single-cell dataset pairing genetic knockdowns with transcriptomic responses, used for training and evaluating the X-Cell model.

### opig/OAS
- **Type**: Antibody Sequences
- **Tags**: Biology, Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/opig/OAS

Observed Antibody Space: a curated database of over one billion antibody sequences from immune repertoire sequencing studies, the standard resource for antibody ML.

### openai/healthbench
- **Type**: Medical Benchmark
- **Tags**: Medicine, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/openai/healthbench

Realistic multi-turn health conversations graded against physician-written rubrics across multiple axes (accuracy, completeness, communication) — an open evaluation benchmark for AI assistants in medicine.

### openai/healthbench-professional
- **Type**: Medical Benchmark
- **Tags**: Medicine, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/openai/healthbench-professional

Professional-graded subset of HealthBench: physician evaluators score model responses to clinically realistic conversations, targeting expert-level health assessment.

### wanglab/CT_DeepLesion-MedSAM2
- **Type**: Medical Imaging
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/datasets/wanglab/CT_DeepLesion-MedSAM2

CT volumes from the DeepLesion benchmark with mask annotations restructured for training and evaluating MedSAM2, the universal medical image segmentation foundation model.

### OpenMed/MedDialog
- **Type**: Medical Dialogue
- **Tags**: Medicine, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/OpenMed/MedDialog

Doctor-patient medical dialogue dataset for training and evaluating clinical conversation models — covers triage, symptom checking, and diagnostic reasoning.

### OpenMed/Medical-Reasoning-SFT-Mega
- **Type**: SFT Reasoning Corpus
- **Tags**: Medicine, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/OpenMed/Medical-Reasoning-SFT-Mega

Large supervised fine-tuning corpus for clinical reasoning — multi-step medical question-answer chains with rationales for training instruction-following medical LLMs.

### OpenMed/synthvision-annotated-qwen
- **Type**: Synthetic Medical Vision
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/datasets/OpenMed/synthvision-annotated-qwen

Synthetic medical-imaging dataset annotated by Qwen — used in OpenMed’s SynthVision pipeline for training and validating medical multimodal models.

### OpenMed/synthvision-seeds
- **Type**: Synthetic Medical Vision
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/datasets/OpenMed/synthvision-seeds

Seed prompts and source imagery feeding the SynthVision generation pipeline that produces OpenMed’s annotated medical-imaging training corpora.

### OpenMed/synthvision-annotated-kimi
- **Type**: Synthetic Medical Vision
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/datasets/OpenMed/synthvision-annotated-kimi

Synthetic medical-imaging dataset annotated by Kimi — sister set to the Qwen-annotated split, supporting cross-annotator validation in the SynthVision pipeline.

### Anthropic/BioMysteryBench-preview
- **Type**: Biology Benchmark
- **Tags**: Biology, Medicine, Scientific Reasoning, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/Anthropic/BioMysteryBench-preview

Preview slice of BioMysteryBench — challenging, expert-curated biology problems for evaluating AI scientific reasoning capability.

### Anthropic/BioMysteryBench-full
- **Type**: Biology Benchmark
- **Tags**: Biology, Medicine, Scientific Reasoning, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/Anthropic/BioMysteryBench-full

Full BioMysteryBench evaluation set — challenging biology problems used to probe expert-level scientific reasoning in frontier models.

### miriad/miriad-5.8M
- **Type**: Medical Reasoning Corpus
- **Tags**: Medicine, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/miriad/miriad-5.8M

5.8M-example medical instruction-tuning and reasoning corpus curated from clinical literature for training healthcare LLMs at scale.

### miriad/miriad-4.4M
- **Type**: Medical Reasoning Corpus
- **Tags**: Medicine, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/miriad/miriad-4.4M

4.4M-example medical reasoning subset of MIRIAD — earlier release used for benchmarking medical instruction-tuning workflows.

### maomlab/TDC
- **Type**: Therapeutics Benchmark
- **Tags**: Medicine, Chemistry, Biology
- **HuggingFace**: https://huggingface.co/datasets/maomlab/TDC

Therapeutics Data Commons subset — drug-discovery tasks (ADMET, drug-target interaction, generation) curated for benchmarking molecular ML.

### maomlab/B3DB
- **Type**: BBB Permeability
- **Tags**: Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/maomlab/B3DB

Blood-Brain Barrier Database (B3DB) — curated permeability measurements for compounds, supporting CNS drug-discovery ML benchmarks.

### maomlab/CryptoCEN
- **Type**: Coexpression Network
- **Tags**: Biology, Medicine
- **HuggingFace**: https://huggingface.co/datasets/maomlab/CryptoCEN

CryptoCEN — Cryptococcus coexpression network dataset for fungal pathogen biology and drug-target prioritisation.

### Aignostics/OpenTME
- **Type**: Digital Pathology
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/datasets/Aignostics/OpenTME

Pre-analyzed H&E whole-slide images from TCGA across breast, bladder, colorectal, liver, and lung cancers — cell-level annotations and tumour-microenvironment spatial features generated by Atlas H&E-TME.

### foundry-ml/foundry_moses_v1-1
- **Type**: Molecular Generation
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/foundry_moses_v1-1

Foundry mirror of MOSES — molecular sets benchmark for evaluating generative chemistry models on drug-like molecule generation.

### FreedomIntelligence/medical-o1-reasoning-SFT
- **Type**: Medical Reasoning Corpus
- **Tags**: Medicine, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT

Medical chain-of-thought reasoning dataset (o1-style) for supervised fine-tuning of medical LLMs — one of the most-liked medical training corpora on Hugging Face (1000+ likes).

### FreedomIntelligence/medical-o1-verifiable-problem
- **Type**: Medical RL Reward Data
- **Tags**: Medicine, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/FreedomIntelligence/medical-o1-verifiable-problem

Verifiable medical reasoning problems with checker functions — supports RL/reward-model training for medical-LLM alignment beyond static SFT.

### recursionpharma/rxrx3
- **Type**: Phenomics Imaging
- **Tags**: Biology, Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/recursionpharma/rxrx3

Full RxRx3 release — multi-million image high-content microscopy dataset spanning genetic and chemical perturbations across human cell lines, paired with rich text annotations for image-based drug discovery.

### recursionpharma/rxrx3-core
- **Type**: Phenomics Imaging
- **Tags**: Biology, Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/recursionpharma/rxrx3-core

Curated core subset of RxRx3 — high-quality phenomics images for benchmarking and lower-cost training of phenomic foundation models like OpenPhenom.

### arcinstitute/Perturb-Sapiens
- **Type**: Single-Cell Perturbation
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/datasets/arcinstitute/Perturb-Sapiens

Large-scale human single-cell perturbation dataset used in the STACK foundation-model lineage — paired baseline and perturbed expression profiles for genetic perturbation screens.

### arcinstitute/Replogle-Nadig-Preprint
- **Type**: Single-Cell Perturbation
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/datasets/arcinstitute/Replogle-Nadig-Preprint

Replogle-Nadig single-cell perturbation dataset (preprint release) — Perturb-seq screens used in the STATE single-cell embedding work for perturbation-response modelling.

### arcinstitute/State-Tahoe-Filtered
- **Type**: Single-Cell Perturbation
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/datasets/arcinstitute/State-Tahoe-Filtered

Filtered Tahoe-100M slice used in the STATE workflow — high-quality single-cell perturbation profiles for training and benchmarking cross-study cell-state models.

### Ahmad0067/MedSynth
- **Type**: Clinical NLP
- **Tags**: Medicine
- **HuggingFace**: https://huggingface.co/datasets/Ahmad0067/MedSynth

Realistic synthetic medical dialogue–SOAP note pairs generated to support training and evaluation of clinical documentation models without exposing real patient data.

## Models (72)

### Evo-2 40B
- **Type**: Genomic Foundation Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/arcinstitute/evo2_40b

40B-parameter DNA language model trained on 9.3 trillion nucleotides across all domains of life — zero-shot function prediction, variant effect scoring, and sequence generation.

### Evo-2 7B
- **Type**: Genomic Foundation Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/arcinstitute/evo2_7b

7B-parameter instruction-tuned DNA language model for gene function prediction, CRISPR guide design, and cross-species sequence analysis.

### STACK Large
- **Type**: Single-Cell Foundation Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/arcinstitute/Stack-Large

Large-scale single-cell transcriptomics foundation model supporting in-context learning across cell types and perturbation states.

### TEDDY
- **Type**: Single-Cell Biology
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/Merck/TEDDY

Transformer for Enabling Drug Discovery - foundation models trained on 116M single cells for genomics and drug discovery.

### AQAffinity
- **Type**: Drug Discovery
- **Tags**: Chemistry, Medicine, Biology
- **HuggingFace**: https://huggingface.co/SandboxAQ/AQAffinity

Open-source protein-ligand binding affinity prediction model for drug discovery.

### MedGemma 1.5 4B
- **Type**: Medical AI
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/google/medgemma-1.5-4b-it

Multimodal medical AI model for medical imaging and clinical text understanding.

### MedGemma 27B
- **Type**: Medical AI
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/google/medgemma-27b-it

Large-scale instruction-tuned medical AI for radiology report generation, pathology image analysis, dermatology, and clinical question answering.

### MedASR
- **Type**: Medical AI
- **Tags**: Medicine
- **HuggingFace**: https://huggingface.co/google/medasr

Medical automatic speech recognition model for clinical documentation.

### MedSigLIP
- **Type**: Medical Imaging
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/google/medsiglip-448

Medical image-language model for visual understanding in healthcare.

### TxGemma 2B
- **Type**: Drug Discovery
- **Tags**: Medicine, Chemistry, Biology
- **HuggingFace**: https://huggingface.co/google/txgemma-2b-predict

Lightweight therapeutic prediction model for drug discovery tasks.

### TxGemma 9B Predict
- **Type**: Drug Discovery
- **Tags**: Medicine, Chemistry, Biology
- **HuggingFace**: https://huggingface.co/google/txgemma-9b-predict

Mid-size therapeutic prediction model for drug property prediction.

### TxGemma 9B Chat
- **Type**: Drug Discovery
- **Tags**: Medicine, Chemistry, Biology
- **HuggingFace**: https://huggingface.co/google/txgemma-9b-chat

Conversational therapeutic model for drug discovery with reasoning explanations.

### TxGemma 27B Predict
- **Type**: Drug Discovery
- **Tags**: Medicine, Chemistry, Biology
- **HuggingFace**: https://huggingface.co/google/txgemma-27b-predict

Large therapeutic prediction model achieving best-in-class performance on 66 tasks.

### TxGemma 27B Chat
- **Type**: Drug Discovery
- **Tags**: Medicine, Chemistry, Biology
- **HuggingFace**: https://huggingface.co/google/txgemma-27b-chat

Large conversational therapeutic model with advanced reasoning capabilities.

### Path Foundation
- **Type**: Pathology
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/google/path-foundation

Vision transformer for histopathology image embeddings - trained on 60M patches from TCGA.

### ether0
- **Type**: Chemistry
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/futurehouse/ether0

24B parameter model for molecular reasoning - SMILES generation, property prediction, and retrosynthesis.

### CYP Inhibition Model
- **Type**: ADMET Prediction
- **Tags**: Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/openadmet/cyp1a2-cyp2d6-cyp3a4-cyp3c9-chemeleon-baseline

Multi-task model predicting inhibition of four major cytochrome P450 isoforms (CYP1A2, CYP2D6, CYP3A4, CYP3C9) critical for drug metabolism assessment.

### PXR Activation Model
- **Type**: ADMET Prediction
- **Tags**: Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/openadmet/pxr-chemeleon-baseline

Pregnane X receptor (PXR) activation predictor for early identification of drug-drug interaction liability via nuclear receptor-mediated CYP induction.

### ESM2 650M
- **Type**: Protein Language Model
- **Tags**: Biology, Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/facebook/esm2_t33_650M_UR50D

650M-parameter protein language model trained on UniRef50 — state-of-the-art embeddings for structure prediction, function annotation, and mutation effect scoring.

### Tahoe-x1
- **Type**: Single-Cell Biology
- **Tags**: Biology, Medicine, Genomics
- **HuggingFace**: https://huggingface.co/tahoebio/Tahoe-x1

Perturbation-trained single-cell foundation models (70M-3B) for cancer research and drug discovery.

### Tahoe-100M-SCVI
- **Type**: Single-Cell Biology
- **Tags**: Biology, Medicine, Genomics
- **HuggingFace**: https://huggingface.co/tahoebio/Tahoe-100M-SCVI-v1

scVI-based variational autoencoder trained on the full Tahoe-100M atlas of 100M+ single-cell profiles across 50 cancer lines and 1,100 drug perturbations.

### PeptiVerse
- **Type**: Peptide Design
- **Tags**: Biology, Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/ChatterjeeLab/PeptiVerse

Foundation model for peptide design and analysis.

### CoLiPRI
- **Type**: Protein-Ligand Interaction
- **Tags**: Biology, Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/microsoft/colipri

Contrastive learning model for protein-ligand interaction prediction.

### Phikon-v2
- **Type**: Pathology Foundation Model
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/owkin/phikon-v2

State-of-the-art histopathology vision foundation model trained with DINOv2 on 460K whole-slide images, achieving top performance on cancer subtyping and survival prediction.

### Phikon
- **Type**: Pathology Foundation Model
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/owkin/phikon

ViT-based pathology foundation model trained on TCGA and other large histopathology cohorts via self-supervised learning for cancer tissue representation.

### X-Cell
- **Type**: Single-Cell Perturbation Model
- **Tags**: Biology, Medicine, Genomics
- **HuggingFace**: https://huggingface.co/Xaira-Therapeutics/X-Cell

Diffusion-based model for predicting transcriptomic responses to CRISPRi perturbations at single-cell resolution, trained on the X-Atlas-Pisces dataset.

### p-IgGen
- **Type**: Antibody Language Model
- **Tags**: Biology, Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/opig/p-IgGen

GPT-NeoX-based generative language model for antibody sequence design, trained on the Observed Antibody Space to generate diverse immunoglobulin heavy and light chains.

### OpenFold3
- **Type**: Protein Structure Prediction
- **Tags**: Biology, Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/OpenFold/OpenFold3

Open replication of AlphaFold3 — predicts structures of proteins, nucleic acids, ligands, and their complexes for drug discovery and structural biology.

### MedSAM
- **Type**: Medical Image Segmentation
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/wanglab/medsam-vit-base

SAM ViT-Base finetuned on a large-scale dataset of CT, MRI, X-ray, ultrasound, and histology — a universal promptable foundation model for medical image segmentation.

### Clinical Camel 70B
- **Type**: Clinical LLM
- **Tags**: Medicine, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/wanglab/ClinicalCamel-70B

Llama-2 70B finetuned with QLoRA on physician-patient dialogues, clinical articles, and MedQA-style reasoning chains for medical conversation and decision support.

### GO-GPT
- **Type**: Protein Function Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/wanglab/gogpt

Generative model that predicts Gene Ontology functional annotations directly from protein sequences — bringing LLM-style decoding to functional protein characterisation.

### OpenMed PharmaDetect
- **Type**: Biomedical NER
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-PharmaDetect-SuperClinical-434M

Token-classification model for pharmaceutical entity recognition in clinical text — built on the SuperClinical 434M backbone for high-recall drug, dose, and regimen extraction.

### OpenMed BloodCancerDetect
- **Type**: Biomedical NER
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-BloodCancerDetect-TinyMed-65M

Compact 65M token-classification model that identifies haematologic malignancy mentions (leukaemia, lymphoma, myeloma subtypes) in clinical and biomedical text.

### OpenMed ChemicalDetect
- **Type**: Biomedical NER
- **Tags**: Medicine, Chemistry, Biology
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-ChemicalDetect-ModernMed-149M

Chemical-entity NER over biomedical literature — identifies drug names, compounds, and chemical substances using the ModernMed 149M backbone.

### OpenMed SpeciesDetect
- **Type**: Biomedical NER
- **Tags**: Biology, Medicine
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-SpeciesDetect-ElectraMed-109M

Species-mention NER over biomedical literature — identifies organisms and taxonomic references using the ElectraMed 109M backbone.

### OpenMed DNADetect
- **Type**: Biomedical NER
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-DNADetect-SuperMedical-125M

DNA-mention NER for biomedical text — extracts gene-level DNA sequence references and locus identifiers using the SuperMedical 125M backbone.

### OpenMed PathologyDetect
- **Type**: Biomedical NER
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-PathologyDetect-TinyMed-135M

Pathology-finding NER over clinical and biomedical text — surfaces histopathological observations, lesion descriptions, and tissue-level abnormalities.

### OpenMed AnatomyDetect
- **Type**: Biomedical NER
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-AnatomyDetect-ElectraMed-109M

Anatomical-entity NER for biomedical text — labels body parts, organ systems, and tissue references using the ElectraMed 109M backbone.

### OpenMed OncologyDetect
- **Type**: Biomedical NER
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-OncologyDetect-MultiMed-568M

Oncology-focused NER that identifies cancer-type mentions, tumour grading, and staging language across clinical and biomedical literature.

### OpenMed OrganismDetect
- **Type**: Biomedical NER
- **Tags**: Biology, Medicine
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-OrganismDetect-TinyMed-82M

Organism-mention NER for biomedical text — broader than SpeciesDetect, also picking up genera, strains, and informal organism references.

### OpenMed DiseaseDetect
- **Type**: Biomedical NER
- **Tags**: Medicine, Biology
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-DiseaseDetect-BioMed-335M

Disease-mention NER trained on the BioMed 335M backbone — recognises disease names, syndromes, and condition references in clinical and biomedical literature.

### OpenMed GenomicDetect
- **Type**: Biomedical NER
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-GenomicDetect-PubMed-335M

Genomic-entity NER over PubMed-style text — labels genes, transcripts, and other genomic references for downstream knowledge extraction.

### OpenMed ProteinDetect
- **Type**: Biomedical NER
- **Tags**: Biology, Medicine
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-ProteinDetect-SuperClinical-141M

Protein-mention NER for biomedical and clinical text — extracts protein names, family references, and post-translational modification descriptors.

### OpenMed GenomeDetect
- **Type**: Biomedical NER
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/OpenMed/OpenMed-NER-GenomeDetect-ModernMed-149M

Genome-mention NER complementary to GenomicDetect — focuses on whole-genome and assembly-level references in biomedical text.

### MMPT-FM
- **Type**: Pharma Foundation Model
- **Tags**: Biology, Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/Merck/MMPT-FM

Multi-modal pharma foundation model from Merck — integrates molecular and biological signals for drug discovery and target prediction.

### BioEmu
- **Type**: Protein Dynamics Model
- **Tags**: Biology, Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/microsoft/bioemu

Generative model for protein structural ensembles — emulates conformational dynamics for drug discovery and structural biology beyond static AlphaFold-style predictions.

### eva-rna
- **Type**: Transcriptomics Foundation Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/ScientaLab/eva-rna

Transformer foundation model producing sample-level and gene-level embeddings from RNA-seq profiles (bulk, microarray, pseudobulked single-cell) in human and mouse.

### GENA-LM BERT large (T2T)
- **Type**: Genomic Language Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/AIRI-Institute/gena-lm-bert-large-t2t

BERT-large-style genomic foundation model trained on telomere-to-telomere human assemblies — supports variant interpretation, regulatory prediction, and downstream genomic tasks.

### GENA-LM BERT base (T2T)
- **Type**: Genomic Language Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/AIRI-Institute/gena-lm-bert-base-t2t

BERT-base-style genomic foundation model trained on T2T assemblies — lighter-weight backbone for genomic sequence understanding.

### ModernGENA large
- **Type**: Genomic Language Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/AIRI-Institute/moderngena-large

GENA-LM rebuilt on the ModernBERT architecture — larger, longer-context, RoPE-equipped genomic foundation model.

### ModernGENA base
- **Type**: Genomic Language Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/AIRI-Institute/moderngena-base

Compact ModernBERT-based GENA-LM variant — efficient genomic foundation model for downstream variant and expression tasks.

### HuatuoGPT-Vision 7B
- **Type**: Medical Vision-Language Model
- **Tags**: Medicine, Biology, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/FreedomIntelligence/HuatuoGPT-Vision-7B

Medical multimodal LLM from the HuatuoGPT family — answers clinical questions over medical imagery (radiology, pathology, dermatology) using a 7B vision-language backbone.

### FlashPPI
- **Type**: Protein-Protein Interaction Model
- **Tags**: Biology, Medicine
- **HuggingFace**: https://huggingface.co/tattabio/flashppi

Fast protein-protein interaction prediction model — trained for high-throughput screening of interaction networks.

### MIST 28M · Tox21
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-kw4ks27p-tox21

MIST 28M fine-tuned on Tox21 — toxicity classification across 12 nuclear-receptor and stress-response assays.

### MIST 28M · ClinTox
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-97vfcykk-clintox

MIST 28M fine-tuned on ClinTox — clinical toxicity classification of FDA-approved drugs and failed candidates.

### MIST 28M · SIDER
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-z8qo16uy-sider

MIST 28M fine-tuned on SIDER — side-effect prediction across 27 system-organ classes for marketed drugs.

### MIST 28M · BBBP
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-3xpfhv48-bbbp

MIST 28M fine-tuned on BBBP — blood-brain-barrier permeability classification for CNS drug candidates.

### MIST 28M · HIV
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-8fh43gke-hiv

MIST 28M fine-tuned on HIV — anti-HIV activity classification from MoleculeNet.

### MIST 28M · Lipo
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-xzr5ulva-lipo

MIST 28M fine-tuned on Lipophilicity — octanol/water distribution coefficient prediction.

### MIST 28M · ToxCast
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-ttqcvt6fs-toxcast

MIST 28M fine-tuned on ToxCast — multi-task toxicity prediction across hundreds of in-vitro assays.

### MIST 28M · BACE
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-8loj3bab-bace

MIST 28M fine-tuned on BACE — beta-secretase 1 (Alzheimer target) inhibition classification.

### MIST 28M · MUV
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-yr1urd2c-muv

MIST 28M fine-tuned on MUV — maximum-unbiased-validation virtual-screening benchmark.

### MIST 1.8B · Tox21
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-uop1z0dc-tox21

MIST 1.8B fine-tuned on Tox21 — large-scale toxicity classification across nuclear-receptor and stress assays.

### MIST 1.8B · ClinTox
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-lu1l5ieh-clintox

MIST 1.8B fine-tuned on ClinTox — clinical toxicity classification.

### MIST 1.8B · SIDER
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-l1wfo7oa-sider

MIST 1.8B fine-tuned on SIDER — side-effect prediction.

### MIST 1.8B · BBBP
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-fbdn8e35-bbbp

MIST 1.8B fine-tuned on BBBP — blood-brain-barrier permeability.

### MIST 1.8B · HIV
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-1a4puhg2-hiv

MIST 1.8B fine-tuned on HIV — anti-HIV activity classification.

### MIST 1.8B · Lipo
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-jvt4azpz-lipo

MIST 1.8B fine-tuned on Lipophilicity — large-scale logD prediction.

### MIST 1.8B · BACE
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Medicine
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-m50jgolp-bace

MIST 1.8B fine-tuned on BACE — Alzheimer-target inhibition classification.

### OpenPhenom
- **Type**: Phenomics Foundation Model
- **Tags**: Biology, Medicine, Chemistry
- **HuggingFace**: https://huggingface.co/recursionpharma/OpenPhenom

Masked-autoencoder foundation model for high-content cell imaging — learns phenomic embeddings from millions of microscopy images for downstream drug-discovery and perturbation analysis.

### Stack-Large Aligned
- **Type**: Single-Cell Foundation Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/arcinstitute/Stack-Large-Aligned

Aligned variant of STACK-Large — single-cell foundation model fine-tuned for cross-batch consistency, supporting multi-study perturbation analysis and downstream alignment tasks.

### SE-600M
- **Type**: Single-Cell Foundation Model
- **Tags**: Biology, Genomics, Medicine
- **HuggingFace**: https://huggingface.co/arcinstitute/SE-600M

600M-parameter Single-cell Embeddings model from the STATE collection — generates embeddings for human single-cell RNA expression profiles to support cell-state and perturbation analysis.

## Blog Posts (16)

### Eve Bio: Mapping the Pharmone Drug Interaction
- **Author**: hugging-science
- **Date**: 2025-01-01
- **Tags**: Medicine, Biology, Chemistry
- **Link**: https://huggingface.co/blog/hugging-science/eve-bio-mapping-the-pharmone-drug-interaction

Understanding drug interactions through AI-powered pharmacogenomics.

### The ExpansionRx OpenADMET Blind Challenge
- **Author**: hugging-science
- **Date**: 2025-01-01
- **Tags**: Medicine, Chemistry
- **Link**: https://huggingface.co/blog/hugging-science/the-expansionrx-openadmet-blind-challenge

A blind challenge for predicting ADMET properties in drug discovery.

### AI for Food Allergies
- **Author**: hugging-science
- **Date**: 2025-01-01
- **Tags**: Medicine, Biology
- **Link**: https://huggingface.co/blog/hugging-science/ai-for-food-allergies

Applying AI to understand and predict food allergies.

### Making Antibody Embeddings and Predictions
- **Author**: ginkgo-datapoints
- **Date**: 2025-01-01
- **Tags**: Biology, Medicine, Biotechnology
- **Link**: https://huggingface.co/blog/ginkgo-datapoints/making-antibody-embeddings-and-predictions

How to create and use antibody embeddings for therapeutic applications.

### SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence
- **Author**: SandboxAQ
- **Date**: 2025-09-06
- **Tags**: Chemistry, Medicine, Biology
- **Link**: https://huggingface.co/blog/SandboxAQ/sair-data-accelerating-drug-discovery-with-ai

How SandboxAQ's SAIR dataset of 1M+ protein–ligand structures is enabling AI-powered drug discovery with unprecedented structural coverage.

### ThermoGFN-IF for Catalysis
- **Author**: AmelieSchreiber
- **Date**: 2026-03-10
- **Tags**: Biology, Chemistry, Medicine
- **Link**: https://huggingface.co/blog/AmelieSchreiber/thermogfn-if

A protein sequence design model fine-tuned with GFlowNets for thermostable and kinetically-aware enzyme engineering.

### ESM-2 for Generating and Optimizing Peptide Binders
- **Author**: AmelieSchreiber
- **Date**: 2023-11-23
- **Tags**: Biology, Medicine
- **Link**: https://huggingface.co/blog/AmelieSchreiber/esm-interact

Generating and optimising peptide binders for target proteins using ESM-2 embeddings and directed evolution.

### A Comprehensive Introduction to AI for Proteins (2026)
- **Author**: tamarind.bio
- **Date**: 2026-01-01
- **Tags**: Biology, Chemistry, Medicine
- **Link**: https://www.tamarind.bio/blog/a-comprehensive-introduction-to-ai-for-proteins

A thorough primer on the state of AI for protein science — covering structure prediction, protein language models, generative design, and the full open-source model landscape.

### Boltz-2: State of the Art Structure and Binding Affinity Prediction
- **Author**: tamarind.bio
- **Date**: 2025-06-18
- **Tags**: Biology, Chemistry, Medicine
- **Link**: https://www.tamarind.bio/blog/boltz2-state-of-the-art-structure-and-binding-affinity-prediction

Boltz-2 outperforms AlphaFold3 on antibody-antigen interfaces and sets a new state of the art for protein-ligand binding affinity prediction.

### Boltzdesign1: Designing De Novo Binders to More Than Just Proteins
- **Author**: tamarind.bio
- **Date**: 2025-06-01
- **Tags**: Biology, Chemistry, Medicine
- **Link**: https://www.tamarind.bio/blog/boltzdesign1-small-molecule-rna-dna-protein-metal-binder-design

BoltzDesign1 extends de novo binder design beyond protein targets to small molecules, RNA, DNA, and metal ions.

### Chai-1r: AlphaFold3 Level Performance, Now Completely Open Source
- **Author**: tamarind.bio
- **Date**: 2025-02-01
- **Tags**: Biology, Chemistry, Medicine
- **Link**: https://www.tamarind.bio/blog/chai-1-alphafold3-level-performance-now-completely-open-source

Chai-1r achieves AlphaFold3-level accuracy on protein-protein and antibody-antigen complexes with fully open weights and no usage restrictions.

### Computational De Novo Design of Antibodies and Nanobodies
- **Author**: tamarind.bio
- **Date**: 2025-01-01
- **Tags**: Biology, Medicine, Chemistry, Biotechnology
- **Link**: https://www.tamarind.bio/blog/de-novo-antibody-nanobody-vhh-scfv-rfdiffusion

A practical guide to designing antibody VHHs and scFvs de novo using RFdiffusion and ProteinMPNN, from target epitope to validated sequence.

### Predicting Antibody Properties & Developability
- **Author**: tamarind.bio
- **Date**: 2025-01-01
- **Tags**: Biology, Medicine, Chemistry, Biotechnology
- **Link**: https://www.tamarind.bio/blog/predicting-antibody-properties-developability

ML approaches for predicting key biophysical properties of therapeutic antibody candidates — stability, solubility, and immunogenicity — before wet-lab validation.

### Are Mini Proteins the Next Antibodies?
- **Author**: tamarind.bio
- **Date**: 2025-01-01
- **Tags**: Biology, Medicine, Chemistry
- **Link**: https://www.tamarind.bio/blog/mini-protein-antibodies

Examining the therapeutic potential of computationally designed miniproteins as a next-generation alternative to traditional antibody drugs.

### Computational De Novo Miniproteins As Therapeutics
- **Author**: tamarind.bio
- **Date**: 2024-12-01
- **Tags**: Biology, Medicine, Chemistry
- **Link**: https://www.tamarind.bio/blog/computationaly-de-novo-minibinders-therapeutic-applications

How computationally designed de novo miniproteins and minibinders are being developed as a new class of targeted therapeutics.

### Computational Protein–Protein Interaction Screening
- **Author**: tamarind.bio
- **Date**: 2024-12-01
- **Tags**: Biology, Medicine, Chemistry
- **Link**: https://www.tamarind.bio/blog/ppi-screen

A practical guide to screening for protein–protein interactions (PPIs) as drug discovery targets using structure prediction and ML scoring.


================================================================================
## Topic: Physics (/topics/physics.md)
================================================================================

# Physics — Hugging Science

> Fundamental forces, particles, and physical systems

_Auto-generated from site data. Fetch `/llms.txt` for the full index._

## Datasets (52)

### polymathic-ai/active_matter
- **Type**: Physics Simulation
- **Tags**: Physics, Engineering, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/polymathic-ai/active_matter

High-fidelity simulations of self-propelled particle systems for benchmarking learned PDE solvers and emergent collective behaviour models.

### polymathic-ai/MHD_64
- **Type**: Physics Simulation
- **Tags**: Physics, Engineering, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/polymathic-ai/MHD_64

3D magnetohydrodynamics turbulence simulations at 64³ resolution for training and benchmarking physics-informed neural operators.

### polymathic-ai/planetswe
- **Type**: Physics Simulation
- **Tags**: Physics, Earth Science
- **HuggingFace**: https://huggingface.co/datasets/polymathic-ai/planetswe

Spherical shallow-water equation simulations modelling large-scale planetary atmospheric dynamics for weather and climate surrogate models.

### polymathic-ai/rayleigh_benard
- **Type**: Physics Simulation
- **Tags**: Physics, Engineering, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/polymathic-ai/rayleigh_benard

Rayleigh–Bénard thermal convection simulations at varying Rayleigh and Prandtl numbers for benchmarking turbulence and heat transfer models.

### polymathic-ai/supernova_explosion_64
- **Type**: Astrophysics Simulation
- **Tags**: Physics, Astronomy
- **HuggingFace**: https://huggingface.co/datasets/polymathic-ai/supernova_explosion_64

Hydrodynamic simulations of core-collapse supernova explosions at 64³ resolution, spanning diverse progenitor masses and explosion energies.

### nasa-impact/WxC-Bench
- **Type**: Climate Benchmark
- **Tags**: Earth Science, Climate, Physics, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/nasa-impact/WxC-Bench

Standardised benchmark for evaluating AI models across six atmospheric and earth science tasks including gravity wave parameterisation, turbulence prediction, and hurricane track forecasting.

### proxima-fusion/constellaration
- **Type**: Fusion Physics
- **Tags**: Physics, Energy, Engineering
- **HuggingFace**: https://huggingface.co/datasets/proxima-fusion/constellaration

Large-scale dataset of quasi-isodynamic stellarator designs with MHD equilibria for fusion energy research.

### google/spiqa
- **Type**: Scientific Benchmark
- **Tags**: Biology, Chemistry, Physics, Benchmark, Scientific Reasoning, Mathematics
- **HuggingFace**: https://huggingface.co/datasets/google/spiqa

Scientific Paper Image Question Answering benchmark requiring multimodal reasoning over figures, charts, and diagrams from research papers across scientific domains.

### nasa-ibm-ai4science/surya-bench-flare-forecasting
- **Type**: Solar Physics
- **Tags**: Astronomy, Physics, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/nasa-ibm-ai4science/surya-bench-flare-forecasting

Full-disk solar flare forecasting dataset from NOAA GOES observations, providing multi-hour-ahead flare probability labels for heliophysics model evaluation.

### nasa-ibm-ai4science/core-sdo
- **Type**: Solar Physics
- **Tags**: Astronomy, Physics
- **HuggingFace**: https://huggingface.co/datasets/nasa-ibm-ai4science/core-sdo

Multi-modal Solar Dynamics Observatory dataset combining EUV imagery, magnetograms, and irradiance spectra for solar foundation model pre-training.

### LeMaterial/LeMat-Bulk-MLIP-Hull
- **Type**: Materials Science
- **Tags**: Materials Science, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/LeMaterial/LeMat-Bulk-MLIP-Hull

Convex hull data for bulk materials from MLIP calculations.

### LeMaterial/LeMat-Bulk-DFT-Hull-All
- **Type**: Materials Science
- **Tags**: Materials Science, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/LeMaterial/LeMat-Bulk-DFT-Hull-All

Complete DFT convex hull dataset for bulk materials discovery.

### LeMaterial/LeMat-Bulk-DFT-Hull
- **Type**: Materials Science
- **Tags**: Materials Science, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/LeMaterial/LeMat-Bulk-DFT-Hull

DFT convex hull reference data for materials stability analysis.

### LeMaterial/LeMat-Bulk
- **Type**: Materials Science
- **Tags**: Materials Science, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/LeMaterial/LeMat-Bulk

Primary bulk materials database aggregating 1M+ crystal structures with DFT-computed formation energies, band gaps, and elastic properties for materials discovery.

### LeMaterial/LeMat-Traj
- **Type**: Materials Science
- **Tags**: Materials Science, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/LeMaterial/LeMat-Traj

Large-scale molecular dynamics trajectory dataset for training machine learning interatomic potentials across diverse bulk material compositions.

### facebook/principia-collection
- **Type**: STEM Reasoning
- **Tags**: Mathematics, Physics, Chemistry, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/facebook/principia-collection

Large-scale STEM reasoning dataset from Meta covering mathematics, physics, chemistry, and biology problems for training and evaluating scientific reasoning in language models.

### facebook/principia-bench
- **Type**: STEM Benchmark
- **Tags**: Mathematics, Physics, Chemistry, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/facebook/principia-bench

Curated benchmark of challenging STEM problems requiring multi-step reasoning, quantitative analysis, and domain knowledge across natural sciences.

### isp-uv-es/rtm_emulation
- **Type**: Earth Observation
- **Tags**: Earth Science, Climate, Physics
- **HuggingFace**: https://huggingface.co/datasets/isp-uv-es/rtm_emulation

Atmospheric radiative transfer model emulation dataset for training fast neural surrogates to replace computationally expensive RTM simulations in satellite data processing.

### UniverseTBD/arxiv-abstracts-large
- **Type**: Scientific Literature
- **Tags**: Astronomy, Physics
- **HuggingFace**: https://huggingface.co/datasets/UniverseTBD/arxiv-abstracts-large

1.7 million scholarly article abstracts spanning physics, computer science, and statistics from arXiv, structured for pretraining and fine-tuning astronomy and scientific language models.

### UniverseTBD/AstroLLaVA_convos
- **Type**: Multimodal Astronomy
- **Tags**: Astronomy, Physics
- **HuggingFace**: https://huggingface.co/datasets/UniverseTBD/AstroLLaVA_convos

Astronomical images paired with detailed captions and question-answer pairs sourced from APOD, ESO, and ESA Hubble archives, for training multimodal vision-language models on astrophysics.

### allenai/peS2o
- **Type**: Pretraining Corpus
- **Tags**: Scientific Reasoning, Biology, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/allenai/peS2o

Approximately 40M cleaned, filtered, and formatted open-access academic papers derived from S2ORC — a large multi-domain pretraining corpus for science-aware language models, spanning biology, chemistry, engineering, computer science, and physics.

### neashton/drivaerml
- **Type**: Automotive CFD
- **Tags**: Engineering, Physics
- **HuggingFace**: https://huggingface.co/datasets/neashton/drivaerml

High-fidelity CFD simulation dataset of the DrivAer reference automotive geometry — resolved-flow data for training ML models on aerodynamics prediction (drag, downforce, surface pressure).

### PLAID-datasets/AirfRANS_original
- **Type**: Aerodynamics CFD
- **Tags**: Physics, Engineering, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/PLAID-datasets/AirfRANS_original

Original AirfRANS airfoil RANS simulation dataset — graph-structured CFD over NACA airfoils for benchmarking physics-informed and graph neural networks.

### luminary-shift/SUV
- **Type**: Automotive CFD
- **Tags**: Engineering, Physics
- **HuggingFace**: https://huggingface.co/datasets/luminary-shift/SUV

Large-scale CFD dataset of SUV-class vehicles for training ML models on automotive aerodynamics — surface pressures, wake structures, and aerodynamic performance metrics.

### luminary-shift/Pump
- **Type**: Turbomachinery CFD
- **Tags**: Engineering, Physics
- **HuggingFace**: https://huggingface.co/datasets/luminary-shift/Pump

CFD simulations of centrifugal pumps spanning operating conditions — for training ML surrogates of turbomachinery flow and performance.

### luminary-shift/SHIFT-Crash
- **Type**: Crash Simulation
- **Tags**: Engineering, Physics
- **HuggingFace**: https://huggingface.co/datasets/luminary-shift/SHIFT-Crash

Vehicle crash-simulation dataset capturing structural deformation under impact — for ML-based safety and structural-mechanics modelling.

### luminary-shift/WING
- **Type**: Aerospace CFD
- **Tags**: Engineering, Physics
- **HuggingFace**: https://huggingface.co/datasets/luminary-shift/WING

Wing-flow CFD dataset for ML-driven aerodynamics — covers a range of geometries and flight conditions for surrogate modelling.

### luminary-shift/CCA
- **Type**: Aerospace CFD
- **Tags**: Engineering, Physics
- **HuggingFace**: https://huggingface.co/datasets/luminary-shift/CCA

Common Compressor Aero (CCA) dataset — compressor and turbomachinery simulations for ML-augmented aerospace design workflows.

### luminary-shift/Submarine
- **Type**: Marine CFD
- **Tags**: Engineering, Physics
- **HuggingFace**: https://huggingface.co/datasets/luminary-shift/Submarine

Submarine hydrodynamics CFD dataset — submerged-body flow simulations for ML-based marine engineering and naval design.

### microsoft/msr-acc-tae25
- **Type**: Quantum Chemistry Dataset
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/datasets/microsoft/msr-acc-tae25

Microsoft Research Accurate Chemistry Collection — large dataset of high-accuracy electronic-structure calculations (TAE25 split) for training and evaluating quantum-chemistry ML models.

### foundry-ml/foundry_oqmd_band_gaps_v1-1
- **Type**: Electronic Structure
- **Tags**: Materials Science, Physics, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/foundry_oqmd_band_gaps_v1-1

Band-gap values from the Open Quantum Materials Database (OQMD), prepared for ML benchmarking on inorganic crystal electronic structure.

### foundry-ml/foundry_aflow_band_gaps_v1-1
- **Type**: Electronic Structure
- **Tags**: Materials Science, Physics, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/foundry_aflow_band_gaps_v1-1

Band-gap values from the AFLOW high-throughput materials database, formatted for ML model training and evaluation.

### foundry-ml/foundry_mp_band_gaps_v1-1
- **Type**: Electronic Structure
- **Tags**: Materials Science, Physics, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/foundry_mp_band_gaps_v1-1

Band-gap values curated from the Materials Project for ML benchmarking on inorganic electronic structure.

### foundry-ml/double_perovskite_bandgap_v1-1
- **Type**: Electronic Structure
- **Tags**: Materials Science, Physics, Chemistry, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/double_perovskite_bandgap_v1-1

Computed band gaps for double-perovskite compounds — supports ML-based screening for photovoltaic and optoelectronic applications.

### foundry-ml/dataset_perovskite_tec
- **Type**: Perovskite Properties
- **Tags**: Materials Science, Chemistry, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_perovskite_tec

Thermal expansion coefficients for perovskite materials — curated for ML thermal-property prediction.

### foundry-ml/atomvison_atomistic_stm_images_2d_materials_unique_chemical_compositions_structure_v1-1
- **Type**: STM Imaging
- **Tags**: Materials Science, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/atomvison_atomistic_stm_images_2d_materials_unique_chemical_compositions_structure_v1-1

Simulated STM images for 2D materials with unique chemical compositions — supports ML on atomic-resolution microscopy.

### foundry-ml/atomvison_simulated_atomistic_stem_images_2d_materials_unique_chemical_compositions_structure_ba
- **Type**: STEM Imaging
- **Tags**: Materials Science, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/atomvison_simulated_atomistic_stem_images_2d_materials_unique_chemical_compositions_structure_ba

Simulated STEM images for 2D materials — paired with structure metadata for training ML models on electron microscopy.

### foundry-ml/training_locating_atoms_stem_images_v1-2
- **Type**: STEM Imaging
- **Tags**: Materials Science, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/training_locating_atoms_stem_images_v1-2

STEM image training set for atomic-position localisation — supports ML pipelines for automated microscopy analysis.

### foundry-ml/direct_electron_detectorceleritas_xs_simulated_readout_images_electron_counting_model_v1-1
- **Type**: Electron Microscopy
- **Tags**: Materials Science, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/direct_electron_detectorceleritas_xs_simulated_readout_images_electron_counting_model_v1-1

Simulated readout images from a Celeritas XS direct-electron detector — training data for electron-counting models in cryo-EM and STEM.

### foundry-ml/elastic_tensor_v1-1
- **Type**: Mechanical Properties
- **Tags**: Materials Science, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/elastic_tensor_v1-1

Elastic tensor data for inorganic materials — supports ML prediction of bulk and shear moduli.

### foundry-ml/piezoelectric_tensor_v1-1
- **Type**: Electromechanical Properties
- **Tags**: Materials Science, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/piezoelectric_tensor_v1-1

Piezoelectric tensor data for inorganic materials — supports ML for sensor and actuator material design.

### foundry-ml/dielectric_constant_v1-1
- **Type**: Dielectric Properties
- **Tags**: Materials Science, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dielectric_constant_v1-1

Dielectric-constant values for inorganic compounds — supports ML screening of high-k materials for capacitors and devices.

### foundry-ml/semiconductor_defectlevels_v1-1
- **Type**: Defect Properties
- **Tags**: Materials Science, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/semiconductor_defectlevels_v1-1

Computed defect-energy levels in semiconductors — descriptors for ML doping and trap-state prediction.

### foundry-ml/superconductivity_v1-1
- **Type**: Superconductivity
- **Tags**: Materials Science, Physics, Energy
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/superconductivity_v1-1

Curated superconductor dataset — measured Tc values for ML-based discovery of new superconducting materials.

### foundry-ml/dataset_rpv_tts
- **Type**: Reactor Materials
- **Tags**: Materials Science, Engineering, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_rpv_tts

Reactor pressure-vessel (RPV) transition-temperature shift dataset — supports ML prediction of irradiation embrittlement.

### foundry-ml/dataset_exfoliatione
- **Type**: 2D Materials
- **Tags**: Materials Science, Physics, Chemistry
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_exfoliatione

Exfoliation energy dataset for 2D materials — supports ML-driven discovery of layered compounds suitable for monolayer isolation.

### foundry-ml/dataset_thermalexp_aflow
- **Type**: Thermal Properties
- **Tags**: Materials Science, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_thermalexp_aflow

Thermal expansion coefficients from the AFLOW database — for ML thermal-mechanical modelling of inorganic materials.

### foundry-ml/dataset_thermalcond_aflow
- **Type**: Thermal Properties
- **Tags**: Materials Science, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_thermalcond_aflow

Thermal conductivity values from the AFLOW database — supports ML-based screening of thermal management materials.

### foundry-ml/dataset_debyet_aflow
- **Type**: Thermal Properties
- **Tags**: Materials Science, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_debyet_aflow

Debye temperature data from the AFLOW database — fundamental thermal-vibrational descriptor for ML materials property prediction.

### foundry-ml/heusler_magnetization_v1-1
- **Type**: Magnetic Properties
- **Tags**: Materials Science, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/heusler_magnetization_v1-1

Magnetisation data for Heusler-alloy compounds — supports ML discovery of half-metallic and magnetocaloric materials.

### foundry-ml/foundry_g4mp2_solvation_v1-2
- **Type**: Solvation Energies
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/foundry_g4mp2_solvation_v1-2

High-accuracy G4MP2 solvation-energy data — supports ML for quantum-chemical accuracy on aqueous and organic systems.

### foundry-ml/foundry_qmc_ml_v1-1
- **Type**: Quantum Chemistry
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/foundry_qmc_ml_v1-1

Quantum Monte Carlo (QMC) reference data for ML benchmarking — high-accuracy electronic structure calculations on small molecules.

## Models (42)

### FNO Active Matter
- **Type**: Physics Foundation Model
- **Tags**: Physics, Engineering
- **HuggingFace**: https://huggingface.co/polymathic-ai/FNO-active_matter

Fourier Neural Operator for active matter prediction.

### Aion Base
- **Type**: Foundation Model
- **Tags**: Physics, Astronomy, Engineering
- **HuggingFace**: https://huggingface.co/polymathic-ai/aion-base

Multi-domain scientific foundation model.

### WALRUS
- **Type**: Physics Foundation Model
- **Tags**: Physics, Engineering
- **HuggingFace**: https://huggingface.co/polymathic-ai/walrus

Foundation model for continuum dynamics pre-trained across 15 physics simulation datasets, enabling zero-shot and few-shot PDE generalisation.

### AstroCLIP
- **Type**: Astronomy Foundation Model
- **Tags**: Astronomy, Physics
- **HuggingFace**: https://huggingface.co/polymathic-ai/astroclip

Multimodal astronomy model aligning galaxy spectra and images into a shared embedding space for downstream astrophysical property prediction.

### FourCastNet 3
- **Type**: Weather Prediction
- **Tags**: Earth Science, Climate, Physics
- **HuggingFace**: https://huggingface.co/nvidia/fourcastnet3

Advanced ML model for global weather forecasting - produces 60-day forecasts in under 4 minutes on a single GPU.

### StormCast V1
- **Type**: Weather Prediction
- **Tags**: Earth Science, Climate, Physics
- **HuggingFace**: https://huggingface.co/nvidia/stormcast-v1-era5-hrrr

Mesoscale ML model for convection-allowing weather forecasting at kilometer-scale resolution.

### Surya 1.0
- **Type**: Heliophysics
- **Tags**: Astronomy, Physics
- **HuggingFace**: https://huggingface.co/nasa-ibm-ai4science/Surya-1.0

First open-source AI foundation model for heliophysics - solar flare forecasting and space weather prediction.

### Surya Solar Flares
- **Type**: Solar Physics
- **Tags**: Astronomy, Physics
- **HuggingFace**: https://huggingface.co/nasa-ibm-ai4science/solar_flares_surya

Surya-1.0 fine-tuned for solar flare prediction from full-disk magnetogram and EUV time series.

### Surya Solar Wind
- **Type**: Solar Physics
- **Tags**: Astronomy, Physics
- **HuggingFace**: https://huggingface.co/nasa-ibm-ai4science/solar_wind_surya

Surya-1.0 fine-tuned for solar wind plasma and interplanetary magnetic field forecasting at the L1 Lagrange point.

### NASA-SMD-IBM
- **Type**: Earth Science NLP
- **Tags**: Earth Science, Physics, Astronomy
- **HuggingFace**: https://huggingface.co/nasa-impact/nasa-smd-ibm-v0.1

RoBERTa-based language model pre-trained on NASA Science Mission Directorate literature for earth and space science information extraction.

### OMat24
- **Type**: Materials Science
- **Tags**: Materials Science, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/facebook/OMAT24

Machine learning models for predicting inorganic material properties using EquiformerV2 and eSEN architectures.

### UMA
- **Type**: Molecular
- **Tags**: Chemistry, Materials Science, Physics, Engineering
- **HuggingFace**: https://huggingface.co/facebook/UMA

Universal Models for Atoms - mixture-of-experts graph network trained on billions of atoms across 5 datasets.

### AstroLLaMA
- **Type**: Astronomy Language Model
- **Tags**: Astronomy, Physics
- **HuggingFace**: https://huggingface.co/UniverseTBD/astrollama

Llama-2 7B fine-tuned on 300K+ astronomy arXiv abstracts for astrophysics text generation, literature summarization, and hypothesis completion — first open LLM specialized for astronomy.

### Equiformer v3
- **Type**: Equivariant GNN
- **Tags**: Chemistry, Physics, Materials Science
- **HuggingFace**: https://huggingface.co/mirror-physics/equiformer_v3

Equivariant graph transformer for molecular and materials modeling — predicts energies, forces, and properties on molecular structures and crystals.

### Skala 1.1
- **Type**: DFT Functional
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/microsoft/skala-1.1

Deep-learning exchange-correlation functional for density functional theory — covers main-group thermochemistry, reaction kinetics, noncovalent interactions, and molecular geometries.

### Aurora
- **Type**: Weather Foundation Model
- **Tags**: Climate, Earth Science, Physics
- **HuggingFace**: https://huggingface.co/microsoft/aurora

Foundation model for the Earth system — global weather forecasting, atmospheric chemistry, ocean waves, and tropical-cyclone tracking from a single shared backbone.

### MatterSim
- **Type**: Materials Simulator
- **Tags**: Materials Science, Chemistry, Physics
- **HuggingFace**: https://huggingface.co/microsoft/mattersim

Foundation-model atomistic simulator for materials over a wide range of temperatures and pressures — drop-in replacement for ab-initio MD for property prediction.

### Skala 1.0
- **Type**: DFT Functional
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/microsoft/skala-1.0

First release of Skala — deep-learning exchange-correlation functional for density functional theory, predecessor to Skala 1.1.

### AIMNet2-rxn
- **Type**: Neural Interatomic Potential
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/isayevlab/aimnet2-rxn

AIMNet2 trained on reaction data — neural-network interatomic potential supporting reactive molecular simulations.

### AIMNet2 ωB97M-D3
- **Type**: Neural Interatomic Potential
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/isayevlab/aimnet2-wb97m-d3

Neural network interatomic potential for fast and accurate molecular simulations, trained at the ωB97M-D3 level of theory.

### AIMNet2 (B97-3c, 2025)
- **Type**: Neural Interatomic Potential
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/isayevlab/aimnet2-2025

AIMNet2 retrained at the B97-3c level of theory — 2025 release with improved coverage and accuracy.

### AIMNet2-NSE
- **Type**: Neural Interatomic Potential
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/isayevlab/aimnet2-nse

AIMNet2 specialised for open-shell chemistry (radicals, transition states) — neural network interatomic potential for non-singlet electronic states.

### AIMNet2-Pd
- **Type**: Neural Interatomic Potential
- **Tags**: Chemistry, Materials Science, Physics
- **HuggingFace**: https://huggingface.co/isayevlab/aimnet2-pd

AIMNet2 specialised for palladium-containing organometallic systems — supports homogeneous catalysis simulation at near-DFT accuracy.

### MACE-MP-0
- **Type**: Foundation Potential
- **Tags**: Materials Science, Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mace-foundations/mace-mp-0

MACE foundation model trained on the Materials Project — equivariant message-passing potential for inorganic crystal simulation across most of the periodic table.

### MACE-MPA-0
- **Type**: Foundation Potential
- **Tags**: Materials Science, Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mace-foundations/mace-mpa-0

MACE foundation model trained on the Materials Project + Alexandria datasets — broader coverage variant for inorganic-materials simulation.

### MACE-MH-0
- **Type**: Foundation Potential
- **Tags**: Materials Science, Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mace-foundations/mace-mh-0

MACE foundation model targeting molecular and hybrid systems — equivariant potential trained on a unified molecular/materials dataset.

### MACE-MH-1
- **Type**: Foundation Potential
- **Tags**: Materials Science, Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mace-foundations/mace-mh-1

Updated MACE-MH foundation potential with refined molecular/materials hybrid training — successor to MACE-MH-0.

### MIST 28M · QM9
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-kkgx0omx-qm9

MIST 28M fine-tuned on QM9 — quantum-mechanical property prediction over small organic molecules.

### MIST 28M · QM8
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-28M-gzwqzpcr-qm8

MIST 28M fine-tuned on QM8 — electronic-spectra property prediction over small organic molecules.

### MIST 1.8B · G298
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-09sntn03-g298

MIST 1.8B fine-tuned for G298 — Gibbs free energy at 298 K from QM9.

### MIST 1.8B · H298
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-3fbbz4is-h298

MIST 1.8B fine-tuned for H298 — enthalpy at 298 K from QM9.

### MIST 1.8B · U298
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-85f24xkj-u298

MIST 1.8B fine-tuned for U298 — internal energy at 298 K from QM9.

### MIST 1.8B · U0
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-a7akimjj-u0

MIST 1.8B fine-tuned for U0 — internal energy at 0 K from QM9.

### MIST 1.8B · μ (dipole)
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-ez05expv-mu

MIST 1.8B fine-tuned for dipole moment from QM9.

### MIST 1.8B · α (polarizability)
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-rcwary93-alpha

MIST 1.8B fine-tuned for isotropic polarizability from QM9.

### MIST 1.8B · HOMO
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-jmjosq12-homo

MIST 1.8B fine-tuned for HOMO energy from QM9.

### MIST 1.8B · LUMO
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-n14wshc9-lumo

MIST 1.8B fine-tuned for LUMO energy from QM9.

### MIST 1.8B · HOMO-LUMO gap
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-kayun6v3-gap

MIST 1.8B fine-tuned for HOMO-LUMO gap from QM9.

### MIST 1.8B · ZPVE
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-6nmcwyrp-zpve

MIST 1.8B fine-tuned for zero-point vibrational energy from QM9.

### MIST 1.8B · ⟨R²⟩
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-xxe7t35e-r2

MIST 1.8B fine-tuned for electronic spatial extent from QM9.

### MIST 1.8B · Cv
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-j356b3nf-cv

MIST 1.8B fine-tuned for heat capacity Cv from QM9.

### MIST 1.8B · QM8
- **Type**: Molecular Property Predictor
- **Tags**: Chemistry, Physics
- **HuggingFace**: https://huggingface.co/mist-models/mist-1.8B-8nd1ot5j-qm8

MIST 1.8B fine-tuned on QM8 — electronic-spectra prediction at scale.

## Blog Posts (5)

### AI for PDEs
- **Author**: hugging-science
- **Date**: 2025-01-01
- **Tags**: Physics, Mathematics, Engineering
- **Link**: https://huggingface.co/blog/hugging-science/pde

Exploring AI approaches to solving partial differential equations.

### Constellation Fusion Challenge
- **Author**: cgeorgiaw
- **Date**: 2025-01-01
- **Tags**: Physics, Energy, Engineering
- **Link**: https://huggingface.co/blog/cgeorgiaw/constellaration-fusion-challenge

A challenge for advancing fusion energy through AI.

### Physics Informed Neural Networks (PINNs): An Intuitive Guide
- **Author**: towardsdatascience.com
- **Date**: 2025-01-28
- **Tags**: Physics, Mathematics, Engineering
- **Link**: https://towardsdatascience.com/physics-informed-neural-networks-pinns-an-intuitive-guide-fff138069563/

A clear, intuitive walkthrough of how PINNs embed physical laws directly into neural network training — bridging traditional PDE-based modeling with data-driven deep learning.

### A Living Review of Machine Learning for Particle Physics
- **Author**: iml-wg.github.io
- **Date**: 2020-06-01
- **Tags**: Physics
- **Link**: https://iml-wg.github.io/HEPML-LivingReview/

A continuously updated, near-comprehensive survey of ML techniques applied to experimental, phenomenological, and theoretical high-energy physics — maintained by the Inter-Experimental LHC ML Working Group.

### Did GPT-5.2 Make a Breakthrough Discovery in Theoretical Physics?
- **Author**: dlouapre
- **Date**: 2026-02-01
- **Tags**: Physics, Mathematics
- **Link**: https://huggingface.co/blog/dlouapre/gpt-single-minus-gluons

GPT-5.2 conjectured a compact formula for single-minus gluon tree amplitudes previously assumed to be zero for 40 years — a striking example of AI contributing to original theoretical physics.


================================================================================
## Topic: Scientific Reasoning (/topics/scientific-reasoning.md)
================================================================================

# Scientific Reasoning — Hugging Science

> Scientific QA, theorem proving, and multi-step problem-solving datasets

_Auto-generated from site data. Fetch `/llms.txt` for the full index._

## Datasets (47)

### jablonkagroup/ChemBench
- **Type**: Chemistry Benchmark
- **Tags**: Chemistry, Materials Science, Benchmark, Scientific Reasoning, Engineering, Mathematics
- **HuggingFace**: https://huggingface.co/datasets/jablonkagroup/ChemBench

Manually curated benchmark of 3,000+ chemistry and materials science questions across spectroscopy, reactivity, synthesis, and property prediction for evaluating LLMs.

### AI-MO/olympiads-ref
- **Type**: Competition Math
- **Tags**: Mathematics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/olympiads-ref

Extended reference set of olympiad problems with verified step-by-step solutions, used for Chain-of-Thought and formal reasoning training.

### AI-MO/Kimina-Prover-Promptset
- **Type**: Theorem Proving
- **Tags**: Mathematics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/Kimina-Prover-Promptset

Prompt-set for training and evaluating Kimina, a Lean 4 theorem prover that uses reinforcement learning over formal mathematical proofs.

### AI-MO/GeometryLeanBench
- **Type**: Theorem Proving
- **Tags**: Mathematics, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/GeometryLeanBench

Geometry theorem proving problems formalised in Lean 4, covering Euclidean, affine, and metric geometry for automated reasoning evaluation.

### AI-MO/CombiBench
- **Type**: Combinatorics
- **Tags**: Mathematics, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/CombiBench

Combinatorics problems drawn from AMC, AIME, and olympiad competitions, formalised for benchmarking discrete-mathematics reasoning in language models.

### AI-MO/minif2f_test
- **Type**: Theorem Proving
- **Tags**: Mathematics, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/minif2f_test

Test set for miniF2F formal mathematics benchmark.

### AI-MO/aimo-validation-amc
- **Type**: Competition Math
- **Tags**: Mathematics, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/aimo-validation-amc

AMC 10/12 competition problems reformatted for AIMO challenge validation, covering algebra, geometry, and number theory at difficulty levels 1–5.

### AI-MO/aimo-validation-aime
- **Type**: Competition Math
- **Tags**: Mathematics, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/aimo-validation-aime

AIME I/II problems reformatted for AIMO challenge validation — 15-question integer-answer format, covering competition math at difficulty levels 5–9.

### AI-MO/NuminaMath-TIR
- **Type**: Math Problems
- **Tags**: Mathematics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/NuminaMath-TIR

NuminaMath with Tool-Integrated Reasoning annotations.

### AI-MO/NuminaMath-CoT
- **Type**: Math Problems
- **Tags**: Mathematics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/NuminaMath-CoT

NuminaMath with Chain-of-Thought reasoning annotations.

### AI-MO/aimo-validation-math-level-4
- **Type**: Math Problems
- **Tags**: Mathematics, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/aimo-validation-math-level-4

Level-4 MATH benchmark problems (pre-calculus difficulty) used for AIMO challenge validation and fine-grained model evaluation.

### AI-MO/aimo-validation-math-level-5
- **Type**: Math Problems
- **Tags**: Mathematics, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/aimo-validation-math-level-5

Level-5 MATH benchmark problems (highest difficulty) used for AIMO challenge validation and measuring the ceiling of model mathematical reasoning.

### google/spiqa
- **Type**: Scientific Benchmark
- **Tags**: Biology, Chemistry, Physics, Benchmark, Scientific Reasoning, Mathematics
- **HuggingFace**: https://huggingface.co/datasets/google/spiqa

Scientific Paper Image Question Answering benchmark requiring multimodal reasoning over figures, charts, and diagrams from research papers across scientific domains.

### facebook/principia-collection
- **Type**: STEM Reasoning
- **Tags**: Mathematics, Physics, Chemistry, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/facebook/principia-collection

Large-scale STEM reasoning dataset from Meta covering mathematics, physics, chemistry, and biology problems for training and evaluating scientific reasoning in language models.

### facebook/principia-bench
- **Type**: STEM Benchmark
- **Tags**: Mathematics, Physics, Chemistry, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/facebook/principia-bench

Curated benchmark of challenging STEM problems requiring multi-step reasoning, quantitative analysis, and domain knowledge across natural sciences.

### futurehouse/BixBench
- **Type**: Research Benchmark
- **Tags**: Biology, Chemistry, Benchmark, Scientific Reasoning, Mathematics
- **HuggingFace**: https://huggingface.co/datasets/futurehouse/BixBench

Benchmark with 205 reproducible research questions paired with data capsules for AI evaluation.

### futurehouse/lab-bench
- **Type**: Research Benchmark
- **Tags**: Biology, Benchmark, Scientific Reasoning, Mathematics
- **HuggingFace**: https://huggingface.co/datasets/futurehouse/lab-bench

Language Agent Biology Benchmark - 8 categories of scientific research tasks including cloning, figures, and protocols.

### futurehouse/ether0-benchmark
- **Type**: Chemistry Benchmark
- **Tags**: Chemistry, Medicine, Benchmark, Scientific Reasoning, Mathematics
- **HuggingFace**: https://huggingface.co/datasets/futurehouse/ether0-benchmark

Chemistry reasoning benchmark covering SMILES-based tasks including reaction prediction, retrosynthesis, and molecular property estimation for evaluating chemistry LLMs.

### SAIRfoundation/equational-theories-selected-problems
- **Type**: Mathematical Reasoning
- **Tags**: Mathematics, Scientific Reasoning, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/SAIRfoundation/equational-theories-selected-problems

Curated selection of equational theory problems for benchmarking LLM mathematical reasoning and automated theorem proving.

### SAIRfoundation/equational-theories-benchmark
- **Type**: Mathematical Reasoning
- **Tags**: Mathematics, Scientific Reasoning, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/SAIRfoundation/equational-theories-benchmark

Full benchmark suite of equational theory problems spanning algebraic structures, designed to evaluate formal reasoning capabilities of AI models.

### openai/healthbench
- **Type**: Medical Benchmark
- **Tags**: Medicine, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/openai/healthbench

Realistic multi-turn health conversations graded against physician-written rubrics across multiple axes (accuracy, completeness, communication) — an open evaluation benchmark for AI assistants in medicine.

### openai/healthbench-professional
- **Type**: Medical Benchmark
- **Tags**: Medicine, Benchmark, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/openai/healthbench-professional

Professional-graded subset of HealthBench: physician evaluators score model responses to clinically realistic conversations, targeting expert-level health assessment.

### openai/frontierscience
- **Type**: Scientific Reasoning Benchmark
- **Tags**: Scientific Reasoning, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/openai/frontierscience

Frontier science evaluation benchmark probing model capabilities on expert-level reasoning across natural sciences — designed to surface what AI systems can and cannot do at the research frontier.

### wanglab/kegg
- **Type**: Biological Reasoning
- **Tags**: Biology, Genomics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/wanglab/kegg

KEGG pathway entries paired with variant annotations for training and evaluating multimodal biological reasoning models (used by the BioReason work).

### AI-MO/olympiads
- **Type**: Math Reasoning Corpus
- **Tags**: Mathematics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/AI-MO/olympiads

Olympiad-level mathematical problems collected from international and national competitions, formatted for training and evaluating mathematical reasoning models.

### OpenMed/MedDialog
- **Type**: Medical Dialogue
- **Tags**: Medicine, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/OpenMed/MedDialog

Doctor-patient medical dialogue dataset for training and evaluating clinical conversation models — covers triage, symptom checking, and diagnostic reasoning.

### OpenMed/Medical-Reasoning-SFT-Mega
- **Type**: SFT Reasoning Corpus
- **Tags**: Medicine, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/OpenMed/Medical-Reasoning-SFT-Mega

Large supervised fine-tuning corpus for clinical reasoning — multi-step medical question-answer chains with rationales for training instruction-following medical LLMs.

### allenai/peS2o
- **Type**: Pretraining Corpus
- **Tags**: Scientific Reasoning, Biology, Chemistry, Physics, Engineering
- **HuggingFace**: https://huggingface.co/datasets/allenai/peS2o

Approximately 40M cleaned, filtered, and formatted open-access academic papers derived from S2ORC — a large multi-domain pretraining corpus for science-aware language models, spanning biology, chemistry, engineering, computer science, and physics.

### Anthropic/BioMysteryBench-preview
- **Type**: Biology Benchmark
- **Tags**: Biology, Medicine, Scientific Reasoning, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/Anthropic/BioMysteryBench-preview

Preview slice of BioMysteryBench — challenging, expert-curated biology problems for evaluating AI scientific reasoning capability.

### Anthropic/BioMysteryBench-full
- **Type**: Biology Benchmark
- **Tags**: Biology, Medicine, Scientific Reasoning, Benchmark
- **HuggingFace**: https://huggingface.co/datasets/Anthropic/BioMysteryBench-full

Full BioMysteryBench evaluation set — challenging biology problems used to probe expert-level scientific reasoning in frontier models.

### PLAID-datasets/AirfRANS_original
- **Type**: Aerodynamics CFD
- **Tags**: Physics, Engineering, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/PLAID-datasets/AirfRANS_original

Original AirfRANS airfoil RANS simulation dataset — graph-structured CFD over NACA airfoils for benchmarking physics-informed and graph neural networks.

### jablonkagroup/chempile-instruction
- **Type**: Chemistry Instruction Corpus
- **Tags**: Chemistry, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/jablonkagroup/chempile-instruction

Instruction-tuning corpus for chemistry — curated Q&A and dialogue traces drawn from chemical literature and educational sources for training chemistry-specialist LLMs.

### jablonkagroup/chempile-reasoning
- **Type**: Chemistry Reasoning Corpus
- **Tags**: Chemistry, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/jablonkagroup/chempile-reasoning

Multi-step chemistry reasoning corpus — open-domain QA, NLI, and multiple-choice items with chains of reasoning for training and evaluating chemical reasoning models.

### jablonkagroup/chempile-lift
- **Type**: Chemistry Pretraining
- **Tags**: Chemistry, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/jablonkagroup/chempile-lift

ChemPile-LIFT — large-scale language-modelling dataset combining curated chemistry literature and structured chemical knowledge for foundation-model pretraining.

### jablonkagroup/chempile-education
- **Type**: Chemistry Education Corpus
- **Tags**: Chemistry, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/jablonkagroup/chempile-education

Educational chemistry corpus — multiple-choice and open-ended items spanning introductory through graduate chemistry for assessing model educational capability.

### jablonkagroup/chempile-caption
- **Type**: Chemistry Captioning
- **Tags**: Chemistry, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/jablonkagroup/chempile-caption

Image-to-text dataset of chemistry figures (molecular structures, reaction schemes, plots) with expert captions for training multimodal chemistry models.

### jablonkagroup/chempile-code
- **Type**: Chemistry Code Corpus
- **Tags**: Chemistry, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/jablonkagroup/chempile-code

Curated chemistry-relevant code (RDKit, ASE, simulation tooling) drawn from The Stack — supports training models that can read and write computational chemistry workflows.

### miriad/miriad-5.8M
- **Type**: Medical Reasoning Corpus
- **Tags**: Medicine, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/miriad/miriad-5.8M

5.8M-example medical instruction-tuning and reasoning corpus curated from clinical literature for training healthcare LLMs at scale.

### miriad/miriad-4.4M
- **Type**: Medical Reasoning Corpus
- **Tags**: Medicine, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/miriad/miriad-4.4M

4.4M-example medical reasoning subset of MIRIAD — earlier release used for benchmarking medical instruction-tuning workflows.

### wanglab/bioreason-pro-sft-reasoning-data
- **Type**: Biological Reasoning Corpus
- **Tags**: Biology, Genomics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/wanglab/bioreason-pro-sft-reasoning-data

Reasoning trace dataset used to supervised-fine-tune BioReason-Pro — multimodal biological problems with rationales over genomic variants and pathway data.

### foundry-ml/dataset_metallicglass_rc_llm
- **Type**: Metallic Glass
- **Tags**: Materials Science, Engineering, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/foundry-ml/dataset_metallicglass_rc_llm

LLM-extracted critical cooling rate data for metallic glasses — text-mined complement to the structured Rc dataset.

### FreedomIntelligence/medical-o1-reasoning-SFT
- **Type**: Medical Reasoning Corpus
- **Tags**: Medicine, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT

Medical chain-of-thought reasoning dataset (o1-style) for supervised fine-tuning of medical LLMs — one of the most-liked medical training corpora on Hugging Face (1000+ likes).

### FreedomIntelligence/medical-o1-verifiable-problem
- **Type**: Medical RL Reward Data
- **Tags**: Medicine, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/FreedomIntelligence/medical-o1-verifiable-problem

Verifiable medical reasoning problems with checker functions — supports RL/reward-model training for medical-LLM alignment beyond static SFT.

### ADSKAILab/Zero-To-CAD-1m
- **Type**: CAD Vision-Language Corpus
- **Tags**: Engineering, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/ADSKAILab/Zero-To-CAD-1m

1M paired image-and-CAD-program examples for training vision-language models that synthesise parametric CAD from images.

### ADSKAILab/Zero-To-CAD-100k
- **Type**: CAD Vision-Language Corpus
- **Tags**: Engineering, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/ADSKAILab/Zero-To-CAD-100k

Curated 100K-example subset of Zero-To-CAD — useful for benchmarking and lightweight fine-tuning of CAD-from-image models.

### ADSKAILab/LLM-narrative-planning-taskset
- **Type**: Planning Benchmark
- **Tags**: Engineering, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/ADSKAILab/LLM-narrative-planning-taskset

Narrative planning task set for evaluating LLM planning and reasoning over multi-step design and engineering scenarios.

### ADSKAILab/codeparrot_megatron
- **Type**: Code Pretraining
- **Tags**: Engineering, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/datasets/ADSKAILab/codeparrot_megatron

Megatron-formatted CodeParrot release used for large-scale code language-model pretraining experiments at Autodesk AI Lab.

## Models (9)

### Clinical Camel 70B
- **Type**: Clinical LLM
- **Tags**: Medicine, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/wanglab/ClinicalCamel-70B

Llama-2 70B finetuned with QLoRA on physician-patient dialogues, clinical articles, and MedQA-style reasoning chains for medical conversation and decision support.

### Kimina-Prover Preview Distill 7B
- **Type**: Lean 4 Theorem Prover
- **Tags**: Mathematics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/AI-MO/Kimina-Prover-Preview-Distill-7B

Distilled 7B preview of Kimina-Prover — a reinforcement-learning-trained model that generates Lean 4 proofs for olympiad-level mathematics problems.

### Kimina-Prover Distill 1.7B
- **Type**: Lean 4 Theorem Prover
- **Tags**: Mathematics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/AI-MO/Kimina-Prover-Distill-1.7B

Compact 1.7B distilled Kimina-Prover variant for Lean 4 proof generation on olympiad-level theorems — runs on a single consumer GPU.

### Kimina-Prover Distill 8B
- **Type**: Lean 4 Theorem Prover
- **Tags**: Mathematics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/AI-MO/Kimina-Prover-Distill-8B

8B distilled Kimina-Prover variant — Lean 4 theorem-proving model trained on olympiad-level mathematical problems with reinforcement learning over proof traces.

### BioReason-Pro SFT
- **Type**: Biological Reasoning Model
- **Tags**: Biology, Genomics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/wanglab/bioreason-pro-sft

Supervised fine-tuned variant of BioReason-Pro — multimodal biological reasoning over genomic variants and pathway data with chain-of-thought rationales.

### BioReason-Pro RL
- **Type**: Biological Reasoning Model
- **Tags**: Biology, Genomics, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/wanglab/bioreason-pro-rl

RL-tuned variant of BioReason-Pro — reinforcement-learning fine-tuning over BioReason’s SFT base for sharper biological reasoning across KEGG pathways and variant data.

### NexaMass V3 Struct
- **Type**: Mass Spectrometry Model
- **Tags**: Chemistry, Biology, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/AethronPhantom/NexaMass-V3-Struct

Self-supervised representation model for MS/MS spectra in metabolomics — learns molecular fingerprints to support compound identification and structure inference.

### HuatuoGPT-Vision 7B
- **Type**: Medical Vision-Language Model
- **Tags**: Medicine, Biology, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/FreedomIntelligence/HuatuoGPT-Vision-7B

Medical multimodal LLM from the HuatuoGPT family — answers clinical questions over medical imagery (radiology, pathology, dermatology) using a 7B vision-language backbone.

### Zero-To-CAD Qwen3-VL 2B
- **Type**: CAD Vision-Language Model
- **Tags**: Engineering, Scientific Reasoning
- **HuggingFace**: https://huggingface.co/ADSKAILab/Zero-To-CAD-Qwen3-VL-2B

Qwen3-VL fine-tuned to generate parametric CAD models directly from images — bridges vision-language reasoning and engineering geometry synthesis.