Summary
The CRC1709 aims to integrate multi-dimensional data from compatible shared disease models regarding myeloid plasticity. For efficient and sustainable data management, a central data integration hub is established for the CRC. On this hub, all data relevant for the project are collected and integrated to allow for analysis and modelling across all data sources.
State of the Art
- Challenges in Biomedical Data Management: Biomedical research often struggles with integrating diverse, sensitive, and high-volume datasets across institutions.
- Sensitive Patient Data: Compliance with GDPR and ethical standards is critical for handling clinical data.
- FAIR Principles: Modern research demands that data is Findable, Accessible, Interoperable, and Resusable for long-term usability.
- Advancements in AI: Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) enable structured information extraction from unstructured data sources.
- Unexplored Potential: Current research underutilizes rich data from electronic health records (EHR) and sequential clinical analyses.
- Need for Integration: Harmonizing clinical and experimental datasets is essential for advancing translational research and understanding cellular plasticity in cancer.
Preliminary Work
- Clinical Data Warehouse of MED V: Centralized repository for structured data
- Medical Data Integration Center (MeDIC): Mature infrastructure ready
- SMART-CARE-Linked Data Repository: Harmonizing diverse datasets
- Portal of Medical Data Models: FAIR-compliant portal for annotated data
- Data Analysis Pipelines in Myeloid Leukemia: Genomic / clinical data tools
- Standardized Data Workflows: Established SOPs ensuring interoperability
- Experience with FAIR Practices: Implementation of data sharing
- Multimodal Omics Integration: Expertise in diverse data types
- HPC Collaboration: Leveraging high-performance computing
- AI Tools for Data Extraction: Deployment of LLM-based pipelines
Specific Aims

Working in Progress
(1) Multimodal data integration with extended EHR annotation
- Integrating diverse omics data (scRNA-seq, scATAC-seq, bulk RNA-seq, Ribometh-seq, proteomics, metabolomics)
- Advanced tissue profiling: Spatial multiplex imaging and multi-color flow mytometry (MFC)
- Incorporating biobanked patient samples, standardized PDX mouse models, and genetically engineered disease models
- Leveraging high-dimensional patient data from clinical data warehouses and external sources
- Establishing a unified platform for cross-project data accessbility and analysis
(2) Cloud-enabled data analysis pipelines for reproducibility and scalability
- Developing containerized bioinformatics pipelines for standardized and reproducible analyses
- Utilizing High-Performance Computing (HPC) for large-scale omics and imaging data
- Ensuring scalability across local and cloud-based infrastructures
(3) Automatic extraction of structured information from unstructured reports
- Designing and deploying LLM-based tools for extracting structured clinical and research data
- Creating a local data warehouse populated with structured outputs from EHRs and free-text sources
- Enabling “talk-to-your-dataset” functionalities for seamless interaction with unstructured data
- Facilitating predictive analytics and research-ready data pipelines with privacy-compliant LLMs
Collaborations
The INF project will collaborate with all CRC projects. A typical example for data integration is project A06, which processes PDX and patient samples; data modalities include whole genome sequencing, RNA-sequencing (scRNA-seq CITEseq, bulk RNA-seq), proteomics, methylation arrays and imaging data with an estimated data volume of 10 GB per sample. Project B02 specified similar data integration requirements. Specific data integration needs were also specified, for example project A07 plans scTCR-seq with approximately 260 GB per sample; or project B04 will process scATAC-seq, imaging and amplicon sequencing with an overall raw data volume of 1 TB.
PRINCIPAL Investigator
Jakob Nikolas Kather, Prof. Dr. med., M.Sc.

PRINCIPAL Investigator
Martin Dugas, Prof. Dr. med. Dipl.-Inform.

Team:
Core Projects











