print(img)<xarray.DataArray 'image' (c: 3, y: 2, x: 1)> Size: 24B
0 1 2 3 4 5
Coordinates:
* c (c) <U4 48B 'DAPI' 'PCNA' 'pH3'
* y (y) float64 16B 0.0 0.0
* x (x) float64 8B 0.0
Early embryonic development is driven by a complex interplay of diverse molecular species that regulate processes such as cell division, chromatin remodeling, differentiation, and tissue organization. Capturing this molecular complexity at subcellular resolution remains a major challenge, particularly in three-dimensional systems. To address these limitations, this thesis presents an integrated experimental and computational framework for high-throughput iterative immunofluorescence (multiplexing) and quantitative volumetric imaging of the whole animal cap in early zebrafish embryos.
Key innovations of the experimental protocol include a 96-well plate mounting strategy—compatible with automated microscopy—that enables volumetric imaging of over 200 embryos per experiment, as well as a refractive index-matched mounting buffer that supports high-fidelity 3D acquisition. The protocol accommodates at least four rounds of antibody staining, enabling simultaneous detection of 11 protein targets alongside a DNA dye.
These advances enabled the acquisition of a time-course dataset spanning 2 to 4.5 hours post-fertilization, encompassing key developmental processes such as the mid-blastula transition (MBT) and zygotic genome activation (ZGA). The selected markers provide information on cell cycle progression, transcriptional activity, and ZGA regulation, while simultaneously supporting dense segmentation of nuclei, cells, and embryos.
A complementary image analysis pipeline was developed including elastic registration, object segmentation, and single-cell feature extraction, with an emphasis on generalizability across volumetric model systems. To facilitate downstream analysis, additional tools for interactive object classification and single-cell visualization of microscopy data were introduced. Furthermore, a pixel‐level intensity bias correction method was implemented to improve quantification of fluorescent markers across the entire volume.
Expanding upon this framework, a continuous model of cell cycle phase (CCP) was constructed at single-cell resolution. This enabled quantification of emerging cell cycle asynchrony during MBT, detection of mitotic pseudowaves in fixed samples, estimation of feature dynamics across the cell cycle, and sorting of cells along a pseudotime axis spanning two hours of developmental progression, inferred from embryo staging and CCP.
Finally, CCP served as a predictor in non-linear Bayesian regression models of transcriptional activity in interphase cells, measured by RNA Polymerase II S2P phosphorylation. These models identified CCP as a major driver of transcriptional heterogeneity. After accounting for cell cycle effects, incorporating ZGA-related factors as linear predictors revealed Pou5 and Polymerase II S5P as positive regulators of transcription, while PCNA emerged as a repressor.
Collectively, these advances offer a robust and scalable framework for quantitative single-cell analysis in three-dimensional multicellular systems, with broader implications for biomedical research using volumetric tissue models.
High-throughput quantitative immunofluorescence microscopy is an indispensable tool to probe protein abundance in-situ and widely adopted in cell biology and bio-medicine. More recently developed multiplexing technologies (Gut, Herrmann, and Pelkmans 2018; Black et al. 2021; Saka et al. 2019) allow for the extraction of even more information by extending the number of molecular species that can be measured within one sample in a single experiment.
While multiplexing protocols for three-dimensional structures—as they are encountered in developmental biology and increasingly in pharmaceutical screening applications that use 3D tissue models like organoids and synthetic embryos—have been developed (Ku et al. 2016; Kuett et al. 2022), not much emphasis has been put on (1) adapting them for high-content screening applications and (2) analyzing the resulting datasets quantitatively using the methods of image-based systems biology.
In my thesis I worked on the experimental methods required to generate such data, using the example of the zebrafish model. Furthermore, I worked on the computational tools needed in the analysis of high-throughput volumetric image data and applied them to the study of two fundamental processes in early embryonic development, the mid-blastula transition (MBT) and zygotic genome activation (ZGA).
I will begin with an introduction to the zebrafish in Section 1.1, followed by an overview of the experimental approaches used in image-based systems biology in Section 1.2, and conclude with an introduction to the computational methods essential for analyzing such data in Section 1.3.
The zebrafish (Danio rerio) is a model organism introduced to developmental biology by Streisinger et al. (1981). As a vertebrate, it shares a significant portion of its genetic makeup with humans, with approximately 70% of human genes having a zebrafish ortholog—only slightly fewer than the 80% that have a mouse ortholog (Choi et al. 2021). In contrast to mice, they are much easier to raise in large numbers with a single female being able to produce hundreds of eggs per week (Streisinger et al. 1981). Paired with the genetic tools that have been developed since its introduction, and its relatively short generation time of 3–4 months (Streisinger et al. 1981), the zebrafish is ideally suited for high-content screening applications. Finally, due its stereotypic development outside the mothers’ womb and the fact that the embryos are mostly transparent during early development, it is an ideal model for the development of volumetric microscopy applications.
Zebrafish development begins with the fertilization of the oocyte, resulting in a zygote enriched with maternally deposited mRNAs and proteins that govern the initial stages of development. During the cleavage period (~0.75–2 hours post-fertilization, hpf), the embryo undergoes six rapid and synchronous cell divisions, occurring at approximately 15-minute intervals, with each blastomere dividing along predetermined division planes.
The cleavage period is followed by the blastula period starting at the 128-cell stage up to the beginning of epiboly (~2.25–5 hpf) that then leads into gastrulation and the formation of the germ layers (Kimmel et al. 1995). During the first 5 hours of development important processes take place including (1) specification of the enveloping layer (EVL) and deep cells at the end of the cleavage period (64-cell stage), (2) formation of the yolk syncytial layer (YSL) where the marginal blastomeres release their nuclei and cytoplasm into the adjacent yolk cytoplasm at division cycle 10 (512-cell stage), (3) mid-blastula transition (MBT) (Kimmel et al. 1995) with is the lengthening and emerging asynchrony of the cell cycle, (4) maternal-to-zygotic transition (MZT) which includes the partial degradation or deactivation of maternal transcripts and proteins followed by zygotic genome activation (ZGA) and (5) the molecular specification of the dorsoventral axis and establishment of the early signalling gradients that define the later body axis. Figure 1 shows an overview of early zebrafish development relevant for this thesis.
The MZT is a fundamental process in early embryonic development that is conserved across all animals and consist of the degradation or inactivation of a subset of the maternal transcripts and proteins in conjunction with the activation zygotic genes (Vastenhouw, Cao, and Lipshitz 2019). Failure to undergo ZGA is—unsurprisingly—detrimental for embryonic development and leads to developmental arrest prior to gastrulation in zebrafish (Chan et al. 2019). The earliest detectable zygotic transcripts appear at the 64-cell stage (Chan et al. 2019), originating from the miR-430 locus, which produces a microRNA believed to facilitate the clearance of maternal transcripts during the MZT (Giraldez et al. 2006).
Another pivotal event in early zebrafish development is the mid-blastula transition (MBT). At division cycle 10, both a lengthening of the cell cycle and the onset of asynchrony become evident (Kimmel et al. 1995). Although MBT temporally coincides with the MZT, and the two processes are interconnected (Schulz and Harrison 2019), lengthening of the cell cycle is not sufficient to activate the zygotic genome (Chan et al. 2019). I will therefore follow the clear distinction of the two processes as discussed by Vastenhouw, Cao, and Lipshitz (2019) with MBT defined as the onset of cell cycle lengthening and asynchrony around division cycle 10, and MZT as the process involving partial degradation of maternal transcripts followed by zygotic genome activation.
ZGA in zebrafish has been extensively studied and many molecular species that are involved have been identified. In preliminary data of 512-cell embryos injected with 5-ethynyl-uridine (5-EU), a modified uridine base that incorporates into nascent (i.e., zygotic) transcripts. The modification allows them to be labelled with a fluorescent azide via click chemistry. Most interphase nuclei show two spots of accumulating transcripts—likely the miR-430 loci (Figure 2). Interestingly, even within cells from a single embryo, there appears to be notable heterogeneity in transcriptional activity (Figure 2, white arrowheads).
A mental model that can incorporate many of the molecular players shown to be involved in zygotic genome activation (ZGA) and guide our understanding of the observed heterogeneity in transcriptional activity at the single-cell level is that of a balance between repressors and activators (Vastenhouw, Cao, and Lipshitz 2019). In this view, zygotic transcription is repressed by DNA-binding proteins that compete with the transcription machinery for access to DNA. Histones are an example of such general repressors of zygotic transcription and have been shown to delay ZGA when injected into the yolk of zebrafish zygotes. Conversely, inhibiting histone binding to DNA leads to the early onset of ZGA (Joseph et al. 2017). Another general repressor of zygotic transcription is the replication machinery, which competes for DNA access—particularly during the rapid cell divisions prior to the mid-blastula transition (MBT) (Schulz and Harrison 2019).
In contrast, activating histone modifications such as H3K27ac and H3K4me3—mediated by the histone acetyltransferase P300 and the reader protein Brd4—are essential for the initiation of ZGA. Both p300 and brd4 mRNAs are maternally deposited and translated prior to the onset of zygotic genome activation. Injection of their proteins at the 1-cell stage accelerates the accumulation of activating H3K27ac modifications, subsequently promoting earlier initiation of zygotic transcription. Inhibition of these proteins leads to developmental arrest before gastrulation, resulting in a phenotype similar to that observed with general transcriptional inhibition (Chan et al. 2019).
Another class of activating proteins comprises the transcription factors Pou5f3, Nanog, and Sox19b, which are thought to specifically bind zygotic gene loci and enable their transcription (Lee et al. 2013; Joseph et al. 2017). Figure 3 contains a summary of some of the most important factors known to be involved in ZGA.
Our current view of ZGA is compiled from years of experimental research performed in many different labs. Individual research papers highlight the components observed in their specific study, often stressing their primary importance in the process—a necessary rhetorical strategy for navigating the revision process. However, the onset of zygotic transcription—as is the case for most biological processes—is mediated by an interplay of many different molecular species and their physical interactions, with many of them potentially still unknown. Developing methods to measure many of these factors simultaneously can offer a novel perspective on their interplay and enhance our understanding of the process.
Microscopy has served as a fundamental tool in biological discovery since its inception in the 17th century. From the first description of cells attributed to Robert Hooke, via the beautiful hand drawings of Purkinje cells at the end on the 20th century by Santiago Ramón y Cajal, to modern quantitative use of the technology (e.g., Gut, Herrmann, and Pelkmans 2018; Kramer, Castillo, and Pelkmans 2022) our ability to describe, and reason about, living systems co-developed with our technical ability to visualize and measure them. Since the early days of microscopy several key innovations took place that enabled their use in the field of systems biology.
First, the development of fluorescence microscopy allowed the specific measurement of molecular targets. Whereas previous microscopy methods generated image contrast by amplifying differences in refractive index in the sample (e.g., phase contrast, differential interference contrast microscopy), in fluorescence microscopy the target of interest is visualized by coupling it to a fluorescent molecule that can be excited with a matching light emitting diode or laser. This allows for specific and semi-quantitative measurement of any cellular component as long as it can be tagged with a fluorescent dye. Approaches for visualizing proteins include tagging the target of interest with a fluorescent protein through genetic engineering, or employing target-specific small molecule dyes—such as DAPI for DNA or phalloidin conjugates for F-actin.
A very general way of targeting fluorescent dyes to cellular components is by using fluorescently labeled antibodies in a method called immunofluorescence (IF) microscopy. This can be achieved by either directly attaching fluorophores to an antibody (direct IF) or by using a primary antibody against a target of interest and then staining with a labelled secondary antibody that binds the primary (indirect IF). Antibodies can be raised against virtually any target of interest and allow not only the measurement of endogenous proteins but also post-translational modifications. For example, antibodies that specifically recognize phosphorylated sites on a protein—often corresponding to its activated form—can provide a more informative readout than measurements of total protein abundance. A limitation of IF is that it can only be used with fixed samples since antibodies are too large to penetrate intact cell membranes. Furthermore, the optimal staining conditions and specificity to the target need to be established empirically—especially when using antibodies raised against a homologous target in a different species or when using novel sample processing protocols.
Another innovation enabled by the invention of fluorescence microscopy was the development of microscopes that allow for optical sectioning of the sample. Confocal laser scanning microscopes (CLSM) utilize a pinhole in the emission light path to reject light emitted from fluorophores outside the focal plane. This results in reduced background signal when imaging thick samples and enables the acquisition of volumetric images. Spinning disk confocal microscopy (SDCM) operates on the same principle of pinhole rejection of out-of-focus emission, but instead of scanning a single spot across the field of view and detecting the signal with a photomultiplier tube, it scans the focal plane using an array of pinholes located on a rapidly spinning disk. The emitted light is then captured using a camera sensor, such as a Charge Coupled Device (CCD) or Complementary Metal-Oxide-Semiconductor (CMOS) sensor. This approach enables the parallel scanning of the entire field of view, reducing both the acquisition time for individual images and the overall photon load on the sample. However, it comes at the cost of reduced optical sectioning capability, especially when imaging thick samples, due to pinhole cross-talk. A third method for achieving optical sectioning is selective plane illumination microscopy (SPIM), also known as light sheet microscopy. In this modality, the sample is illuminated with a thin sheet of light emitted from an illumination objective placed orthogonally to the focal plane of the detection objective, which focuses the emitted light onto a camera sensor. SPIM offers the advantage of highly efficient use of excitation light, as only the focal plane is illuminated, minimizing photo-toxicity and making it ideal for live imaging applications. However, due to the geometry of the two orthogonally arranged objectives positioned close to the sample, these systems impose strong spatial constraints on the sample mounting chamber and are not compatible with standard multi-well plates.
Table 1 provides an overview of microscopy modalities capable of optical sectioning that were considered for the project. CLSM and related point-scanning methods such as two-photon microscopy—although superior in terms of optical sectioning performance—were deemed unsuitable due to prohibitively long acquisition times in a high-throughput context. This limitation is further exacerbated in cycling-based multiplexing, where the same sample must be imaged multiple times. SPIM was considered impractical due to its incompatibility with multi-well plates. Furthermore, most SPIM systems are optimized for live imaging in water as a mounting medium, making them less suitable for imaging fixed samples. As a result, SDCM remained the only viable option.
The animal cap of early zebrafish embryos is roughly 300 µm thick, while Thorn (2016) report a recommended maximum thickness of 30–50 µm for SDCM, and mention that thicker samples would require careful preparation. Imaging the full animal cap with SDCM therefore pushes the operational limits of those systems.
| Microscope Modality | Acq. Time [h]* | Multi-Well | Max. Sample Thickness | Photon Load |
|---|---|---|---|---|
| Confocal Laser Scanning (CLSM) | 72 | yes | 100–200 µm | maximal |
| Spinning Disk Confocal (SDCM) | 8 | yes | 30–50 µm | intermediate |
| Selective Plane Illumination (SPIM) | 8 | no | >1 mm | minimal |
The use of digital sensors—such as photomultiplier tubes for CLSM or camera sensors for SDCM, SPIM, and wide-field microscopy—was another important innovation. This advancement allows for fast and objective recording of observed samples, eliminating the need for artistic skill on the part of the user, which is essential for transitioning from microscopy images to microscopy data. Additionally, when paired with fluorescence microscopy, digital sensors enable semi-quantitative, spatially resolved measurements of fluorophore abundance, allowing the modern notion of quantitative microscopy. Another crucial advantage of digital imaging is that it enables the application of computer vision algorithms in image analysis, allowing for the automated processing of large-scale image datasets
The final technological innovation to mention is the introduction of automated screening microscopes. Systems like the Yokogawa CV8000 Spinning Disk Confocal Microscope utilize standardized multi-well plates and feature an automated stage with the capability to focus on the sample automatically, enabling the acquisition of full plates without human intervention. However, a limitation of these microscopes is that they were primarily developed for tissue culture systems. As a result, their acquisition software, at least in the case of Yokogawa microscopes, is not optimized for capturing arbitrary positions within wells.
The key innovations are summarized in Table 2. Progress across multiple scientific and engineering disciplines—including biology, chemistry, optical and electrical engineering, and computer science—has enabled the collection of large-scale image datasets of biological systems. These datasets can be leveraged in applied research, such as drug screening, as well as in basic research within the field of image-based systems biology.
| Key Innovation | Enables |
|---|---|
| Fluorescent probes | Specific labeling of molecular species, usage of light microscopy in molecular biology. |
| Optical sectioning | Acquisition of 3D volumes. |
| Digital sensors | Objective recording of the sample not limited to what can be drawn by hand. Use of computer vision in image analysis. |
| Fully automated microscopes | Usage of microscopes in high-content screening applications. |
Confocal immunofluorescence microscopy is a powerful technique for visualizing and quantifying endogenous proteins and their modifications in situ. When acquired within the linear range of the detector, each pixel value is proportional to the local concentration of the target molecule. With proper calibration it could even be converted into a quantitative concentration measurement.
To maintain the linear relationship between signal intensity and fluorophore concentration across pixels, various sources of intensity bias must be considered. Ideally, these biases should be minimized during data acquisition—for example, by calibrating the microscope, selecting a suitable objective, matching the refractive index of the mounting medium, or optimizing staining protocol conditions. The goal is to experimentally reduce each source of bias to the point where it becomes negligible for the intended analysis. Only after these experimental improvements have been exhausted should we estimate and correct for any remaining bias computationally.
When acquiring any experimental data, it is generally preferable to adjust sources of bias experimentally rather than rely on computational bias correction.
Table 3 summarizes the sources of intensity bias and how they were addressed in this thesis. While some corrections are standard in fluorescence microscopy—such as flat-field correction, dark-field and background subtraction, and bleaching correction (Peng et al. 2017)—others are more specific to 3D imaging, multiplexing, or high-throughput experiments. In volumetric imaging, novel challenges include staining bias, z-intensity decay from spherical aberration and scattering of light in the tissue, and background from pinhole cross-talk when using SDCM. Multiplexed acquisitions introduce issues like background accumulation from incomplete elution and misalignment between imaging cycles. In high-throughput settings, long acquisition times can lead to fluorophore quenching in the imaging buffer leading to signal decay can become noticeable.
| Source of Intensity Bias | Experimental Adjustments | Computational Correction |
|---|---|---|
| Laser fluctuations | Microscope maintenance | – |
| Fluorophore bleaching | Selection of fluorophore and mounting medium | Bleaching correction* |
| Channel shift (within acquisition) | Microscope maintenance; use of objectives with chromatic aberration correction | Channel registration |
| Staining bias | Adjust incubation conditions; use direct IF for high-abundance targets | Staining bias estimation & correction |
| Channel shift (between acquisitions) | Tune protocol to minimize sample warping | (Elastic) registration between acquisitions |
| X/Y flat-field | Microscope maintenance; record flat-field images | Flat-field correction |
| X/Y dark-field | Record dark-field images | Dark-field subtraction |
| Temporal intensity decay during 8h acquisition | Longer incubation in imaging buffer before imaging | Temporal decay estimation & correction |
| Z intensity decay | Use refractive index-matched mounting medium | Z-decay estimation & correction |
| Background due to pinhole cross-talk | Use Nipkow disk optimized for volumetric imaging with smaller or fewer pinholes | Background estimation & subtraction |
| Background accumulation during multiplexing | Tune elution conditions | Background estimation & subtraction |
Peng et al. (2017) introduce a decomposition of intensity bias sources into a multiplicative (\(S(x)\) or flat-field) and an additive (\(D(x)\) or dark-field) term, allowing them to relate the measured intensity at location \(x\) (\(I^{meas}\)) to the true underlying intensity at that location (\(I^{true}\)) according to Equation 1.
\[ I^{meas}(x) = I^{true}(x) \times S(x) + D(x) \tag{1}\]
A major limitation of (immuno)fluorescence microscopy is the number of molecular species that can be measured simultaneously. Most microscopes are equipped with excitation and emission filters that support the simultaneous measurement of up to four channels. This limitation is driven by the available light bandwidth detectable by sensors (typically 400-750 nm) and the overlap of emission spectra from fluorophores, which can cause bleed-through between channels.
One approach to overcome this limitation is spectral unmixing, where fluorophores with overlapping emission spectra are measured on microscopes equipped with special spectral detectors. This allows for linear unmixing of the channels in a post-processing step (Sadashivaiah et al. 2023). However, this method requires custom microscopy hardware, and the practical upper limit for the number of molecular species that can be measured simultaneously with spectral unmixing is typically around seven (Sadashivaiah et al. 2023).
As an alternative way of overcoming the limitation of spectral bandwidth—in fixed samples—multiplexing protocols have been developed that involve iterative rounds of staining followed by imaging and elution of the antibodies (Gut, Herrmann, and Pelkmans 2018). Development of such protocols was a major advancement in cell biology using tissue culture models, as they enable the measurement of more than 40 proteins and post-translational protein modifications, such as phosphorylated signaling kinases.
These protocols enable us to study cellular processes involving many molecular species, while at the same time getting a comprehensive picture of cellular state, which can be measured by targeting organelles and proteins involved in various processes. Recent work from our lab applied this method to cells in the context of epidermal growth factor (EGF) signaling response, and could show that the signaling response of single cells is modulated by the cell-state context (Kramer, Castillo, and Pelkmans 2022). This is a testimony of the novel—systems level—questions that can be addressed with such methods.
Knowing about the power of multiplexing when applied to cell biology, an obvious next step is to establish the same methods to study multicellular systems. Tissue culture models are ideally suited for studying cell biology since experiments done in those systems are (1) accessible from a microscopy and image analysis perspective1 and (2) experiments with isogenic cells grown in culture have “intrinsic perturbations” built-in, due to the different micro-environments that cells experience (Pelkmans 2012). But a flat tissue culture model of isogenic cells—even if originally derived from a multicellular organism—is deviating strongly from the in-situ environment those cells would experience. And one might argue that immortalized cancer cell lines exhibit limited multicellularity, as intercellular coordination mechanisms are likely disrupted.
The inherent three-dimensional nature of multicellular systems demands experimental approaches capable of probing cells within their native environment. A zebrafish embryo, for instance, does not develop on a flat surface, and collapsing image data into two dimensions results in a significant loss of spatial information crucial for analysis. Although advancements have been made in multiplexing protocols for complex biological samples, methods suitable for volumetric systems face significant limitations. Many rely on physical sectioning via microtomy, thereby restricting the analysis to 2D planes (Wahle et al. 2023). Conversely, existing protocols capable of imaging intact 3D samples suffer from low throughput and are typically optimized for manual processing of only a few specimens (Ku et al. 2016; Kuett et al. 2022).
One of the most important concepts in light microscopy is that of refraction. Light is widely known to travel at a constant speed in a vacuum but when entering an optical medium, interactions with the atoms in the medium lead to an apparent slow-down of the wavefront. The ratio of the speed of light in a vacuum (\(c\)) and this medium dependent phase velocity (\(v\)) is referred to as the refractive index (\(n\)) of a medium (Equation 2). It can be used to describe the amount of refraction a light-beam experiences at the interface between two media using the law of refraction (Equation 3; Figure 4 (a)) (Hecht 2017, 108–10).
\[ n = \frac{c}{v} \tag{2}\]
\[ \frac{sin \Theta_1}{sin \Theta_2} = \frac{n_2}{n_1} \tag{3}\]
In a volumetric imaging setting light not only has to travel through the internal optics of the microscope, objective, immersion medium and the cover glass, but also through the mounting medium and sample (Figure 4 (b)). The burden of correcting for refractive index (RI) mismatches up to and including the cover glass lies with the microscope manufacturers (Richardson and Lichtman 2015), and the person in charge of system maintenance. Users mainly need to ensure they use the appropriate immersion medium for the selected objective, and set the correction collar of the objective in accordance with the cover glass (or plate bottom) thickness and material. The light path through the mounting medium and the sample, on the other hand, is outside the control of manufacturers and is therefore solely the responsibility of the experimentalist. To fully utilize the optical sectioning capabilities of modern microscopes (see Section 1.2.1), careful tuning of this part of the optical path is essential. The aim is to render the sample “as transparent as possible”.
In live imaging applications, the need to preserve sample viability and health imposes significant constraints on the selection of mounting conditions. Nonetheless, attempts have been made to introduce refractive index matching solutions to mounting media (Boothe et al. 2017). However, their efficacy is limited because live tissues actively prevent the deep penetration of foreign agents. This typically results in RI matching predominantly at the sample-medium interface, rather than homogenization throughout the specimen. Despite this incomplete matching, such methods have been shown to yield significant improvements in image quality while adding minimal complexity to the experimental protocol. Given these demonstrated benefits, the relatively limited adoption of RI matching techniques in live microscopy is noteworthy.
Fixed samples offer greater experimental flexibility, benefiting from a wide array of tissue clearing protocols developed since their initial introduction over a century ago (Spalteholz 1914). The two main sources of signal degradation in volumetric imaging are (1) absorption of light by pigments in the tissue and (2) uneven light scattering due to refractive index heterogeneity, resulting in tissue opacity, i.e., milky appearance of the tissue (Richardson and Lichtman 2015). Tissue clearing protocols address these issues by removing or bleaching light-absorbing molecules, notably hemoglobin, myoglobin, and melanin, and by homogenizing the refractive index to enhance transparency and reduce off-axis light scattering (Richardson and Lichtman 2015). In zebrafish melanogenesis initiates around 24 hours post-fertilization (Kimmel et al. 1995), allowing for simplified clearing procedures in earlier developmental stages, focusing solely on homogenization of the refractive index.
Microscopy datasets span length scales from the optical resolution limit of 10 nanometers with super-resolution techniques (Galbraith and Galbraith 2011), to whole mouse brains around 1 cm3, facilitated by tissue clearing and microscopes like the mesoSPIM (Voigt et al. 2019). This covers seven orders of magnitude, though not within a single experiment. The goal of bioimage analysis is to derive biological insight from these data. The sophistication of such analysis can range from visual inspection of the images and writing descriptive prose, to manual annotations and subsequent measurement (e.g., drawing a line and plotting an intensity profile), all the way to large automated analysis pipelines including image preprocessing, object segmentation, feature extraction, and modeling of the resulting high-dimensional data. The amount of effort that can go into the analysis of an image dataset depends on (1) the biological question addressed by the experiment for which the data were collected, (2) the quality of the data and the cost of its generation2, and (3) the amount of data that needs to be processed, including the prospect of generating similar data in the future.
In image-based systems biology, we spend a lot of time and resources generating high-quality datasets. This justifies a significant investment in the development of automated image analysis workflows. And since the amount of data is in the terabytes for a single experiment, it is well beyond what can be analyzed manually or processed on a local machine. These data therefore necessitate the development of robust image analysis pipelines that minimize the need for human intervention—especially in the early stages of processing. Furthermore, they have to be deployed to high-performance computing (HPC) infrastructure.
While the tools to build these kinds of systems are well established for 2D-projected tissue culture data (CellProfiler, TissueMAPS, …), the same does not apply to high-content volumetric data. The scientific Python ecosystem—consisting of well-established open-source computer vision libraries (ITK, scikit-image, OpenCV, …), machine learning libraries (scikit-learn, PyTorch, TensorFlow, …), domain-specific computer vision models and tools (Cellpose, StarDist, Ilastik, …), and libraries for scientific computation and visualization (polars, NumPy, SciPy, dask, napari, PyVista/VTK, matplotlib, …)—is ideally suited to tackle this problem using a single programming language.
Before introducing the unique challenges in analyzing high-content multiplexed volumetric microscopy data I want to quickly reflect on the analysis of microscopy images in general, particularly in their relation to the broader field of computer vision that drives a lot of innovation relevant to our science. Digital images are ubiquitous in the modern world and their proliferation increased the demand for automation in their processing. The field of computer vision leverages classical image processing algorithms and machine learning methods to turn image data into useful numerical representations, that allow their interpretation and enable downstream analysis and decision-making. Applications range from filters applied to images on social media platforms to real time processing of video streams as they might be used in autonomous vehicles or robotics.
While there are a commonalities across all domains of digital image analysis, microscopy data has some unique properties we need to be aware of when doing image analysis. Here I want to highlight three important differences between “natural images” (as resulting from regular photography or a camera feed) and fluorescence microscopy data:
Because of these differences, volumetric microscopy data is best thought of as set samples on a regular 3D grid, with the channels independently encoding the abundance of an underlying molecular species. This geometric interpretation of microscopy data has important implications for the appropriate ways of representing and storing them (see Section 1.3.3.1.1) and for the kinds of measurements that can be derived from them (see Section 3.2.6). It is also the reason we add scale bars to our images which does not make sense in a photograph.
Systems biology is a data centric approach to biology, and we produce and process them in large amounts. But large amounts of data are just that, more of the same. In the documentation of the Vega-Altair visualization library, that provides a declarative grammar for data visualization, Stevens (1946) is mentioned as a source of inspiration for their data model. The paper was written in 1946, a time when most data processing was done with pen and paper, and the frontier of digital processing were systems like the ENIAC that performed 500 floating point operations per second (FLOPS). Just for context, today’s consumer grade hardware is on the order of 1015 FLOPS, meaning that training a single cell cycle phase model (see Section 3.3.3) on that system, a task that runs in 5 minutes on a modern laptop, would take around 10 million years.3 Even though written in a time when data processing was severely limited when compared to today’s standards, their insights about measurement scales are just as relevant today, and the terminology introduced here will be used throughout the thesis.
Stevens (1946) take a very broad definition of measurement as the “assignment of numerals to objects or events according to rules”. Equipped with that definition, they identify 4 types of measurement scales that can be differentiated based on the empirical operations that they allow. This property of available operations, that is captured by their mathematical group structure, determines the descriptive statistics that can be computed on them. The four types form a hierarchy, meaning that the allowed operations are cumulative. A higher up scale inherits all operations and statistics from the lower levels (Table 4).
On the lowest level are measurements on the nominal scale within which entities have unique IDs assigned to them and the only meaningful comparison between two measurements is whether they are equal or not. An example for this is the assignment of label IDs when performing segmentation as part of image processing, where we assign pixels to be part of object instances. The assignment rule for this “measurement” is very simple: two objects that are different must be assigned different numbers. With equality being the only allowed comparison the only statistic we can compute on them is the cardinality (number of elements) of the scale. Stevens (1946) introduce a second type of nominal assignment, where labels are assigned to classes of entities instead of instances. They call the former Type A and the latter Type B nominal scale, and point out that Type A nominal scales can be conceptualized as special cases of Type B nominal scales where each class has only a single member (i.e., singleton set with cardinality 1). Once multiple instances per class are allowed there are additional statistics that can be computed, for example, the mode determining the most numerous class.
A big part of programming and data analysis is assigning instance IDs and types to identify components in the program. In bioimage analysis we have to keep track of a broad range of biological (e.g., cells, transcripts) and other processing relevant entities (e.g., line profiles, images, wells), each with instance counts ranging from tens to millions. This is why we will spend a significant part of the results section establishing a naming scheme for all these components (see Section 3.2.4). The strategy will ultimately boil down to defining nominal scales—which we call categorical coordinates; essentially ID columns in a data frame—that can be combined to identify those different components.
The next level in the hierarchy are measurements on the ordinal scale. Here we add greater or less comparison between measurements, so there is a way of sorting entities without a meaningful notion of differences between them. Stevens (1946) use the hardness of mineral as an example that is constructed by comparing the hardness of a material to that of 10 reference materials without them having a well-defined difference in hardness between each other. Since we have a notion of ordering, we can compute rank ordering based statistics like minimum, maximum, median and other quantiles.4
At this stage, we move from categorical features to quantitative measurements on an interval scale. This scale enables meaningful comparisons of differences between two measurements, in the absence of a true zero point, which is established based on criteria beyond mere convention. A common example is the Celsius temperature scale or—more relevant in the context of spatial data analysis—the position along a spatial dimension. While there is meaning in the difference (interval) between two temperatures or positions, both rely on an arbitrary zero point, like the freezing point of water, or the origin of a reference coordinate system. There is no meaningful notion of “twice as hot” on a Celsius scale, or “position A is half of position B” without reference to this arbitrary zero (Figure 5). We might try to divide a position, by dividing its value on the coordinate axis, but this is in fact dividing the difference between two positions (i.e., the difference between that point and the origin), not the position itself. The same pair can be seen on the time axis with time points and durations (differences in time) residing on ratio and interval scales respectively.
These last examples of duration and difference in position (length) are measurements on the ratio scale since they refer to a true zero, which would be coinciding points or time points respectively. Other examples of measurements on the ratio scale include most physical quantities measured on absolute scales, such as temperature in Kelvin or protein abundance in moles. Note that the unit of the measurement is not relevant to determine the scale type. The categorization is invariant under multiplication by a scale factor, which is a change in measurement unit.
An overview adapted from Stevens (1946) is shown in Table 4. The fact that they form a hierarchy means that—if we have the choice and all else being equal—we should always strive for measurements on the highest possible scale. It is always possible to transition to a lower scale (e.g., ratio to interval by adding an offset, or ratio to ordinal by binning the data).
| Super Type | Type | Operations | Allowable Statistic | Physical Type |
|---|---|---|---|---|
| categorical | nominal | = | Cardinality, Mode | int, str, cat |
| categorical | ordinal | =, >, < | …, Median, Percentile | int, str, cat |
| quantitative | interval | =, >, <, + | …, Mean, Var, Corr | float* |
| quantitative | ratio | =, >, <, +, x | …, Coefficient of Variation | float* |
As an example we can think of our microscopy data and the intensity bias sources we introduced in Section 1.2.2 in terms of their effect on the type of measurement scale. If we consider the additive sources of intensity bias in our acquisition (camera gain & background fluorescence) their combined effect is that they offset our measurements by a certain amount, which is equivalent to moving from a ratio to an interval measurement scale. If we analyze image intensities without applying proper background subtraction, we effectively move down in the hierarchy of measurement scales. This means we undermine our ability to reason about the data in certain ways—for example, making a statement like “cell A contains twice as many fluorophores as cell B” becomes invalid.
The effect of multiplicative bias sources is that they effectively mean we are measuring each pixel with a different unit. We can think of the whole process of intensity bias correction as (1) restoring ratio scales from interval scales in each pixel by subtracting the background (dark-field) so that a zero in pixel value corresponds to no fluorescent signal in that location, and (2) restoring a uniform unit5 of measurement between all samples from that ratio scale by ensuring that a flat signal distribution is represented by constant intensity values in our data (flat-field).
These measurement scales are present throughout the analysis of a high-content screening dataset, or really in any kind of data analysis. Placing measurements within this hierarchy is valuable as it enables us to reason about them in more general terms. The aggregations of voxels6 that we will discuss in the section on feature extraction (see Section 3.2.6) are all features derived from quantities on one of those scale types, which constraints the descriptive statistics (i.e., aggregations) that are allowed.
A broader example of the utility of this mental model in data science is the distinction between categorical and quantitative features in a machine learning pipeline. These feature types require different preprocessing methods—for instance, categorical features are often one-hot encoded, while quantitative features may be mean normalized or log transformed. These differences in processing arise directly from the valid operations inherent to each type of measurement scale.
Images are often represented as n-dimensional arrays in memory, which are data structures consisting of a collection of values of uniform type, accompanied by additional metadata that defines the shape of the array (e.g., height and width in the 2D case). In Python this means using NumPy array or one it’s many descendants that share a similar application programming interface (API). These add important new functionality like lazy evaluation (dask), automatic differentiation and GPU support (TensorFlow, PyTorch7) or symbolic differentiation and compilation (PyTensor). Since camera sensors produce pixels on a grid with even spacing, the x and y coordinates of the pixels can be encoded implicitly in the array’s shape, saving a significant amount of memory. However, in the context of spatial images a “naked” array is not enough. We also need at minimum the pixel spacing along each spatial dimension ([dz], dy, dx). This becomes particularly salient when we move to 3D data, which is often sampled at a lower rate along the axial dimension. An algorithm that is unaware of anisotropic pixel spacing will effectively analyze a distorted image leading to incorrect measurements.
When seeking an appropriate raster data model, we can find valuable inspiration in related fields. The Insight Toolkit (ITK) library (McCormick et al. 2014), developed in the biomedical imaging community for processing volumetric image data (e.g., from magnetic resonance imaging, ultrasound, etc.), provides not only many powerful and efficient processing algorithms but also a geometric image representation, complete with all the necessary metadata for spatial image analysis. One might argue that the data model has been key in enabling the development of these algorithms. Given that biomedical imaging modalities, just like microscopy data, are regularly sampled grids, the ITK data model serves as an excellent reference for how to conceptualize bioimage data.
Although ITK is written in C++, it is largely accessible in Python through bindings. The ability to make performant code, written in compiled languages closer to the hardware, accessible in Python is one of the key reasons for the language’s popularity. However, a downside is that ITK was not originally designed with Python as the main target, and the documentation is written for the C++ library, making it less ergonomic or “Pythonic”. This can make its use challenging, particularly for those new to programming.
Within the Python ecosystem the xarray library (S. Hoyer and Hamman 2017; Stephan Hoyer et al. 2025), originating in the geospatial science community, is an exciting project that provides a low level generalization of nd-arrays with named dimensions and coordinate arrays as axes along those dimensions. Building on the xarray library Matt McCormick, one of the core developers of the ITK library, built spatial-image which imposes constraints on an xarray.DataArray (like uniform sampling and allowed dimension names) to arrive at an adequate in memory model for spatial image analysis in Python (McCormick n.d.). It should come as no surprise that spatial-image closely aligns with the ITK image model.
A spatial image is an array of up to 5 labelled dimensions from the set {'c', 't', 'z', 'y', 'x'}, corresponding to channel, time and three spatial dimensions. Each spatial dimension has a coordinate axis containing the pixel center positions along that dimension, and they must be uniformly spaced. The time dimension has a coordinate that can optionally hold timestamps or just an integer index, and the channel coordinate (c_coord) can hold string identifiers for the channels. This last point is especially relevant when dealing with multiplexed data, where the number of channels can reach 40 or more (Gut, Herrmann, and Pelkmans 2018; Kuett et al. 2022). In such cases, channels are easily confused in code when they are represented as unlabeled arrays, increasing the likelihood of errors.
Figure 6 (a) shows an example for an image of shape (c: 3, y: 2, x: 1). Note that each dimension is annotated with a coordinate of the same cardinality. For the channel coordinate, the values are human-readable strings that describe the contents of each channel. In contrast, for the spatial dimensions, the coordinate arrays contain numerical values representing positions along the respective axes. Each dimension can either be accessed via the usual array integer index operations (.isel) or by indexing into the coordinates (.sel).
Another nice property is that upon selecting a single element from a coordinate (Figure 6 (b)), or when squeezing away a singleton dimension, xarray keeps the coordinate entry of the degenerate singleton axis around. Because of this, the channel name remains accessible when selecting a channel. The same applies when squeezing a singleton dimension, rendering the operation reversible (Figures 6 (c) and 6 (d)). Although spatial-image is not without its limitations (see Section 1.3.3.1.2), the use of labeled dimensions and the intelligent behavior of squeezable coordinates retaining their labels are particularly useful features.
The use of labeled dimensions eliminates the need to enforce a specific ordering of dimensions or, more problematically, to rely on code attempting to infer dimensions based on ambiguous rules. For example, an array with shape (3, 100, 100) is probably 3 channel 2D image. Or a 3D stack with three z-slices. Or three frames of a time-lapse movie. Spatial-image solves this problem, by allowing all common image types to be encoded via the dimension labels. Furthermore, the categorical coordinate attached to the channel dimension allows the specification of human-readable channel IDs. Finally, squeezable singleton dimensions that retain their label allow for metadata to be stored in those coordinates without having to maintain singleton dimensions. This sounds to me like having your cake and eating it too. Whenever we interface code that expects singleton dimensions, we can just expand the relevant dimensions.
image[c, y, x].
c) but retains the coordinate label ('PCNA').
x) dimension also retains the label (0.0).
xarray.DataArray / spatial-image coordinate selection behavior.
In Figure 7 (a) we have a minimal example of constructing a SpatialImage from a numpy.ndarray. Dimension names dims define the image type, scales are used to construct the spatial coordinates and a channel coordinate c_coord contains a string label for each channel. After indexing into the channel axis by name of the channel (.sel(c='DAPI')), we can plot the resulting 2D image to see all its components: Measured values (val) of a molecular species (c) sampled on a regular grid in physical space (y, x) Figure 7 (b).
# Create a multichannel spatial image
cyx_img = to_spatial_image(
np.arange(36).reshape(2, 3, 6),
dims=["c", "y", "x"],
scale={'x': 0.5, 'y': 1.0},
c_coord=["DAPI.0", "pH3.1"],
axis_units={"x": "μm", "y": "μm"},
name="val [AU]",
)
# select a channel by name
yx_img = cyx_img.sel(c="DAPI")c_coord. Sampling scale’s define the spatial coordinates. All components, except the values, can have units in this representation.
People familiar with the ITK software guide will recognize the visualization as an adaptaion of Figure 4.1 from Chapter 4 (Johnson, McCormick, and Ibáñez 2024). The red rectangle, called the pixel coverage, and the points at the pixel center indicate the two perspectives on a pixel one should keep in mind. A pixel is neither one nor the other but both! The two perspectives are what is called dual in geometry. We usually tend to emphasize the “point view”, because it is simpler to grasp, but it is not the full picture.
Also, note how I slightly cheated by adding the unit for the pixel value to the name (name="val [AU]"). I did this to emphasize how (1) the value can be viewed like a coordinate as well and it (2) needs a unit, even though in fluorescence microscopy we often do not calibrate our measurements, hence the unit is arbitrary (AU). Calibrated or not, the fact that we could be compelled to put a unit there tells us something about the image data. Furthermore, we should have a column name that we can use when storing pixel values in a table (val).
When looking up the citation for this section I found this gem. Just replace “medical images” with “microscopy images” and cross out some medical terms.
Medical images with no spatial information should not be used for medical diagnosis, image analysis, feature extraction, assisted radiation therapy or image guided surgery. In other words, medical images lacking spatial information are not only useless but also hazardous.8 - ITK Software Guide
I believe defining the origin as the center of the bottom-left pixel is slightly off. Instead, the origin should be either (a) exactly at the bottom-left corner of the image coverage, or (b) at the center of the entire image.
People generally agree that the overall image coverage should not change when creating image pyramids by resampling. When defining the image origin relative to the bottom-left pixel center, the individual images of multiscale stacks have origins that shift relative to each other. This does not happen if the origin is defined relative to the entire image coverage.
Consider all possible samplings of a fixed image coverage with integer values from \([1, \infty)\). At \(1\)—which corresponds to the worst possible sampling, a single-pixel image—the pixel center coincides with the image center (b). As sampling improves and tends towards infinity, the pixel center converges to the bottom-left corner of the image coverage (a). Therefore, we should fix the origin in one of those two locations, not somewhere in between.
If we correctly account for origin shifts during resampling—and properly transform and store the origins for each pyramid level—measurements remain accurate. However, “defaulting” to shifted coordinate systems across multiscale levels seems like a foot gun. Downsampled images are often used for segmentation and measurement, so forgetting to apply the necessary transform even once can introduce subtle errors. Such errors might be small but can be critical—for example, when measuring intensity precisely at the center of a single-molecule FISH spot.
The representations of images using xarray.DataArray/spatial-image has a couple of drawbacks:
Just because I am pointing out limitations of spatial-image does not mean I am suggesting going back to a non-geometric view of microscopy data! The ITK image model is what we should orient ourselves at in bioimage analysis. A microscopy image is a regularly sampled grid of multiple (c) molecular measurements (val) in 3DT space (t, z, y, x). Each pixel has—in addition to the molecular information—two geometric representations in terms of its center and coverage.
We have already mentioned that the pixels in an image can have different meanings, depending on how the image was acquired (see Sections 1.3.1 and 1.2.1), even though on a lower level, all of them encode a physical quantity of light hitting the detector at a certain location. While we can acknowledge those differences, from an image processing point of view they are all very similar.
A different type of raster data often encountered during image analysis is the binary mask. Binary masks have a boolean9 value for each pixel, partitioning an image into a foreground and a background region. A binary mask is a way of representing a single region (or object) in raster data.
Related to the mask is the label image which is usually of unsigned integer type and stores a label ID for each pixel, assigning it to be part of an object, with 0 commonly used as the background label. The pixel values of a label image are only used to differentiate objects and their values are otherwise meaningless, which means they are nominal (see Section 1.3.2). This also means that the kinds of operations that are valid for a label image are different from the ones commonly applied to intensity images. As an example we can look at linear interpolation between two pixels, which is done when resizing or spatially transforming an image. In a label image this is an invalid operation, since if we interpolate between two labels, we effectively create new single pixel objects, pixels belonging to objects somewhere else in the image.
Binary masks and label images are closely related with the mask representing a single, and the label image representing a collection of objects. It is common in some domains to use label images for representing multiple different objects, for example a segmentation from an MRI image might contain segmented kidneys, a liver and a heart. In biology, we usually have many objects of the same kind. For us, it makes things simpler to enforce homogeneous object types as part of the label image specification. If we want to share a specification with the biomedical imaging community10 we could try to convince them that a biomedical label image is of object type “organ”—or we define two label image subtypes (e.g., semantic and instance label image).
LabelImages contain objects of homogeneous object type, like cells or nuclear speckles. This means our IDs for label images are also IDs for an object type.
We can think of a label image as a set of disjoint binary masks, one for each label object. Whenever we want to extract measurements from the objects in a label image we need to convert it to these masks,11 which is why I think of it as the appropriate mental model to have. The reason we use label images is that for large numbers of objects it would be memory inefficient to represent them all using individual masks. Nevertheless, the label image representation is merely a memory optimization of the relevant unit of processing: the set of disjoint masks.
This will be a recurring pattern: our data structures are modeled based on computational efficiency not conceptual clarity. I am certainly in favor of efficient computation and memory use, but it can sometimes obscure important insights about the data. As a brief teaser for Section 3.2.4, we can consider the implications of representing a label image as a stack-of-disjoint-masks. To be able to losslessly convert between the two representations, we need to store the label ID for each mask. This is akin to adding a label coordinate along the stacking dimension that contains the integer label IDs in the label image. Furthermore, this is the exact same coordinate that we need if were to annotate objects in a label image with measurements where we need a label column12 as an ID column in the feature table.
The final raster image type I want to introduce is the distance transform. The most common one is the Euclidean distance transform which—starting from a binary mask, i.e., a single object—assigns each pixel the Euclidean distance to the next non-zero pixel, meaning the distance to the object border. The signed Euclidean distance transform additionally differentiates between inside the object and outside the object. We will use the convention of positive values for pixels inside the object and vice versa. Other distance transforms we will encounter in this text are measuring the distance along a spatial axis. They will be used in Section 3.2.7.1 to apply z intensity bias correction on the image volume. Distance transforms are more similar to intensity images than to masks or label images, because their content is quantitative (ratio), and their pixels are therefore of floating point type.
While in the early stages of bioimage analysis we are often working with images, we need to keep in mind that every image is a regular sampling of an underlying continuous13 signal distribution. Measurements obtained from images will generally not align with the arbitrary sampling grid imposed by digital sensors. The same holds true when performing spatial operations on an image, which often necessitate the conversion between physical coordinates and an image’s discrete sampling grid.
Aside from raster data there are data structures that are defined in physical coordinates like points, lines and higher order simplices,14 which can be further combined to form simplicial complexes. Many computer graphics libraries use k-simplices up to order 2 (i.e., triangles), and sometimes other polytopes (e.g., quadrilaterals), and combine them to meshes.15
These continuous representations of geometric objects represent the second type of spatial data and are called vector data. The distinction between vector and raster data is familiar to anyone who has used graphical software tools like Affinity Design or Inkscape16 and had to struggle with images becoming pixelated when resizing them. The same issues do not exist with vector data as they do not rely on arbitrary sampling grids.
The analysis of spatial image data requires flexibly moving back and forth between these continuous vector data types and raster data types. In this context, it is important to recognize images as special cases of regular sampling within the broader category of geometric (or spatial) data, which encompasses all vector data, with meshes representing the most general form. A good example of that can be found in the data model of PyVista, a Python library geared toward the visualization and analysis of meshes. Within that library, images are only distinguished from other data structures insofar as their regular sampling grid enables more efficient data storage. This justifies the use of specialized data structures that take advantage of these regularities. Beyond that, images are allowed to be their true geometric selves. In my view, this is the perspective we should ultimately adopt when considering microscopy images, although it may not be the most beginner-friendly.
PyVista, including its data model, is built on top of the Visualization Toolkit (VTK), another powerful open-source platform developed by Kitware, a company also known for creating ITK (Schroeder, Martin, and Lorensen 2006). Fortunately, PyVista excels at bridging the gap between the high-performance C++ code of the VTK library and a user-friendly higher level Python interface that is well-documented. The remaining challenges that do arise when using the library mostly stem from the inherent difficulty of understanding geometry and topology, which can be tricky to grasp at times.
A large part of image analysis, mainly the later stages of processing, once images have been reduced to a more manageable amount of data, use a tabular format. While I started the section trying to highlight the difference between raster and vector data as two important forms of spatial data, tables are so flexible that they can in fact be used to represent both. However, storing an image as a table is not something we want to do very often, since it adds a lot of redundant information.17
There are cases where it is appropriate though, say we have single-pixel annotations for training a pixel classifier tool like Ilastik (see Section 1.3.6.1) and want to feed the data into a machine learning model that expects tabular data, while also storing these sparse annotations efficiently in case we want to reuse them in future models. In that case, storing a binary mask of the full volume, just for a couple of hand drawn lines, is much less efficient than shifting to a tabular (sparse) representation of an image, where each pixel is an individual point, allowing us to throw out all the empty pixels. Furthermore, if we wanted to reuse the annotations for training a model at a different image resolution we simply need to decide how to select pixels at our desired resolution from those physical points.
The example for the last paragraph highlights an important point: the two most important basic data structures used in bioimage analysis, nd-arrays (including ITK metadata) and tables, represent two extreme ends on a spectrum. Arrays are useful to represent densely sampled data with some regularity in the sampling that can be exploited, a regular grid in the case of images. Tables can be used to represent any data, but not necessarily very efficiently. They excel at representing sparse or unstructured data, which is the reason why in the later stages of analysis there is a shift from arrays to tables. Finally, the spatial-image representation using arrays with spatial coordinates is in between those two extremes.
Since tables are so flexible in representing sparse data, which is in fact most data outside the natural sciences and imaging applications, there are many tools available for tabular data processing. These include spreadsheet programs, data frame libraries and relational databases. Even though spreadsheets can represent all kinds of tabular data, and are arguably the most important way of making this kind of processing accessible to people outside of programming, they are not strict in enforcing some important properties that are crucial for building robust, scalable analysis workflows. These are (1) homogeneous column types and (2) table schemas which allow a table to be defined in terms of its column names and their associated type. (3) Consistency in handling missing data or null values is noted here as a side point.
By leveraging the first two properties, we can design data processing pipelines even for data that is not yet available, simply by defining table schemas and the transformations between them. The last one can help in designing them in a way that does not immediately crash and burn as soon as the inevitable inconsistencies that come with real world data are encountered. But missing data is a notoriously difficult problem—often requiring domain knowledge and analyst choices to resolve—and can therefore not be offloaded to the data processing technology.
Spreadsheets like Excel and Google Sheets, tables as in data frames implemented in libraries like pandas and polars in Python or the tidyverse in R, and tables as relations in the sense of the relational data model going back to Codd (1970) are different levels of strictness with regard to the organization of tabular data. As such, they target different users and user needs that I would summarize like this:
Spreadsheets are best used with data that is small and requires a lot of manual interaction. An appropriate amount of data is about as much as fits on a computer screen. In that context they shine by making it very easy, even for non-technical users, to interact with the data while providing some powerful automation capabilities. On the other hand they make it very difficult to enforce rules for data integrity which makes them better suited for being processed by humans than machines.
Data frames sit somewhere in the middle, mostly geared toward technical users with data on a scale that fits into computer memory, on the order of gigabytes. They are often the tool of choice for scientists that need to process moderate amounts of data reproducibly but mostly interact with large chunks of it in an interactive programming environment like Jupyter notebooks. They are ideal for exploratory data analysis (EDA) and data science, enabling automation in an environment where the data sources and requirements are changing very quickly. Enforcement of data integrity, that goes beyond ensuring correct column data types, is optional and happens during runtime. Either by using third party libraries like pandera or by writing custom validation code (Bantilan 2020). Data frames are the dominant way of interacting with tabular data in statistics, and the natural sciences. Scientists value their flexibility and ease of use, while having little need and incentive, to spend time concerning themselves with database schemas.
Relations, the term for tables in the context of relational database management systems (RDBMSs), require the highest degree of technical expertise from their users. On the other hand, they offer robust mechanisms to ensure data integrity, not just for a single table but also across multiple connected tables. Each table contains a primary key column, which acts as a unique identifier for each row (record). Furthermore, the tables can be linked by adding foreign key constraints, which specify corresponding columns between different tables. RDMSs include strict enforcement of table schemas and additional abilities of adding constraints on single or multiple columns. Furthermore, they often provide functionality to manage the authorization of multiple users with differing degrees of access to the data, that all might be editing the system concurrently, while ensuring consistency of the state of the database. The properties of (1) tracking user access through atomic transactions to ensure data accesses do not interfere, and (2) preventing external sources of error from corrupting data, are commonly known by the acronym ACID (atomicity, consistency, isolation, and durability).
Databases come in a huge variety of specializations for different use cases, relational databases being only one flavor, albeit a very common one. Even within just relational databases, there are many different form factors ranging from in-process to large distributed systems, unified by the aforementioned data integrity principles and the shared query language named Structured Query Language (SQL). As such, relational database are the dominant choice for production systems that value correctness, scalability and multi-user access over flexibility. They are commonly used by entities that have the resources to employ people mostly or only concerned with managing those systems.
A particularly influential contribution tabular data processing using data frames is the work by Wickham (2014) on tidy data, which underpins the tidyverse ecosystem in R. The main message is quite simple to grasp: a data frame can come in different form factors many of them suboptimal for data processing. Wickham (2014) define the term tidy for tables where (1) each row corresponds to an observation (or object), (2) each column to a variable (or feature) and (3) each type of observational unit forms a table. It should come as no surprise that this data model was, at least in parts, inspired by the work on the relational data model mentioned before. As it turns out, the people who have been dealing with tabular data for years, and built large scale real world infrastructure around them, did arrive at some important insights. Wickham (2014) did important work in publicizing some of those insights among scientist and statisticians without formal training in computer science.
Within Python the dominant data frame library is pandas that was instrumental in enabling the use of the language in data science applications (McKinney 2010). With it being an old library built on NumPy, it did inherit some peculiarities, like the fact that only floating point arrays can have a null values (np.nan). As a consequence of relying on NumPy as a backend, pandas also inherits the implicit type casting that occurs when null values are encountered in a non-float array. This is just one example for an issue, caused by the circumstances of the early Python ecosystem. These problems are of course best know to the pandas developers, including the creator of the library, that mentions it among others in this blog post. The fact that the developers are worrying about these things is a sign of a healthy library, and they are in fact the people that pay most of the cost, by having to try to address them without risking the stability of the library. For the specific issue of missing values there are at this point not one but two solutions, one with nullable extension types internal to the library, and the other via integration of Apache Arrow data types that handle null values gracefully via bitmaps. NumPy still remains the default backend in pandas, and might remain so for quite a while, which shows the downside of being the most popular library: it is hard to change things because so many others rely on you. I would probably default to using arrow data types when using pandas in a new project. Note that I have not used the library extensively in that way, so there could be peculiarities that make it impractical, that I do not know about.
A breath of fresh air for Python data frames was the emergence of the polars data frame library, in terms of (1) adapting important concepts from database research, like lazy query evaluation allowing for query optimizations, (2) multithreaded execution and vectorized operations, (3) committing to cheap interoperability with modern specifications namely the Apache Arrow columnar format and (4) reimagining the Python data frame API which up to this point was dominated by pandas.
This decision by the library developers to diverge from the pandas API was quite bold, as most people, myself included, tend to grow attached to familiar APIs. Generally, it is considered wise to stick closely to established standards when the goal is to persuade users to adopt a new library. Fortunately for polars, and largely due to pandas’ age and the gradual accumulation of functionality, there is not really one consistent pandas API. Instead, there are multiple ways of achieving the same result—with some enthusiasts vehemently insisting that “method chaining is idiomatic pandas,” while most others simply shrug and focus on getting their work done which ever way they please. For this reason—and enticed by the promise of better performance—I became an eager early adopter and learned a valuable lesson about using a library before its first stable release.18 Since the 1.0 release last year stability is not an issue anymore, maybe a good time to try something new? I am very pleased with the API, albeit a bit of a learning curve coming from pandas. But you would have a hard time convincing me to switch back. Since method chaining is the only way to use the library, and it is so tempting to just keep typing, I probably tend to err on the side of overly long statements—which can quickly turn into spaghetti. Other than that, it’s consistent, it’s fast, and it’s fun to write.
Most commonly used relational database management systems (RDBMSs like SQLite, PostgreSQL, MySQL, etc…) are optimized for online transaction processing (OLTP) workloads, which means they usually store tables as rows, called records, and it is easy to edit (i.e., add, remove, mutate) individual rows within tables (Hipp 2020). While this is great for building an online-shop, our use case in biological research does not involve a lot of single row edits, but rather, large amounts of data being processed and written in batches, and therefore would be considered an online19 analytical processing (OLAP) workload. In recent years, there has been exciting development in this area, largely driven by the industry’s need for large-scale analytics while also integrating with existing business logic built around OLTP databases and maintaining the safety guarantees they rely on. An exciting project that demonstrates there is still room for innovation, even with a technology as well-established as an RDBMS, is DuckDB. It bridges the gap between the data frame world and relational databases, similar to polars, but from the other direction (Mühleisen and Raasveldt 2025).
DuckDB is an open-source, in-process analytical database that stores data in a single-file binary format. In this regard, it draws inspiration from SQLite, the most widely deployed database, which is found on every phone and most computers (Hipp 2020). SQLite’s widespread use is largely due to its self-contained, cross-platform, in-process design and single-file format, making it easy to deploy without the need for a server. The key distinction between SQLite and DuckDB is that the latter is optimized for analytical workloads. DuckDB offers client libraries for nearly all common programming languages, including Python, R, Julia, Java, and many others.
Installing DuckDB is very straightforward, as the entire system comes bundled in a single self-contained binary. For Python users, if the command line interface (CLI) is not needed, a simple pip install duckdb is all it takes. The only caveat is that, while multiple processes can be set up to read from the database, only a single process can write to a database concurrently. This limitation arises from it running in a single process, and was a deliberate design choice.
Using DuckDB as a feature store for our data, where processing typically occurs on a distributed computing cluster, would therefore (a) require some manual coordination and the potential introduction of a bottleneck when writing to a single database, or (b) involve intermediately writing to separate databases with a downstream consolidation step. While this is not a significant issue—similar to what we do when using data frames and writing to multiple Parquet files—it does mean that a major advantage of using an ACID-compliant DBMS is somewhat diminished.
Still, DuckDB is a very interesting project to keep an eye on. What I find most impressive is how flexible the system is when used with data sources outside its own binary file format.20 After running a hyperparameter grid search on the cluster with more than 15,000 jobs, each writing a three-row Parquet file containing hyperparameters and scores, the polars Parquet reader was incredibly slow, taking around 19 minutes,21 which was expected given the inefficiency of reading so many tiny files. In contrast, DuckDB handled this demanding query impressively quickly, finishing in around 1.5 minutes.
Another example of impressive performance against atypical data sources is in queries against JSON data, as demonstrated in one of the examples on the website. They query against 2.3 GB of compressed and 18 GB of uncompressed JSON data, with the queries running in just seconds. Due to this performance, combined with easy deployment and the ability to retain most of the benefits of DBMSs, DuckDB is a valuable tool to keep in the back pocket for specific use cases, or to enable exploration of an adjacent field of tabular data processing with a low barrier to entry.
Having introduced the key data structures, we now turn to a typical data pipeline in image-based systems biology. The process stats with an image from which we generate annotations (e.g., segmenting nuclei, detecting fish spots or manually drawing line profiles). Next, we either measure features of those objects directly (e.g., volume) or analyze the pixel intensities of the corresponding regions in the image (e.g., mean intensity). All measurements are then stored in a table, with rows corresponding to objects and columns representing the measured quantities or features of those objects. These measurements may be further processed using machine learning techniques, including clustering, classification, or regression models. Embedding algorithms can also be applied to create new representations of subsets of features. The final step involves generating insightful visualizations that highlight some of the discovered patterns (Figure 8). A summary of the key processing steps in bioimage analysis can be found in Table 6.
| Term | Description | Typical Output |
|---|---|---|
| Image Processing | Applying transformations (e.g., filters or spatial operations). | Image |
| Registration | Aligning two images (or other types of spatial data) into a shared coordinate system. | Image |
| Thresholding | Binarizing an image by separating foreground and background. | Binary Mask |
| Segmentation | Dividing an image into subregions by assigning each pixel a label ID. | Label Image |
| Instance Segmentation | Segmenting objects so that each instance (e.g., each cell) receives a unique label. | Label Image |
| Semantic Segmentation | Segmenting regions by object type (e.g., all people share one label, all cars another). | Label Image |
| Feature Extraction | Calculating descriptive numerical values (i.e., features) from annotations and images. | Table |
| Feature Embedding | Projecting high-dimensional features into a lower-dimensional space, often non-linearly. | Table |
| Image Analysis | A combination of the above steps to derive meaning from image data. | Plot, Report |
In the early days, computer vision primarily relied on hand-crafted algorithms for image processing. In recent years, deep learning–based methods have become indispensable tools in computer vision applications. This shift was enabled by decreasing compute costs for large-scale model training, made possible by hardware optimized for fast, parallel matrix multiplication—namely, graphics processing units (GPUs)—alongside the availability of large, curated datasets such as ImageNet (Russakovsky et al. 2014), and architectural innovations tailored to image tasks, including convolutional neural networks (CNNs) and, more recently, vision transformers (Lecun et al. 1998; Krizhevsky, Sutskever, and Hinton 2017; Dosovitskiy et al. 2020). Building on these general advances, deep learning methods have also been adopted in the field of bioimage analysis, where they are applied to tasks such as image restoration and denoising (Weigert et al. 2018; Krull, Buchholz, and Jug 2018), object detection (Mantes et al. 2024; Xu et al. 2024), and segmentation (Ronneberger, Fischer, and Brox 2015; Schmidt et al. 2018; Stringer et al. 2021).
Where classical computer vision algorithms required significant expert knowledge to develop rule-based pipelines, with learning-based methods applied only at the final stage, modern deep learning approaches have in some ways democratized the field. While using deep learning for computer vision still demands technical familiarity with model training and data preparation, these methods can be employed effectively—even without deep understanding of the underlying algorithms—so long as the target task is sufficiently similar to existing problems that can inform the choice of model architecture. The broader shift toward deep learning has also played a key role in establishing Python as the dominant programming language in this space. As a result, Python is now also the standard in bioimage analysis, thanks in part to the strength of its deep learning ecosystem, including libraries such as PyTorch and TensorFlow.
The shift within computer vision toward deep learning is only one part of a broader development of learning based methods being used in science and engineering. In this context machine learning refers not only to deep learning but to any statistical algorithm that is not explicitly programmed, but rather can learn from data and generalize to new data. In our modern world where a lot of our work and private live involves digital technology those methods are ubiquitous, whether it is our phone organizing our photo library according to the people recognized in the pictures, the Spotify algorithm suggesting a new song to listen to, or us using a large language model for writing an email. During the analysis of high-content microscopy data we employ machine learning in many of the processing steps and I will give a brief introduction as relevant for the project with no claim of covering the topic comprehensively.
In a supervised setting we have pairs of input and a desired output ahead of time and aim to find a model and model parameters that can perform a mapping between them. Each training example holds some information about the relationship between input and output that can be used during model training to adjust the model parameters toward better performance. What we consider to be good performance needs to be encoded in a loss function. Many machine learning methods then use optimization algorithms to minimize that cost function (training) leading to the desired model parameters (fitted model) that allows the application of the model to new data (prediction).
The canonical example for supervised machine learning is linear regression which in its simplest form, aptly named simple linear regression, has only two parameters a slope and an intercept. This model can express any linear relationship between two variables by setting those two parameters, and “optimal parameters” are usually defined in terms of the mean squared error as a loss function.
Whereas in linear regression, a continuous dependent variable22 is modeled as a linear combination of independent variables,23 we can use the closely related logistic regression model for tasks where target belongs to one of two classes. In this way we can use linear models not just in regression problems with a continuous output but also for regression problems with discrete output. These are often called classification tasks in the literature. The logistic regression model is part of the family of generalized linear model which consist of a random component which is the probability distribution of the response variable (normal distribution for linear regression, binomial for logistic regression), a systematic component (the linear combination of independent variables \(\beta X\)) and a link function that connects the two (\(g(\mu)\), identity for linear regression, logit for logistic regression). With this generalization, linear models can be used to model a broad range of problems, and they are some of the most used and best understood statistical models.
Linear models are only a small subset of models used in supervised settings. In general the model can be whatever we like as long as there is a way of tuning its parameters to the observed data. All deep learning based methods mentioned in the section on computer vision are just complicated non-linear regression models (i.e. more parameters and different ways of combining those parameters into models or model architectures). More parameters mean they can learn more sophisticated mappings between data, such as from an image of nuclei to an object probability map and distances to a convex polygon around said object as done by Weigert et al. (2019).
On the other hand, more complicated models tend to need more training data and care has to go into ensuring that a model is not overfitting on the training data to ensure it generalizes well to new observations. Another disadvantage of complicated models is that it is relatively hard to track down how and why the model generates a specific output. Deep learning or other sufficiently complicated models are sometimes called “black-box” models to highlight this property, and contrast them to simpler statistical machine learning methods where the values of model parameters can have straight forward interpretations. In a simple linear regression model the slope parameter can be readily interpreted as the amount of change induced in the target variable by a unit change in the independent variable.
An important family of models I want to mention here are random forests, and their descendants in the form of gradient-boosted trees and histogram gradient-boosted trees popularized with models like XGBoost and LightGBM (Chen and Guestrin 2016; Ke et al. 2017). Random forests are ensemble models, meaning they combine multiple simpler decision tree models that are individually trained on bootstrapped samples of the training data. The final model output of the random forest is derived from averaging the predictions of its component decision trees. The other two methods differ slightly in the way the ensembles are created, and in the case of histogram-base gradient boosting additionally transform the features from continuous values to histograms first, which can help with scalability to large numbers of features and reduce training time (“Scikit Learn User Guide 1.6.1” 2025).
The reason I mention the random forest family of models specifically is because they are some of the least demanding in terms of the need for feature preprocessing. Whereas other machine learning models can be very sensitive to the distributions and scales of the features going into them, and in the case of categorical features require them to be one-hot-encoded, tree ensembles do not have those requirements and usually perform well out of the box. This is the case even with features that are scaled differently by orders of magnitude, have highly skewed distributions or contain outliers. In the case of the histogram-based gradient boosting models implemented in scikit-learn and LightGBM, the models can even handle missing values. This makes them ideally suited in real life scenarios where the data is usually messy. As such, they are a very useful tool particularly in light of the challenges that come with heterogeneous features—as they are common when working with microscopy data (see Section 1.3.8.1).
I like to differentiate between different modes of, and reasons for, using supervised machine learning. In many cases we use machine learning in data analysis because we are interested in the predictions of a model, like when using learning based image denoising (Krull, Buchholz, and Jug 2018), running an image segmentation network (Stringer et al. 2021), or training a pixel classifier (Berg et al. 2019). We do not really care how these models do their task as long as we can verify that they reach a certain performance on our data. For these applications we happily use complicated models if they promise even marginally better prediction results, because the goal are the predictions.
Such predictor models can be trained on a lot of very broad training examples which enables them to generalize well, even in the case of complicated function mappings like they are necessary for cell segmentation. A broadly trained segmentation model can just solve the task to the point where it does not even need to be retrained anymore. An example for this is Cellpose that can handle many 2D cell segmentation problems—and even some 3D ones as we will see later—out of the box (Stringer et al. 2021). A model that generalizes well across a broad range of tasks is an extremely powerful tool, since it enables the engineering of robust analysis pipelines that are easy to adapt for the processing of new data. Therefore, the development of those methods is of immense value to image-based systems biology, just like the development of robust experimental protocols that do not need months of adjustment when applied in slightly different conditions.
Fryzlewicz (2024) highlights the distinction between the predictive emphasis of machine learning in engineering and data science, and its use in the natural sciences, where it is often developed within a statistical framework to support statistical inference, interpretation, and understanding of underlying mechanisms. In this complementary application of machine learning, sometimes referred to as statistical learning, the objective extends beyond prediction. The aim is gaining insights into the data by making assumptions about the data-generating process which are formulated in the language of statistical models. This approach helps to uncover relationships among the model’s components and assess what the available finite data can reveal about the real-world phenomena we are ultimately seeking to understand. In this process of statistical model inference24 the model predictions are of secondary importance to our ability to learn something about the data. We would usually like to know more than just “there is an association between x and y” which is the main statistical insight we gain from a successful prediction. Instead, we want to know how the independent variables interact in leading to a certain response. Each statistical model employed this way reflects a hypothesis of how the data was generated that we can then use to explore that hypothesis. This “statistical inference focused” machine learning needs to be distinguished from the “prediction focused” approach used to solve engineering problems (Fryzlewicz 2024, chaps. 12, 14).
With the latter objective in mind, model interpretability and the capacity to use a model for statistical reasoning take precedence. As a result, simpler models with fewer, yet interpretable, parameters are often preferred over more complex models that may offer higher predictive accuracy but lack interpretability. Since methods in statistical model inference are developed by the statistics community, much of the literature and examples are using the R programming language and generalized linear models (GLM) or generalized additive models (GAM) are the workhorse approaches because they are some of the best understood models around. Within Python, the ecosystem is dominated by prediction focussed modelling as exemplified by the scikit-learn ecosystem and the deep learning libraries mentioned before. Libraries focussing more on statistical inference are still available and include statsmodels (Seabold and Perktold 2010), or the more recently developed frameworks for Bayesian statistical modelling PyMC (Abril-Pla et al. 2023a), and it’s descendant Bambi (Capretto et al. 2022).
So which one of the two approaches is better suited for our needs in bioimage analysis? The answer is we need both, and knowing about the distinction between prediction and statistical inference can be useful when learning about the topics in order to pick an appropriate tool when encountering a new problem. Many of the challenges encountered in the early stages of image analysis clearly fall in the category of prediction problems. Examples we will encounter in this thesis include image segmentation, where we use Cellpose, a deep learning based segmentation model, and Ilastik, a pixel classifier that uses a random forest model under the hood, to segment nuclei and embryos respectively (see Section 3.2.3). Another example in this category are the classifiers trained to find segmentation debris and classify cell types (see Section 3.3.2). All these models have one thing in common: once we are confident in their performance, we apply them to new data without concerning ourselves with their internal workings. Model inspection is only done with the objective of developing a better-performing predictor.
Statistical modelling and model inference tend to be employed in the later stages of the analysis, at a point where the raw image data have been transformed into more manageable sets of objects with associated features in tabular form. At that point we shift gears from solving engineering problems to asking scientific biological questions. And the quality of our modelling approach can not be reduced to a simple accuracy score or explained variance, but comes from the extent to which it allows us to test hypotheses that we had going into the analysis.
An example for an attempt at inferential statistical modelling can be found in the multilinear regression used to model transcriptional activity of single cells using cell cycle phase and the known ZGA factors as independent variables (see Section 3.3.7). By examining the effects of changes to the variables included in the model we want to learn about their relative contributions in explaining the observed heterogeneity in transcriptional activity. Notice how the goal is not to predict the transcriptional activity of cells, or only to the extent that our being able to do so with a specific model at hand tells us something about the association between transcription and the features that went into the model.
A proper treatment of statistics, probability theory and the associated study of probability distributions of random variables is well beyond the scope of this text, and my abilities, and I can only point to some resources that I found helpful in their study, namely the freely available online lecture series on “Statistical Rethinking” taught by Richard McElreath, author of the book with the same title (McElreath 2015). It teaches about Bayesian statistics and the associated tools and methods for statistical inference. Other resources that were extremely helpful were the documentation of the PyMC and Bambi libraries.
Unsupervised methods include clustering and embedding methods that aim at reducing the dimensionality of our data in a way that retains or even emphasizes aspects relevant to a specific analysis. We do not provide a target to those algorithms so they work by modelling the observed data distribution. Examples for clustering are k-means clustering, density based spatial clustering of applications with noise (DBSCAN) and hierarchical DBSCAN also known as HDBSCAN (Ester et al. 1996; McInnes, Healy, and Astels 2017). All those algorithms share in common that they cluster the data based on some notion of distance between the data points in high dimensional feature space and assumptions about the distribution of the data in that space.
While clustering can be seen as embedding a feature space onto a categorical scale, dimensionality reduction involves projecting a feature space into a lower-dimensional space. A classic example of dimensionality reduction is Principal Component Analysis (PCA), which decomposes the data into orthogonal components that capture the maximum variance in the reduced dimension. Non-linear embedding algorithms that are often used in bio-image analysis include uniform manifold approximation and prediction (UMAP) and t-distributed stochastic neighbor embedding (tSNE) (McInnes, Healy, and Melville 2018; Maaten and Hinton 2008). Both of them can be used to reduce high dimensional feature spaces to 2 or 3 dimensions, amenable for visualization, and are invaluable tools in exploratory data analysis.
In addition to the use of unsupervised machine learning for visualization, the methods are often used to reduce the complexity (dimensionality) of data as a pre-processing step before employing computationally expensive supervised models. This can reduce training time and increase prediction speed. When employing models in that way the effect of this preprocessing should always be evaluated so it can be weighted against a potential loss in model performance. Pre-processing using unsupervised methods can also help in reducing multi-collinearity of features which can pose problems for statistical inference. Multi-collinearity occurs when explanatory variables are highly correlated, which makes it difficult to estimate their individual effect on model performance because of their redundancy. This is similar to how redundancy in biological systems, like multiple protein isoforms with overlapping functions, can make it harder to investigate their functions, as a phenotype might only appear in the multi knock-out condition.
The term hyperparameters refers to the parameters that are not learned from the data but are set to choose a specific model from its hyperparameter space, which defines related family of models. While we often separate those hyperparameters from the model parameters that are tuned to the data during model training, the line between them starts to blur once we employ a process called hyperparameter tuning where we try to find the best hyperparameters by training many slightly different versions of a model to find the best one for a given task, effectively learning hyperparameters from the data.
On a related note it is also common to combine individual models into model ensembles, as we have encountered them already with random forest models, which are ensembles of decision trees. Similar to hyperparameter tuning, the process begins by training a set of models. However, rather than selecting only the best-performing model, their predictions are then aggregated. This ensemble of models often yields improved performance, albeit at the expense of increased computational cost.
A machine learning pipeline, consisting of a model and preprocessing steps like feature selection, data transformations or dimensionality reduction, can be viewed as a special case of an ensemble where the data is processed sequentially by each component. Using a pipeline abstraction, like the one implemented in scikit-learn, is a powerful way of combining modular components into a larger logical processing unit, and will be used extensively to build cell cycle phase models (see Section 3.3.3). When combining models into ensembles/pipelines the result is a higher level model that has its own hyperparameter space built from the Cartesian product of the individual component hyperparameter spaces. It can be tuned using the same methods applicable to individual model components or simpler models.
The fact that we can combine feature preprocessing steps into a model using a pipeline is a good reminder that the preprocessing parameters are part of our model. For example, when using UMAP for dimensionality reduction the resulting embedding will change depending on how we scale the features. For a visualization in a publication one should try different hyperparameter settings, ideally including feature preprocessing, and confirm that observed patterns are robust to parameter perturbations.
The most straight forward approach to measure dynamic processes using microscopy is using time-lapse imaging of live specimens. A beautiful example are the movies of early zebrafish development by Keller et al. (2008) that uses a custom light sheet setup to visualize and track nuclei until gastrulation. The downside of live imaging is the usually high amount of manual labor involved in preparing the sample and the decrease in image quality that comes with having to image in water based buffers instead of mounting media optimized for volumetric imaging (see Section 1.2.4).
In the context of high-content microscopy data generated from fixed samples, or methods like single-cell RNA sequencing, there is an alternative route to arrive at (pseudo) dynamics: by computationally inferring them from large numbers of single cells that are sampled while progressing through a stereotypic process. We can leverage the statistical power coming with large sample numbers, and the fact that features tend to vary smoothly during dynamical processes, to reconstruct trajectories in this high dimensional feature space, corresponding to the cell averaged progression.
Most methods start by doing some sort of dimensionality reduction, whereby both linear (Schwabe et al. 2020) and non-linear embedding methods (Setty et al. 2019; McInnes, Healy, and Melville 2018) can be employed. Trajectory inference can then be achieved in a variety of ways, for example by fitting a suitable regression model, like a principal curve, principal tree or principal circle (Albergante et al. 2020), or by modelling dynamic progression along a nearest neighbor graph using a Markov process (Setty et al. 2019).
In the former case, the position of a cell along the regression model is interpreted as the progression of the process under study, while in the latter, it is the number of steps and transition probabilities in the Markov process, from a reference cell to the cell of interest. Modeling high-dimensional single-cell data in this manner can, for instance, be used to infer expression dynamics during a developmental process or identify branching points in differentiation (Deconinck et al. 2021).
The broad adoption of single-cell transcriptomics and its application in developmental biology to study cellular differentiation have been major drivers of innovations in trajectory inference methods. However, these methods and algorithms have found broader applications, including in high-content microscopy using fixed data. In this context, feature spaces are not derived from transcript counts but rather from the measurements of cellular morphology, molecule abundance, subcellular molecule distribution, and population-level properties like local cell density.
When selecting an algorithm, an important criterion is choosing a model with an appropriate topology, that aligns with our prior knowledge of the trajectory’s structure. For example, modeling branching trajectories—commonly observed during cell differentiation—requires a model capable of representing a tree structure (Deconinck et al. 2021).
In the context of the cell cycle, we have a strong prior assumption: cells are expected to return to their initial state, reflecting the cyclical nature of the process. This implies that the model should be topologically closed, forming a loop. However, discontinuities may arise due to insufficient sampling or discontinuities in the measured features. Furthermore, orthogonal processes may influence the features, causing the circular trajectory to drift in feature space, effectively turning it into a helix (Schwabe et al. 2020).
Methods for inference of dynamics have been used to estimate pseudo dynamics across 24h of zebrafish development from single-cell RNA sequencing data (Wagner et al. 2018), for investigating symmetry breaking in the development of intestinal organoids (Serra et al. 2019) or to infer cell cycle progression in tissue culture (HeLa) cells (Gut et al. 2015).
After outlining the key steps in bioimage analysis and reviewing the tools available for constructing such pipelines, I will conclude this chapter by highlighting the challenges inherent in processing bioimage data. While some of these challenges are shared across all domains of bioimage analysis, others are specific to high-content screening applications, where the scale of the datasets precludes local processing.
As we have discussed in Section 1.2.1 microscopy data is inherently multi-modal as each pixel contains both a set of molecular readouts and its spatial coordinates and coverage. We can make this point more salient by contrasting high-content microscopy data with single cell RNA sequencing data, another prevalent source of high-dimensional data in systems biology.
Whereas the tables resulting from image-based systems biology measurements might look similar at a first glance, each feature in a standard sequencing experiment has the same meaning, namely the transcript counts of a gene. With our measurements on the other hand, features can relate to cell morphology, protein abundance, subcellular distribution of a target, or when including spatial measurements, position and orientation of a cell, or a spatial gradient of a protein concentration.
This means that, while we can, and often do, aggregate all our measurements into a single feature table, we still need to apply custom processing to subsets of them. These vary based on the underlying distribution and the meaning of a feature. For example, applying a log transformation to mean intensity measurements is often beneficial due to their right-skewed distribution. However, the same transformation would be problematic for the kurtosis25 intensity feature, as it can be negative, rendering the log transform undefined. This heterogeneity in features, which is common in all image-based methods, introduces substantial complexity into downstream machine learning pipelines and is often underappreciated or overlooked.
As discussed in Section 1.3, bioimage analysis includes a broad range of processing steps that are required to turn raw images into numerical features and ultimately biological insight. Each of these steps introduces potential sources of error that propagate through the analysis, reducing data quality. Additionally, they increase the amount of code that must be maintained to keep the analysis pipeline operational. All these processing steps create an interdependent graph of computations. While individual components can largely be developed in isolation, they may still require later adjustments due to oversights during the initial analysis.
Furthermore, there is often a chicken-and-egg problem, where the result of a computation could be used to refine that very same result. For example, intensity features extracted from segmented nuclei are used to estimate intensity bias in the imaged volume. Once bias correction is applied to the images, the conditions for accurate nucleus segmentation improve significantly. These kinds of situations are very common, and both difficult to keep track of mentally and hard to express computationally.
The pragmatic thing to do is not to think about these opportunities and move on with the analysis. Especially because a second round of segmentation would entail managing a dataset with multiple nuclei segmentations, introducing the danger of using the wrong label image in downstream analysis. The issue is further amplified by the fact that it also applies to any component derived from that label image, such as feature tables used in statistical modeling.
Another complication stemps from the fact that analysis pipelines rely on a variety of tools and algorithms, originating from a fragmented ecosystem. They might be using different programming languages, or within one language rely on widely different data specifications. Integrating such tools into an analysis pipeline often requires data duplication, and considerable development effort just to transform the data into the different required specifications.
Furthermore, many developers of image analysis tools in academia come from a background in the natural sciences rather than computer science. The result is a lack of expertise in software development and architecture, a requirement for building and maintaining large and scalable codebases effectively.
A difficulty that follows from the need to process large amounts of data is that much of the image processing needs to be run on dedicated high-performance computing (HPC) infrastructure. Furthermore, some processing steps, especially the one’s involving deep learning, need to be executed on graphical processing units (GPUs) leading to an additional layer of infrastructure complexity.
This exacerbates the challenge of integrating the disparate tools into a cohesive pipeline, since they are often run locally during development and then need to be adapted for the integration into a pipeline that can be executed efficiently on the available infrastructure. And the fact that such infrastructure is usually hosted by university or lab IT makes the installation of software packages more cumbersome.
Furthermore, proprietary image analysis software is often not available in those environments due to licensing restrictions, sometimes even in cases where a university does have a license. For example, at the University of Zurich Huygens deconvolution is only available on virtual machines at the Center for Microscopy and Image Analysis (ZMB). As a result, integrating it into a high-content image processing pipeline requires time-consuming and manual transfers of the entire dataset back and forth. This renders the software effectively unusable for such datasets.
Another limitation with proprietary software is that, even without the licensing issues, it is even harder to use their algorithms as part of a larger pipeline. Their business model is usually built around providing an all-in-one software solution where the user does the whole analysis within their ecosystem. Therefore, they have no incentive to make data export accessible. I still have nightmares from attempting to use nuclei tracks generated in Imaris outside the software. It was possible to export tracklets26 and all kinds of—mostly useless—features. But the one thing I was interested in, the connectivity of tracklets for related nuclei, was mysteriously absent.
This thesis aims to develop a robust experimental protocol for high-throughput iterative immunofluorescence (multiplexing) combined with quantitative volumetric imaging of the entire animal cap in early zebrafish embryos. Using this protocol, I will generate a multiplexed time course capturing early development between 2 and 4.5 hours post fertilization, encompassing the mid-blastula transition (MBT) and zygotic genome activation (ZGA). The multiplexing strategy enables the inclusion of markers necessary for reconstructing multichannel volumes from individual imaging cycles, single-cell segmentation, cell cycle characterization, and known factors involved in ZGA. Chapter 3.1 focuses on these experimental advancements in high-content volumetric imaging and multiplexing of early-stage zebrafish embryos.
Additionally, I will develop an automated image analysis pipeline capable of elastic registration, segmentation, and single-cell feature extraction for downstream machine learning and statistical modeling. The pipeline is designed for generalizability beyond this dataset, aiming to support volumetric spatial image analysis broadly. Core functions like feature extraction from label images are modular and decoupled from higher-level workflows to facilitate large-scale HPC processing. While segmentation methods may need adaptation for different model systems (e.g., organoids, iETX embryos), the intensity bias correction and feature extraction techniques presented are broadly applicable. Chapter 3.2 details the development of these general tools and the underlying logical data model.
The final results Chapter 3.3 demonstrates the application of these tools to the pseudo time course data produced in the experimental phase. This chapter aims to:
Leveraging the power of multiplexing protocols (see Section 1.2.3) to address questions in developmental biology required several key extensions to the 4i protocol. Together with my Master’s student Marvin Wyss, we developed an experimental protocol for (1) efficient fixation and processing of large numbers of samples, (2) stable mounting in 96-well plates compatible with at least four cycles of immunofluorescence staining and subsequent elution, and (3) volumetric imaging on our Yokogawa CV8000 SDC microscope compatible with single-cell image analysis.
The main innovations, compared to the original 4i protocol (Gut, Herrmann, and Pelkmans 2018) and standard IF protocols for zebrafish (Schulte-Merker n.d.), are the novel approach for mounting embryos in 96-well plates and the combination of the 4i imaging buffer with a refractive index-matched mounting medium (PROTOS buffer from (Chung and Deisseroth 2013)) enabling volumetric imaging. After establishing an antibody panel relevant to early embryonic development, we applied the protocol to record a multiplexed time course of early zebrafish development (2.5–4.5 hpf).
To efficiently generate large numbers of samples, zebrafish were set up for mating in up to 12 sloping breeding tanks, each containing mating pairs of 4 males and 4 females. Embryos were collected and transferred to plastic Petri dishes held at 28 °C in E3 medium until they reached the appropriate stage.
Fixation was performed in 5 ml Eppendorf tubes by transferring up to 1 ml of eggs, aspirating to 2.5 ml, and adding 8% freshly prepared paraformaldehyde (PFA) in E3 for a final concentration of 4% PFA. Fixation was carried out overnight at 4 °C on a carousel shaker. The following day, samples were thoroughly washed in PBS, transferred to methanol through a graded series (25%, 50%, 75%, 100% MeOH), and stored at −20 °C for long-term storage and permeabilization. This approach allowed the collection of thousands of samples within two days. Samples could be stored in methanol for several weeks without observable degradation.
To mount samples for an experiment, they were first rehydrated through a PBS-T series (75%, 50%, 25%, 0% MeOH). Chorions were then manually removed using tweezers under a dissecting microscope. Dechorionated embryos were handled only with Pasteur pipettes and glass containers due to their tendency to adhere to plastic surfaces. The yolk was removed by gently agitating the embryos in a glass centrifuge tube using a Pasteur pipette and washing out yolk particles in at least six rounds of PBS rinses.
Samples were transferred to fibronectin-coated 96-well plates by gently allowing them to settle in PBS-filled wells, resulting in orientation with the animal pole facing down. Up to nine embryos were mounted per well to maintain sufficient spacing for imaging (Figure 10). After low-speed centrifugation (200 RCF) to avoid morphological distortion, plates were illuminated for 30 minutes on a UV transilluminator (365 nm), then exposed to a 488 nm laser under a 20x objective to photo-crosslink the embryos to the plate.
This mounting approach ensured sample stability over at least two weeks, supporting four or more rounds of staining, refractive index matching, imaging, and elution. Despite the stability, samples remained accessible enough to keep antibody incubation times practical (overnight for primaries, 4 hours for secondaries).
A full description of the protocol can be found in Section 5.
The Yokogawa CV8000 acquisition software was designed with tissue culture models in mind. As a result, the focus is on whole-well acquisitions or imaging pre-defined patterns within a well. However, since we need to image across a z-range of approximately 300 μm and embryos are sparsely distributed within a single well (Figure 10), this acquisition approach would be extremely inefficient—both in terms of imaging time and the amount of “empty” data generated. While manual specification of positions is possible, it is very cumbersome using the software provided by Yokogawa. Ideally, we would like to perform a low-resolution acquisition of the full plate, which can then be used to automatically detect our objects of interest and generate an appropriate acquisition file for just those positions.
Wako Automation, a third-party vendor of Yokogawa systems, developed a solution to address this limitation: a piece of software called SearchFirst. This tool enables low-resolution acquisitions and allows for the injection of custom object detection algorithms within those sites for subsequent high-resolution imaging. Unfortunately—likely again due to the focus on tissue culture systems—objects can only be detected within a single field of view of the low-resolution acquisition. Given the large size of our samples, many overlap with the border of an acquisition site, even when acquired with a 4x objective. This complicates automated object detection, since it is (1) difficult to identify the center of an object located at the image border and (2) possible to detect the same object twice in adjacent fields of view.
To address these limitations, we built a custom tool tailored to our use case. A low-resolution acquisition (first pass) is stitched well-wise, enabling object detection algorithms to run on the entire well. The detected objects are then used to generate a .mes27 file for a high-resolution acquisition (second pass), based on a template .mes file containing the desired acquisition settings. The tool can be executed via a command-line interface (CLI).
The initial implementation used template matching for object detection. This was later extended by additional contributions from other lab members with different requirements. These include threshold-based object detection by Serguei Ovinnikov, manual object selection in the napari viewer by Adrian Tschan, and a semi-automatic approach by Ruth Hornbachner that combines threshold-based detection with manual curation.
The stitching of wells and reverse-engineering of undocumented Yokogawa acquisition files only needed to be performed once. New object detection algorithms can then be added relatively easily, provided they adhere to the function interface for object detection in stitched wells (Listing 1).
searchFirst/processing_methods.py
def find_objects_pseudocode(
stitched_well: Array2D, **kwargs
) -> tuple[Array2D, Array2D]:
...
# Create a zero filled arrays of the same shape as `stitched_well` with
# ones at the locations of objects to be imaged `selected_objects`.
# Optionally: create a second array of objects that were detected by the
# function but should not be imaged (e.g., detections that do not meet a
# threshold). These are used for plotting only. If the function does not
# reject objects return a zero filled array, i.e.:
#
# `rejected_objects = np.zeros_like(stitched_wells)`
return selected_objects, rejected_objectsA technical challenge when staining whole-mount 3D structures is ensuring that the stain distribution matches the underlying target distribution in the sample. Vendors of antibodies do provide recommendations for incubation conditions and dilutions of the reagents, but the appropriate staining conditions depend on specificity to the target of interest (which might deviate from the vendor expectations when using antibodies raised against orthologs from a different species), the sample processing conditions (fixation and permeabilization protocol) and the size of the sample.
Because of this, quantitative immunofluorescence experiments generally require testing a range of incubation conditions to minimize staining bias. When multiplexing, there is added complexity in that the optimal staining conditions for an antibody might change over the course of the experiment due to deterioration or retrieval of the target epitope upon repeated exposure to the protocol buffers. A lot of work, in large parts part of Marvin Wyss’ Master Thesis, went into finding specific antibodies for targets of interest, testing their cross reactivity in zebrafish and evaluating their compatibility within the multiplexing protocol. Furthermore, staining conditions had to be individually optimized to minimize staining bias.
We observed particularly strong staining bias for high-abundance targets. This is compatible with findings by Susaki et al. (2020) that did diligent empirical testing of a broad range of staining conditions for volumetric whole mount samples. Their findings strongly influenced our experimental design, most importantly our use of direct IF (i.e., pre-incubation of primary antibodies with fluorescently labelled fab-fragments) for high-abundance, high-affinity targets.
Figure 11 illustrates the result of a preliminary experiment where we stained an endogenously tagged H2A-mCherry line with an anti mCherry antibody. H2A is a high-abundance target and antibodies raised against fluorescent proteins are usually high-affinity, two conditions that are reported to lead to strong staining bias (Susaki et al. 2020). As expected, the indirect staining approach exhibited strong staining bias (R2=0.04) which was drastically reduced using the direct staining approach (R2=0.84).
To establish the utility of the protocol, we recorded a time course of early zebrafish development. Embryos were fixed between 2.5–4.5 hours post-fertilization (hpf) in 15-minute intervals (Figure 12). After dechorionation, methanol permeabilization, and yolk removal, the samples were mounted in fibronectin-coated 96-well plates and photo-cross-linked to the plate. This mounting approach kept them stable while still accessible for iterative immunofluorescence staining.
We mixed embryos from all time points within each well to avoid confounding technical bias within the ~8-hour acquisition window with biological variability. After 3D imaging in a refractive index-matched mounting solution containing a scavenger (adapted from PROTOS, 4i), we eluted the antibodies using the standard 4i elution buffer. The full protocol is summarized in Figure 9.
We acquired four imaging cycles, each consisting of a reference DAPI staining for subsequent image registration and ten additional target proteins or protein modifications. This enabled us to investigate two extensively described phenomena in early zebrafish development: cell cycle lengthening, including the emergence of asynchrony during the mid-blastula transition (MBT), and zygotic genome activation (ZGA).
We included markers for segmentation (DAPI for nuclei; β-catenin for cells and the embryo), cell cycle (PCNA, pH3), transcription initiation and elongation (Pol II S5P and S2P), and known factors involved in ZGA (Figure 13; Table 10). Pairing multiplexing with spinning disk confocal (SDC) microscopy allowed us to acquire a dataset of 212 embryos spanning six division cycles of development at subcellular resolution across the entire animal cap (Figure 14).
Since we pooled embryos from multiple fish tanks in 15-minute batches over a 2-hour period, the data roughly correspond to a time-lapse acquisition at 30-second intervals, assuming uniform temporal sampling. Unlike live microscopy, we could optimize imaging for quality rather than specimen health, enabling more precise, spatially resolved, semi-quantitative measurements of target abundance.
As a technical control, we re-stained the markers from the first three staining cycles in cycle four using three replicate wells each. Visual inspection indicated good correspondence between signal distributions for stains in acquisitions 0 and 3 (Figure 15 a). To assess the reproducibility of the protocol at the cell population level, we report Pearson correlation coefficients above 0.8 between mean nuclear (or cellular, for β-catenin) intensities for most stains (Figure 15 b). Signal correlation was also high at the single-pixel level, which is encouraging, as this metric could be degraded not only by sample deterioration but also by inaccurate image registration (Figure 15 c). We did observe an accumulation of background signal, which may explain the reduced pixel-level correlation for low signal-to-noise targets such as FLAG and H3K27Ac.
One significant limitation of the protocol is the challenge in identifying specific antibodies against targets of interest—particularly in the zebrafish model—and optimizing their staining conditions to achieve reliable results. Additionally, the protocol’s accessibility is constrained by the requirement for automated liquid handling systems to maintain compatibility with gentle mounting conditions.
Another drawback is the progressive accumulation of background signal with each successive staining cycle due to imperfect elution, which can compromise data quality. Furthermore, we observe the accumulation of fluorescent artifacts at the plate bottom and, to a lesser extent, at the margin/yolk interface of the embryos (Figure 16 (a)). These artifacts are potentially caused by nonspecific binding of secondary antibodies to fibronectin, or by contaminants accumulating in the plate during buffer exchange steps. They could interfere with measurements directly or by sequestering secondary antibodies, leading to lower effective antibody concentrations in later cycles. Due to the localization of these artifacts mostly at the plate bottom, their direct impact on fluorescence intensity measurements is confined to a few cells at the yolk interface, as is evident when considering the channels with the embryo mask applied (Figure 16 (b)).
The main dataset acquired in this thesis (see Section 3.1.4) comprises roughly 5 terabytes (TB) of raw image data—well beyond what can be analyzed manually or processed on a local machine. This volume necessitates a robust image analysis pipeline that minimizes the need for human intervention, especially during the early processing stages. Even the initial steps leading up to feature extraction require integrating a wide range of tools. These include converting the data into a chunked, pyramidal file format, estimating dark-field and flat-field corrections with the Fiji BaSiC plug-in, performing image registration with elastix, and carrying out segmentation using tools such as Ilastik, Cellpose, scikit-image, and ITK.
To leverage the overlap between common processing steps in volumetric image analysis, Joel Lüthi and I developed the abbott library. This collection of custom Python workflows is tailored for high-content, multiplexed 3D microscopy data. It provides an on-disk file specification along with tools for registration, segmentation, feature extraction, and visualization.
An important step in standardizing image analysis workflows is specifying the data format on disk. For internal use in our lab, Joel and I standardized the storage of image data using a custom container based on the HDF5 specification. This format supports storing image pyramids—multiple down-sampled resolutions of raster data—and label images.
We chose HDF5 containers because they allow: (1) storing arrays with modern compression algorithms that can be tailored to specific use cases; (2) storing arrays in chunks that can be read independently; and (3) storing an arbitrary number of arrays, including custom metadata, within a single file organized hierarchically like a file system using nested groups.
Each microscopy site is saved as an HDF5 file containing groups for each channel, with multiple resolution levels stored within. Additional metadata are stored as array-level28 attributes, as detailed in Table 7. An example summary of a single file is presented in Table 8.
| Attribute | Type | Description |
|---|---|---|
img_type |
'intensity'|'label' |
Type of the dataset (image or label image). |
level |
int |
Multiscale level of the dataset. |
element_size_um |
tuple[float, ...] |
Voxel size in μm for each dimension. |
stain |
str |
Content of the image (stain or object type). |
cycle |
int |
Acquisition cycle in a multiplexing context. |
wavelength |
int |
Excitation wavelength. |
At the time we set up the format, there was no widely accepted specification within the community for storing high-content volumetric microscopy datasets. This has changed in recent years, as OME-Zarr has emerged as a standard across the open-source bioimage community (Moore et al. 2023). For this reason, the abbott-h5 format is now deprecated and should not be used in new projects.
The goal of image registration is to find a spatial transform that maps homologous points in one image onto another. Once this transform is determined, the moving image can be resampled onto the pixel grid of the fixed image, bringing the two into alignment.
This is typically done by minimizing an image similarity metric through an optimization procedure under a chosen transformation model. As with many optimization methods, challenges arise from the risk of getting stuck in local minima or using a poorly tuned optimizer (Johnson, McCormick, and Ibáñez 2024). Designing a registration pipeline that performs robustly across large datasets can therefore be challenging.
In our case, we stained the same sample multiple times over a period of 2–3 weeks using the multiplexing protocol described earlier. Since microscope stages cannot precisely reproduce the same positions, imaging the same site at a later time results in slight misalignments. To correct for this, we registered all acquisitions of a multiplexing experiment to a reference cycle, using the DAPI channel as a common landmark present in each round.
While rigid (translation and rotation) or affine (including scaling and shearing) transforms are often sufficient for 2D tissue culture systems, our volumetric data exhibited more complex deformations. These arose from slight shrinkage or swelling of the sample during buffer exchanges between staining rounds. To correct for such non-linear distortions, we employed an elastic (B-spline) transform.
The elastix library (Marstal et al., n.d.; Ntatsis et al. 2023), built on top of ITK (McCormick et al. 2014), offers a comprehensive toolkit for custom registration workflows, supporting a range of spatial transforms and similarity metrics. We implemented a three-stage registration pipeline, increasing the degrees of freedom at each step: starting with a rigid transform, followed by affine, and finally a B-spline transformation. The second acquisition cycle was chosen as the reference because it included key segmentation markers (DAPI and β-catenin), and any misalignment would compromise the quality of downstream measurements.
Figure 17 illustrates the outcome of the registration process after each stage for an embryo imaged in two cycles. Notably, the sample exhibited shrinkage between acquisitions (Figure 17 (a)), and a simple affine transform was insufficient to fully align the nuclei due to local non-linear deformations (Figure 17 (b)). Acceptable alignment was only achieved after applying the elastic registration stage (Figure 17 (b)).
Even after B-spline registration, some nuclei—primarily at the borders of the imaged volume—remain misaligned. To quantify registration quality at the single-cell level, we use the Pearson correlation of nuclear DAPI between acquisitions and the reference cycle as a metric. Individual cells with low correlation can be excluded from downstream analysis. For most cells, the correlation exceeds 0.9 (Figure 18).
Applying these elastic transformations is computationally intensive, and in the process requires rewriting—and duplicating—the entire dataset. However, the benefit is substantial: once all channels are aligned in a common coordinate system, downstream analysis can proceed as if on a single multichannel image. This alignment ensures that the segmentation masks remain valid across all channels. As a result, feature extraction is greatly simplified.
Another important early processing step is partitioning the image volume into regions corresponding to objects we want to analyze, a process called image segmentation. Dense 3D segmentation remains challenging, but recent advances in domain-specific deep learning models promise high-quality segmentation with minimal manual annotation (Stringer et al. 2021; Weigert et al. 2019).
For nuclei segmentation, we used the pre-trained Cellpose nuclei model and achieved satisfactory segmentation results without retraining, requiring only minor preprocessing for intensity normalization (Stringer et al. 2021). We then trained an Ilastik pixel classifier to segment the embryo (Berg et al. 2019). Finally, nuclei were used as seeds and the embryo mask as foreground in a solidity-constrained seeded watershed approach implemented in scikit-image (Walt et al. 2014). Membrane segmentation was generated by selecting a two-pixel-wide outline of the cell segmentation, and cytoplasm was defined by subtracting nuclear and membrane segmentations from the cell segmentation (Figure 19).
Segmentation was performed on downsampled images because the sampling at that scale was closer to isotropic than in the full-resolution images (Table 12, m=1). This improved the performance of the Cellpose algorithm, which applies the same model three times along 2D slices of each spatial dimension and merges the results. The XZ and YZ slices benefit from the increased isotropy. A sample embryo and the resulting segmentation are presented in Figure 20.
Bioimage analysis requires orchestrating a broad range of data structures (Table 5) and processing steps (Table 6). Developing a pipeline for high-throughput microscopy data often involves experimenting with and integrating a variety of tools. To manage this complexity, it helps to maintain a clear mental model and adopt consistent naming practices throughout the analysis. The following section introduces the logical data model underlying the abbott library, which is the product of many iterations and refinements born out of the challenges encountered while working with large-scale image data.
The ideas presented are encoded in a set of table schemas and have been incorporated into the most recent parts of the abbott library code. However, they do not yet constitute a full specification; rather, they reflect ongoing experimentation aimed at discovering the most effective mental model for working with complex spatial datasets. Our goals are to establish (1) a consistent and concise naming scheme that uniquely identifies all elements within the dataset, (2) mechanisms for accessing arbitrary subsets of the data to support browsing, query planning, and processing, and (3) unified access to raster, vector, and tabular dataset components—all of which are essential parts of bioimage analysis.
As mentioned in the introduction, one of the biggest sources of inspiration for the following ideas on modeling spatial data comes from xarray and the spatial-image data model (see Section 1.3.3.1.1). Following that logic, we conceptualize data in terms of dimensions and coordinates. In spatial-image, the focus is on representing a single multichannel image29 using the (c, t, z, y, x) dimensions to annotate a dense array of values (val). Similar to how spatial-image imposes constraints on the spatial coordinates of an xarray.DataArray, we introduce an additional constraint to the xarray data model: coordinate values must be unique. Dimensions can be thought of as the column names of a data frame, with coordinates representing the corresponding columns.
We aim to expand this concept to accommodate (1) multiple label images by introducing the object type (o) dimension, (2) a multiscale sampled pyramid grid by introducing the multiscale level (m) dimension, and (3) feature tables annotating the segmentation masks by introducing the label (lbl) dimension. Combining these, we can represent a 3D physical region of space (z, y, x) sampled at multiple scales (m), from multiple channels (c) of measurements (val), containing multiple types of objects (o) and object instances (lbl). Together, these form a single region of interest (roi:1). The entire dataset can then be represented as a collection of such regions arranged along a roi coordinate (Figure 21, left).
The dimensions come in two main flavors: quantitative and categorical dimensions. Quantitative dimensions measure physical quantities, whereas categorical dimensions are used as labels.
The quantitative dimensions (z, y, x, val) are the physical—spatial or molecular—space in which we perform the measurements and their coordinates represent samples from that space (i.e., points in spatial dimensions or a protein concentration in the channel value dimension).
The multiscale dimension (m) is not really a dimension in the classical sense, but it can be thought of as a continuous space \(\mathbb{R}_{+}^{3}\) with each point representing a possible sampling of a 3D regular grid. The coordinate corresponds to an ordinal set of points in that space defining the pyramid schedule.
The categorical dimensions (c, o, lbl, roi) represent types and their coordinates contain unique identifiers to for instances of that type. As an example, the channel dimension is the space of all possible measurable channels with a specific coordinate containing the actual channels measured in an experiment.
Categorical coordinates are used to hold labels associated with instances of a given dimension (i.e., type). These coordinates can be nested, in which case their type becomes a named tuple of their components. For example, o: ['nucleus', 'embryo'] is an object type coordinate representing two object types, each associated with a separate label image. lbl: [1, 2, 4] could be a coordinate listing the label IDs of three nuclei in one image, while lbl: [1] might represent the single ID for an embryo in another. To reason about all four objects simultaneously, we can define a new coordinate label_object: [('nucleus', 1), ('nucleus', 2), ('nucleus', 4), ('embryo', 1)], which combines both nuclei and embryos in a single indexable space.
Spatial coordinates represent sets of points and are of type float. For example, z: [4.0] defines a one-dimensional point at location 4.0 (relative to the origin), while z: [0.0, 1.0, 2.0] represents three points sampled along the z-axis. Both are valid image grid coordinates because they maintain uniform sampling. In contrast, z: [0.0, 1.0, 5.0] is not a valid image grid coordinate, as it breaks the uniform spacing assumption. Although the z dimension is continuous, image coordinates are always discrete samples from that space. Absent any image data (i.e., for processing steps that only require vector components) we can conceive of the empty volume in terms of a bounded region on each of the three spatial axes.
An overview of the dimensions and their corresponding coordinate types is provided in Table 9.
lbl) of segmentation masks are usually generated independently for each ROI (roi) and object type (o) and are therefore not a valid identifier for label_objects across the whole dataset.
Each categorical dimension is associated with a coordinate table, indexed by its coordinate, such that each row corresponds to one entity defined along that dimension. In relational database terms, each of these tables has a primary key named after the dimension, with the coordinate serving as the primary key column.
These tables may include any number of additional columns to annotate the entities they represent. This design allows us to attach metadata (or features) to coordinates, enabling operations such as sorting or selecting subsets based on those annotations. In the following section, we will walk through each categorical dataset dimension and its corresponding coordinate table.
c) CoordinateThe channels dimension table (Table 10) uses the dimension name c as its primary key column. The identifiers are chosen to be human-readable (e.g., 'DAPI.1') and must uniquely identify the channels in the dataset. In multiplexed experiments, the channel ID is typically formed by concatenating the stain and acquisition columns.
Looking at the acquisition column, we see that after the first acquisition cycle (cycle 0), the stain alone would suffice as a valid channel ID. However, once a second acquisition is registered, which depends on the correspondence between signals across acquisitions, we now have two DAPI channels. Importantly, we do not want to rename the second DAPI channel to something else, as that would incorrectly suggest a different stain was used. Nor can we discard the second DAPI channel after registration—it is critical for evaluating registration fidelity (Figure 18). Concatenating stain and acquisition yields an interpretable and unique channel name.
Notably, unlike in later examples where we might join two coordinates into a tuple (see Section 3.2.4.6), here we generated a new string-based ID (e.g., c='DAPI.1') by combining two underlying columns (stain='DAPI', acquisition=1). This is always an option, and whether it is a good idea depends on the expected access pattern—specifically, how often the individual components of the coordinate need to be accessed independently.
This leads to two general principles for defining coordinates and their role as primary keys:
Categorical coordinates must be unique within their dimension table. This allows them to serve as the table’s primary key, with each row annotating one instance of the dimension.
When constructing human-readable IDs it is on the programmer to ensure they are consistent. In the case of abbott this is enforced in the initial compression step.
c column is a primary key and defines the entity being annotated in the table. The other columns contain channel specific metadata like the approximate intensity range of the voxel values in that channel. The dataset contains an additional 9 channels of the 3 restain control wells that are not shown.
o) CoordinateThe object_type coordinate is a categorical coordinate consisting of string identifiers representing the types of segmented objects. Just as the channel coordinate holds labels for fluorescence channels, the object_type coordinate holds labels for multiple label images that occupy the same spatial region (Table 11). The corresponding object_types table may include additional columns—such as aggregated object counts and specifications for hierarchical relationships between object types—both of which are discussed later in Section 3.2.6.4.
m=2 only in x and y to balance anisotropy.
m) CoordinateWhen defining a pyramid schedule for a dataset we effectively take samples from the multiscale sampling space and arrange them on a 1D ordinal coordinate axis (m). This way of sampling makes it behave similarly to other 1D coordinates.
It is labelled by integers starting at 0 for the original highest resolution and increasing numbers representing downsampled pyramid levels. By convention, each spatial coordinate is downsampled by a factor of 2 for each level. Since a lot of volumetric microscopy data is sampled anisotropically (usually with a lower sampling rate along the axial (z) direction) we downsampled only in x and y until we reached maximum isotropy and from that point on downsampled all three spatial dimensions the same way (Table 12).
As mentioned in the beginning, on its own the multiscale coordinate represents a set of regular sampling grids. To arrive at an image grid we can either specify (1) spatial bounds and a boundary behavior if the scales do not divide the image extent without remainder or (2) an origin and a set of \(n_{m} \times n_{dim}\) integers specifying the shape (cardinality) along each dimension for each multiscale level.
val) and Label (lbl) CoordinatesTwo additional dimensions we left out so far are val and lbl. The value dimension is the continuous30 dimension of intensity values in a single channel (c) of an image or the space of possible voxel values (\([0, \infty)\) for fluorescence intensity). In that sense the value coordinate of a channel can be thought of as the collection of all intensity values with their spatial components removed which is the image histogram, and the contrast_limits column in the channels table (Table 10) as an interval on that value dimension. This val “coordinate” does not have an associated table.
The label (lbl) dimension is derived from the voxel values of a label image. These are assigned during segmentation with 0 encoding background (i.e., no object) and the other labels identifying instances of a given object type. When looking at a single label image in isolation the lbl coordinate contains unique identifier for each label object. As soon as we add a second label image (be it a different object type segmented in the same volume or the same object type segmented in a different site) this uniqueness breaks down unless explicitly enforced by coordinating the labeling between the two label images. So by adding new components to our dataset we destroy the uniqueness of our labels, a problem we have encountered before when looking at the channel coordinate (see Section 3.2.4.3.1). We could do the same thing and join columns (o, lbl) → (o.lbl) together to recover uniqueness. Another option is to allow for compound coordinates (i.e., primary keys) which corresponds to named tuples as object IDs (o='nucleus', lbl=5).
roi) CoordinateThe final base coordinate is the one identifying each microscopy site. As mentioned before a single region of interest is a bounded region in 3D space (z, y, x) that contains images (c) and label images (o) sampled at multiple scales (m). Since we acquired individual z-stacks without overlap from each site they are independent of each other for processing, and the roi is an obvious axis along which one can parallelize workflows in an embarrassingly parallel31 fashion.
The rois table contains the path to the raster data which is stored in a single file per site in the abbott-h5 format (see Section 3.2.1). Furthermore, it contains a foreign key to the parent well and the transformation to change into its coordinate system.
roi as a primary key. In our case it only contains a file path to the abbott-h5 containing the accompanying data and a reference to the parent well coordinate system including the required translation. well is called a foreign key since it maps to the primary key of the wells table.
At this point we need to mention an important deviation from the “classical” view on coordinates as it is used for dense arrays. Let us start with a single roi populated with some data: if we just naively assemble our coordinates like this (o:2, c:3, y:2, x:2; 2D image with 3 channels and 2 segmented object types) we would expect this “array” to contain \(2\times3\times2\times2=24\) values. But in reality the c and o axis both form the product with the spatial axes, but not with each other. Pixels do not have different values per channel for each object type. So rather than thinking of them like orthogonal coordinates creating a space via the Cartesian product we need to take their (tagged32) union and then the product with the spatial axis leading to \((2+3)\times2\times2=20\) pixel values 8 from the label images and 12 from the channels as one would expect. The mental image here is of two coordinates pointing along the same direction and stitched together end to end (Figure 21, o and c). A different view is that the two coordinates are a partition of a larger raster channel coordinate.
The second issue comes up when we consider the whole dataset. Even though we tend to process all sites the same way in general (i.e., segmenting the same object types, resampling the image pyramids using a uniform pyramid schedule and recording the same fluorescent channels), there are exceptions. In this dataset we are using elution control wells in each cycle and 9 wells in the last cycle that were used as re-stain controls, meaning they have different (albeit overlapping) channel coordinates. This means that our dataset level channel coordinates, which are the union of the channel coordinates of the individual sites, are not densely populated with data. Here is a toy example:
As we have alluded to in the last section, we can combine the categorical base dimensions to generate new unique identifiers (and therefore dimensions) for each processing relevant entity in the dataset, namely images, label_images and label_objects (Table 14; Figure 21, right). Those new dimensions have coordinates consisting of tuples of the coordinate entries of their component dimensions. For example a single-channel, single-scale image at a specific site can be identified across the whole dataset by the named tuple (roi='site14', m=0, c='DAPI.1'). Just like with base dimensions we can use the unique IDs in the compound coordinates as primary keys in a table allowing us to store additional metadata for objects of that compound type.
(roi='site0', m=1, c='DAPI.0') for an image.
roi, m, c) and Label Image (roi, m, o) CoordinatesThe images and label_images tables look almost identical. They contain their respective three compound dimensions as primary keys and an additional path column with the relative path in the abbott-h5 file or more generally any input-output (I/O) source (Tables 15 and 16). These paths correspond to arrays within the h5 file and can be connected to a reader function to access the raster data.
Note that the label_images table contains an additional column m_scale_factors, and the o.paths for the same label images at two different scales (i.e., m=0 versus m=1) have the same value. This is because we resample label images on-the-fly from the one persisted to disk using the scale factors in that column. This works well because label images are resampled with nearest neighbor interpolation which is very cheap. It saves some disk space and also encodes at which level a label image was originally segmented, in this case m=1 which is why there are no resampling scale factors there.
images table uses the m, roi and c columns as primary key resulting in the image dimension. Other than that it only contains the (relative) path to the 3D array containing the data.
label_images table uses the m, roi and o columns as primary key resulting in the label_image dimension. Note that the file paths repeat between m=0 and m=1. This is because label images are resampled on-the-fly without persisting the pyramid to disk.
o, roi, lbl) CoordinateUp to this point, we have focused on raster data that, within each site, occupy the same spatial region. This is ideal for many feature extraction functions, which operate directly on label images rather than on individual objects. However, as we move toward higher-level features, the individual objects within those sites—rather than the sites themselves—become the primary focus.
To enable access to these objects at the single-object level (for measurements or visualization in exploratory data analysis), we introduce a dedicated label_objects table. This table requires an initialization pass across all label images to identify the label objects (lbl) they contain. During this pass, we can also compute additional processing-relevant “features.”
This initialization step is conceptually distinct from feature extraction. It alters how the data is accessed by introducing the label object coordinate and alternative geometric representations. For instance, the bounding box is a feature used purely for data access. Other geometric descriptors—like the centroid or principal axes—may support both visualization and further measurements, or be treated as standard features in downstream analysis.
The primary key columns of the label_objects table are object type (o), ROI (roi), and label (lbl). Importantly, the multiscale level (m) is not part of this key. This is because label objects are intended to represent the object itself—independent of how it is sampled—rather than any specific mask tied to a particular resolution. A more appropriate view is that a label object has multiple representations—both raster and vector—none of which is inherently privileged.
Even the original mask—the one from which all other representations are derived—is not treated as the definitive version, though it may receive slightly different consideration. For example, it is typically the only mask suitable for computing morphology features on raster data. If we upsample a label image created at a lower multiscale level, the resulting mask can be useful for intensity measurements at higher resolutions, but it must not be used for morphology measurements such as roundness, as nearest neighbor resampling introduces jagged edges that would degrade the resulting measurement.
After IDs have been assigned all those processing relevant representations are stored in the label_objects table where they are readily accessible (Table 17).
label_object. Note how combining base coordinates gives rise to new entities and that the components of the key contain the keys for other entities (o: object type, roi: site ID, (roi, o): label_image). The other columns contain different representations of the label object in physical coordinates (e.g., centroid, bounding box).
Let me quickly summarize what we have discussed so far in terms of the logical data model and coordinate tables:
spatial-image/xarray approach of defining dimensions and coordinates to annotate and access their underlying data, we aim to adapt that view for a broader use case than just dense arrays, namely high-content multiplexed volumetric datasets.t], c, z, y, x) to include label images (o), multiscale sampled grids (m) and multiple sites (or coordinate systems; roi) and defined their respective coordinates each in a dedicated dimension table schema as the primary key of that table.images, label_images and label_objects) across the whole dataset. They are defined in an identical way to base coordinates, in terms of dimension/coordinate tables, the only difference being that their primary keys consist of multiple columns.While the base coordinates (Table 9) reflect decisions made during experimental design and early preprocessing—such as how many sites to image (roi), the microscope’s spatial sampling (z, y, x), stain selection (c), segmentation targets (o), and pyramid downsampling levels (m)—the compound coordinates (Table 14) represent the concrete resources available for downstream analysis. These include the actual image volumes, label images, and label objects used in feature extraction.
Base coordinate tables are derived from the unique values of the compound dimension coordinates (roi, c, o) and serve as a structure for attaching metadata or defining new access axes. By computing the Cartesian product of selected base coordinates, we generate a compound coordinate space that is only partially populated with data in the general case (see Section 3.2.4.5).
Compound coordinate tables for images and label images can be assembled by crawling a dataset using a suitable image and metadata reader. This is implemented for the legacy abbott-h5 format, which stores 3D single-channel arrays. To access an OME-Zarr dataset similarly, one would need a reader capable of slicing into the 5D array along the channel dimension. This could be accomplished by adding a channel index column to the coordinate table.
For raw Yokogawa data, where each 3D volume consists of 271 individual .tif files (one per z-slice), a parser could extract metadata from the filenames and group corresponding slices into lists. The reader would then be responsible for concatenating these into single-channel volumetric images.
Let’s now think about how we can access the components of the dataset. The simplest way is by just iterating over the coordinates one by one:
In a feature extraction context, where each batch is processed on a different machine, the batch ID is stored in the table metadata, while the remaining image IDs are stored as columns in a single table per worker (see Section 3.2.6.9).
The compound coordinate tables for representing images and label images were based on the most common access patterns useful for feature extraction (i.e., single-channel, single-scale volumetric images and single-channel, single-object type volumetric label images). By allowing tuple IDs the data model can be extended to express multichannel or multiscale images as well (Table 18). An entity, such as a multichannel image, would then correspond to a slice along the c coordinate in the images table.
:n indicates a slice through the coordinate axis as compared to selecting a single entry (indexing into the coordinate). Coordinate entries are the named tuples of its components which in this case can contain tuples to represent a coordinate slice e.g., (roi='site0', m=0, c=('DAPI.0', 'PCNA.0')) for a multichannel image.
The resource tables have fully specified schemas (i.e., column names, data types, primary key columns) that can represent any multiplexed high-content microscopy dataset. These schemas are defined as Polars data frames in Python, with additional metadata—such as primary keys and other constraints—specified using the data frame validation library pandera (Bantilan 2020).
On disk, the data are stored as a set of Parquet files. In memory, a custom container called MFrame holds the data as polars.DataFrames or polars.LazyFrames. The primary purpose of MFrame is to serve as a convenient structure for organizing both the resource tables (Resources) and the feature tables (Features), which are discussed in the next section.
__repr__) of the full dataset, derived from its coordinates. This representation should summarize the dataset’s components for use in an interactive programming environment and remain informative throughout processing steps and subset selection.abbott-h5 (see Section 3.2.1).It has not slipped my attention that a set of tables with constraints and relationships between them essentially constitutes a relational database. This is by design. I briefly explored whether existing technologies might serve as table stores, especially for the feature tables discussed in the next section. In the long term, features should likely be stored in a database optimized for OLAP workloads (see Section 1.3.3.3.3).
I ultimately decided against adopting these systems due to their added complexity and to avoid writing SQL during the prototyping phase. Still, exploring various RDBMSs and data frame libraries proved valuable. It informed the decision to use complex data types sparingly in the table schemas (e.g., storing points as three separate columns—point-z, point-y, point-x—instead of a single polars.Struct or polars.Array column). This choice helps maintain compatibility with a variety of potential backends.
MFrameIn memory, the tables are handled using a custom container called an MFrame (multi data frame). At its core, MFrame is an auto-generated, frozen34 dataclass, with fields corresponding to individual tables. It is built dynamically from the table schemas and represents a collection of tables, including their respective primary key columns.
MFrame provides a helpful __repr__ that summarizes the component tables, their primary keys, and the cardinality of compound keys (Listing 2 (b)). It serves as a compact container for managing multiple tables without cluttering the namespace.
Filter predicates applied to an MFrame are forwarded to the underlying DataFrames, allowing row filtering across the entire set of tables in one operation. If a filter references a column that does not exist in a table, the resulting exception is caught and ignored—meaning the filter is applied where possible and silently skipped otherwise.
Additionally, an _update35 method can be called on the MFrame to synchronize the state of the data index (i.e., primary key columns) across all tables. This method also enables filtering based on metadata columns from individual component tables.
Due to the dynamic creation of MFrame classes, MFrame itself acts as a class factory or metaclass, relating to the generated classes in a similar way that regular classes relate to their instances. Each generated class exists in a pair—LazyMFrame and MFrame—that hold polars.LazyFrames and polars.DataFrames, respectively. This design enables seamless transitions between lazy and eager evaluation via LazyMFrame.collect() and MFrame.lazy(). Additionally, it provides convenient attribute access to the component data frames (e.g., resources: MFrame; resources.images: pl.DataFrame).
While I personally enjoy exploring the metaprogramming capabilities Python affords, this added layer of abstraction can make the resulting code harder to follow—and some might reasonably argue it overcomplicates things. Coupled with my broader uncertainty about the ideal backend (see Section 3.2.5.3), this project remains experimental and is primarily intended for personal use. Its suitability as an open-source tool is limited for now and would require considerable refactoring and documentation. That said, I still view it as a success: it has enabled valuable prototyping and helped clarify what I want from my dataset API.
LazyResources and Resources
In the context of the resource tables, we rarely use the lazy version (LazyResources), since these tables typically do not hold large volumes of data. Moreover, the data they do contain—namely the data index and relevant metadata—is exactly what we want in memory to facilitate efficient access and summarization of the dataset.
However, implementing LazyMFrame alongside MFrame is still useful, particularly for feature tables. In those cases, the benefits of predicate pushdown and lazy loading—features inherited from the polars library—can significantly reduce I/O. This is useful because most analyses only require a small subset of features, making it wasteful to load all feature tables into memory.
Resources TablesThe following section presents an example of the Resources multi-table that contains all the tables discussed in Section 3.2.4 and summarized in Figure 21. After scanning the resource tables and immediately collecting them (Listing 2 (a)), we see the representation of the whole dataset based on the coordinate tables discussed so far (Listing 2 (b)).
The HTML table displayed there contains the same information provided by the string repr of the multi-table class, as displayed in an interactive programming environment like the IPython shell or a Jupyter notebook. In fact, it is only a thin wrapper around the tabular summary, using the Great Tables library to convert it to HTML.
The first column specifies the table names, which are the same as the primary key names in the case of simple dimensions but differ for compound dimensions. By convention, we use the plural form for table names and the singular form for (compound) dimensions (e.g., the images table with the image coordinate; Table 15).
We can immediately see that we imaged 212 sites in 30 wells, resampled images at 6 multiscale levels, and that there are 8 segmented object types and 25 channels in total. Furthermore, we observe that there are 10,116 label images, 20,352 images, and 1,376,201 label objects.
Looking at the primary key columns (pk), we see their names and cardinality. In the case of label images and label objects, only 4 out of the 8 object types were actually initialized and have entries in the label_objects table. This is because only 1 out of 4 segmentation runs for nuclei was used in downstream processing. Additionally, the membrane segmentation was not used for measurements and therefore was never initialized.
We can apply polars filter operations over the multi-frame, which will forward the filter to each DataFrame it applies to. For instance, we can filter on the roi column for regions starting with the letter “B”. The predicate applies to 31 ROIs in 4 wells, and the other resource tables are updated accordingly (Listing 2 (c)).
Selecting just the nuclei at multiscale level 1 reduces the number of label images to exactly 212, which matches the number of ROIs, as there exists one such label image per ROI. Even though we are working with tables, and the underlying columns in the data frames are retained, we can conceptually squeeze the now singleton m and o dimensions—meaning they are removed from the pk columns in the repr to reduce visual clutter—so that we only see what is relevant (Listing 2 (d)).
Finally, we can filter not only on the primary key columns but on any metadata column in the coordinate tables, and the selection will propagate to the rest of the tables. In the final panel, we filter against row and column from the wells table, m and o just like before, and finally against wavelength in the channels table (Listing 2 (e)). Even if we had not seen the query, just the final table, we would have no trouble understanding the state of the dataset: we have 9 ROIs from a single well remaining. They contain 21,270 nuclei and 4 channels sampled at multiscale level 1.
Resource tables can be (a) loaded into memory via scan_resources().collect() and (b) summarize all dataset components. (c, d) The multi data frame can be filtered (sliced) both along the primary key (pk) columns or (e) metadata columns of the individual tables using polars filter operations.
n is the number of rows, i.e., the number of instances of type table.
pl.DataFrame.filter, inheriting filter predicates like string matching.
wells and channels tables.
I will refer to both polars.LazyFrames and xarray.DataArrays with dask as an array backend that represent lazy data structures of tables and labelled arrays respectively. The method polars.LazyFrame.collect() -> polars.DataFrame is equivalent to the xarray.DataArray.compute() -> xarray.DataArray method that leads to a NumPy backed xarray.
By aggregating all the dataset metadata into one data structure, we can generate useful summary views of the entire dataset while requiring only minimal amounts of data to be read. The Resources tables contain all processing-relevant information, including IDs and component metadata, which is why they are referred to as dataset coordinates. They can be thought of as a lazy representation of the dataset. This should not be confused with the lazy table representation implemented via LazyResources—those represent not a lazy dataset, but lazy dataset coordinates, or for lack of a better term, a very lazy dataset (see Caution 1).
The goal is for both to be used identically: to select the processing-relevant slices of the dataset that fit into memory. The ability to attach additional metadata columns to the coordinate tables is similar to using xarray with multiple coordinates on a dimension. Compound coordinates are comparable to an xarray.DataArray with a multi-level index (i.e., a pandas.MultiIndex) as its dimension coordinates. I personally have always found the pandas multi-index confusing to use,36 which is why I prefer a table-based approach that treats all columns equally.
In summary, the resource tables provide the information necessary to quickly gain an overview of a dataset and enable the user to choose which components they want to access.37 Furthermore, they contain the data required to validate feature queries, as will be discussed in the next section.
Being able to perform validation quickly and in a single process means we can catch errors before any data processing begins. This is preferable to sending queries to different workers, each performing its own validation, which might fail independently due to a typo in the configuration. In the best case, where queries are carefully designed to be idempotent, we can fix the configuration error and re-run the same queries with minimal effort. In the worst case, we may have to dig through log files, reconstruct a new query to cover the missed data, or rerun the entire job, after deleting all previous results. One approach is tedious, the other inefficient. In both scenarios, the ideal outcome would have been for the job to fail immediately, allowing us to correct the mistake on the spot.
Depending on the processing environment—especially in cloud setups that may introduce network issues and other failures beyond our control—it remains important to maintain idempotent queries and to consider automated rescheduling in case of frequent transient failures. However, in my personal experience working within the HPC environment at the University of Zurich, I did not experience a single job failure due to anything other than my own mistakes in parameter specification or bugs in the processing code. The former could have been prevented with upfront validation, while the latter can only be avoided by thoroughly testing code before running it on large datasets.
Specifying the component IDs using named tuples allows us to adapt their complexity in accordance with the subset of data being processed. I propose that infrastructure code should define the relevant dimensions and coordinates, and not allow deviations—except, perhaps, for the ability to specify default behavior for missing values. This is akin to writing image processing functions that only accept 5D arrays with fixed dimensions, with an added benefit: by using named dimensions and consistent function signatures that communicate what data they expect, we can ensure that people can still use the functions ergonomically (i.e., without manually adding singleton dimensions, which is both ugly and error-prone), while still getting the correct results.
When working on a subset of data during EDA—say, a single multiscale level, site, and object type (roi='site0', m=1, o='nucleus')—we can conceptually apply a squeeze operation, returning a minimal ID to the user within the context of the data being processed, while ensuring that the squeezed information is maintained in the background. In this case, a feature table with label as the only ID component would suffice. This allows a single data model, which might be overly complex for some use cases, to still be used ergonomically.
Finally, the named tuple IDs used in the tables as primary keys are essentially types—similar to how object-relational mappings (ORMs) translate between database tables and types in a programming language. The only difference is that, in this case, the types are typically lazy representations of large data structures like images or meshes, rather than small objects that can be stored in a record as a whole.
The flexibility of dynamic IDs allows dimensions to be added generously, since it is always possible to automatically reduce them to the ones minimally required in a given context. This includes the possibility of treating processing steps as dimensions, with the corresponding coordinates representing different parameter settings.
An example of this can be found in z-intensity bias correction models (see Section 3.2.7). During the initial phase of exploring different models, I considered them to be a coordinate (z_model: ['Exp-1D', 'Exp-2D', 'Linear-1D', 'Linear-2D']38). Once the best model was established, I started treating it as a squeezed singleton dimension (z_model='Exp-2D'). That is, I applied the correction, kept a record of it, but dropped the coordinate (or column) from my feature tables, since it was now redundant. Feature tables generated before the introduction of z-bias correction would default to (z_model=None), meaning no z-intensity bias correction was applied.
The final conclusion is how this data model can be used to unify access to different data components. Data structures that can be efficiently stored in tables (e.g., points, lines, bounding boxes) are stored there directly. More complex data structures (e.g., images, meshes) are not stored in tables but are accessible via their path and type39, paired with associated data loaders.
Again, this can also include processing steps, such as lazy label image resampling (see Section 3.2.4.6.1) or on-the-fly application of intensity bias correction on images (see Section 3.2.7).
I want to begin with a high-level overview of the different kinds of features we will encounter before diving into the details of each. The goal of feature extraction is to annotate our objects with a—relatively—small number of numerical values that carry biological meaning or instrumental utility. In doing so, we reduce the data volume from terabytes to gigabytes—an amount suitable for exploratory data analysis and statistical inference.
An alternative perspective on feature extraction is to view it as a form of lossy data compression: we intentionally discard raw pixel information while retaining a distilled representation that—hopefully—captures the essence of what we care about biologically or technically.
When considering a single voxel, we know from the introduction (see Section 1.3.1) that it carries information about its location, extent (i.e., coverage), and the fluorescence abundance at that location. Once segmentation has been performed, there is additional information indicating whether the voxel is part of a hierarchical series of objects (e.g., nuclear compartment, nucleus, cell, embryo), spanning scales from the subcellular to the embryo or tissue level.
Measurements can be understood as aggregations of voxel properties into properties of labeled objects. For instance, we can measure the volume of a nucleus—represented as a label mask—by summing the volumes of its constituent voxels. Similarly, we can compute descriptive statistics (mean, sum, variance, etc.) on voxel intensities within a given channel. I will refer to this first layer of aggregation as raster features, since they are typically computed from raster data structures (Figure 23 a, (1)). Beyond this initial layer, data processing is primarily table-based.
Once features have been assigned to spatial objects, we can analyze them within their spatial neighborhoods—that is, by aggregating object features across “sibling” objects of the same type (e.g., nuclei). This enables the computation of population-level features such as local cell density or cell cycle phase synchrony within a specified neighborhood (Figure 23 a, (2)).
A third axis of aggregation operates across different levels of the object hierarchy (Table 11)—for example, by counting the number of cells within an embryo or computing the embryo’s average cell cycle phase (Figure 23 a, (3)).
Finally, feature embeddings represent another form of aggregation—this time across feature columns. These embeddings produce new—categorical or continuous—higher-level features, derived from patterns in the existing feature space (Figure 23 a, (4)). They can include “classical” embeddings like UMAP, but also the cell type class resulting from applying a classifier model (see Section 3.3.2) or the cell cycle phase feature discussed later (see Section 3.3.3).
Raster features form the most fundamental class of features in the library and are the basis for all downstream measurements. For people familiar with scikit-image’s regionprops, or the MATLAB function with the same name, the first two function signatures should look familiar (Listings 3 (a) and 3 (b)).
Within the abbott library, we use the ITK implementation for “regionprops-style” measurements, mainly because scikit-image was not ready for 3D image data at the time we started (this has since changed with the addition of the scale and offset parameters). Furthermore, ITK offers very fast and battle-tested algorithms implemented in C++, specifically designed for volumetric imaging.
Label features are those that can be computed from the label image alone (Listing 3 (a)). They include label object IDs and vector representations (e.g., bounding box, centroid, principal axes) as mentioned in the previous section. In addition, they include scalar morphological features such as roundness, volume, and Feret diameter.
The second feature class is intensity features, which aggregate pixel intensity distributions within a label region by applying descriptive statistics to the intensity histograms, such as the mean, median, or kurtosis (Listing 3 (b)). In addition, they include intensity-weighted vector representations (e.g., weighted centroid, weighted principal axes). Texture features for quantifying subcellular intensity distributions are not yet implemented, but they would also fall into this category.
An additional family of features is colocalization features (Listing 3 (c)). These measure pairwise associations between two channels within the label region, using either correlation metrics or mutual information. New features can be added by providing additional ColocalizationFn functions that process two vectors containing the paired pixels from the two respective channels.
These features are useful for quantifying the co-localization of protein stains and have also proven valuable as quality metrics during the registration stage, where pairwise correlation between DAPI signals from different acquisitions is used to detect misaligned cells (Figure 18).
The last of the four functions computes distance features on the raster images, measuring distances from objects in a label image to a reference object specified by its object type and label ID (Listing 3 (d)). The reference label image is first converted into a binary mask by selecting the reference label, and a distance transform is applied (Figure 23 b, Distance). The first label image can then be used to perform “intensity-style” measurements on the resulting distance transform, aggregating the distance values within each label region using the same descriptive statistics such as minimum, median, or maximum. An alternative approach involves interpolating the distance transform at a representative point (e.g., the centroid).
The applied distance transforms can be either the standard Euclidean distance or distances computed along a specific spatial axis. While the latter may seem obscure, it becomes particularly useful in estimating intensity bias correction models (see Section 3.2.7).
This function is not intended for exhaustive pairwise distance measurements between large numbers of objects, but rather for measuring distances to either parent objects (e.g., parent embryo, cell population border) or reference objects (e.g., dorsal cell).
LabelImage as the first argument, a set of function-specific [FType]Features as the last argument, and returning a polars.DataFrame. Depending on the feature type, they may take additional images or parameters such as the label ID and distance transform. This function signature defines the feature type.
features/label.py
def get_label_features(
label_image: LabelImage,
features: tuple[LabelFeature, ...],
) -> pl.DataFrame:
...features/intensity.py
def get_intensity_features(
label_image: LabelImage,
image: SpatialImage,
features: tuple[IntensityFeature, ...],
) -> pl.DataFrame:
...features/colocalization.py
def get_colocalization_features(
label_image: LabelImage,
image0: SpatialImage,
image1: SpatialImage,
features: tuple[ColocFeature, ...],
) -> pl.DataFrame:
...features/distance.py
def get_distance_features(
label_image: LabelImage,
label_image_to: LabelImage,
label_to: int,
dist_transform: tuple[DistFunc, ...],
features: tuple[DistFeature, ...],
) -> pl.DataFrame:
...Having established the framework for measurements on single objects, we now turn to population-level features. So far, all measurements—except for distance features—have been performed at the single-cell (object) level. Staying at this level is essentially doing a slightly more sophisticated40 version of flow cytometry, albeit in a roundabout way. The true power of volumetric imaging lies in measuring protein abundance of cells in situ, so it is important to incorporate spatial relationships between objects into the analysis. This introduces three main challenges:
These relationships can be represented as graphs—sets of nodes connected by edges. In geometric contexts, nodes are often called vertices or points, terms I will use when discussing spatial graphs where nodes have physical positions. While it is clear that graphs are necessary for aggregating features based on neighborhoods and hierarchies, we still need to decide on the graph type (weighted vs. binary, directed vs. undirected) and how best to store them in memory and on disk.
Hierarchy features capture the child-parent relationships between label objects of different types. They reflect the fact that biological tissues can be described at multiple spatial scales, where larger structures are composed of smaller components or subdivided into substructures—such as an embryo containing multiple cells, each with a nucleus, or a cell consisting of nuclear, cytoplasmic, and membrane regions.
These relationships are modeled as a directed acyclic graph (DAG), with edges defined as binary to represent that two objects are either related or not—there is no “partial” relationship. Although weights could be used to encode uncertainty in these relationships, we omit them here for simplicity. In this model, each object is a node, and an edge points from the child to its parent object.
We establish these relationships between label objects by aggregating the pixels of the parent label image within the child label region using the mode. This means some objects may have the background label from their parent label image as a parent, rendering them orphans (e.g., debris outside the embryo detected as a nucleus). Objects at hierarchy level 0 are called roots; in our case, these are the embryos. Roots have no parents because they were not queried for any (Table 11, column parents), whereas orphans have the background label as their parent.
Child-parent relationships can be classified as one-to-one (1:1) or many-to-one (m:1). For example, the relationship of a nucleus41, cytoplasm, or membrane to their parent cell is 1:1, whereas the relationship between cells or nuclei and the embryo is m:1 (Figure 24). These relationship types fully define how features can be aggregated between objects. Objects with a one-to-one relationship simply pass features along and add their object type as a label, allowing us to distinguish whether a feature refers to the nuclear or cellular volume. For many-to-one aggregations, an aggregation function must be applied. These can be any descriptive statistic returning a scalar value—the same ones used for raster feature aggregations. The operation is conceptually identical; instead of aggregating from voxels to label objects, we aggregate from label objects (i.e., super-voxels) to a higher-level label object.
The current implementation for establishing the object hierarchy uses individual parent queries (e.g., nucleus-to-cell, nucleus-to-embryo, and cell-to-embryo). This approach can lead to an edge case where a nucleus has a parent cell that, in turn, has a parent embryo, while the nucleus itself is considered an orphan with respect to the embryo. It may be preferable to enforce a consistency constraint that prevents such discrepancies.
The relationships are recorded in a hierarchy table that consists of the two label_object coordinates of the parent and child objects (Table 19). Each row in this table records one parent-child relationship, making the table an edge list representation of a graph.
A key advantage of this representation is its sparsity, since we only need to store an entry if two objects are related. Furthermore, this data structure fits perfectly with our goal of aggregating features between different objects because we can join a feature table on either side of the hierarchy table, allowing aggregation in both directions.
For aggregations between siblings (1:1) or for passing features from parent to child (1:m), we propagate features between related objects simply by using two join operations. For many-to-one relationships (m:1), we first join the feature table on the child columns, then apply a group-by operation on the parent ID, followed by any suitable aggregation function. This follows the well-known split-apply-combine pattern (Wickham 2011), resulting in an aggregated feature table for the parent objects.
Neighborhood features are computed by aggregating across objects of the same object type, starting with the specification of a neighborhood, followed by the application of standard scalar aggregation functions—similar to what we have seen in previous sections. Conceptually, this follows the same split-apply-combine pattern of grouping by neighborhood and then applying an aggregation function (Figure 25 a).
Unlike the hierarchy graph, the full neighborhood graph can contain cycles, and relationships between objects may be weighted—for instance, using a kernel function applied to pairwise distances. From the perspective of a single label object, the local neighborhood graph is a directed acyclic graph (DAG), with edges pointing inward from neighboring objects (Figure 25 b). Additionally, each neighborhood may include a self-loop, depending on whether the object is considered part of its own neighborhood.
One of the most intuitive ways to construct a neighborhood is to compute all pairwise distances between objects and define a threshold distance below which cells are considered neighbors. This results in a symmetric graph, since the neighbor relationship is mutual for any two points within the threshold. By adjusting the distance threshold, we control the spatial scale of the neighborhood. In abbott, we refer to this as a radius neighborhood. To compute these distances efficiently—especially when working with hundreds of thousands of objects—we do not use the raster-based distance functions described earlier. Instead, objects are represented as points at their label mask centroids, allowing the use of an efficient spatial indexing structure: the K-D Tree. This structure supports fast radius-based neighbor queries and can handle non-Euclidean metrics, although here we use it for Euclidean distances in physical space. The implementation is provided by sklearn.KDTree from scikit-learn (Pedregosa et al. 2011).
A second commonly used neighborhood type is the k-nearest neighbors (KNN) graph. It also starts from pairwise distances, but instead of using a fixed radius, each object is connected to its k closest neighbors—regardless of their actual distances. The resulting graph has a fixed degree for each node but is generally asymmetric. KNN graphs are popular because they adapt well to regions with variable object density and are less prone to creating disconnected subgraphs when k is sufficiently large. This flexibility makes them a staple in unsupervised machine learning, especially when defining neighborhoods in high-dimensional feature spaces. However, when used in physical space, KNN graphs imply varying neighborhood sizes, which can make interpretation more challenging. Internally, the same K-D Tree implementation is used to support both radius and KNN neighborhood queries.
The last method for defining neighborhoods from point clouds (i.e., object centroids) is via Delaunay triangulation. In 2D, Delaunay triangulation partitions the convex hull of the points into triangles such that the minimum angle of all triangle angles is maximized—a property that helps avoid thin, elongated triangles and makes Delaunay popular for generating meshes from point clouds. We use the scipy.Delaunay implementation to construct these simplices (triangles in 2D, tetrahedra in 3D), and then define edges between any pair of points that are connected in a simplex (Virtanen et al. 2020).
Delaunay triangulation provides an intuitive notion of neighborhood without requiring dense segmentation of objects. However, it has limitations in concave geometries. For instance, in early zebrafish embryos, the convex hull assumption leads to long-range spurious connections across concavities (Figure 26). One way to mitigate this is by introducing a radius threshold to prune these long connections. Another approach is to use an embryo mask to remove connections that exit the masked region. The latter avoids choosing an arbitrary radius threshold but comes with the cost of needing to load the mask during neighborhood construction, meaning it can no longer be computed from centroids alone.
With this, we arrive at the last implemented neighborhood type: the touch neighborhood. This approach uses the label mask to identify adjacent label regions and estimates their shared surface area by summing the areas of touching voxels, while accounting for anisotropic voxel dimensions. Unlike other neighborhoods that measure spatial proximity, the touch neighborhood quantifies physical contact—specifically, the interface area between neighboring cells. Its accuracy is limited by segmentation quality and the approximation of shared surfaces based on discrete voxel grids.
From a biological perspective, having access to multiple neighborhood definitions is highly valuable. For example, in studying cell-to-cell signaling, some pathways—such as BMP signaling—rely on diffusible ligands, where a radius neighborhood may be more predictive of signaling interactions. In contrast, pathways like Delta/Notch signaling depend on membrane-bound ligands and require direct contact between cells, making the touch neighborhood more relevant in such cases.
Aggregation of neighborhood features follows the familiar split-apply-combine strategy discussed earlier. A neighborhood feature is computed by first defining a neighborhood—for example, all objects within a 40 μm radius, the 3 nearest neighbors, or adjacent (i.e., touching) label objects. Then, a (potentially weighted) aggregation function is applied to the features of objects within each neighborhood group (Figure 25 a).
Many biologically relevant population features can be formulated this way. For instance, a common measure of local cell density is simply the count of neighboring cells within a fixed radius (e.g., 50 μm). This corresponds to grouping by a 50 μm radius neighborhood followed by a count aggregation (Figure 25 c). Alternatively, another widely used density proxy is the average distance to the k-nearest neighbors—computed by grouping via a KNN neighborhood and applying a mean aggregation on the pairwise distances.
To support efficient and reusable aggregation across different neighborhoods, the NeighborhoodQueryObject class provides a unified interface for constructing neighborhood adjacency matrices and applying aggregations. There are two ways to create a NeighborhoodQueryObject:
NeighborhoodQueryObject.from_dataframe: Takes a DataFrame containing object centroids and supports neighborhoods derivable from point clouds (radius, knn, delaunay).NeighborhoodQueryObject.from_labelimage: Accepts a label image, enabling construction of the touch adjacency matrix required for the touch neighborhood in addition to those based on centroids.Once an instance is initialized, neighborhoods can be added via the .radius(), .knn(), .delaunay(), or .touch() methods, specifying length scales and a self_loops flag (to control whether an object includes itself in its own neighborhood).
Finally, the .aggregate() or .weighted_aggregate() methods take in a feature DataFrame along with aggregation functions to compute neighborhood-level features. This design enables concise expression of complex multi-scale aggregations. For example, Figure 27 depicts the median nuclear volume aggregated over KNN neighborhoods with k ranging from 0 to 100. The code required to generate that plot is presented in Listing 4.
abbott.NeighborhoodQueryObject (imports omitted).
# load feature table (including centroid positions)
df_nuc = pl.read_parquet('features.parquet')
# select the features you want to aggregate
df_feat = df_nuc.select(pl.col('PhysicalSize'))
# create a NeighborhoodQueryObject
nqo = NeighborhoodQueryObject.from_dataframe(
df_nuc, centroid_columns=('Centroid-z', 'Centroid-y', 'Centroid-x')
)
# specify neighborhood sizes and aggregate
ks = (0, 1, 2, 3, 5, 6, 8, 10, 13, 16, 20, 25, 30, 35, 40, 50, 60, 70, 80, 100)
df_feat_agg = nqo.knn(k=ks, self_loops=True).aggregate(
agg_func.Median, df_feat
)The geospatial analysis community has extensively explored how to define neighborhoods (Anselin n.d.), and developed libraries for computing spatial statistics using what they call spatial weights matrices (Rey and Anselin 2007). These matrices are functionally equivalent to the (weighted) adjacency matrices used to represent neighborhoods in abbott.
The abbott.features.neighborhoods module focuses on quick and flexible construction of multiple neighborhoods followed by feature aggregation. Computing spatial statistics is explicitly non-goal of this module. For such analyses, users are encouraged to use the libpysal library.
Still, the NeighborhoodQueryObject can serve as a convenient frontend for constructing neighborhoods that are then passed to libpysal. Specifically, each adjacency matrix accessible via an iterator in NeighborhoodQueryObject.neighborhoods is a scipy.sparse.csr_array. These can be converted into libpysal.weights.W objects using the W.from_sparse() constructor, as demonstrated in Listing 5.
from libpysal.weights import W
for sparse_array in nqo.radius([30, 40, 100], self_loop=False).neighborhoods:
weights = W.from_sparse(sparse_array)
... # compute spatial statisticsThe final class of features relevant to our analysis are what we refer to as embeddings. Embeddings are low-dimensional representations of a higher-dimensional feature space. Broadly, they can serve two goals: (1) retaining as much structure and variance from the original feature space—compression—or (2) emphasizing specific aspects of the data—modeling. The former is typically referred to as dimensionality reduction and includes techniques like PCA, t-SNE, or UMAP. The latter may include supervised or unsupervised models that highlight variation relevant to a particular question or hypothesis.
As introduced in Section 1.3.6, machine learning techniques can be broadly grouped by task type:
For our purposes, we refer to all of these as embeddings—because in practical terms, they transform an input feature set into one or more output features. In other words, they share the same functional form: feature(s) in → feature(s) out. This abstraction lets us treat them uniformly in data processing pipelines.
From this perspective, a classifier or clustering algorithm that stratifies objects by cell type can be viewed as a one-dimensional categorical embedding. Similarly, the first three principal components obtained through PCA constitute a three-dimensional continuous embedding. The cell cycle phase estimate discussed in Section 3.3.3 serves as an example of a one-dimensional, continuous circular embedding.
Despite their differences in algorithm and objective, all embedding methods follow the same core workflow:
This consistency in structure allows us to treat embeddings as a distinct and uniform class of features—computed purely from tabular inputs and an associated machine learning model.
In Section 3.2.4, we explored how the coordinates of the data index can be used to access and iterate over dataset components, particularly from the perspective of I/O operations. The same logic can be applied to processing steps when planning feature queries. Based on available resources and the function signatures, we can estimate how many features are computable.
As a simplification, consider a single ROI with the following resources: roi: (o:5, c:16, lbl.o.to:1, m:3, f:300), where o refers to objects, c to channels, lbl.o.to to target objects for distance measurements (e.g., one embryo), m to the different length scales used for colocalization features (as in Figure 15 c), and f to a selected set of 300 features to be further aggregated across neighborhoods or hierarchies.
Based on these parameters, we can estimate a typical number of features that could be computed (Table 20). The feature count rises sharply when computed over combinations of components—for example, colocalization features, which can be measured between all unordered pairs of channels, can easily reach into the thousands. Population-level features further multiply the feature space as each base feature can be propagated across the object hierarchy or aggregated within neighborhoods.
| Feature Type | Computable | Example (c:16, o:5, lbl.o.to:1, m:3) |
|---|---|---|
| Label | \(n_{o} \times n_{f:label}\) | \(5 \times 16 = 80\) |
| Intensity | \(n_{o} \times n_{c} \times n_{f:intensity}\) | \(5 \times 16 \times 18 = 1440\) |
| Colocalization | \(n_{o} \times \frac{n_{c} (n_{c}-1)}{2} \times n_{m} \times n_{f:coloc}\) | \(5 \times \frac{16 \times 15}{2} \times 3 \times 3 = 5400\) |
| Distance | \(n_{o} \times n_{lbl.o.to} \times n_{dist\_trans} \times n_{f:distance}\) | \(5 \times 1 \times 2 \times 5 = 50\) |
| Hierarchy | \(n_{f} \times n_{1:1} + n_{f} \times n_{1:m} \times n_{agg\_func}\) | \(300 \times 3 + 300 \times 2 \times 5 = 3900\) |
| Neighborhood | \(n_{f} \times n_{nhood} \times n_{agg\_func}\) | \(300 \times 50 \times 5 = 75000\) |
| Embeddings | \(\infty\) | \(\infty\) |
All of these functions together define a vast42 feature space of potential measurements. This means we do not aim to densely compute the entire feature space, but rather select specific regions of it that are relevant to our analysis. The full feature space—spanned by the product of available resources and the functions applicable to them—represents everything we could compute in principle. In practice, analysis involves selecting subsets of this space, each defined by a feature query representing a dense slice that can be stored as a table. During the early stages of pipeline development and exploratory data analysis (EDA), we might start with a broad set of default feature queries, then refine our selection based on which features prove informative. Later, in a production setting, queries may be narrowed to the minimal necessary subset—such as features used for cell cycle embeddings, intensity bias models, diagnostic reports, or figure reproduction for publications.
We want to access feature tables like any other resource. This would allow us to treat feature embeddings not as mysterious, hard-to-reproduce values, but as deterministic outputs of processing steps within a larger analysis pipeline.
I understand that the goal of fully reproducible queries across both image data and derived feature tables is ambitious. Everything I present in terms of implementation represents, at best, a step toward this vision. There are still many loose ends, and much work remains to be done.
The categorization of features we have seen so far is based purely on the resources required to compute them. We defined table schemas for the raster features according to the rule that feature names (f) are stored as table columns, while all other resources (roi, o, lbl; c; c.0, c.1; o.to, label.to, dtf) form components of the primary keys of their respective feature tables.43 These are also the IDs and columns over which filter predicates are typically applied. I arrived at these schemas because I consider them tidy in terms of feature queries—which differs from the tidy format used in much of the downstream analysis, where we mostly reason about label objects and therefore prefer a different structure.
Many of my thoughts on what constitutes an “appropriate” table format were shaped by some interesting44 discussions we had at the BioVision OME-NGFF hackathon. The main point of contention was how tall or wide the resulting feature tables should be. See Note 1 for examples of the available options. My main takeaway from those discussions was that people have very strong opinions on the topic—likely due to differences in where along the feature extraction pipeline they spend most of their time. And since everyone there could formulate compelling arguments for their preferred format, we can not expect there to be a simple solution that satisfies everyone.
The feature table schemas I present here are geared toward flexibility with respect to scheduling feature queries applied to raster data, which I would like to contrast with convenience in terms of machine learning/statistical modeling. This does not mean I do not value the latter. Having done a fair share of ML with tabular data I know that they require a wide table where every row corresponds to an object (e.g., cell). The solution is to make it easy to go from one to the other, namely from the feature query friendly tall format to the analysis friendly wide format.45
Each feature extraction function has a signature which specifies a coordinate slice of resources required to compute them and additional parameters, like the set of features to compute, or the distance transform to use. Together they contain the necessary information to load all resources and apply the function. Furthermore, they specify an output table schema that uniquely identifies each feature across the whole dataset (Table 21).
By keeping the table schemas tall we do not need to add “resource information”, like the channel name, to the feature name. Since the measured channels change in every experiment this would prevent us from defining table schemas that are stable across experiments! The aggregation functions or features on the other hand will be changing much more rarely, only when a new feature is implemented, at which point an existing table schema has to be updated or a new schema added.
Raster feature queries are constructed by specifying the resources they require, as defined by their function signature. Since the label feature function only requires a label image and a set of features, we can construct a query by providing a selection of label images (roi, m, o; Table 16) and a set of label features (f:label). To concisely construct queries, we use a list for each component. These are expanded into individual queries via the Cartesian product (excluding f:label):
label_queries = {
# selection of label images
'roi': ['site0'],
'm': [0, 1, 2],
'o': ['nucleiRaw3', 'cells'],
# additional parameters
'f:label': ['Centroid', 'Roundness', 'FeretDiameter'],
}This would result in a total of 1 × 2 × 3 = 6 individual applications of the label feature function. Depending on how the work is scheduled, the 6 tables might be aggregated in a single process and written as one table:
table_name = 'roi=site0/f_label.parquet'
column_names = ['o', 'm', 'label', 'Centroid', 'Roundness', 'FeretDiameter']Alternatively, they can be distributed to up to six workers, each writing a table:
table_name = 'roi=site0/o=nucleiRaw3/m=0/f_label.parquet'
column_names = ['label', 'Centroid', 'Roundness', 'FeretDiameter']All other raster queries follow the same logic, consistent with the table schemas presented in Table 21.
Additional processing steps—such as intensity bias correction for features that depend on channel intensity—can be integrated similarly (not currently implemented). In that case, intensity feature tables would include extra primary key columns like z_model or t_model to indicate which correction model was applied.
Legacy tables that don’t have those columns can still be concatenated. Missing processing steps will simply appear as null, which makes the tables compatible without requiring migration.
This same logic applies to broadcasting functions across new coordinates. For instance, running the same function on a time series can be parallelized using the same strategy—just by slicing over time and scheduling each slice independently, while including the time index in the table name.
Specifying these feature queries consistently could be highly effective—the same query could be reused to load a corresponding set of features in a downstream machine learning application.
I propose structuring feature extraction code according to the required resources and function signatures. User-defined feature functions must declare a function signature that consists of simple types or established dataset components, and returns a table with specified schema. How this function is applied to the dataset is the responsibility of a scheduler, which can adapt the execution strategy based on the available computational resources.
In addition, developers should be encouraged to provide metadata describing the function’s broadcasting behavior with respect to dimensions not explicitly accessed. At a minimum, the function should be annotated with type hints that specify its input expectations. Developers may also indicate the degree to which their function is flexible in its application.
For example, consider a function designed to process 2D label images. If the function yields correct results when applied to a cropped label image46—because it does not depend on local coordinate computations (such as centroid positions) or because it properly handles metadata (e.g., when using an image data model that maintains global coordinates like spatial-image or ITK)—then this flexibility should be explicitly communicated. Doing so enables safe use of the function in out-of-memory processing of large image volumes.
Moreover, the specification should include which coordinate axes support safe broadcasting. For instance, broadcasting along z might produce invalid results, whereas broadcasting across t and roi should be permissible. This could look something like this:
user_defined_function.py
def my_feature_func(
lbl_img: LabelImage[D2],
param1: Literal['foo', 'bar'],
param2: int,
features: list[str],
) -> DataFrame:
...
my_feature_func_meta = {
'apply_to_crops': True,
'broadcast': ('t', 'roi')
# maybe: annotate functions that require image meta
'required_image_meta': ('scale', )
}To maintain compatibility with code where source access is limited—and to keep type hints optional—all processing metadata can also be specified using plain dictionaries or JSON. In cases where both type hints and metadata dictionaries are present, the dictionaries take precedence.
outside_library.py
def their_feature_func(lbl_img, param1, param2):
...integrating_outside_library.py
their_feature_func_meta = {
# missing: json specification for the python type hints used here
'input': (LabelImage[D2], Literal['foo', 'bar'], int),
'output': (DataFrame, ),
'apply_to_crops': True,
'broadcast': ('t', 'roi')
'required_image_meta': ('scale', )
}Whereas the scheduler defaults to the most restrictive function application.
scheduler_defaults.py
DEFAULT_FUNC_META = {
'apply_to_crops': False,
'broadcast': (),
'required_image_meta': ('scale', 'origin', 'shape')
}Another example: a user writes a function that processes a single mask. Instead of worrying about how to efficiently traverse a label image, they simply register a function with the correct signature that returns a single record of results. The scheduler handles the rest. Iterating over the label image is implemented once by someone who knows what they’re doing, rather than being re-implemented with every new function. The same applies to aggregating the resulting records and writing feature tables.
We introduced potential sources of intensity bias in Section 1.2.2 and summarized them in Table 3.
Flat-field (\(S^{\text{flat}}(c_{\text{wv}\_\text{len}}, x, y)\)) and dark-field (\(D^{\text{dark}}(x, y)\)) corrections were measured prospectively—that is, from additional reference images—using a plate with fluorescent dyes for flat-field and with the camera shutter closed for dark-field (Peng et al. 2017). These corrections were applied during conversion to the abbott-h5 format and are therefore irreversible.
There are three additional bias sources that can be addressed similarly: the intensity decay observed during 8h of imaging (\(S^{t}(c, t_{acq})\)), the intensity decay along the z-axis (\(S^{z}(c, z)\)) and the accumulated background during multiplexing and due to pinhole cross-talk (\(D^{bg}(c, x, y, z)\)). The first two were modeled as multiplicative terms and the background as an additive term. The bias decomposition proposed by Peng et al. (2017) (Equation 1) can be extended to include the new terms (Equation 4; coordinates of the bias terms omitted).
\[ I^{meas}(x) = I^{true}(x) \times (S^{flat} + S^t + S^z) + D^{dark} + D^{bg} \tag{4}\]
An interesting side note is that intensity bias manifests along all physical coordinate axes (Figure 28). Even in fixed samples, the time dimension becomes relevant, as we observed notable signal decay for several stains in the imaging buffer. To model this time-dependent decay, we had to recover acquisition timestamps from archived microscopy metadata files. We should therefore make it a priority to preserve as much available metadata as possible—even if it appears unimportant at first.
Measurement correction can be applied either to the extracted features or at the image level, analogous to flat-field and dark-field corrections (Peng et al. 2017). Correcting bias in the images is advantageous, as it allows parallel correction across all intensity-based features and the use of bias-corrected images for visualizations (e.g., Figure 31). Additionally, downstream tasks such as segmentation or deep learning-based feature extraction can benefit from improved input data quality, avoiding the need for heuristic intensity corrections like histogram matching.
Whenever possible, intensity bias should be estimated in physical coordinates. This enables a single bias model to be applied across all image pyramid levels. For time-dependent decay in the imaging buffer, using real timestamps is both more precise and robust. If acquisition errors occur at the microscope and wells need to be re-imaged47, timestamp-based models remain valid, unlike those relying on acquisition order and assuming a constant rate of acquisition. Moreover, this approach allows direct comparison of decay rates between different imaging buffer protocols.
The strongest source of intensity bias was the signal decay along the axial direction (Figure 29) possibly due to inhomogeneous scattering in the tissue and spherical aberration due to the refractive index mismatch between mounting medium and the water used as an objective immersion medium.
We did not know the best approach for z-intensity bias correction a priori, so writing the results of each correction attempt to disk would have been prohibitively expensive due to data duplication for every model tested. Since the operations involved in bias-correction are relatively inexpensive, we therefore opted to apply them on-the-fly. Furthermore, by estimating the bias models in physical coordinates, we could apply them across all multiscale levels of an image pyramid.
This enabled us to explore various models and parameters, including a correction scheme in which we partitioned the light path into medium and sample paths, estimating individual decay rates for each (Figure 30 a, b, 2D model). The resulting corrected intensity measurements did not only flatten the mean intensity over the observed z-range but also maintained a more homogeneous variance around the mean (Figure 30 c; Figure 31) and were therefore used in downstream analysis.
To assess the quality of our intensity bias correction, we analyzed sum nuclear DAPI intensity, where we had clear prior expectations of the results based on DNA content. In developmentally time-matched samples, the distributions showed two peaks separated by a factor of two, corresponding to nuclei with one or two DNA copies (Figure 32). The 2D correction models more clearly resolved these peaks and exhibited reduced variability around the expected values compared to both the 1D model and uncorrected data, supporting the superiority of this correction approach.
A major challenge in working with volumetric image data is the difficulty of visualization. Because screens are 2D, only slices or projections can be displayed. Even 3D stereo renderings suffer from occlusion, preventing full visibility of the volume. Simple maximum projections, especially in the presence of fluorescent artifacts (Figure 16 (a)), often fail to provide a representative summary of the data.
Due to this limitation, obtaining a high-level overview of collected images is more challenging than in 2D tissue culture settings, where full 96-well plates can be easily visualized by arranging maximum projection image data in a viewer, as done in TissueMAPS (Herrmann and Hafen 2016). Having these image-based views without the need for extensive image processing is especially valuable during protocol development, where experiments can often be analyzed through visual inspection and by comparing experimental conditions side by side.
Later in the analysis, when working with tabular results from feature extraction, the challenge arises that the cells of interest, such as outliers in a scatter plot, are often scattered throughout the dataset. This makes it laborious to trace these objects back to their corresponding image volumes. Therefore, developing methods for easy access to the image data at each stage of the analysis is crucial, as it can accelerate experimental progress both in the wet-lab and during the development of analysis pipelines.
One of the most important visualization tools for bioimage analysis in Python is the napari viewer, that allows visualization of images, label images and other annotations like points or vectors, all in the same volume (Sofroniew et al. 2025). While multiple images can be arranged next to each other by loading them into separate layers and applying a translation to each layer individually, this approach does not scale well to tens or hundreds of sites. There are workarounds, for example one can link multiple layers to make their layer controls coordinated. Furthermore, there have been community efforts to introduce layer groups. But those efforts are still experimental, and it is unlikely that they will scale to tens of thousands of single objects that we would like to visualize independently.
To remedy this, we wrote a small visualization tool to arrange raster volumes on a large canvas that can then be visualized in the viewer as a single layer per channel. This is done by loading the label objects of interest, optionally applying masks to each channel, optionally maximum projecting the objects, and finally arranging them on a grid sorted by a feature of interest.
Since we use a chunked file format (see Section 3.2.1), the label objects table containing the physical bounding boxes provides single-object access across the entire dataset (Table 17). However, the efficiency of this access depends on the chunking of the H5 files and the specific object selection—whether it is a sparse or dense selection of cells from a volume influences how many chunks need to be read. In many cases, especially when accessing multiple objects, it may be faster to load the full array and crop the required regions rather than loading individual array slices from the H5 file. This is particularly true for images with on-the-fly 2D z-intensity bias correction (see Section 3.2.7.1). Applying this correction requires computing the embryo and medium paths from the embryo mask, which necessitates loading the full label image even for small region requests.48
Virtually all visualizations presenting raster data of multiple objects in this thesis use a version of this tool. Examples include arranging embryos by cell count (Figure 14), displaying individual nuclei with different intensity bias correction applied to them (Figure 31), visualizing the manual annotations used for scoring of the cell cycle phase models (Figure 58), or arranging nuclei based on their inferred cell cycle phase (Figure 41). Beyond its role in generating scientific visualizations and facilitating the communication of results, the tool also proves valuable during exploratory data analysis (EDA), enabling rapid validation of analyzes through direct inspection of the underlying image data.
Another advantage of single-object access is its applicability to deep learning methods that operate directly on image data (Doron et al. 2023). At present, training such models often requires developers to write extensive boilerplate code to extract individual cells from datasets and to apply the necessary padding and masking for model input. The same functionality used for single-cell visualizations could be repurposed to generate the cropped49 image inputs required for model training, streamlining the workflow significantly.
This may also serve as a valuable opportunity to encourage developers of such models—who often draw inspiration from, or have backgrounds in, computer vision techniques developed for natural images (see Section 1.3.1)—to adopt geometric image representations that are more appropriate for spatial image analysis (see Section 1.3.3.1.1). This might in turn allow them to build domain-specific models that generalize even better. For example, the diameter parameter of Cellpose is defined in terms of image pixels, and the model is ultimately oblivious to the spatial scale at which it is operating (Stringer et al. 2021). I would hypothesize that adding a metadata path50 with the pixel spacings to the model might allow it to make more efficient use of the training data by picking up on “scale-specific” patterns in cellular shape (i.e., bacteria and small organisms tend to have less complicated shapes). A model that makes use of such information might be able to handle the various sampling scales encountered in volumetric microscopy data more gracefully.
A tool Joel Lüthi and I built is a napari plug-in for annotation and interactive classification of label objects. The tool called napari feature classifier allows loading multiple channels of a site and a label image containing the objects to be classified (Luethi and Hess, n.d.). After defining the classes the user can interactively select samples for each class. The plug-in allows training a random forest classifier (sklearn.RandomForestClassifier) and offers immediate visual feedback on the predicted classes for objects within the field of view (Figure 33). This feedback can be used to selectively add training examples in areas where classification performance is subpar.
The tool allows training on multiple sites and applying the trained classifier on a larger dataset. Alternatively, annotations can be exported and used for “manual” retraining of classifiers. In a recent refactoring of the tool we split those two functions for a better separation of concerns into an annotation and a classifier component.
In my personal use, I primarily treat the tool as a label image annotator that provides user feedback on annotation numbers and coarse feedback on classifier performance. I then prefer to retrain classifiers outside the tool allowing me to: (1) retrain on feature sets different from—and potentially computed after—those used in the original classifier, (2) swap the specific model used for classification or tune its hyperparameters, (3) aggregate annotations from more than one classifier (e.g., the debris classifier includes pooled debris annotations from cell cycle and cell type classifiers), and (4) use the annotations for tasks beyond classification, as with the cell cycle phase annotations generated with the cell cycle phase models in mind.
Future development efforts for the project could focus not only on maintaining compatibility with an evolving library environment, but also integrating some of those features directly into the plug-in. Notably, adding support for retraining on different features or aggregating annotations from multiple classifiers could be valuable. Fully specifying the annotation table schema, while ensuring compatibility with the label objects table specification (Table 17), would enable easier computational interfacing and enhance the tool’s flexibility for use across multiple datasets.
One limitation I encountered while using the tool over an extended period of time is that while classifier models can be stored, the resulting binary files are tied to a specific version of scikit-learn. This often leads to warnings or exceptions when attempting to load them with an updated version in a new environment. As a result, storing the models effectively pins the scikit-learn version to the one used during the initial training. However, since the Random Forest models we use are computationally inexpensive to train—it can be done in an interactive tool without affecting responsiveness, even on a single thread—there is little benefit in storing the trained models. Instead, a version-independent storage format based solely on the annotations, classifier model, and its hyperparameters seems both practical and preferable.
While in chapter two we focused on the early image processing steps and general tooling for high-content multiplexed volumetric microscopy data, the goal of chapter three is to demonstrate how these tools were used in the analysis of our multiplexed time course dataset of early zebrafish development. We will use the features generated using the abbott library to (1) stage embryos based on nuclei count, (2) apply the classifier tool for segmentation clean-up and cell type classification, (3) develop a cell cycle phase (CCP) model that results in a continuous estimate of CCP for single cells, (4) refine our embryo staging by incorporating CCP (5) quantify the global and local reduction in CCP synchrony that occurs during the mid-blastula transition (MBT), (6) compute the gradient of CCP to reconstruct mitotic waves, (7) measure the onset of ZGA by quantifying polymerase 2 serine 2 phosphorylation (elongating polymerase) and finally (8) apply Bayesian regression modeling to see to what extent we can explain the observed transcriptional heterogeneity using both CCP and other ZGA factors that we could measure in the same cells enabled by our multiplexing protocol.
Recall from Section 3.1.4 that we acquired a multiplexed pseudo time course between ~2.5–4 hours post-fertilization (Figure 1). Up to the 1024-cell stage, cell divisions remain relatively synchronous, enabling precise embryonic staging based on cell count (Kimmel et al. 1995). By imaging the entire animal cap and segmenting all the nuclei, we could stage the embryos simply by counting the number of nuclei. Since early divisions occur at regular 15-minute intervals, taking the log2 of the nuclei count corresponds roughly to developmental time (Figure 34 a).
This ability to perform staging computationally is important—it allowed us to mix different time points within each well of a 96-well plate. This is the preferred experimental design, as it helps guard against confounding technical and biological variability. Up to the 1k-cell stage, nuclei counts exhibited clustering around powers of two, reflecting the synchronous nature of early embryonic cell cycles (Figure 34 a). This synchrony makes it unlikely to capture an embryo in which only a subset of cells has undergone division.
When analyzing aggregate features of these embryos (see Section 3.2.6.4), a linear decay is observed on a log-log scale for mean cellular volume, indicating that cellular volume halves with each doubling of nuclei count (Figure 34 b). This pattern aligns with expectations for a system with minimal cellular growth, such as a cleavage-stage or early blastula-stage zebrafish embryo. In this context, each cell division produces two daughter cells of approximately half the original volume, which then immediately proceed to divide again.
Mean nuclear volume remains roughly constant during the same time period but exhibits more variability—especially at earlier time points, up to 211 nuclei (Figure 34 b). This is consistent with higher synchrony in early stages, where embryos may have a large fraction of cells undergoing mitosis simultaneously, resulting in a lower average nuclear volume. The fact that these embryos tend to lie between powers of two further supports the idea that the observed variability is biological in origin and caused by cell cycle dynamics.
We used the classifier tool introduced in the previous chapter to annotate false positive segmentations caused by fluorescent debris in the DAPI channel (Figure 16 (a)). Additionally, we annotated the three cell types emerging during early zebrafish development: enveloping layer cells (EVL), deep cells, and nuclei of the yolk syncytial layer (YSL). Finally, we annotated different cell cycle phase classes (M anaphase, M telophase, early S phase, S phase, M prophase, and M metaphase) used in the CCP analysis in Section 3.3.3.
The annotations were stored in separate tables and used to train two classifiers—one to remove mis-segmented debris, and another to classify cell types. Confusion matrices for both classifiers on the test data are presented in Figure 35. We trained a single classifier for cell type classification across all time points by including all of them in the training set. This model was then applied to the entire dataset, allowing us to stratify the data based on the predicted cell type annotations. A random sample of embryos—two from each division cycle (7–12)—is illustrated in Figure 36 and appears to indicate reasonable classifier performance across all time points.
Predictions from the debris classifier were used for outlier removal, while the cell type predictions were used to stratify the cells during the hyperparameter search of the cell cycle phase model, specifically to exclude YSL cells.
The cell cycle is a fundamental biological process that drives cellular proliferation and is conserved across all domains of life. In multicellular organisms, it must be tightly regulated to maintain tissue homeostasis by ensuring the correct number of cells in each tissue. Many developmental and cellular processes rely on controlling the progression of specific cell types through the cell cycle to form complex structures. As a result, variation in cell cycle state represents a major source of heterogeneity in many experimental systems. Accurately measuring this variation is often valuable—even when studying unrelated biological processes.
A common approach to control for cell cycle effects in microscopy data is to train a classifier to identify discrete cell cycle stages and restrict downstream analyses to a subset of matched cells (typically interphase).
However, rather than framing the cell cycle as a classification problem, we can apply computational methods that infer dynamics from fixed data (see Section 1.3.7) to estimate cell cycle phase (CCP) as a continuous variable. While such approaches have been successfully applied to single-cell RNA sequencing data (Schwabe et al. 2020) and to tissue culture cells in image-based systems biology (Gut et al. 2015), to our knowledge, they have not yet been used on volumetric image data in the context of embryonic development. A continuous CCP measure allows for estimating feature dynamics across the full cycle and enables more nuanced analyses by supporting fine-grained partitioning of cells by phase.
In early zebrafish development, our interest in cell cycle phase is twofold. First, the cell cycle itself is a fascinating process—especially as we capture time points surrounding the mid-blastula transition (MBT), when synchrony across cells is lost, typically around the 10th division cycle. By quantifying CCP at the single-cell level, we gain a fresh perspective on this well-characterized event and contribute new quantitative insights into cell cycle synchrony.
Second, in the context of zygotic genome activation (ZGA), cell cycle phase is known to introduce variability in transcript abundance across individual cells (Stapel, Zechner, and Vastenhouw 2017). Incorporating CCP into models of transcriptional activity is therefore essential, as differences in progression through the cell cycle likely explain a substantial fraction of the observed transcriptional heterogeneity (Figure 2).
Inspired by earlier work in our laboratory to reconstruct a continuous cell cycle trajectory from image data of fixed samples (Gut et al. 2015), we developed a machine learning-based approach to estimate the cell cycle phase as a continuous, circular feature at the single-cell level. The method is based on the assumption that cells at similar stages of the cell cycle share comparable morphological and molecular characteristics, such as reduced DNA volume during mitosis or increased PCNA levels during S phase, indicating active DNA replication. Given that many of these features exhibit gradual51 changes throughout the cell cycle, we hypothesize that a random population of cells will inherently organize along a one-dimensional, circular manifold within the high-dimensional feature space defined by these cell cycle-related attributes. Identifying this trajectory would enable the quantification of a cell’s position along the population-averaged cell cycle, thereby providing a robust framework for single-cell analysis of cell cycle dynamics.
The features utilized in constructing the cell cycle embedding can be influenced by biological processes orthogonal to cell cycle progression. For instance, nuclear area has been reported to vary in response to local cell crowding in tissue culture cells (Gut et al. (2015)). Consequently, it is critical to distinguish between variations in the feature space caused by such confounding factors and those genuinely reflective of cell cycle phase. To mitigate this issue, Gut et al. (2015) used crowding-corrected nuclear area, rather than raw nuclear area, when defining the feature space.
In our dataset, we observed analogous confounding effects. Developmental progression lead to a systematic reduction in nuclear volume. This phenomenon might also be attributed to increased cell crowding resulting from the exponential expansion in cell number during early embryonic division cycles (Figure 37 a). Additionally, we noted a decline in PCNA levels during S-phase in later divisions, potentially linked to the slowing of cell division associated with the MBT (Figure 37 b). To address these temporal shifts in feature distributions over developmental time, we partitioned the data by division cycle and trained distinct CCP embeddings for each time point. This strategy ensured that the inferred cell cycle trajectories are robust to developmental-stage-specific variations in cellular features.
Another challenge in unraveling CCP was that the speed at which cells progress through cell cycle feature space is variable. A cell undergoing mitosis may experience rapid changes in nuclear morphology, whereas an interphase cell typically exhibits slower changes relative to the feature space.52 An ideal embedding for inferring a trajectory corresponding to the cell cycle would be one, where the cells are uniformly distributed along an easily identifiable, low dimensional manifold in the feature space at which point one would only need to find a suitable regression model or employ another method of measuring a cells position along that structure.
To derive a one-dimensional, circular, and continuous cell cycle embedding from the original high-dimensional raw feature space we sought to establish an intermediate, lower-dimensional representation. This pre-embedding was intentionally designed to facilitate CCP inference. We can think of the CCP embedding as an “embedding inside this pre-embedding”. By explicitly considering this intermediate step, we gained a structured approach to selecting relevant features and determining appropriate preprocessing techniques—even when a fully established CCP model or an absolute ground truth was unavailable. This proved valuable when developing the method and for communication of the results. Furthermore, it could enable a more systematic way of troubleshooting when applying the model to new datasets.
The model was implemented using modular components of a scikit-learn pipeline (Table 24) and implemented in a separate repository scikit-ccp. This approach enabled the integration of multiple preprocessing steps, like feature scaling or dimensionality reduction, directly into a single machine learning model (see Section 5.6). Paired with a way of scoring against discrete manual annotations (Figure 58), this allowed us to do a hyperparameter grid search over the full model including preprocessing steps (see Section 5.6.1).
Having explored the hyperparameter space we found a family of models that where indistinguishable in performance based on the limited scoring approach available to us. This allowed us to pick one of the best performing ones based on other considerations, and we chose a model with a 3D pre-embedding shared between all the time points, mainly because it is easier to visualize a single pre-embedding and a shared pre-embedding also allows a qualitative assessment on how the features change over developmental time. As was hypothesized, in the 3-dimensional UMAP embedding of cell cycle features the cells organize along a low-dimensional manifold forming a closed loop (Figure 38). The manual annotations are ordered along that manifold leading us to conclude that the structure corresponds to a circular trajectory of cell cycle progression.
The differences between cells that can be attributed to the cell cycle mainly are in the first two UMAP components whereas the 3rd component seems to largely correspond to the variability induced by the features changing over developmental time. This can be seen when coloring the cells by their division cycle that were used to fit individual principal circles. These ordered mostly along the 3rd UMAP dimension (Figure 39).
The topology of the whole embedding matches with observations from similar analysis done in single cell RNA sequencing data using HeLa cells, in which Schwabe et al. (2020) show that cell cycle variability on the transcriptomic level can be mostly captured in terms of a planar (i.e., two-dimensional) circular trajectory, that they construct by rotation of initially 3 principal components they identify as containing the relevant cell cycle variability. They confirmed that two of those dynamic components53 indeed capture the cell cycle and argue that they are largely statistically independent of other axes of variability including the 3rd dynamic component in the data. This leads them to propose a view of circular motion along the cell cycle and other epigenetic or environmental changes along an orthogonal direction leading to a helical motion of cells in transcriptome space.
Due to the limitation of needing to stratify the data into discrete classes we can not directly present a helical trajectory, but the aligned principle circles do approximate a cylinder, and it is easy to imagine how their true trajectories would progress on a helical path along that manifold. The motion around the cylinder (UMAP components 0 and 1) corresponding to cell cycle progression and the orthogonal third component (UMAP component 2) corresponding to developmental progression.
Cell cycle phase is defined as the closest point of each cell on their respective principle circle based on the embryos’ assignment to a division cycle. As a remanence of originally using the angle around the pre-embedding in CCP models, and to highlight the circular nature of the feature—including the different statistics necessary when computing averages or differences, we use radians [0, 2π) to represent CCP.
Due to the non-uniform progression speed through feature space mentioned earlier, there are three distinct lobes corresponding to M phase post division (start of the cell cycle), S phase and cells entering mitosis again (Figure 40 (a), density plot). To normalize the temporal progression, under the assumption of the cells being sampled uniformly in time, CCP was quantile normalized using a sklearn.QuantileTransformer to arrive at a uniform distribution. The effect of this normalization on the dynamics we want to infer from our cell cycle model are illustrated in the side by side comparison (Figure 40).
We can now look at all the features as they vary during cell cycle progression. We start with features that were used in the construction of CCP, and where we have a clear expectation regarding their dynamic behavior. Mean intensity of PCNA, an S-phase marker, should be proportional to nuclear PCNA concentration and is low during mitosis (start and end) and high during interphase. This is what we would expect in cycle 11 embryos close to the early cleavage cycles where interphase is dominated by S-phase (Figure 40 (b), first panel).
DNA content as measured by sum DAPI intensity is low right after chromosome segregation at the start of the cell cycle and gradually increases during S-phase up to roughly 2x the initial amount in cells entering mitosis again (Figure 40 (b), second panel, between the dotted lines indicating interphase). This serves as a reminder of the quantitative potential of our volumetric microscopy data given the appropriate care is put into experimental and computational bias correction.
Similar to DNA content, nuclear/DNA volume is steadily increasing during S-phase but starts to decrease during chromosome condensation with the cells entering mitosis where we see a relatively quick reduction in nuclear volume from slightly above 1500 μm3 down to around 1000 μm3, importantly while DNA content remains constant until the end of the 11th division cycle.54 Wrapping around the circle to the beginning of the cycle, we move from ~1000 μm3 to ~500 μm3 as the metaphase chromosomes are segregated into two sets of DNA, each with roughly half the original volume (Figure 40 (b), third panel).
As a last example we show roundness of the nuclei to be low in mitotic cells with them quickly rounding up with the cells entering S phase (Figure 40 (b), last panel). In total these results increase our confidence both in our ability to use volumetric imaging in a quantitative setting and in our ability to infer dynamics from such data.
While I hope the dynamic behavior of features matching the expectations were a compelling argument in favor of the CCP model, one of the advantages of working with microscopy data is that there is always the opportunity of looking at the actual images. Using the single-object visualization tools introduced in the last chapter (see Section 3.2.8), we can pick a random sample of nuclei from 10 embryos of division cycle 10 and sort them according to the inferred CCP (Figure 41).
Visualizations like this, given they can be generated easily, are more than just a frivolous endeavor. They go a long way in gaining confidence in the results of the model. Furthermore, they are very helpful in the exploratory phase of the analysis when troubleshooting. For example, in an earlier iteration of the CCP model, segmentation outliers were not properly removed, causing them to cluster in one region of the CCP and producing abnormal feature dynamics. Although detectable using other visualization methods, such as plotting features against the CCP, confirming that the cells are genuine outliers and can be safely removed often requires multiple plots and additional checks to validate the assumptions. In contrast, a single glance at the image data provided a clear view, allowing us to quickly identify the cells as segmentation outliers.
Another concern was that, although the global structure of the CCP trajectory might appear reasonable, there could be local sorting of cells within the three density clusters in the original feature space that might not reflect cell cycle progression. Additionally, some parts of the embedding, only sparsely connected to the rest, might be flipped entirely. While the results do not reveal any striking errors upon visual inspection, some mitotic nuclei in the process of dividing were still segmented as a single object. Those were not placed appropriately between metaphase and anaphase but closer to prophase (Figure 41, white arrowheads), probably due to them being rounder than metaphase nuclei all else being equal (e.g., DNA content, volume, PCNA abundance, etc…). They were therefore simply more similar to prophase nuclei within the feature space used for CCP inference.
These types of issues are extremely difficult to detect from a scatter plot alone, yet they become immediately apparent when we apply our visual pattern recognition abilities. This is why I strongly advocate for building tooling that facilitates examining the actual images as much as possible when conducting image analysis. Far worse than encountering such issues is failing to detect them. Fortunately, since we do not intend on basing any arguments on mitotic cells being perfectly sorted according to their CCP, we can simply acknowledge the problem and proceed with the analysis.
The CCP feature is inherently circular: it spans values from 0 to 2π, with these endpoints representing the same cell cycle phase. This circularity must be accounted for in statistical analyses. For instance, computing the average of two cells at 2π - ε and ε (for ε < π/2) requires a circular mean, as the naive mean would always result in π, not accounting for the numerical discontinuity in the feature. Notably, the maximum possible difference between two CCP values is π, corresponding to cells that are half a cycle apart.
In Figure 42, we illustrate an example distribution of single-cell CCP measurements (orange). The circular mean (Θ, green) reflects the average CCP per embryo, while the mean resultant length—defined as the norm of the resultant vector (R = |ρ|, red)—quantifies the degree of synchrony in CCP across cells. Bootstrapping can be used to estimate confidence intervals for both Θ and R (Figure 42). These same circular-statistical principles also apply when computing quantities such as the cell cycle phase gradient later in the analysis.
Having gone through a more in-depth example using division cycle 11, we can move on to looking at the entire timespan covered in the dataset, ranging from the end of division cycle 7 to division cycle 12. Whereas quantile normalization enabled the transformation of a single cell cycle into a representation proportional to cell-averaged real-time progression, incorporating additional information allowed us to visualize the data across this extended temporal axis. While our method cannot measure true time progression, we can use measurements from live experiments—namely some of the first measurements of cell cycle length by Kimmel et al. (1995)—that conveniently cover the exact time points we need. We scaled each individual division cycle by the average cell cycle length in minutes to make our inferred dynamics comparable between time points.
The reported single-cell pseudotime corresponds to the average developmental time in minutes, spanning two hours of development starting at division cycle 7 (Figure 43). An important caveat to mention is that, due to our categorical assignment of an embryo to a single division cycle, all the cells of an embryo are assigned to that same division cycle. This is incorrect, since an embryo can contain cells from two adjacent division cycles, which is not taken into account in this analysis. The more asynchronous the cell cycle becomes, the less likely it is that all of an embryo’s cells are in the same division cycle.
Another approach would have been to not fix cells to their embryo’s division cycle but use additional information to get a single-cell estimate of division cycle. Unfortunately, the best way of arriving at such an estimate would require a cell cycle phase estimate, another example of a chicken-and-egg problem like the one mentioned in the introduction (see Section 1.3.8.2). Additional information still left on the table that could inform a more precise estimate of single-cell division cycle include nuclear and cellular morphology that are known to affect cell cycle lengthening (Kane and Kimmel 1993), and in turn could be used to assign cells of embryos “between” division cycles. Furthermore, one might use other dynamic quantities we will infer later when discussing mitotic waves (see Section 3.3.6), where the positioning within such waves—even once divisions become more asynchronous—could be informative for assessing whether a cell is ahead or behind its local neighborhood. To keep the analysis simple we decided to just use the refined embryo staging (see Section 3.3.4) for the stratification of embryos into division cycles and leave it at that. The single-cell pseudo time, while a useful way of giving us a broad view on the data, has therefore still room for improvement.
That said, this work serves as a compelling example of how data science approaches, combined with domain-driven machine learning, can offer novel insights into biological systems using fixed data. The resulting best estimate of developmental time at the single-cell level—particularly in early stages where cell cycles remain largely synchronous and division cycle assignments are more reliable—is ultimately constrained by our confidence in resolving the fine-grained ordering of cells within a given division cycle. While only a qualitative estimate of this uncertainty can be provided—and acknowledging that the error magnitude is unlikely to be uniform even within a single division cycle—I would estimate it to be on the order of 1–2 minutes during early stages (up to the 1k-cell stage). At later time points, however, the uncertainty increases substantially (exceeding 15 minutes), as it becomes more likely that many cells are misassigned to the incorrect division cycle. While the normalization of a single cell cycle allows us to interpret it as proportional to real-time progression (Figure 40 (b)), scaling each cell cycle by real-time estimates is akin to calibrating this temporal axis to arrive at an estimate of temporal progression in physical time (Figure 43)—analogous to how intensity measurements might be calibrated to reflect actual concentrations.
I leave the evaluation of the individual division cycles as done in the last section for cycle 11 up to the reader and want to focus the larger patterns that can be observed in the data. Starting with PCNA we observe a drop of PCNA concentration beginning at cycle 10 coinciding with the lengthening55 of the cell cycle. An interesting observation is that our data seems to indicate continuous DNA replication along the whole interphase (Figure 43, first two rows) which would lead us to conclude that the initial lengthening of the cell cycle is a lengthening of S phase rather than the introduction of a gap phase. This is seemingly at odds with existing work that report the introduction of a G2 phase in division cycles 11-13 (Nogare, Pauerstein, and Lane 2009). On closer inspection, the issue may be more semantic in nature. In my understanding, the gap phase refers to any period that is neither mitosis nor S-phase, with S-phase defined as lasting for the entire duration of DNA replication. Nogare, Pauerstein, and Lane (2009) state “[we report the] acquisition of a G2 phase that is essential for preventing entry into mitosis before S-phase completion in cycles 11–13”. This is a perspective on G2 phase in terms of the cell cycle checkpoint that causes it and is consistent with our observation of continuous replication while G2 checkpoints are being established.
Having gained confidence in our new CCP feature, we can revisit the original question of embryo staging. When staging embryos based on nuclei count, we observed clustering around powers of two during early division cycles. This pattern reflects cell cycle synchrony. Therefore, in early division cycles, nuclei count is an imprecise measure of an embryo’s developmental age—it only indicates which division cycle an embryo belongs to, without resolving whether the embryo is in the early or late phase of that cycle. Moreover, this staging method is susceptible to error, as missing nuclei are interpreted as differences in developmental age.
Consequently, the ordering of embryos within a given cycle—such as cycle 8—may be driven by missing nuclei rather than true differences in developmental progression. This is evident when visualizing nuclei-based staging against the average CCP aggregated per embryo (Figure 44 (a)). The ordering within cycle 8 based on nuclei count alone does not match the order when incorporating CCP. We can use the CCP dimension to adjust the binning of embryos, not only by thresholding based on nuclei count (Figure 44 (a), top), but also by incorporating CCP values (Figure 44 (a), bottom). These bins, paired with the average CCP within each bin, provide our best estimate for the age of an embryo, at least up to and including cycle 11 (Figure 44 (b)).
This demonstrates how measurements at one level of the object hierarchy (see Section 3.2.6.4)—the cell cycle inferred at the level of individual cells or nuclei—can be aggregated to infer properties of a larger structure, namely the refined embryonic stage. This approach is similar to our initial estimate of embryo age, which simply counted (aggregated) the number of nuclei per embryo. Importantly, these embryo-level properties can then be used to stratify cells at the lower level, enabling estimation of new properties for them. For example, we used the initial embryo age estimate based on nuclei count to bin the data, allowing us to fit individual CCP models per division cycle. This in turn allowed us to arrive at a better estimate of embryo age.
The data flow that led to the final CCP model is as follows:
In fact, this represents a simple implementation of an iterative algorithm performed over two cycles. Such a loop would consist of: (1) embryo age estimation for binning, (2) CCP estimation for sampling56, and (3) CCP model fitting. This loop could be run repeatedly until the CCP estimates—and thus the embryo age estimates—stabilize.
Having refined our embryo staging we arrived at a point where we can visualize individual nuclei of embryos of all time points in their spatial locations, without it just being a chaotic colorful display but an interpretable visualization of the data (Figure 45). To achieve this, we split CCP into 14 discrete bins which together with the 7 division cycles form a grid, with 65 out of 212 embryos displayed on their closest matching grid position. Grid positions without a matching embryo were left empty.
We can see that in early division cycles the nuclei have similar cell cycle phases whereas with increasing developmental time the cell cycle becomes ever more asynchronous. A caveat to mention here again is that after cycle 10 or 11 the asynchrony covers a full cell cycle and as such the average loses its meaning57 to the point of being mostly noise once the distribution becomes uniform somewhere around division cycle 12 or 13. This might be the reason why there is a stray cycle 13 embryo in the bottom right.
Another interesting point to notice are the color gradients that are evident in the early cycles which illustrate that the emerging CCP asynchrony is spatially patterned. The gradients of early division cycles are reminiscent of mitotic pseudowaves as described by Keller et al. (2008). They used live imaging on a custom light sheet microscope, and describe radial waves up to division cycle 9, peripheral waves in cycles 10 to 13 and asynchronous patches later on. While the peripheral waves will be hard to detect in this data, since we excluded YSL cells from the CCP models, it does seem like the other phenomena match this description qualitatively.
Finally, the visualization indicates a sampling bias within the dataset with a lack of embryos sampled in the later part of the 9th division cycle, also visible in the first panel of Figure 44 (a). While it is unfortunate to have such sampling bias in the data, since we are able to detect it due to the refined staging enabled by CCP inference, we can account for the volume distribution of the cycle 9 embryos that we observed in Figure 37 which shows a distinct lack of large nuclei as they are expected in the second half of the cell cycle.
We already addressed the pattern of emerging asynchrony in the last section. This is a well known result in early zebrafish development, where in the process of the mid-blastula transition cell cycles both lengthen and start to become asynchronous (Kane and Kimmel 1993; Vastenhouw, Cao, and Lipshitz 2019). Having the ability to quantify CCP across all time points, we can use the mean resultant length (R) as a measure of synchrony with R = 1 corresponding to perfect synchrony and R = 0 meaning total asynchrony (see Section 3.3.3.5).
Computing R globally per embryo, we can see a clear pattern of emerging asynchrony presented in Figure 46, left. We fitted a linear regression model using the adjusted embryo age from last section as an independent variable and the logit transformed measure of synchrony R as a dependent variable. The Wilkinson formula58 of the model is shown in Equation 5.
\[ logit(R_{CCP}) \sim \text{Age}_\text{adj. division cycle} \tag{5}\]
The model was fit using the high level Bayesian modeling library Bambi (Capretto et al. 2022) that estimates the model parameters using Monte Carlo Markov chain (MCMC) based sampling methods and with that allows reporting of the uncertainty in the model parameters in terms of the 96% credible interval of the posterior mean and 96% credible interval of the posterior predictive distribution of the model (Figure 46, right). These were transformed back to the original data scale using the expit59 function (Figure 46, left). We can see that an asynchrony of R = 0.5 is reached during division cycle 10 which fits with the existing literature that reports the mid-blastula transition taking place in division cycle 10 (Kane and Kimmel 1993).
The model is one way of modeling the data using simple linear regression on the logit transformed target variable and seems like a reasonable choice, but there are other ways of modeling the data. It seems like there is a rather sharp drop in synchrony during cycle 10 (i.e., early cells of that cycle are all above the mean predicted by the model in Figure 46) that is not well captured by this model that assumes a linear loss in synchrony on the logit scale. An alternative modeling approach, that might be able to capture this qualitative observation, could be using change-point linear regression similar to the approach described by Cahil (n.d.). This approach would assume two linear models with independent intercept and slope parameters and an additional change point parameter corresponding the time point of switching from one linear model to the other.
We can use neighborhood aggregations introduced earlier (Figure 27) to compute the measure of synchrony R in radius-defined neighborhoods ranging from 0 to 500 μm. The resulting synchrony profiles all start at R = 1 for neighborhoods containing only a single cell and decay to the global synchrony at a length scale larger than the embryo, where each cell’s neighborhood includes all other cells (Figure 47 (a)). We observe that embryos, when pooled by division cycle, reach their global synchrony level at different length scales. This is particularly apparent when examining the plot normalized by the global synchrony score (i.e., a value of 0 indicates a neighborhood has reached the global synchrony level of its respective division cycle). Whereas cycle 7 embryos reach half of their global synchrony at a scale of around 100 μm, this radius steadily decreases to approximately 25 μm by cycles 12 and 13 (Figure 47 (b)).
To evaluate whether mitotic pseudowaves could be reconstructed from fixed data, we computed CCP gradients using a finite difference method for gradient estimation developed for irregularly spaced data (Meyer, Eriksson, and Maggio 2001). This method was (1) adapted to operate on a 3D point cloud, and (2) modified to use circular differences, enabling application to the circular CCP feature. The resulting gradient field was then integrated using PyVista to generate streamlines that visualize mitotic pseudowaves. In cycles 8 and 9, we observed aligned pseudowaves reminiscent of the radial waves described by Keller et al. (2008), which they reported to be visible up to cycle 9 (Figure 48, top row). Beginning with cycle 10, the wave patterns became more chaotic (Figure 48, bottom row). The marginal radial waves described by Keller et al. (2008) could not be reproduced in this analysis—likely due to limitations of the gradient estimation method near tissue boundaries (Meyer, Eriksson, and Maggio 2001), and because YSL nuclei were excluded from the CCP models.
In the final part of the analysis, we address the question that motivated the collection of this dataset in the first place: to what extent can we explain the variability in transcriptional activity between cells using (1) the cell cycle phase feature established in the previous sections and (2) the ZGA factors included in the antibody panel.
Due to the limitation of having access only to proteins and protein modifications with our immunofluorescence-based approach, we could not measure transcripts directly, but instead inferred transcriptional activity via phosphorylated states of RNA polymerase II. Specifically, we measured RNA polymerase II phosphorylated at serine 5—a marker of transcription initiation mediated by CDK7 (Glover-Cutter et al. 2009)—and at serine 2, which is phosphorylated by CDK9 and marks transcriptional elongation (Peterlin and Price 2006).
Pol II S2P has been reported to co-localize with nascent early transcripts in zebrafish as early as the 256-cell stage (Hadzhiev et al. 2019). Moreover, preliminary experiments conducted by my Master’s student Marvin Wyss demonstrated good correlation between serine 2 phosphorylation and nascent transcript abundance, as measured by 5-EU injection and fluorescent labelling via click chemistry (Wyss 2022).
In the following analysis, we therefore focus on modeling the mean nuclear signal of Pol II S2P as a proxy for transcriptional activity.
Analysis of Pol II S5P and Pol II S2P during the observed developmental period revealed an increase in mean nuclear intensity, consistent with rising transcriptional activity. The same caveat mentioned previously—namely, the embryo sampling bias for cycles 7 and 9—applies here as well (Figure 49). Both phosphorylation markers exhibited a distinct bimodal distribution, reminiscent of the bimodality observed in PCNA, a marker of S phase (Figure 37 b; note the linear versus logarithmic scale). This suggests a potential link to cell cycle progression, indicating that cell cycle phase may serve as a predictor of transcriptional activity.
This hypothesis is supported by the relationship between the two transcriptional markers and the CCP feature established in earlier sections, as illustrated for cycle 11 embryos in Figure 50. Both phosphorylation states are low during mitosis and increase gradually throughout interphase, eventually reaching a plateau. Notably, Pol II S5P (initiation) appears to precede Pol II S2P (elongation), in accordance with the temporal sequence of transcriptional initiation followed by elongation.
To investigate the relationships between measured single-cell features and transcriptional activity, we employed statistical regression modeling. Following the approach outlined by McElreath (2015), we begin with a causal diagram that represents plausible interactions among the measured variables prior to formal model specification.
Figure 51 provides an overview of both the components measured in this dataset and additional molecular factors known from the literature that were not directly measured. It also illustrates hypothesized influences on Pol II S2P levels. Arrows indicate the direction and type of interaction: green arrows represent activating relationships, while red T-shaped arrows denote repression, as introduced earlier in Figure 3.
We employed PyMC and Bambi for Bayesian regression modeling (Abril-Pla et al. 2023b; Capretto et al. 2022). In contrast to frequentist approaches, Bayesian models express parameters as probability distributions. This formulation offers a key advantage: once a model is fit to data, its parameters are represented by posterior distributions that inherently capture parameter uncertainty. Moreover, it is possible to draw samples from the fitted model—a process known as posterior predictive sampling—to (1) assess model fit to the observed data (e.g., Figure 46), or (2) generate predictions for new, out-of-sample observations. Predictions derived from a Bayesian model are not point estimates but distributions, reflecting the uncertainty in the estimated parameters.
Libraries such as PyMC allow for the specification of prior distributions on model parameters, which are then linked to define a complete probabilistic model. In most real-world scenarios, such models cannot be solved analytically. Instead, posterior distributions are typically approximated using Markov Chain Monte Carlo (MCMC) sampling or variational inference—both of which are implemented efficiently in PyMC, enabling practical application without requiring deep familiarity with their computational underpinnings.
It is, however, crucial to run appropriate diagnostics to evaluate sampling performance and to compare different models. For these purposes, we used the ArviZ library (Kumar et al. 2019). Key functions included plot_trace() to verify well-behaved sampling traces, plot_forest() to visualize posterior distributions across parameters or models, plot_ppc() to compare posterior predictive distributions with the observed data, and plot_compare() to assess models based on their expected log pointwise predictive density (ELPD).
We additionally report the coefficient of determination (R2), a conventional regression metric, while noting that it did not influence model selection. For detailed guidance on Bayesian modeling and these diagnostics, we refer the reader to the excellent documentation of ArviZ,60 PyMC, and Bambi.
We modeled transcriptional activity during interphase \(Y\) using varying sets of predictor features (Equation 6) using multilinear regression (Equation 7; Table 22, linear predictors). Since the relationship between CCP and Pol II phosphorylation is clearly non-linear in interphase cells (Figure 50) we additionally used (1) multilinear regression with B-spline encoded CCP, (2) asymptotic (Equation 8) and (3) power curve regression (Equation 9) on the CCP feature (Table 22, non-linear predictors). Finally, we combined those models using non-linear CCP predictors with linear predictors (Equations 10 and 11).
\[ Y = Normal(\mu, \sigma) \tag{6}\]
\[ \mu = \beta_0 + \beta X \tag{7}\]
\[ \mu = a - (a - b) exp({-c X_{CCP}}) \tag{8}\]
\[ \mu = a X_{CCP}^b \tag{9}\]
\[ \mu = a - (a - b) exp({-c X_{CCP}}) + \beta X_{linear} \tag{10}\]
\[ \mu = a X_{CCP}^b + \beta X_{linear} \tag{11}\]
The resulting models, constructed from a set of linear predictors and potentially non-linear CCP predictors, were trained on up to 5000061 randomly sampled cells per division cycle (8-12) and compared using ELPD. A model named s5p_zga_ccp_ccp-power used 3 feature sets as linear predictors (s5p: [Pol2.S5P], zga: [Pou5, Nanog, H3K27Ac, H2B], ccp: [CCP]) and additionally non-linear power regression on the CCP feature. Note that in this example CCP would be used both as a linear predictor (ccp) and in the power regression (ccp-power).
zga_ccp-asymt is using 4 ZGA features as linear predictors and asymptotic regression with the CCP feature (3 parameters) for a total of 7 model parameters.
We started by modeling Pol II S2P in interphase cells using CCP as a linear (ccp) or non-linear (ccp-spline, ccp-asympt, ccp-power) predictor. We also compared it to a multilinear regression model including an approximate set of features that went into the CCP model (ccp-feats). For the asymptotic and power regression, we tested additional models (ccp-cens-asympt, ccp-cens-power) that account for the censoring62 of values at 100 seen in Figure 50.
Figure 52 illustrates the results of the regression modeling for interphase cells using CCP as a predictor for division cycles 8–12 (division cycles 7 and 13 are excluded because they are not covered as a whole). The asymptotic regression models (ccp-asympt, ccp-cens-asympt) using only CCP as a predictor feature are consistently the top-performing models for all time points. They outperform modeling the non-linearity using power regression (ccp-power, ccp-cens-power) and encoding the CCP feature using B-splines (ccp-spline). In fact, the B-spline models perform by far the worst. Even though their R2 values are high, they are penalized in the ELPD scoring, indicating potential overfitting.
Multilinear regression models using CCP features (ccp-feats), even when CCP is added as a linear predictor (ccp_ccp-feats), perform consistently worse than the asymptotic CCP models (Figure 52). This leads us to conclude that (1) the established CCP feature is a strong and parsimonious predictor of polymerase 2 serine 2 phosphorylation in early zebrafish development, and (2) the relationship is non-linear and captured well by an asymptotic regression model consistent with a saturation of phosphorylation during interphase. Furthermore, the result is in agreement with existing literature that identifies cell cycle as a source of variability in transcript abundance on the single gene level (Stapel, Zechner, and Vastenhouw 2017).
The CCP feature is a parsimonious and interpretable predictor of transcriptional activity. Modeling the non-linear relationship between CCP progression and transcriptional activity explicitly using asymptotic regression led to the top-performing models. The explained variances (R2) of the top-performing models, based on ELPD model selection, range between 0.66 and 0.83.
Having confirmed the utility of CCP as a predictor of transcriptional activity, we proceeded to incorporate additional linear predictors (feature sets ccp-feats and zga; Table 22, linear predictors) alongside the non-linear CCP component of the model (Equations 10 and 11). Additionally, we examined the impact of including Pol II S5P as a predictor (s5p).
Again, asymptotic regression models outperformed power regression models (Figure 53, left). Furthermore, adding the cell cycle feature set, ZGA feature set, and S5P all improved model performance. The best models without Pol II S5P had R2 between 0.69–0.88. Including S5P resulted in models with R2 between 0.74–0.88.
Standardizing the features enabled the magnitude and sign of the regression parameters (\(\beta\)) for the linear predictors to offer insights into the influence of each feature within the statistical model (Figure 53, right). Note that all models contain a non-linear CCP term, meaning the regression parameters represent the effect of a feature while accounting for CCP.
Incorporating SP5 as a predictor into the model consistently improved its performance. This outcome is expected, as serine 5 phosphorylation is likely a causal upstream event of S2P. The regression coefficient is positive, indicating that more S5P in a nucleus of cell cycle matched cells is predictive of higher S2P, consistent with the initial hypothesis (Figure 51).
From the cell cycle feature set PCNA had a strong negative association with transcriptional activity from cycle 10 onwards. This is indicative of a repressive effect of PCNA, potentially due to competition between the replication machinery and the transcription machinery for DNA access. Aside from PCNA, the regression parameters of the linear predictors in this feature set are relatively small and do not follow a consistent pattern.
Pou5 consistently exhibited the strongest positive association with transcriptional activity across all time points. Nanog generally displayed a positive regression parameter of lower magnitude, except in the cycle 10 model. Meanwhile, H3K27Ac exhibited a slight positive regression parameter from cycle 10 onward. These results are again consistent with our initial hypothesis (Figure 51) of activating pioneering factors and histone modifications being associated with increased transcriptional activity (Lee et al. 2013; Chan et al. 2019).
An unexpected result of the modeling is the positive association between H2B and transcriptional activity up to division cycle 10. Our expectation would have been a negative association, with histones acting as general repressors competing for DNA access (Joseph et al. 2017). Within the endogenous variability of histones in our experiments, we could not observe this repressive effect.
Another interesting observation is the effect on regression parameters when adding S5P as a predictor. While the effect was usually a decrease in parameter magnitude, indicative of redundancy of the features in the model, not all predictors were affected equally. Nanog exhibited the most striking drop in magnitude of the regression parameter upon adding S5P.
Finally, we used the two best performing models without S5P (zga_ccp-feats_ccp-asympt) and with S5P (s5p_zga_ccp-feats_ccp-asympt) and evaluated them on an embryo excluded from training. The models had R2 of 0.8 and 0.91 respectively (Figure 54).
Visualization of the predictions for the nuclei within the context of the embryo displayed no apparent spatial bias in prediction quality (Figures 55 and 56).
I presented a protocol for high-throughput multiplexing of the animal cap in early-stage zebrafish embryos (see Section 3.1). The protocol includes a mounting method compatible with 96-well plates that kept the samples stable for at least four rounds of buffer exchanges without negatively affecting antibody incubation times. Furthermore, the imaging buffer of the original 4i protocol was adapted to achieve refractive index matching during imaging—a prerequisite for quantitative volumetric imaging. The protocol represents a significant experimental advancement in terms of (1) sample throughput (~200 embryos in a single experiment), (2) the ability to acquire high-quality volumetric data, and (3) the number of proteins and protein modifications that can be measured within the same sample.
While the protocol is a clear step forward, there remain several areas for improvement. In the following, I will reiterate some of its current limitations and suggest possible directions for future enhancement.
Finding specific antibodies for the zebrafish model can be a tedious process. Since most commercially available antibodies are raised against human targets, many do not work in zebrafish, and information regarding cross-reactivity is often not available. The problem is exacerbated by protocol specifics (fixation, permeabilization, and clearing conditions); antibodies that work in one IF protocol may not function under different conditions. Setting up a single experiment is cumbersome and requires numerous preliminary experiments to:
These problems cannot be addressed easily. The only things that can help are (1) keeping a list of antibodies that have been tested readily available and (2) having the early image processing steps automated to the extent that preliminary experiments can be performed efficiently and with minimal manual intervention.
I hope the extensive documentation of antibody testing done by Marvin Wyss (Wyss 2022), and the image analysis and visualization tools introduced in Section 3.2, are helpful in future efforts of multiplexing in zebrafish and other volumetric samples.
The fibronectin-based mounting in 96-well plates using photo crosslinking marked a significant milestone, enabling four imaging cycles of the same samples. The relatively loose tethering of samples to the plate is essential for maintaining short antibody incubation times. This is particularly important since longer incubation periods extend the experiment’s duration and raise the risk of sample degradation. Furthermore, prolonged experiments raise the stakes of data acquisition due to the substantial time investment required from the experimentalist.
A downside of the loose tethering in the established protocol is that it relies on specialized automated liquid handling hardware,63 limiting accessibility to the broader scientific community. Although automated liquid handling offers other advantages, for example better reproducibility in buffer exchange and increased experimental throughput, it also limits the accessibility of the protocol to labs equipped with the necessary infrastructure. The original 4i protocol on tissue culture cells is compatible with the use of washer-dispensers, which are relatively low-tech and cheap. Their use is not an option with our protocol. Recently developed open-source liquid handling platforms (Florian et al. 2020; Eggert et al. 2020) and new competition in the commercial space by companies like Opentrons make automated liquid handling accessible at a fraction of the cost and might aid in making the protocol more accessible.
Another solution might be the development of hydrogel-based embedding methods (Chung and Deisseroth 2013; Yang et al. 2014; Ku et al. 2016) compatible with multi-well plates, that increase sample stability while maintaining staining accessibility. They provide the additional benefit of enabling lipid removal which might lower incubation times and could further improve sample clearing when compared to the current approach that relies on refractive index matching only. The problem with hydrogel-embedded samples is that they usually expand, which may be declared a feature and called expansion microscopy outside a multi-well plate (Ku et al. 2016). But when constrained by a well, the expansion is anisotropic and detrimental to sample morphology. Preliminary experiments with the CLARITY protocol were very promising in terms of image quality. However, the embryos exhibited significant deformation, with their animal poles flattened against the plate bottom as a result of sample expansion.
Improvements to the mounting and staining protocol, and streamlining of preliminary experiments, would allow for more comprehensive and quantitative measurement of cellular components in volumetric samples.
The current protocol is limited to measurements on the protein level. Integration of the protocol with (single-molecule) fluorescence in-situ hybridization (FISH) methods would grant access to molecular readouts of another important class of biomolecules, the nucleic acids including both DNA and RNA.
Such protocol development is not simple, since it requires not only adapting a new protocol, but also maintaining compatibility with the existing one. Furthermore, single-molecule FISH data is often analyzed by counting individual molecules, which could be very difficult for all but very low abundance transcripts with the current imaging using a 20x objective.
Moving to higher resolution imaging introduces novel challenges. One generates more than four times the amount of data when moving to a 40x objective, since sites need to be imaged with an overlap for later stitching of the image volumes. Furthermore, stitching can be a challenging since it involves image registration of the overlapping regions, albeit “only” using translation transforms. This increase in complexity in the early image processing should not be underestimated, as it is an additional step that might lead to some loss of data or errors that will propagate throughout the pipeline. An interesting alternative approach might be to maintain the 20x imaging while establishing intensity-based quantification of transcript abundance rather than counting individual transcripts.
In this project, we aimed to expand our coverage of molecular species measurable simultaneously by adapting the 4i multiplexing protocol for early-stage zebrafish embryos. A parallel goal was to normalize these measurements across the entire imaged volume with the goal to produce pixel intensities proportional to local target concentrations. A significant effort went into acquiring data, and developing processing workflows, that allow for semi-quantitative fluorescence intensity measurements (see Sections 1.2.2, 3.1.3 and 3.2.7).
Improving measurement quality is a key methodological objective in image-based systems biology. Yet, in my view, it receives disproportionately little attention in the community. In addition to striving for single-pixel, on-the-fly bias correction (discussed in Section 4.3.3) we experimented with alternative ways of estimating intensity bias caused by aberrations to the light path in the sample mounting chamber.
A key innovation in our approach to z-intensity bias correction was partitioning the light path through the mounting chamber into two segments—mounting medium and sample—and fitting a two-dimensional intensity decay model with independent decay rates for each region (Figure 30 a, b). This approach yielded superior bias correction, not only flattening the mean intensity across the z-range but also reducing variability around that mean (Figure 30 c).
These 2D models could be extended into three dimensions. For example, a third axial partition—such as one defined by a membrane mask—could help correct for residual refractive index inhomogeneities in cleared samples that have not undergone lipid removal. Alternatively, dimensionality could be increased by incorporating a new directional component derived from the embryo mask, such as the distance to the embryo border, which may help account for staining bias.
Extending model dimensionality, however, introduces additional complexity. As discussed in the relevant section, doing so requires either a justified assumption of uniform intensity along the new model dimensions or a principled method for estimating bias independently (e.g., from fluorescent background signal). Moreover, increasing the number of model dimensions results in a sparser distribution of data points across the model hypersurface, as already evident in the transition from 1D to 2D modeling (Figure 30 b). A three-dimensional model would further amplify this sparsity, necessitating a careful evaluation of whether reliable parameter estimation remains feasible.
Another approach to enhancing the quantitative quality of the data involves calibrating measurements to achieve true concentration measurements. This would enable absolute comparisons across experiments, eliminating the need for distribution matching when integrating data longitudinally from multiple experiments. Moreover, it would enable the development of stoichiometric quantitative models, thereby providing a more rigorous foundation for mechanistic insights and predictive analyses.
We have invested a considerable amount of time trying to increase the number of molecular species that can be measured simultaneously by multiplexing. Given the sheer complexity of biological systems, I think this is a worthwhile effort, because even with 12 targets at a time we are far from a comprehensive measurement of the molecular complexity within cells. Furthermore, we might be interested in measuring a broad range of biomolecules—ideally in the same sample (e.g., proteins, protein modifications, transcripts, small molecules, lipids, metabolites, etc…).
The more components we can measure the more fine-grained of a description and understanding of the system is accessible to us. And when reasoning at a coarse-grained level, for example about morphogenesis, there are causal connections down to the molecular species that build macro molecules, make up organelles and subcellular compartments, cells, cell-populations and ultimately the tissue and organism we are studying. We should expect many of them to modulate whatever process we are investigating at the higher level.
Given this complexity, the methods of systems biology aiming for more comprehensive measurements serve two main purposes. (1) Multiomics methods are widely used as screening tools to identify “hits”—molecular components involved in the biological process of interest—which are then studied in more detail through targeted experiments. (2) They also enable the investigation of components within the broader context of complex biological systems.
I think many systems biologists would consider (2) to be the “true” goal of systems biology. But it is a much harder problem in terms of formulating a useful hypothesis. I am sometimes wondering to what extent our experimental methods are ahead of our scientific theory and ability to formulate both conceptual and mathematical models at those higher levels. And whether we already have the appropriate coarse-grained concepts necessary to do so.
This does of course not mean we should not keep developing such methods since absent the data we would have no way of addressing such hypotheses experimentally. Furthermore, generating “systems level data” might be a prerequisite to stimulate the generation of such hypotheses in the first place. But I do want to highlight the importance of developing the conceptual and computational methods in parallel with the experimental advancements.
Additionally, both more comprehensive measurement by extending the existing protocol and integrating new modalities come at a cost. When thinking about my own data I do want to highlight three of the downsides that come with more “plex”:
This is why I would contemplate the following questions when thinking about an experimental design using any of the methods in systems biology, including the protocol developed in this work:
The protocol presented in this thesis enables the acquisition of large-scale volumetric image datasets of early zebrafish development. Roughly two weeks of experiments (excluding preliminary work to establish an antibody panel) result in approximately 5 terabytes of image data. In the meantime, others in the lab have conducted even larger experiments. Yet beyond the experimental challenges, analyzing such datasets presents another major hurdle.
As an example to illustrate this point: despite my efforts to streamline early image processing and keep the code reusable, analyzing the preliminary multiplexing dataset Marvin acquired—prior to the one presented in this thesis—would still take me months. And that’s for a dataset with a well-defined analysis path, with me arguably having all the necessary expertise.
The only way this kind of science can be sustainable is through major investments in computational infrastructure that supports both the development and deployment of large-scale image analysis workflows.
Fortunately, there is exciting progress in the bioimaging and spatial imaging communities that makes me hopeful it will soon be much easier to conduct this kind of analysis. Ideally, it would be accessible to biologists with minimal technical expertise—those who are focused on biological questions, not on how to store image data, manage feature tables, or distribute workflows across computing infrastructure.
At the local level, I am very excited to see that the University of Zurich has not only acknowledged the need for investment in bioimage analysis but also followed through by launching the BioVision Center. This marks an important step toward (1) making large-scale image processing workflows—such as the one developed for this project—both accessible and reproducible through infrastructure projects like Fractal; (2) fostering a strong image analysis community in Zurich and Switzerland; and (3) lowering the barrier of entry for biologists to learn about image analysis, join the community, and contribute their ideas and code.
It has been very exciting to witness how quickly the center progressed—from having no office space to organizing community events with international attendance—while simultaneously advancing the Fractal project to a state where it is now actively used in research. I am confident that the center will play a key role in shaping the image analysis infrastructure and community ecosystem of the future.
I also hope that the funding bodies at the University of Zurich recognize both the significant progress already made and the long-term nature of building computational infrastructure. Even with all the momentum, much work remains. As with all software projects, the challenge lies not only in developing new functionality and onboarding users, but also in maintaining a large and complex system so that it remains robust and reliable.
One of the most important developments in the broader imaging community is the push toward standardizing microscopy data using a modern, open-source format: OME-NGFF (also known as OME-Zarr).64 Standardization is fundamentally important for creating shareable and extensible image analysis tools. Once broadly adopted, it will be much easier to justify the investment of time and resources into developing such tools and infrastructure—both from the perspective of individual developers and institutions.
The specification is still relatively young, and there are many exciting proposals for its improvement. One recent development I would like to highlight is a spatial transform specification, proposed by John Bogovic, which has already received broad community support and contributions. The aim is to define spatial transforms, as required for tasks like image registration and other use cases.
The inclusion of displacement fields—one way to represent non-linear spatial transforms (see Section 3.2.2)—is particularly exciting for our use case. This approach could eliminate the need to duplicate image data after elastic registration and prevent us from being locked into decisions made early in the analysis during a complex processing step. Instead, we would gain the flexibility to experiment with different registration parameters and retain the ability to adjust image alignment later in the workflow.
Even though we might still duplicate data in the near future—due to the current computational cost of applying such transforms—having a clear specification opens the door to investing developer time into adapting efficient, GPU-based implementations (Ruijters, Haar Romeny, and Suetens 2011). If elastic transformations could be applied on-the-fly without significantly impacting I/O performance, it would represent a major step toward making our analysis both easier and more reproducible.
One reason I’m particularly excited about the inclusion of spatial transforms in the OME-Zarr specification is how well this aligns with my own efforts in performing intensity bias correction on the single-pixel level on-the-fly (see Section 3.2.7). While perhaps not as high a priority for the broader imaging community, intensity correction is paramount in fluorescence microscopy to ensure accurate measurements. Of the five sources of intensity bias that needed correction in this dataset, all but elution background were corrected directly in the images (Figure 28). Elution background was treated differently only because it was identified too late to be implemented as an image-based correction.
I believe all bias correction terms should be applied on the single-pixel level. Not only does this allow us to use bias-free images for visualizations (Figures 14, 31 and 41), but it also drastically simplifies the extraction of intensity features. Aggregating raw pixel intensities across a large label region is difficult to justify without correction. For example, when measuring intensity variance within a labeled region, we aim to capture the true fluorophore variance—not variance inflated by imaging bias. A related example is discussed in Section 4.4.
One of my regrets is that I applied flat-field and dark-field corrections during the initial compression step, effectively burning them into the image data. One problem is that the written images are slightly incorrect: rounding to 16-bit precision after correction introduces minor inaccuracies, and clipping at zero after dark-field subtraction means noise fluctuations that would dip below zero are lost. Ironically, this is precisely why camera sensors include a gain in the first place.65 While irritating, the magnitude of the errors is small and should not have affected the analysis presented in this thesis.
More importantly, illumination correction files are sometimes used long after their creation, even though a single maintenance operation on the microscope can render them invalid. While skipping bias field acquisition is acceptable in preliminary experiments to save time66, it becomes problematic when this habit carries over to critical experiments, where such corrections are essential. On-the-fly bias correction would reduce the cost of mistakes by still allowing adjustments to the bias correction if issues are discovered later. Such issues typically come to light long after the raw image data has been archived. In practice, no one is going to retrieve raw data from archival storage just to reapply flat-field correction—the data volumes are simply too large. Moreover, reversing a previously applied correction would be error-prone, introduce additional rounding errors, and be nearly impossible to reproduce reliably.
Bias corrections are computationally inexpensive, making a lazy implementation that can be applied on-the-fly practical. Furthermore, if models are based on physical coordinates, they can be applied across all image pyramid levels and compared between experiments. Finally, since intensity bias correction can be tricky—especially when dealing with volumetric imaging—it would not be wise to bake them directly into the image data. Therefore, lazy application of bias models is the only viable option for intensity bias correction at the single-pixel level when working with volumetric data.
Combined with spatial transforms—if both can be implemented efficiently—we would be able to store the raw microscopy data as OME-Zarr and apply all image transforms on-the-fly. From the user’s perspective, images would simply load with all relevant corrections applied transparently in the background. In my opinion, this is the best long-term approach.
Building on the previous section, a central goal in early image processing is the ability to define lazy transformations between raster images—operations that are specified but only computed when needed. In addition to intensity bias correction, we implemented lazy resampling of label images. Another promising application is the flexible selection of arbitrary subsets of labeled objects.
Take, for instance, the nuclei segmentation stored under the name nucleiRaw3 (Table 11). The name reflects that several segmentation runs were required to achieve acceptable results. Furthermore, a segmentation algorithm is usually proceeded with a sequence of post-processing steps to remove mis-segmented objects. The file on disk contains the raw output of Cellpose, which was later refined by applying a debris classifier, flagging morphological outliers, detecting misaligned cells, and filtering based on intensity artifacts caused by fluorescent debris.
Crucially, whether a nucleus should be considered an outlier depends on context. Suppose we want to count the number of nuclei in an embryo to estimate its age: in that case, removing debris makes sense, but rejecting nuclei that are merged or cropped at the image border would worsen the result. Similarly, we would not want to discard intensity outliers or misaligned nuclei because that would lead to further undercounting. In a large-scale multiplexing experiment, filtering all potential outliers indiscriminately quickly results in an empty dataset. Instead, we should define contextual outlier sets and apply only those appropriate to a given analysis.
With tabular data one can simply store the individual outlier masks for each outlier kind and filter out the appropriate ones depending on the analysis. Paired with a simple label selection operation when loading a label image, the same outlier masks can be used to generate corresponding outlier-free virtual label images. Those virtual label images get assigned a new name, either by assigning it automatically or manually. In the case of segmentation post-processing, we might assign a new name to the virtual label image (e.g., nuclei after removing segmentation debris). Or we could have virtual label images for each cell type resulting from a classifier defined in the same way (e.g., nuclei-EVL, nuclei-Deep, nuclei-YSL). Or for a specific cellular neighborhood (nuclei-#hash-of-lbl-selection) to visualize a cell in its surrounding context (i.e., crop and mask a cellular neighborhood from the volume).
Again, in the long-term we want to store raw microscopy data and formulate the transforms for processing as a computational graph, with each intermediate result being uniquely identifiable in terms of its processing history. In the case of computationally expensive transformations, we want to be able to persist them under a pre-determined ID. When requesting such an image, the implementation would first try to load the data from a persistence location based on this ID, and if it does not exist trigger the upstream processing required to compute it. A relatively small number of defined types (Image, LabelImage, [Mask], [DistanceTransform]) paired with a few operations between them (Image → Image, LabelImage → LabelImage, LabelImage → Mask, Mask → DistanceTransform) would enable specifying many useful early image processing steps lazily.
A final, crucial point is the specification of table schemas. The main motivation is to allow features to be referred to just like any other resource in the dataset. This is important not only for the visualization of components, but more critically for computing new features that build on existing measurements. Examples from this work include neighborhood features, hierarchy features, and feature embeddings—all of which rely on tabular data (Figure 23 a, (2), (3) and (4)).
In practice, I often relied on intermediate tables without being certain of the specific processing steps that had been applied to them. The lack of specification made it difficult to trust the correctness of the analysis. Was a table generated before or after t-intensity correction was introduced? Or z-intensity correction? Which correction model was used? And if I come up with a better approach tomorrow, how difficult would it be to recompute downstream analyzes like a CCP or regression model?
To fully leverage our image data, we need table schemas that are (1) easily extensible and (2) stable—meaning that at least parts of the schema are clearly defined, allowing downstream processing to be built around them. With such a specification, we could implement object hierarchy aggregation, neighborhood aggregation, and various embeddings once in a generalizable way, and then apply them to all new data conforming to the schema.
This is particularly important for building advanced image analysis workflows that go beyond simple mean intensity measurements in nuclei and cytoplasm. A solid schema specification would also facilitate the integration of data across multiple experiments over time, and support clinical applications—where analysis reports may need to be produced within hours or days.
Kevin Yamauchi proposed a table specification for OME-Zarr, built around AnnData tables commonly used in the (spatial) transcriptomics community (Virshup et al. 2024). The proposal met resistance in the community, which, from my understanding, was largely due to its reliance on a Python-only project for a language-agnostic data specification. While I share some of these concerns—not being an AnnData user myself—I still strongly believe that the absence of an accepted table specification poses a greater long-term problem. Without a common standard, we risk ending up with 15 competing ones that may prove even harder to unify later on.
The lack of consensus around a table specification, combined with the use of the legacy abbott-h5 format in my own project, was the primary reason I didn’t feel compelled to follow any particular specification. Instead, I opted for technologies that matched my personal preferences and workflow. Since the logical data model and table schemas introduced in Sections 3.2.4 and 3.2.6 are largely backend-agnostic, it would be relatively straightforward to map them to OME-NGFF using the AnnData table specification. Nonetheless, in the absence of community consensus, it’s difficult to justify investing the time to do so.
In my mind, there are two possible paths forward for the OME-NGFF table specification:
I personally have a slight preference for the latter—mainly because it maps more closely to what I already use in my everyday image analysis and personal code base. The main advantage I see in going with option (1) is that it includes a way of storing sparse adjacency matrices in obsp, which could be used to store the touch adjacency matrices (Figure 25 a). But since I usually compute neighborhoods on-the-fly (Listing 4), this has not been a major priority for me. The touch adjacency matrix is an exception, as it requires access to the label image and is therefore expensive to compute—it probably should be persisted.
Ultimately, the exact specification is not a hill I am willing to die on, because I find it much more important that we make progress toward an agreed-upon specification and begin considering how to unify access to large-scale datasets across the various levels of access required in analysis. These include the “single site level” (i.e., one ROI), where image processing workflows are typically developed, and the “dataset level” (i.e., a collection of ROIs67), which is used for exploratory data analysis or modelling based on tables aggregated from multiple sites across one or more experiments.
Independent of the choice regarding how to persist tables to disk, I would like to see the following table schemas included as part of a specification, listed in order of perceived urgency.
Label objects table specification, which includes simple vector data types (e.g., centroid, physical bounding box, and principal axes; Figure 21, label_object68; Table 17).
Feature table schemas with extensible specifications, including column names (features and categorical “resource coordinates”) and data types that support computation of embeddings and visualization.
DAPI.1_Mean.Features.intensity).roi, o, label, CCP, NormalizedCCP, DistanceToCCP).roi, o, label, c, Mean, Kurtosis, …).roi, o, label, c, z_model, t_model, Mean, Kurtosis, …).
null.roi, o, label) to arrive at a minimal ID during EDA.c, z_model, t_model) to avoid redundant clutter in the feature name.c: ['DAPI.1', 'PCNA.0'], Mean: [1.0, 2.0] → c=DAPI.1_Mean: [1.0], c=PCNA.0_Mean: [2.0])roi, o, label)roi, o, label). They can then either:
<channel>_Mean)—ideally defining consistent naming conventions.An object hierarchy table for encoding parent-child relationships between spatial objects (Table 19). This supports hierarchical feature aggregation.
An outlier table, or boolean outlier columns in the label objects table, enabling definition of multiple outlier sets (e.g., segmentation, registration, fluorescence artifact in c=PCNA.0, …). These allow flexible filtering of label images and feature tables depending on the downstream analysis task (see Section 4.3.4). A similar structure could be used to define train-test splits or folds for cross-validation.
I have introduced the concept of single-object visualizations using the image data in Section 3.2.8 and pointed out the utility of such tools—especially in the development phase of large scale image processing pipelines and during exploratory data analysis (e.g., Figure 41; see Section 3.3.3.2). Such streamlined access to the image data, enabled by a solid logical data model with single-object access (see Section 3.2.4) and paired with broadly adopted community standards mentioned in the beginning of this section, will facilitate the development of interactive EDA tools, that will in turn accelerate the development of large image analysis pipelines. An example for such a tool is presented in Figure 57.
All data analysis begins with exploration. During exploratory data analysis (EDA), interactive visualizations are invaluable for accelerating discovery. While such visualizations can be used to communicate results effectively, they should be employed mindfully.
Throughout this thesis, I have used interactive or animated visualizations extensively. In some cases, they served as effective communication tools (e.g., Figures 17, 27 and 38). In others, they were convenient enhancements (e.g., Figure 26; the interactive version of Figure 25 b). Occasionally, they reflect limited time spent refining EDA outputs into more polished visualizations with a clearer message (e.g., Figure 52).
Adopting modern document standards like HTML would streamline the transition from EDA to publication-ready content. Furthermore, HTML is arguably superior to PDF, as it enables figures to be referenced across multiple document sections and viewed on hover, allowing readers to access visual content without disrupting their reading flow. However, I recognize the risk of authors including overly complex “EDA-style” visualizations directly in manuscripts. Additionally, these tools are still under active development, and the prospect of “debugging a manuscript” may not be very appealing to many scientists.
An advantage of using a programmatic typesetting tool like Quarto is automated report generation (Allaire et al. 2022). In the long term, we could initiate microscope acquisitions that trigger downstream processing until human intervention is required. The next day, a quality control report could be consulted to guide experimental decisions (e.g., stop a multiplexing experiment early due to technical issues or invest time in improving object segmentation). Setting up such a report could be as simple as writing a Jupyter Notebook—a format that is already widely adopted—containing diagnostic plots and Markdown documentation. Once such a report is set up, it can be reused in all subsequent experiments and easily shared or published online.
Interactive visualizations in manuscripts should not serve as an excuse to offload exploratory data analysis to the reader. Effective interactive visualizations should instead leverage their capabilities to either display data challenging to display otherwise (e.g., 3D scatter plots), or provide additional information without distracting from the core message (e.g., details that might otherwise appear in a supplementary figure).
As a guiding principle, interactivity should be opt-in, with the default view functioning as a static plot (e.g., Figure 38), or be extremely simple, such as toggling between panels (e.g., Figure 17). In case of doubt a static visualization is most likely the better choice.
Scott Berry once asked in the lab, “What exactly is a maximum projected image from which we extract cellular features?”. I do not recall who was around that day, but we did some brainstorming without coming to a satisfying conclusion. Maximum projections are weird. They are widely used with tissue culture data, and they do not seem to interfere too much—it is still possible to extract useful measurements from maximum projected images. But might this approach cause us to overlook potential discoveries? I was a bit disturbed, but did not give it much more thought, since I used projections only as visualization endpoints with my volumetric data. Still, the question stuck with me, so here’s a take:
Measurements in maximum projected images are ill-defined. A maximum projection is picking maximum intensities along the axial direction. Tracing those values back to their origin in the image volume yields a strange, scattered point cloud. So it seems like they should only be used for visualization, or processing not directly involved with physical measurement, for instance, as a pre-processing step for segmentation.
The other options that are usually proposed are sum or mean projection. However, both approaches are problematic as well, as they aggregate not only the object pixels but also those above and below the object. With a maximum projection, the selected pixels usually originate from within the object. For flat, non-overlapping objects, as they are encountered in many tissue culture models, this might actually be quite reasonable, since the volumetric point cloud approximates a plane in that case. The “appropriate” projection depends on how the image was acquired, including sources of background signal and the amount of empty volume acquired around the cells.
In the absence of relevant background sources sum projection seems like the most logical choice to me, of course only after estimating and subtracting additive sources of background signal, especially the dark-field including the camera gain! Mean projections are, in my opinion, hard to justify. We know that the mean is sensitive to outliers, and zero-valued background pixels around the object are strong outliers. But why not try a rank-based statistic other than the maximum—like the median or a chosen quantile—depending on how much dead volume surrounds the objects during acquisition?
The best thing to do is to retain the image stack, and do volumetric object segmentation and measurements. This is often avoided in high-throughput screenings, typically justified by prohibitive data size and added complexity in the analysis. We often get away with it, but should at the very least consider how any lossy transformation affects downstream feature extraction and analysis.
I would like to see some solid justification and testing in different imaging conditions, and a conscious decision on the part of the imaging community, if we insist on analyzing projected data. If the primary goal is to reduce data size rather than avoiding the added complexity of volumetric image analysis, an alternative might be to explore lossy compression methods69 that preserve the full image volume while reducing storage demands.
Ironically, projections like these (Figure 41) could be used for justifiable intensity measurements. Since the volumetric segmentation was used to mask out-of-object pixels, sum projections of this kind could be used to recover some intensity features in the projected data. The sum intensity can be calculated simply by summing the pixel values from the sum projection in the projected label region. The mean intensity can then be computed by dividing by the nuclear volume, measured in the 3D segmentation that was used for masking. In practice however, this offers little benefit since the volumetric segmentation was needed to generate the masked projections in the first place. At that point we might as well do the whole analysis in 3D.
Any projection of an image volume is lossy! Saving the resulting projection in a lossless image format will not magically undo this—the damage has already occurred. We can either bite the bullet and work with volumetric images or we have to provide a good explanation why a certain way of projecting is valid.
I had my personal “maximum projection moment” when I noticed the following. We usually log-transform mean intensity features. This seems reasonable, as the distributions are right-skewed (e.g., Figure 37 b). Furthermore, log scales are commonly used with concentrations. For example, pH as a measure of acidity or alkalinity is defined as the -log10 of the concentration70 of H+ ions.
The problem arises here: why would it be reasonable to aggregate pixel intensity values in a label region on a linear scale and then log-transform the aggregated values instead of doing the opposite? If log-transforming intensities is appropriate, why not transform the pixel intensities directly? The same argument applies before performing any intensity aggregation. The problem seems identical to the one encountered with projections and feature measurements in non-bias-corrected images—premature lossy aggregation. In contrast to the bias-correction models, the solution is even simpler in just the application of a scalar function to each pixel before feature extraction. Still, one quickly realizes the need to step down from the high horse of “principled volumetric image analysis” absent image value transforms (see Section 4.3.3).
I have mentioned the efforts that went into improving intensity measurements in volumetric microscopy—both experimentally and computationally (see Section 4.1.4). Another compelling result is how we used pseudotime inference to infer cell cycle dynamics of protein targets (see Section 3.3.3) and used measurements from existing literature to calibrate the time axis (Figure 43)
Looking ahead, one compelling direction is using spatially resolved, semi-quantitative intensity data to inform biophysical geometric models derived from the same images. This would involve single-cell volumetric segmentation to generate mesh representations of cells and tissues (e.g., Figure 25 c). Rather than serving purely as visualization aids—as in this work—these meshes could form the basis for finite element or finite volume models.71 Such models could be parametrized using intensity-corrected volumetric data.
These geometric models could be constructed at varying levels of detail. For example, cells might be represented as watertight triangular meshes or as two concentric shells—one for the cell membrane and another for the nuclear envelope (Runser, Vetter, and Iber 2024). More complex models could include subcellular compartments, subdivide cells into finite volumes to capture heterogeneity, or incorporate cytoskeletal structures.
Crucially, with pixel-level corrected volumetric data, it becomes possible to measure concentrations of force-generating molecules (i.e., actomyosin and other cytoskeletal components) at mesh vertices—enabling local model parameterization. While some studies already estimate forces from image data in tissue culture (e.g., estimating forces measured by traction force microscopy from focal adhesion protein images (Schmitt et al. 2023)), most current volumetric tissue models use microscopy solely to inform mesh geometry—not to quantify local protein concentrations. This presents a major opportunity for interdisciplinary collaboration.
Live imaging could provide high-temporal-resolution geometries, while multiplexed immunofluorescence of fixed samples offers subcellular-resolution maps of endogenous protein levels. In highly stereotyped systems (e.g., ascidians or nematodes), live imaging may even be omitted in favor of pseudodynamic inference. At a minimum, such systems facilitate the integration of fixed and live data by enabling one-to-one mapping between individual cells across samples.
This approach—pairing quantitative microscopy of molecular targets with geometric biophysical models—could help shift developmental biology beyond descriptive and associative studies toward a more quantitative and mechanistic understanding of morphogenesis.
As academic scientists, our productivity is measured largely by our contributions to peer-reviewed academic journals. Having our work published—preferably in a journal with a high impact factor—is a requirement for success and strongly influences our ability to get hired and receive funding.
Having spent a lot of time on a single project, I have to wonder to what extent the resulting work—even with a publication that will still come from it—was a smart use of my time in enabling a scientific career. If I wanted to increase my chances, I probably should have spent far less time thinking about details and instead tried to publish as soon as something seemed good enough and justifiable within the vague norms of “current best practice.” I most certainly should not have spent any time questioning things that would be unlikely to be addressed by a reviewer.
The publication pressure and competitive nature of the publication ecosystem—especially surrounding those high-impact journals—might be seen by some as a source of motivation and a driver of progress. In my opinion, they can also have a detrimental effect on publication quality in biology, as the projects required to land one of those scarce spots incentivize overstating results, bloating projects, and chasing hypes over diligent, concise, and modest scientific work.
Interestingly, publishing—as in making writing available to other scientists—is now possible without involving journals at all. Anyone can upload a manuscript free of charge to a preprint server like arXiv or bioRxiv, or host it on a personal web page. While there is no formal peer review for those articles, they are more democratic because they do not hide the results of publicly funded research behind a paywall that excludes the public and entities that cannot afford subscription fees.
The rise of preprint publication servers is an important step toward more open and inclusive science, and a tool to empower us as researchers to take ownership of our work—independent of the game of submission and resubmission to journals with decreasing impact factors and often lengthy review processes.
I do not want to downplay the benefit of a well-organized peer review process in ensuring the quality of scientific work. I simply want to highlight the shift in the main function—and therefore the value proposition—of journals today. In the absence of a need for physical publishing, their role lies solely in organizing peer review and providing a marketing platform for scientific work. We should, therefore, question whether journals add an appropriate amount of value to this process, outside of being an established norm. Furthermore, we should at least consider whether there are alternative publishing models that align better with our goals as a community.
In my view, the publishing model of the EMBO Press journals takes a few steps in the right direction, with all their journals being open access and offering an option72 to publish reviewer comments alongside the manuscript. If we value peer review, why would we hide it and leave readers to speculate which experiments or analyses were added in response to reviewer feedback?
In addition, EMBO Press journals embrace Review Commons, a peer review model in which reviews are transferable between participating journals upon resubmission. This can help reduce redundant review work for practicing scientists and shorten the overall time to publication.
Scientific journals can voluntarily participate in Review Commons. Here is a random selection of journals: Cell, eLife, PLOS ONE, Nature, Science, EMBO Reports, Open Biology. Try to guess which ones are conservative in their publishing models and which ones actively innovate—by committing to open access without additional charges, participating in platforms like Review Commons, or integrating seamlessly with preprint servers.
Preliminary experiments and antibody testing were using wild-type golden strains. Staining bias experiments (Figure 11) were done with a transgenic line H2A-mCherry line provided by Max Brambach from the laboratory of Prof. Darren Gilmour at the University of Zurich. In the multiplexing experiment a transgenic line Pou5-FLAG was used, allowing the antibody staining of Pou5 with an anti-FLAG antibody. The fish line was created by Edlyn Wu from the laboratory of Prof. Nadine Vastenhouw at the University of Lausanne.
All zebrafish strains were maintained in the fish facility of the laboratories of Prof. Darren Gilmour and Prof. Francesca Peri. All animal work was carried out by the FELASA guidelines and standards of the University of Zurich, as instructed by their Fish Facility Manager Cornelia Henkel.
The recirculating system maintained the water at 28 °C with a pH of 7.5, under a 14-hour light and 10-hour dark cycle. The fish were fed twice daily. All experiments involved only early zebrafish embryos (up to 5 hours post-fertilization), while adult fish were used solely for husbandry.
Egg collection began by setting up the desired number of mating pairs the evening before spawning. Tanks were selected based on the fish line and background, considering their previous mating history. To minimize stress, a one-week rest period was maintained between setups. Male and female fish were placed in sloping breeding tanks, in pairs of 4 males and 4 females, separated by a divider. For higher yield, up to 12 tanks were prepared simultaneously.
The following morning, fresh water was added to each tank, and the dividers were removed. Tanks were checked for mating and egg deposition in 15 minute intervals. Eggs laid within that time window were collected and transferred to plastic Petri dishes, marked with the time, and held at 28 °C in E3 medium until they reached the desired stage (2–4.5 hours post-fertilization).
Embryos were pooled in 30-minute intervals between 2 and 4.5 hours post-fertilization. 1 mL of embryos were transferred to a 5 mL Eppendorf tube, aspirated to 2.5 mL and filled to 5 mL with fresh 8% paraformaldehyde (PFA; pre-cooled on ice; Lucerna Chem). Tubes were kept on ice during sample collection in the fish facility and agitated in irregular intervals during that time (first ~2.5 hours of fixation). They were then transferred to a rotary shaker at 4 °C for overnight fixation (up to 18 hours). Fixation was halted by thoroughly washing the embryos five times with 1X PBS, followed by quenching with 100 mM ammonium chloride (Sigma-Aldrich) for 30 minutes. Finally, the embryos were stored at 4°C in PBS containing 0.05% sodium azide (Sigma-Aldrich) until further use.
PFA-fixed embryos were screened for fertilized eggs and manually dechorionated in a glass Petri dish using watchmaker forceps under a stereo microscope. Permeabilization was achieved by gradually increasing the methanol concentration in 25% increments up to 100% (Sigma-Aldrich), with 5-minute intervals between each step. The embryos were then incubated in 100% methanol at -20 °C for at least 2 hours. For long-term storage, -20 °C was preferred over 4 °C, with successful storage tested for up to 3 months.
Rehydration was performed by gradually increasing the PBS-T (PBS + 0.1% Tween 20) concentration in 25% increments, with 5-minute intervals between additions. Following rehydration, the embryos were deyolked by gentle pipetting with a glass Pasteur pipette. Dissociated yolk particles were removed through repeated volume replacements—at least six times or until the solution appeared clear. All permeabilization steps were conducted in glass tubes using glass Pasteur pipettes to prevent embryo adhesion to plastic surfaces.
The wells of a polystyrene, flat-bottom 96-well plate (Greiner) were coated to facilitate zebrafish embryo adherence. The coating solution consisted of a 1:10 dilution of human fibronectin (Sigma-Aldrich) in PBS mixed with a 1:2 dilution of Poly-D-Lysine (Thermo) in PBS. Each well received 50 μl of the solution and was incubated for two hours at room temperature. After incubation, the liquid was aspirated, and the plate was left to dry for an additional two hours. Plate preparation was done on the day of mounting.
Samples were rehydrated from long-term storage (see Section 5.2.2) and yolk was removed. Wells were filled with PBS, then the embryos were carefully transferred into the wells using a glass Pasteur pipette with a Pipet Pump II (Sigma-Aldrich) to enable gentle handling. Allowing the samples to gently settle within the liquid column of the filled well ensured their orientation with the animal cap facing downward. Samples from different time points were pooled and mixed within wells to ensure potential sources of technical bias would not align with our developmental time. After mounting, the plate was incubated for two hours before centrifugation at 200 RCF for five minutes on a plate centrifuge.
To enhance adherence, the plate was placed on a UV transilluminator (Lab Gene Instruments) for one hour at 365 nm with high intensity for crosslinking. It was then stored at 4 °C overnight before proceeding with the next steps. For additional crosslinking, an imaging pass was performed on the Cell Voyager 8000, acquiring a 100 μm stack with a z-spacing of 4 μm using the 488 nm laser at 100% laser power and 100 ms exposure. It is not exactly clear to what extent the two steps contribute to the crosslinking, the imaging is likely to be the more relevant.
The protocol for performing 4i on 3D samples was adapted from the original method (Gut, Herrmann, and Pelkmans 2018) and standard IF protocols for zebrafish (Schulte-Merker n.d.) and volumetric imaging (Chung and Deisseroth 2013). Embryos were prepared and mounted in a 96-well plate as described in Section 5.2. Before starting multiplexing cycles, a pre-elution step was performed reduce the effect of sample shrinkage in the elution buffer (Figure 17 (a)) that was most prominent after the first elution.
Each cycle consisted of the following steps, all conducted at room temperature:
Extensive washing steps were performed between each stage to prevent buffer interactions. All pipetting steps were executed using automated liquid handling protocols, as described previously. A residual volume of 70 μl was maintained in each well at all time to avoid sample disturbance.
The 4i Elution Buffer (0.5 M L-Glycine, 1.2 M Urea, 3 M Guanidine hydrochloride) was prepared in advance, with Tris(2-carboxyethyl)phosphine hydrochloride (TCEP; 0.07 M) added just before use. The pH was adjusted to 3 using 32% HCl. The buffer was introduced via multiple volume replacements and incubated for 1 hour. Before proceeding with staining cycles, the wells were thoroughly washed 12 times (total duration: 30 min). After washing with PBS, the pH was confirmed to be approximately 7.
The 4i blocking buffer (PBS with 2% BSA and 0.2 M Ammonium chloride) was prepared at 2x concentration. Before use, Maleimide (0.3 M) was added. The blocking buffer was mixed with the residual PBS in the wells and incubated for at least 1 hour, followed by extensive washing.
Primary antibodies were prepared at 2x concentration in 2x 4i blocking buffer (without Maleimide) and mixed with an equal volume of residual PBS. The antibodies were incubated overnight at room temperature on a gentle see-saw rocking device. Dilution factors for each antibody are listed in Table 23.
Secondary antibodies were diluted 1:200 in 2x 4i blocking buffer (without Maleimide), supplemented with DAPI (1:40). The 2x antibody mixture was combined with an equal volume of the residual PBS in each well and incubated for 4 hours under gentle agitation in the dark.
Before clearing and imaging, N-acetyl-cysteine (NAC; 0.7 M) was added to the PROTOS buffer used for refractive index matching prior to imaging (“CLARITY Protocol and Tissue Clearing Guide Abcam” n.d.). Excess secondary antibodies and dyes were removed by washing, and the PBS was gradually replaced with PROTOS imaging buffer to clear the zebrafish embryos. DAPI was added to the imaging buffer in the same concentration used for staining. This was done because the affinity of DAPI seemed to be lower in the imaging buffer, leading to the fluorescent dye being washed out during long imaging. Embryos were kept in imaging buffer for at least 1h prior to imaging to ensure equilibration of the nuclear dye and homogenization of the refractive index.
The prepared samples were imaged as described in Section 5.4. Imaging was usually done overnight.
Imaging buffer was washed out thoroughly using PBS after each imaging. Interruptions in the experiment were always done after imaging buffer washout to prevent evaporation and precipitation of the buffer components.
Imaging was performed using a Yokogawa CV8000 prototype spinning disk confocal microscope equipped with a CSU-W1 spinning disk unit with a 50 μm pinhole disk and a Hamamatsu ORCA Flash 4.0 camera. A 20x/1.0 NA water immersion objective was utilized to capture high-resolution images, with each field of view (FOV) containing a single sample. Object detection was conducted using a custom search-first approach, as described in Section 3.1.2. A z-stack of 271 μm was acquired with 1 μm spacing between optical sections. Laser power and exposure time were adjusted individually for each stain to optimize signal quality. Imaging was conducted overnight.
Flat-fields were estimated prospectively by imaging fluorescent dyes diluted in an identical 96-well plate as was used for the experiments. Multiple image stacks were acquired for each wavelength, following an identical acquisition protocol as described in Section 5.4. Flat-fields were then estimated for each channel using the BaSiC Fiji plug-in (Peng et al. 2017). Dark-fields were recorded by averaging 100 images with the laser turned off. Finally, dark-field and flat-field correction were applied during the conversion from raw Yokogawa images to abbott-h5. Corrected images were clipped at 0 to prevent integer overflow before casting to uint16.
Bias models were estimated on the mean intensity features extracted after nuclei segmentation. Both t-decay models and z-decay models were estimated for each stain individually. The code for model estimation and application to image volumes can be found in the intensity normalization module of the library. models.py contains the model classes, t_decay.py and z_decay.py are the scripts used for fitting of the models. Bias correction was applied on the imaged volume prior to feature extraction.
Background signal accumulated by improper elution of antibodies and pinhole cross-talk for nuclear stains was estimated by measuring mean signal intensity in the cytoplasm. Mean cytoplasmic signal was subtracted from nuclear signal to arrive at the background subtracted values. Sum nuclear intensity was then computed by multiplying the background subtracted mean nuclear signal by nuclear volume.
The complete model—excluding subsampling and outlier removal—was implemented following the scikit-learn Estimator/Transformer API, which serves as the de facto standard for prediction-oriented machine learning with tabular data in Python (Buitinck et al. 2013). The advantage of following such community standards is that this allows integration with a rich set of tools build around this API, like Pipeline’s that allow one to combine processing components in a modular fashion, or access to meta-estimators that can run fairly complicated training schedules required for hyperparameter tuning while requiring minimal effort on the side of the data-scientist experimenting with workflows.
To achieve easy interoperability components were implemented as classes which implement a .fit(X, y=None)73 method to train a model on some feature matrix X, which is followed by either .transform(X) for Transfomers or .predict(X) for Estimators the difference being that estimators can have an additional .score(X, y) method to evaluate their performance after training.
Within the core library there are some useful components for steps like feature pre-processing (e.g., scaling and transformations, or one-hot-encoding of categorical features), missing value imputation, dimension reduction and many supervised estimators for both regression and classification tasks. In addition to the core library there are a number of extension libraries including component collections (Brouns and Warmerdam n.d.; Galli n.d.), embedding algorithms (McInnes, Healy, and Melville 2018; Setty et al. 2019) or methods for automated machine learning (Le, Fu, and Moore 2020; Kanter and Veeramachaneni 2015). Furthermore, there are attempts at extending the API for sampling of imbalanced classes (Lemaître, Nogueira, and Aridas 2017) or more niche applications like topological data analysis (Saul and Tralie 2019).
The steps involved in generating the CCP embedding and most important search axes during hyperparameter tuning were as follows ([] indicating optional steps):
StandardScaler: rescaling to 0 mean 1 standard deviationRobustScaler: rescaling to 0 median and 1 quantile rangePowerTransformer: monotonic transformation toward Gaussian with 0 mean 1 standard deviation using the 'yeo-johnson' method.QuantileTransformer to achieve uniform distribution.We wanted to benefit maximally from the library and therefore strived for including as much of the hyperparameter space we wanted to explore within a single Pipeline. To achieve this, we wrote some additional custom estimators and transformers to build a flexible pipeline in a repository called scikit-ccp (Table 24). This allowed us to integrate many of the processing steps into a single pipeline, which is preferable as it (1) allows hyperparameter tuning of not only the final estimator, but also each of the pre-processing steps, including exchanging components like the ones used for feature scaling, and (2) it is much easier to adhere to “machine learning best practices” like separation of train and test data across all stateful transformers.
This last point is often overlooked when doing pre-processing in a data frame library, which makes it much more likely to overlook a stateful preprocessing step, that could be something as innocuous as a StandardScaler (meaning standardization or z-scoring of a feature). Z-scoring a feature requires computing the sample mean and sample standard deviation, parameters that should only be estimated from the training not the test data. This subtle error, of giving a model access to data it should not have access to, is called data leakage. It can lead to an overestimation of how the model will perform on unseen data.
Having a model at hand that is able to represent the problem we are trying to solve, we can set its hyperparameters and fit it to data. When doing exploratory analysis in that direction we might train a handful of models and compare their results by manually comparing the resulting embeddings. This is a perfectly valid approach in the initial exploration of a broad range of different approaches and algorithms we might consider, and gives us more flexibility since we do not have to invest time to build a shared interface of the models.
To effectively utilize the substantial computational resources available today for enhancing model performance, manual evaluation of each resulting model is not a feasible approach. What is needed is an automated way of assessing model performance. We addressed this by annotating 634 cells across each time point into six cell cycle classes, ranging from anaphase and telophase of mitosis, through early S and S phase, and back to prophase and metaphase of mitosis (Figure 58).
To use these annotations in the form of ordinal classes for evaluating the output of a continuous CCP embedding, we assessed the rank ordering of the annotated cells within the embedding. We then partitioned them into six classes and counted the number of pairwise permutations between adjacent classes required to achieve the correct order.
A complication came from the fact that the embedding is circular, meaning that the numerical values of 0 and 2π correspond to the same point in the embedding. This also means the metaphase and anaphase classes are adjacent to each other. So instead of the 5 cuts required to partition a one dimensional feature into 6 regions we need 6 cuts for a circular feature.
The annotated cells were only used for model scoring and did not go into training of the embedding and the models were scored on each internal stratum (division cycle 7-9, 10, 11, 12) separately. The worst score was selected to find a model that works reasonably well across all time points. We ran four separate hyperparameter searches while making minor adjustments to the model. Each run consisted of defining a parameter grid over which we ran the embedding and evaluated the resulting model based on the scoring procedure.
There is a connection between this hyperparameter search and the training of a single deep learning model, that during a training run in multiple passes (epochs) over the data, optimize the internal model parameters according a loss function. In this example finding optimal parameters of the deep learning models is equivalent to finding optimal hyperparameters of the classical machine learning pipeline. The important distinction is that whereas we use a hyperparameter grid (i.e., evaluating model performance in fixed locations specified in the grid) deep learning models are built in a way that the gradient of the loss function with respect to the model parameters can be approximated. This means that the optimal parameters can be found by optimization methods like stochastic gradient descent (SGD) to efficiently move toward the best parameter settings. This efficient way of finding optimal parameters enables the huge number of parameters those models can have, while still fitting them within reasonable time. The important constraint for using gradient based optimization is that the loss function needs to be differentiable75 to allow the computation of gradients. This is not the case for a general machine learning pipeline where some of the hyperparameters are not even continuous (e.g., the type of feature scaler takes one of three categorical values) and there is therefore no way of computing a gradient with respect to the full hyperparameter space, closing the door on optimization based methods. On the plus side, we gain a lot of flexibility in what we consider a hyperparameter, including the inclusion of arbitrary pre-processing steps.
My initial, maybe naive, hope of finding an obvious small region of the parameter space with the best models turned out to be wrong. While I am confident that the scoring function did its job in the sense of tracking model performance to some extent, allowing us to find reasonable CCP models, it relies on only a very small set of annotated data (634/300’000+ cells) and the scoring function displayed slight variation upon multiple applications to the same model that were only noticed after running the hyperparameter search. The problem was caused by the fact that there were tiny inconsistencies in the UMAP embedding between subsequent runs, which could be seen when embedding the annotated cells 5 times using the same trained model. A small subset of them (12/634) showed slight inconsistencies in their embedding coordinates potentially caused by numerical imprecision or a random number generator in the UMAP algorithm (Figure 59). These were enough to lead to slight variation in the model score. We can assume the same is happening with all cells but due to the small amount of variation we do not expect an effect on the cell cycle embedding that is relevant since the differences in CCP caused by them are well below the resolution limit of the method.
While the small scoring inconsistencies were annoying they forced us not to be too confident in the evaluation metric, which is advisable in general, particularly in a semi-supervised setting where the target is only rough approximation it in form of categorical annotations. We were still able to see some patterns with regard to the best performing models, but nothing quite as clear-cut as I would have liked for a simple statement of having found the single best set of hyperparameters for the approach that would facilitate the reporting of the results. The general trends that informed the choice of the final model were:
PowerTransformer > RobustScaler, StandardScaler)StratifiedEmbedderAligned serves for the embeddings to be very roughly aligned only, mainly useful for visualization purposes certainly not to turn them anything close to a consistent shared embedding.The scoring performance, especially given the remaining uncertainty mentioned before, was not the only parameter that went into the choice of the final model used in the rest of the work. Furthermore, since the subsampling seemed to be such an important factor, we used an initial CCP model to inform our subsampling (attempting to sample uniformly based on the initial CCP embedding and then retraining a model on that new sample), which lead to qualitatively77 better results.
Regarding the pre-embedding algorithms we went with UMAP for two practical reasons. First, the Palantir algorithm requires access to the full dataset when computing the embedding. Since it is not possible to compute the embedding and then add new cells later on it was impossible to use it with subsampling methods. This is quite unfortunate as the embedding seemed to be slightly better in performance in the initial qualitative exploration, potentially due to the algorithm being developed specifically with single cell trajectory inference in mind. I think this is an important point to consider when developing (embedding) algorithms that are to be used in a flexible manner in other work, and was not on my radar as a valuable algorithmic property before encountering the issue first hand. While there might be workarounds to adapt the method for this more flexible use, the people who could have developed it most easily were the original authors. For everyone else adapting an existing algorithm in that way is not very appealing, as it will probably not be viewed as a substantial improvement by many reviewers and is therefore barely justifiable as an academic project. A secondary reason was that at the moment there were some dependency conflicts, that might already be resolved by now, but they made the integration harder.
ChatGPT-4o, Gemini 2.5 Flash and DeepSeek-V3 assisted in refining phrasings throughout the thesis; however, I take full responsibility for all the ideas expressed.78 ChatGPT-4o was also used to summarize parts of the materials and methods, based on the very thorough protocol descriptions by Marvin Wyss in his lab journal and thesis, which formed the basis for Sections 5.1, 5.2 and 5.3.
GitHub Copilot helped with auto-completing code in the abbott library. For anything more than a single function or boilerplate it was not very helpful. There is still value in learning how to code, kids—do not let NVIDIA executives tell you otherwise. I am quite confident that people talking about AGI79 in light of today’s transformer-based LLMs are delusional—or, more likely, trying to pump up shareholder value.
Yes, I do believe our brains are “only” processing information. Yes, I do acknowledge that information processing is substrate-independent. Therefore, a mind can run on our current silicon hardware. But simulating a biological brain on digital hardware is incredibly inefficient. My understanding of the current state of the art in this endeavor is that we manage to simulate parts of a rodent hippocampus—on an HPC cluster, not in real time (Romani et al. 2024). Maybe not quite a blue brain yet—more like a blue partial-brain.80 While there might be an alternative route to AGI on silicon hardware that is orders of magnitude more efficient than simulating biological brains—a requirement for its economic viability to replace human thought—I am willing to bet that it involves more than feeding the whole of the internet into a transformer model and fine-tuning it with human feedback. Then again, things move quickly in that space, and I might look like a fool five years down the line.
I would like to end by extending my sincere gratitude to all the people who supported me throughout my journey toward a PhD.
First, I would like to thank Marvin Wyss, my Master’s student, who joined the project around 2.5 years in. At that time, I was starting to lose patience with the repetitive nature of protocol development, and there were no other people in the lab working with the zebrafish model. I am very grateful to have had you as a student and collaborator, and I was very impressed by how quickly you took charge of your part of the project. Your diligence in planning, conducting, and reporting on experiments is something I aspire to. This project would not have progressed to where it is without your help!
I would also like to thank Joel Lüthi for the collaboration on the abbott library, related projects, and the many coding and analysis deep dives we’ve had over the years. I really appreciate the community work you’re doing now, and how you helped me out with personal invites to events—especially at times when I wasn’t exactly on top of my email inbox.
A huge thanks also to all the past and current members of the Pelkmans lab. Lucas Pelkmans for hiring me, always supporting my project and being open for discussions in both lab and personal meetings. And for giving me a tremendous amount of freedom to explore rabbit holes along the way. Ola Sabet for advocating for me before I had joined the lab, teaching me tissue culture early on,81 nerding out about microscopy, and most importantly, being a role model in collaborative work and how to respond to scientific criticism with both resolution and humility. Jacobo Sarabia Del Castillo for teaching me the original 4i protocol. Arpan Rai for always helping out with wet-lab and biology-related questions. Scott Berry for raising important issues related to quantitative imaging (see Section 4.4) and nudging me toward ZGA as a research question. Gabriele Gut for providing a 2D blueprint for large parts of my project (Gut, Herrmann, and Pelkmans 2018; Gut et al. 2015).82 Raffaela Gallo, Adrian Tschan and Bernhard Kramer for showing me the PhD ropes. Doris Steiger and Valentina Marcelli for keeping the lab running, and Noel Imboden for keeping the microscopes running.
Another huge thanks to Shayan Shamipour for all the feedback and in-depth discussions on the project, the collaboration on the manuscript (in preparation), and for contributing live data from early zebrafish development (data not shown). Ruth Hornbachner for battle-testing and contributing to the abbott library, and taking charge of porting parts of the code base to Fractal. Serguei Ovinnikov for engaging discussions on systems biology and data analysis in general, and for pointing out some of those aforementioned rabbit holes. And to all the other past and current members of the lab for inspiration, feedback, games, “coofi”, beers, and stimulating discussions during lab meetings and elsewhere.
Another big thanks to Nadine Vastenhouw for serving on my committee and for her feedback during our meetings. I am especially grateful for the invitation to Lausanne and for introducing me to her lab. A special thanks to Edlyn Wu, who engineered the Pou5-FLAG line used in the project.
I also want to thank Darren Gilmour for serving on my committee, always being open to chat, and for him and Francesca Peri generously sharing their zebrafish facility. Specifically, I would like to thank Max Brambach, Greta Ebnicher, and Cornelia Henkel, who taught me the ways of the zebrafish early in the project. I also appreciate Robert Bill and Jérôme Julmi for their help with injections,83 and everyone who answered my questions in the fish room or invited me for coffee and cake in the lab.
Finally, I want to thank all my friends and family for reminding me that there’s more to life than scientific research. For all the dinners, board game nights, parties, walks, holidays, cinema-visits, discussions and laughs. And for being so patient with me when I was caught up in work.84 I especially want to thank my parents and sister for supporting me throughout my childhood and adult life. My mom, Claudia Lüthi, for encouraging me to consider others’ perspectives—especially in conflict. My dad, Alex Hess, for showing me the value of genuine curiosity and openness toward other people. And my sister, Livia Hess, for demonstrating how powerful a tool language can be—both for arguing a point or just to have fun.
Cells are mostly flat, and analysis is usually done on 2D projected data.↩︎
It is often preferable not to waste time developing complicated analysis workflows accounting for mediocre image data if there is an easy way of solving the issue experimentally by adjusting the acquisition protocol.↩︎
Just for context, 107 years is enough time to evolve from our shared common ancestor with chimpanzees and bonobos to modern humans. I don’t even want to think about how long it would take to train a CCP model with their hardware. Ok, I’ll stop now.↩︎
Stevens (1946) rightfully note that we can not interpolate between data points of measurements on a purely ordinal scale, which is sometimes done when computing quantile based statistics like the median. This is only valid in cases where the median is computed for a feature on a higher scale, that has a meaningful notion of differences between measurements.↩︎
The term “scale” is sometimes used to refer to the unit of a measurement. However, this should not be confused with the hierarchy of measurement scales discussed in this section.↩︎
Volumetric pixels.↩︎
Only partial API correspondence to NumPy.↩︎
Ok, the stakes are not quite as high. People do not get physically injured, but a bioimage analyst somewhere sheds a tear.↩︎
True or False↩︎
Which we certainly want.↩︎
This can be obscured by the fact that many high-level processing functions for label images (e.g., skimage.regionprops) perform the iteration over objects under the hood.↩︎
Or coordinate.↩︎
At least at the scales we are concerned with in biology. At the Planck scale the “real world” might be discrete.↩︎
a k-simplex is a k-dimensional generalization of a point (0-simplex). Higher order simplices include a line (1-simplex), triangle (2-simplex) or tetrahedron (3-simplex).↩︎
A simplicial complex is a special kind of mesh that only uses simplices. In computer graphics, meshes that only use 2-simplices are referred to as triangular meshes.↩︎
Or Adobe Illustrator, relegated to a footnote due to its subscription model.↩︎
For an image with uint16 pixels, with coordinates represented as float32 this is 7x the amount of data. This is assuming we are treating it as a point-grid, meaning each pixel has an x, y and z coordinate plus a pixel value (val). This representation has the identical limitation as spatial-image (see Section 1.3.3.1.2).↩︎
Things break, a lot. Not a big problem in a largely personal code base, not advisable otherwise. But I much prefer the authors invest their time in exploration and refining a consistent API, rather than rushing a stable release and thereby committing to suboptimal choices.↩︎
Online might be a bit of a stretch, at this point we are doing analytical processing.↩︎
This, along with the straightforward installation process and interoperability with data frame libraries, is what I mean when I say it is “closing the gap” between data frame and database-style tabular processing.↩︎
Over VPN & a bad Wi-Fi connection both with scan_parquet().collect() and read_parquet() using the Rust native reader.↩︎
target↩︎
features↩︎
The term inference is used in two different ways reflecting the two approaches to machine learning. Within the data science world it is used to refer to using a trained model for prediction on new data. In the statistical literature it used to refer to the process of relating the experimental data to the larger unobserved population (i.e. what we can infer about this population from our sample.)↩︎
Another descriptive statistic on a distribution relating to how “heavy-tailed” it is. A univariate normal distribution has kurtosis 0. Positive kurtosis corresponds to heavier tails (e.g., Student’s-t), negative kurtosis to thinner tails (e.g., Uniform)↩︎
Part of a track. For example, of a nucleus between divisions.↩︎
XML format for Yokogawa acquisitions.↩︎
In HDF5 arrays are called datasets.↩︎
I prefer to view multichannel images as collections of images because of the independence of the channels discussed in Section 1.3.1.↩︎
Images, even when using 16-bit pixel encoding rarely go above 5000 and are often in the range around 1000. Still 1000 grayscale values are enough for us to call it continuous for all intends and purposes.↩︎
Embarrassingly parallel workloads are ones that do not have any interdependencies and are therefore very easy to parallelize because they do not require consideration of race conditions on the side of the developer.↩︎
Union of two sets with the elements labelled by their set of origin. This avoids name collisions should we have a channel named the same way as an object type (which would be a rather strange naming convention, but who knows?)↩︎
The type of an image dataset might, for instance, be:
A Python dataclass can be declared as frozen, meaning its fields can only be set during initialization. Attempts to modify these fields later result in a FrozenInstanceError. This is Python’s closest approximation of immutability. Although mutation is still technically possible via object.__setattr__, the design clearly signals that such behavior is not intended. If a frozen dataclass contains a field with a mutable object, the object remains mutable.↩︎
The underscore at the beginning of a function name follows a Python convention indicating that the function is intended for internal use and not part of the public API. In this case, it reflects that users should not call the _update function manually in a final implementation—instead it should be invoked automatically, for example, after calling MFrame.filter.↩︎
This might well be me not really getting it.↩︎
While implemented as an API for programmatic access, this is of course the same information that could be exposed to users via a graphical user interface (GUI).↩︎
This represents the primary key column in a z_models table containing the full parameter settings for each model.↩︎
In our case, the type is implicitly encoded in the table names.↩︎
I call it “more sophisticated” because it allows quantification of subcellular intensity distribution through summary statistics on intensity distribution and texture features (not yet implemented). However, volumetric imaging makes quantitative intensity measurements more susceptible to bias.↩︎
In the case of dividing cells, the DNA is segregated during anaphase, prior to cytokinesis, when the daughter cells’ cytoplasm divides. Technically, we could use the CCP model discussed later to detect mitotic pairs of DNA before cytokinesis, at which point we would need to account for cells with multiple nuclei. Another example where the 1:1 relationship between nuclei and cells does not hold are multinucleated cells, which can occur naturally in tissues (e.g., hepatocytes) or under certain experimental conditions.↩︎
If we consider that neighborhood features can be computed across continuous radii, or account for the full range of possible embeddings, the feature space becomes effectively infinite.↩︎
These primary keys are not object IDs (i.e., roi, o, lbl), but rather identifiers for a “feature query applied to an object and other resources.”↩︎
And heated😂.↩︎
Initially, I thought it would be necessary to flip back and forth, but after a couple of months of using the table schemas as specified, my usual access pattern turned out to be: tall for lazy IO and filtering (i.e., selecting a subset of features/objects) followed by a flip to wide in the end.↩︎
Provided that cropping does not interfere with assumptions made by the function.↩︎
This happened to us more than once. And the likelihood of this occurrence increases with the number of acquisitions in an experiment.↩︎
Technically it would suffice to only read the z-planes from the objective up to and including the cropped region with the same XY extent, but that would complicate the implementation drastically.↩︎
And padded, and masked.↩︎
I do not have sufficient familiarity with the internals of Cellpose to assess how significant a change to the model architecture this would entail.↩︎
There is a sharp discontinuity in DNA volume and content during meta/anaphase in mitosis, or potentially very quick dilution of nuclear proteins when the nuclear envelope breaks down. These discontinuities pose a major problem for inferring a correct trajectory. Ideally, discontinuities in one of the features (i.e., DNA abundance during the metaphase to anaphase transition) can be “patched” by other features (i.e., DNA concentration should remain constant).↩︎
Note that this depends on the features included in the space and their preprocessing.↩︎
The term they use for the principal components after rotation.↩︎
From a quantitative microscopy perspective this is probably my favorite result of the whole project, since it suggests that our intensity quantification is reasonably stable in the face of relatively drastic morphological changes of the nuclei. Thinking about all the things that could go wrong here, including the bias issues discussed, and how those could be additionally amplified by segmentation issues, this makes me a very happy.↩︎
That is something we did not measure. Including such measurements in the same dataset would have been extremely challenging, particularly while preserving key properties like random temporal sampling and the mixing of time points—both of which were crucial in the experimental design for statistical robustness.↩︎
Only after the first iteration since it requires a fitted CCP model.↩︎
It is hard to say when exactly this breaks down. Even if asynchrony spans more than a cycle, as long as there is still a detectable density concentration, not just noise in terms of a spurious fluctuation in synchrony, the average CCP might remain meaningful.↩︎
Shorthand notation for linear regression models popularized by libraries in the programming language R.↩︎
Inverse function of the logit.↩︎
Fun fact: Arviz uses xarray (see Section 1.3.3.1.1) in their arviz.InferenceData object that holds the sampling results generated in the process of running Bayesian inference. In my personal opinion, a sign of quality work from the library’s creators😉.↩︎
Early division cycles (<11) have fewer than 50000 cells.↩︎
Statistical censoring means values are clipped at (i.e., cannot go below) a certain value.↩︎
Agilent Bravo or a similar platform.↩︎
OME-NGFF is a backend agnostic specification with OME-Zarr being an implementation of the specification using zarr as a backend.↩︎
🤦♂️↩︎
It’s even preferable—not worth wasting time.↩︎
If the definition of ROI is kept general enough, these can refer to independent sites in an experiment, as well as spatial slices extracted from a volume. The latter is useful when processing large volumes that do not fit into memory, or when re-using feature extraction functions across different length scales (i.e., the same function could be used to measure the location of nuclei relative to their parent embryo, and transcripts relative to their parent cell).↩︎
It should probably be called spatial objects table. Label object puts too much emphasis on label images as just one of multiple representations. Furthermore, each spatial object is associated with multiple “label objects”, one for each multiscale level.↩︎
Maybe something like what the company dotphoton sells. This is not an endorsement of their product, just an example.↩︎
More precisely, pH is defined in terms of the activity of H+ ions. For ideal solutions, the activity equals the concentration.↩︎
I should emphasize that I am not an expert in these methods—this would be best implemented by someone proficient in differential geometry.↩︎
At 95% adoption, one might almost call it a norm.↩︎
In scikit-learn features (independent variables) and target (dependent variable) are conventionally stored in variables named X and y respectively. Unsupervised methods do not require a target, still they need to be implemented as if they did y=None. This enforcement of a shared interface between different processing steps enables them to be used in meta estimators like GridSearchCV.↩︎
With the caveat for algorithms that can not be applied to new data, such as Palantir, discussed below.↩︎
Meaning reasonably smooth.↩︎
In a continuous setting we do not have classes, but the issue is similar to the well known problem of class imbalance that is a challenge in classification task with imbalanced classes.↩︎
The fact that we do not have a clear quantitative measure of success in terms of the scoring means that we have to resort to a qualitative assessment.↩︎
This is an example of a refined sentence. The sentence, not the footnote. The footnote is all me—notice the difference? I do not think that I have ever used a ‘;’ in a regular sentence.↩︎
Artificial general intelligence.↩︎
Note that I am mocking the tendency of scientists to oversell their projects to receive grant money, not the work of the scientists on the project. To me, as a non-expert in the field, it actually sounds quite cool.↩︎
Never again! 😅↩︎
I like to think of my PhD as “doing whatever Gabri did—in a relevant model system 😜—but without the impressive publication record 🥲”. If you don’t know me and Gabri, feel free to ignore this footnote.↩︎
And generously sharing their table tennis equipment. 😉↩︎
❤️↩︎