The Principle of Maximum Entropy

Executive summary

The Principle of Maximum Entropy (MaxEnt) is a rule of statistical inference: given partial information in the form of constraints, choose the probability model with the largest entropy among all distributions satisfying those constraints. This yields the least-committal distribution consistent with what is known, and — as Jaynes formalised in 1957 — connects thermodynamic entropy and Shannon information entropy through a single underlying epistemological framework.1 2

The mathematical core of MaxEnt is a constrained optimisation problem solved via Lagrange multipliers. Its solution is always a member of an exponential family, and it is equivalent to minimising KL divergence to a reference measure.3 5 This unification extends from classical statistical mechanics to quantum density matrices, biological sequence models, and single-cell gene expression distributions.

This article surveys the mathematical foundations in depth — entropy functionals, variational derivation, duality, consistency axioms, and large deviation theory — then traces MaxEnt through classical and quantum physics, systems biology, and current research frontiers to 2026. Limitations including the non-equilibrium paradox, identifiability, and effective versus mechanistic couplings are treated explicitly in Section 7.

1. Introduction and historical context

The Principle of Maximum Entropy represents one of the most profound conceptual bridges connecting information theory, statistical mechanics, and predictive modelling in complex systems. Originally expounded by E. T. Jaynes in two seminal papers in 1957, the principle formalised a natural correspondence between the foundational mechanics of classical thermodynamics and the mathematical theory of communication pioneered by Claude Shannon. Jaynes posited that the Gibbsian method of statistical mechanics is inherently sound not merely as an empirical physical theory of interacting particles, but because thermodynamic entropy and information entropy share an identical underlying epistemological framework. Consequently, statistical mechanics can be understood as a specific application of a far more universal tool for logical inference and probabilistic reasoning.1 2

At its core, MaxEnt is a rule of statistical inference: given partial information in the form of constraints, choose the probability model with the largest entropy among all models satisfying those constraints. This yields the least-committal distribution consistent with what is known, connecting directly to minimum-information updating through relative entropy and the Kullback–Leibler (KL) divergence. By selecting the probability distribution that maximises entropy subject to known constraints, researchers ensure that no inadvertent biases or hidden assumptions are introduced into the probability estimation — the principle avoids injecting structure beyond what is empirically warranted.2 9

Entropy appears in two closely related but conceptually distinct roles. As an information measure, entropy quantifies uncertainty or surprisal in a distribution; in statistical mechanics it connects macroscopic thermodynamic observables to microscopic state uncertainty. As an inference criterion, the critical move of Jaynes was to treat the entropy functional as a tool for selecting among distributions given constraints, not merely as a physical state function. Boltzmann–Gibbs ensembles fall out as specific constraint choices, rather than as extra physical postulates.2

A key later development was the shift from "maximise entropy" to the more general "minimise information gain relative to a prior": maximise relative entropy (equivalently minimise KL divergence) subject to constraints. This generalisation — often referred to as minimum cross-entropy or maximum relative entropy — is important when a non-uniform prior or reference measure is available, and it subsumes Jaynes' original formulation as the special case of a uniform reference.3 4 5

The philosophical status of MaxEnt has been debated extensively. There are at least three defensible stances in the literature: (i) as a normative inference rule justified by consistency axioms (Shore–Johnson); (ii) as a minimum updating rule, selecting the posterior closest to a prior in KL divergence subject to new constraints, from which Bayes' rule emerges as a special case; and (iii) as an asymptotic typicality principle, in which entropy and relative entropy appear as rate functions in large deviation theory and conditional limits select MaxEnt-like distributions. Critical analyses by philosophers of physics emphasise that MaxEnt's force depends on how constraints encode knowledge, and that careless constraint choices can yield misleading claims of objectivity.10

Over the decades, this principle has evolved from its origins in physical chemistry and gas dynamics into a generalised, constraint-based paradigm applicable across disparate scientific domains. Today, the applicability of entropy maximisation extends far beyond the ergodic theory of classical physics, penetrating into quantum information theory, the astrophysics of dense stellar environments, and the high-dimensional data landscapes of modern systems biology and genomics.

2. Mathematical foundations

2.1 Entropy functionals and feasible sets

Shannon entropy (discrete). For a distribution \(p = (p_1, \ldots, p_n)\) on a finite set, with the convention \(0 \log 0 := 0\):1

Shannon entropy

\[ H(p) = -\sum_{i=1}^{n} p_i \log p_i \]

Strictly non-negative and strictly concave, achieving its maximum of \(\log n\) at the uniform distribution. The requirement that independent variables add their entropies linearly is built directly into this logarithmic formulation.

Continuous formulation and the KL divergence. When transitioning to continuous variables, the standard differential entropy \(h(p) = -\int p(x) \log p(x)\,dx\) encounters foundational issues: it is not invariant under reparametrisation of coordinates. A change of variables can artificially alter the calculated entropy. The principled resolution maximises relative entropy, or equivalently minimises the Kullback–Leibler (KL) divergence with respect to a specified reference measure \(m(x)\):3 4

KL divergence (continuous)

\[ D_{\mathrm{KL}}(p \| m) = \int p(x) \log \frac{p(x)}{m(x)}\,dx \]

The KL divergence is asymmetric, strictly non-negative, and zero if and only if \(p = m\). This formulation ensures invariance under coordinate transformations and makes the role of the underlying prior measure explicitly clear.

Relative entropy / KL divergence (discrete). For distributions \(p, q\) on the same finite support with \(p \ll q\):3

KL divergence (discrete)

\[ D_{\mathrm{KL}}(p \| q) = \sum_i p_i \log \frac{p_i}{q_i} \]

Von Neumann entropy (quantum). For a density operator \(\rho\) (positive semidefinite, trace 1), the quantum analogue of Shannon entropy is:11

Von Neumann entropy

\[ S(\rho) = -\mathrm{Tr}(\rho \log \rho) \]

Quantifies the notion of the "information qubit," distinguishing it from the physical qubit, and determines how much quantum information is present in a given system.

Quantum relative entropy (Umegaki). For states \(\rho, \sigma\):11

Quantum relative entropy

\[ D(\rho \| \sigma) = \mathrm{Tr}[\rho(\log \rho - \log \sigma)] \]

\(D(\rho \| \sigma) = +\infty\) if \(\mathrm{supp}(\rho) \not\subseteq \mathrm{supp}(\sigma)\).

Linear constraints. MaxEnt typically fixes expectation values alongside normalisation. In the discrete case the constraints are \(\mathbb{E}_p[f_k] = \sum_i p_i f_k(i) = c_k\) for \(k = 1, \ldots, m\), with \(\sum_i p_i = 1\). In quantum form: \(\mathrm{Tr}(\rho F_k) = c_k\) and \(\mathrm{Tr}(\rho) = 1\).

2.2 Variational derivation via Lagrange multipliers

Consider the finite discrete MaxEnt problem: maximise \(H(p)\) subject to \(\sum_i p_i f_k(i) = c_k\) for \(k = 1, \ldots, m\) and \(p \in \Delta_n\). Construct the Lagrangian, introducing multiplier \(\lambda_0\) for normalisation and multipliers \(\lambda_k\) for each moment constraint:

Lagrangian

\[ \mathcal{L}(p,\lambda_0,\lambda) = -\sum_i p_i \log p_i \;-\; \lambda_0\!\left(\sum_i p_i - 1\right) \;-\; \sum_{k=1}^{m} \lambda_k \!\left(\sum_i p_i f_k(i) - c_k\right) \]

Assuming an interior optimum (\(p_i > 0\) for all \(i\)), stationarity gives for each \(i\):

Stationarity condition

\[ \frac{\partial \mathcal{L}}{\partial p_i} = -(\log p_i + 1) - \lambda_0 - \sum_{k=1}^{m} \lambda_k f_k(i) = 0 \]

Rearranging: \(\log p_i = -1 - \lambda_0 - \sum_{k=1}^m \lambda_k f_k(i)\).

Defining the partition function and absorbing constants into normalisation yields the exponential family form:

Canonical MaxEnt solution

\[ p_i(\lambda) = \frac{1}{Z(\lambda)} \exp\!\left(-\sum_{k=1}^{m} \lambda_k f_k(i)\right), \qquad Z(\lambda) = \sum_i \exp\!\left(-\sum_k \lambda_k f_k(i)\right) \]

This is the canonical MaxEnt solution. It assigns a strictly positive statistical weight to every state not excluded by the given constraints, guaranteeing that no possibility is ignored.2 9

Derivation of the Boltzmann distribution. If the only constraint applied is the expected average energy \(\langle E \rangle\) alongside normalisation, the MaxEnt solution simplifies to:

Boltzmann distribution

\[ p(E_i) = \frac{1}{Z}\,e^{-\beta E_i}, \qquad Z = \sum_i e^{-\beta E_i} \]

The Lagrange multiplier \(\beta\) emerges naturally from the mathematics of information theory; in thermodynamics it is subsequently identified as \(\beta = 1/(k_B T)\). The partition function \(Z\) contains all statistical information about the system in equilibrium, and from it all macroscopic thermodynamic variables can be derived analytically.

Quantum analogue. Maximising \(S(\rho)\) subject to \(\mathrm{Tr}(\rho) = 1\) and \(\mathrm{Tr}(\rho F_k) = c_k\), and using the variation identity \(\delta\,\mathrm{Tr}(\rho \log \rho) = \mathrm{Tr}(\delta\rho(\log \rho + I))\), stationarity under arbitrary Hermitian \(\delta\rho\) gives:

Maximum-entropy density matrix

\[ \rho = \frac{1}{Z} \exp\!\left(-\sum_k \lambda_k F_k\right), \qquad Z = \mathrm{Tr}\exp\!\left(-\sum_k \lambda_k F_k\right) \]

This maximum-entropy density matrix is precisely what Jaynes formulated explicitly in his 1957 sequel, and is equivalent to a quantum Gibbs state for an effective Hamiltonian \(\sum_k \lambda_k F_k\).2 11

2.3 Concavity, convexity, and uniqueness

Proposition (strict concavity of Shannon entropy). On the interior of \(\Delta_n\), \(H(p)\) is strictly concave.

Proof sketch. Write \(H(p) = -\sum_i \varphi(p_i)\) with \(\varphi(x) = x \log x\). Then \(\varphi''(x) = 1/x > 0\) for \(x > 0\), so \(\varphi\) is strictly convex and \(-\varphi\) strictly concave. The sum preserves strict concavity on the interior. \(\square\)

Corollary (uniqueness under interior feasibility). If the constraint set \(C = \{p \in \Delta_n : \mathbb{E}_p[f_k] = c_k\}\) is nonempty and contains an interior point, the MaxEnt optimiser is unique.

Proof. \(C\) is convex because constraints are linear. A strictly concave function over a convex set has at most one maximiser. \(\square\)

Relative entropy convexity and uniqueness (I-projection). The map \(p \mapsto D_{\mathrm{KL}}(p \| q)\) is strictly convex in \(p\) for fixed \(q\) with full support. Therefore, minimising over a convex set yields a unique minimiser when one exists. This "information projection" (I-projection) geometry, developed by Csiszár, is central to the duality theory of MaxEnt.5

2.4 MaxEnt as minimum KL divergence to a reference measure

Theorem (MaxEnt ↔ min-KL to uniform). On a finite set of size \(n\), maximising \(H(p)\) subject to constraints is equivalent to minimising \(D_{\mathrm{KL}}(p \| u)\) subject to the same constraints, where \(u_i = 1/n\).

Proof (exact)

\[ D_{\mathrm{KL}}(p \| u) = \sum_i p_i \log \frac{p_i}{1/n} = \sum_i p_i \log p_i + \log n \]

Since \(\log n\) is constant and \(\sum_i p_i = 1\), maximising \(H(p) = -\sum_i p_i \log p_i\) is equivalent to minimising \(D_{\mathrm{KL}}(p \| u)\). \(\square\)

More generally, with a non-uniform reference (a prior) \(q\), the natural problem becomes \(\min_{p \in C} D_{\mathrm{KL}}(p \| q)\), the minimum discrimination information view emphasised in Shore–Johnson's consistency derivation.4 5

2.5 Duality, log-partition functions, and moment matching

For the exponential-family solution \(p_i(\lambda)\), define \(A(\lambda) = \log Z(\lambda)\) (the log-partition function). Then:

Moment matching via log-partition

\[ \frac{\partial A}{\partial \lambda_k} = -\mathbb{E}_{p(\lambda)}[f_k] \]

Solving the constraints \(\mathbb{E}_{p(\lambda)}[f_k] = c_k\) is therefore equivalent to solving \(\nabla A(\lambda) = -c\).

The map \(\lambda \mapsto A(\lambda)\) is convex — typically strictly convex when the \(f_k\) are linearly independent under the base measure — ensuring well-behaved moment-to-parameter mapping and underpinning numerical algorithms such as iterative scaling and Newton methods.5 9

Numerical fitting (pseudocode for quadratic count model). As a practical illustration, fitting \(p(n) \propto \exp(-\lambda_1 n - \lambda_2 n^2)\) to a target mean and variance on \(n = 0, 1, \ldots, N_{\max}\):

def maxent_quadratic_counts(mu, var, Nmax=5000, tol=1e-10, maxit=200):
    lam1, lam2 = 0.0, 1e-3  # initialise; lam2 > 0 ensures normalisability
    for it in range(maxit):
        n = arange(0, Nmax + 1)
        logw = -lam1 * n - lam2 * (n ** 2)
        w = exp(logw - max(logw))  # stabilise against overflow
        Z = sum(w); p = w / Z
        m1 = sum(n * p)
        m2 = sum((n ** 2) * p)
        g1 = m1 - mu
        g2 = (m2 - m1 ** 2) - var
        if max(abs(g1), abs(g2)) < tol:
            return p, (lam1, lam2)
        J = jacobian(g1, g2, wrt=[lam1, lam2])  # analytic or finite-diff
        lam1, lam2 = (lam1, lam2) - solve(J, [g1, g2])
    raise ConvergenceError

2.6 Consistency axioms and the Bayesian connection

Shore–Johnson consistency. Shore and Johnson give conditions — axioms of consistency, subset independence, and system independence — under which updating must be performed by minimising cross-entropy relative to a prior, thereby selecting MaxEnt/min-KL as uniquely consistent. This provides a normative, axiomatic justification for MaxEnt as a general inference framework.4

Bayesian inference as a special case of maximum relative entropy. Caticha and others have argued in a "design" framework that updating from prior to posterior is achieved by maximising relative entropy subject to constraints encoding new information; Bayes' rule arises when the constraints encode observed data through a likelihood.8

A minimal sketch of one common formulation: update a joint \(q(\theta, x)\) to \(p(\theta, x)\) subject to (i) \(p(x) = \delta(x - x_{\mathrm{obs}})\) and (ii) conditional structure \(p(\theta \mid x) \propto q(\theta) q(x \mid \theta)\). Maximising \(-D_{\mathrm{KL}}(p(\theta, x) \| q(\theta, x))\) under these constraints yields:

Bayes' rule from MaxEnt

\[ p(\theta \mid x_{\mathrm{obs}}) \propto q(\theta)\, q(x_{\mathrm{obs}} \mid \theta) \]

This demonstrates that MaxEnt (with a uniform reference) and Bayesian updating (with an arbitrary prior) sit inside a unified family of entropic updates, with data constraints — not just moment constraints — supported by the framework.

2.7 Large deviations: why entropy maximisation is "typical"

Large deviation theory makes the MaxEnt selection appear as an asymptotic typicality statement, providing a probabilistic rationale independent of inference axioms.6 7

Sanov's theorem (informal statement). Let \(X_1, \ldots, X_n\) be i.i.d. from a true distribution \(q\) on a finite alphabet. The empirical measure \(L_n\) satisfies a large deviation principle (LDP) with rate function \(D_{\mathrm{KL}}(\cdot \| q)\): the probability that \(L_n\) is near \(p\) decays like \(\exp(-n\,D_{\mathrm{KL}}(p \| q))\).6 7

Gibbs conditioning principle. Condition on empirical constraints such as \(\mathbb{E}_{L_n}[f] = c\). As \(n \to \infty\), the conditional distribution concentrates near the distribution \(p^\star\) that minimises \(D_{\mathrm{KL}}(p \| q)\) subject to the constraint — exactly the minimum discrimination information solution. This provides a frequentist, combinatorial rationale for why MaxEnt distributions often describe the "most probable" macrostates when many microscopic degrees of freedom are involved, while emphasising that the choice of constraints is where modelling judgement enters.

2.8 Extension to biology: the negative binomial distribution

The mathematical flexibility of Lagrange multipliers extends naturally into biology.16 23 In modern transcriptomics, mRNA counts within single cells frequently follow a Negative Binomial (NB) distribution. This can be derived as the MaxEnt distribution under specific biological constraints. If a birth–death model of mRNA transcription and degradation is analysed at steady state, the mathematical constraints map to the mean and the logarithmic expectations of the transcript counts. Applying the pseudo-likelihood approximation over the multidimensional simplex of gene expression states, the resulting MaxEnt distribution precisely mirrors the Negative Binomial distribution. This derivation proves that the NB distribution is not merely an empirical fit for overdispersed biological data, but the fundamental, maximally uninformative steady-state distribution for mRNA content in single cells given the inherent constraints of transcription kinetics.

3. Classical and quantum research on MaxEnt

3.1 Classical MaxEnt: key results and research directions

Equilibrium statistical mechanics as constrained inference. Jaynes' most influential methodological claim is that with constraints like normalisation and mean energy, MaxEnt yields familiar equilibrium ensembles.2 9 Under a mean-energy constraint, the canonical (Boltzmann–Gibbs) distribution emerges; other constraint sets yield other ensembles (microcanonical, grand canonical). This reframes statistical mechanics as inference from macroscopic data, rather than as purely dynamical postulates.

MaxEnt beyond equilibrium: Maximum Calibre (MaxCal). A major extension applies MaxEnt to path ensembles rather than static distributions: maximise entropy over trajectories subject to dynamical constraints, yielding Markov processes and fluctuation relations in certain limits. MaxCal has been applied to non-equilibrium biochemical dynamics, protein folding kinetics, and cellular signalling, and is increasingly used to derive non-equilibrium steady-state distributions from measured dynamical observables.9

Inverse problems and exponential families. MaxEnt is closely related to fitting exponential-family models from moments or marginals via iterative scaling and convex duality. In modern statistical language, "maximum entropy modelling" often refers to selecting an exponential family with features and fitting to satisfy empirical moment constraints, sometimes with regularisation. Shore–Johnson and Csiszár's I-projection theory provide foundational justification for this approach.4 5

3.2 Quantum MaxEnt: what changes in the non-commutative setting

Quantum MaxEnt generalises "choose \(p\)" to "choose \(\rho\)" under operator constraints.

Density matrices and Gibbs states. As shown in Section 2.2, maximising von Neumann entropy under linear constraints yields \(\rho \propto \exp(-\sum_k \lambda_k F_k)\). Jaynes' 1957 sequel explicitly formulates the "maximum-entropy density matrix" and relates it to quantum statistical mechanics.2

Quantum thermalization and the Eigenstate Thermalization Hypothesis (ETH). Recent theoretical research into quantum thermalization explores how isolated quantum many-body systems reach thermal equilibrium without external heat baths. The ETH posits that thermalization occurs intrinsically at the level of individual, highly excited eigenstates. The von Neumann entropy of a sufficiently small subsystem embedded within a larger, isolated, thermalized pure quantum system is shown to be strictly proportional to the quantum statistical entropy of that subsystem, bridging microscopic unitary quantum evolution with macroscopic thermodynamic irreversibility.

Quantum relative entropy and monotonicity. Quantum relative entropy generalises KL divergence in the operator-algebra setting. A foundational property is data processing / monotonicity under appropriate quantum channels, underpinning the principle that "coarse-graining reduces distinguishability." Early proofs (Lindblad) and later refinements (Petz) are central in quantum information theory.12 13

Genuinely quantum pathology: discontinuity of MaxEnt inference. Unlike the classical case — where MaxEnt inference maps are continuous under mild conditions — quantum MaxEnt inference can be discontinuous when constraints involve non-commuting observables. Stephan Weis demonstrated explicit discontinuities and analysed them in terms of the geometry of the expectation map; later work connects such discontinuities to quantum phase transitions and other non-classical phenomena.14 This subtlety is critical for applied users: in quantum settings, "MaxEnt" is still a convex optimisation, but the inference map from measured expectations to inferred states can have geometric behaviour with no classical analogue.

Quantum Rényi entropies and astrophysical closures. Modern quantum research frequently utilises quantum Rényi entropies \(S_\alpha(\rho)\), which recover the von Neumann entropy as \(\alpha \to 1\). These are additive for product states, vanish for pure states, and satisfy weak subadditivity. Beyond condensed matter, quantum MaxEnt principles are transforming astrophysics, specifically in the numerical simulation of post-neutron-star-merger disks and core-collapse supernovae. Simulating neutrino flavour mixing in ultra-dense environments requires solving infinite, coupled towers of quantum moment evolution equations. By rigorously applying MaxEnt, researchers infer unknown higher-order angular moments of the neutrino one-body reduced density matrix solely from known lower-order moments — a "quantum maximum entropy closure" — enabling tractable modelling of Fast Flavour Instabilities (FFIs) without prohibitive computational overhead.

Recent quantum MaxEnt stability results (2025–2026). A 2025 analysis by James Tian studies quantitative stability: under appropriate control of moment errors and entropy deviations, one can bound trace-norm distance to the MaxEnt inference state, with stability under certain quantum operations, providing concrete guarantees for applied quantum state tomography.15

Scientific domain	Form of entropy	Primary constraint	Key application
Classical Statistical mechanics	Shannon / Gibbs entropy	Macroscopic variables (energy, volume, particle number)	Derivation of thermodynamic ensembles; gas dynamics
Quantum Quantum information theory	Von Neumann entropy	\(\mathrm{Tr}(\rho H)\) fixed	Quantum thermalization, coherent information, entanglement quantification
Quantum Condensed matter physics	Quantum Rényi entropies	Subsystem traces and spin lattices	Criticality and entanglement in quantum spin chains
Quantum High-energy astrophysics	Quantum MaxEnt closure	Lower-order neutrino angular moments	Fast Flavour Instabilities in core-collapse supernovae

4. MaxEnt in biology and gene expression

MaxEnt enters biology primarily as a model-building and inference tool under data limitations, rather than as a direct mechanistic claim that biological systems "maximise entropy." Three recurring uses are: (i) distribution reconstruction from moments; (ii) effective interaction and network models from correlations; and (iii) sequence models from marginal constraints. Living organisms are open, non-equilibrium systems, and direct application of static MaxEnt must be interpreted accordingly — as capturing effective descriptions of snapshot distributions, not as claims about the system's thermodynamic trajectory.

4.1 Gene expression distributions, noise decomposition, and constrained inference

Single-cell and single-molecule expression measurements provide snapshot distributions of mRNA or protein counts. A frequent scientific question is to separate intrinsic (stochastic reaction) noise from extrinsic (cell-to-cell parameter variability) noise without mechanistic models for both simultaneously.

A representative MaxEnt approach treats extrinsic factors (e.g. effective rate parameters) as latent variables whose distribution is inferred by MaxEnt to match observed count distributions. Dixit (2013) applied such a framework to infer extrinsic variability contributions in gene expression in E. coli, with worked examples and empirical comparison.16 The Lagrange multipliers enforcing moment constraints are not mechanistic kinetic rates; they summarise constraints and require additional modelling to map onto biochemical parameters.

4.2 Gene interaction networks from expression patterns

MaxEnt can infer effective gene–gene interaction networks by fitting a distribution over expression states that matches observed summary statistics, typically means and pairwise correlations. Standard correlation measures group genes by profile similarity but fail to reveal direct regulatory topology: genes may appear correlated simply because they share a distant upstream regulator.

By constraining a MaxEnt model to match the empirically observed mean expression levels of every gene and the pairwise covariance matrix between all gene pairs, the procedure naturally generates an interacting, web-like network. Lezon and colleagues (2006) used entropy maximisation on microarray data from Saccharomyces cerevisiae during metabolic oscillations to identify a gene interaction network most consistent with observed transcript profiles, accurately reflecting how the organism dynamically adjusts cellular metabolic activity in response to limiting nutrient conditions.18 In the single-cell era, a notable 2025 study by Bialek and collaborators builds pairwise MaxEnt (Ising) models of presence/absence of transcripts across hundreds of genes in brain single-cell data, demonstrating prediction of higher-order statistics and exploring multi-peak energy landscapes for cell classification.

4.3 Signalling networks and heterogeneous parameters

In biochemical signalling, heterogeneity reflects variability in kinetic parameters across cells. A MaxEnt approach infers a distribution over parameters — rather than a single best-fit parameter set — constrained by observed distributions of signalling readouts. The MERIDIAN framework (Dixit et al., 2020; Cell Systems) uses MaxEnt to infer parameter distributions consistent with single-cell signalling data, seeking predictive models of population responses without imposing arbitrary parametric forms on heterogeneity.17

4.4 Regulatory sequence motifs and multiple sequence alignments

MaxEnt provides a principled way to model biological sequences under finite data.

Short motifs. A MaxEnt distribution over \(k\)-mers consistent with low-order marginal constraints captures dependencies between positions missed by position-independent position weight matrices (PWMs). Yeo and Burge (2004) proposed a MaxEnt framework for short sequence motifs with applications to RNA splicing signals.20 The MaxEnt solution is:

MaxEnt sequence motif model

\[ p(s) = \frac{1}{Z} \exp\!\left(\sum_\ell h_\ell(s_\ell) + \sum_{\ell < m} J_{\ell m}(s_\ell, s_m)\right) \]

Fields \(h_\ell\) enforce single-site frequencies; couplings \(J_{\ell m}\) enforce pairwise dependencies. Using only single-site constraints reduces to an independent PWM-like model; adding pairwise constraints yields the full MaxEnt motif model.

Protein families: Potts models and Direct Coupling Analysis (DCA). Pairwise MaxEnt models over amino-acid sequences inferred from multiple sequence alignments (MSAs) underpin DCA, used for structural contact prediction and generative sequence modelling. Given single-site frequencies \(f_i(a)\) and pairwise frequencies \(f_{ij}(a,b)\) estimated after phylogenetic reweighting, the MaxEnt solution is a Potts model:

Potts model (DCA)

\[ p(a_1, \ldots, a_L) = \frac{1}{Z} \exp\!\left(\sum_i h_i(a_i) + \sum_{i < j} J_{ij}(a_i, a_j)\right) \]

Key empirical demonstrations include Morcos et al. (2011); practical inference via pseudolikelihood (Ekeberg et al., 2013) enables scaling to realistic protein families. These sequence MaxEnt models are now explicitly compared with transformer-based protein language models; recent papers frame Potts/DCA as pairwise MaxEnt baselines and test which models reproduce higher-order statistics.24 25

4.5 Comparison of biological MaxEnt model families

Model family	Variables	Inputs (constraints)	Core assumption	Typical limitations
Moment-constrained count MaxEnt	\(N \in \mathbb{Z}_{\geq 0}\)	\(\mathbb{E}[N]\), \(\mathrm{Var}(N)\)	Only specified moments are trusted	Moments may be noisy; solver instability; multipliers not mechanistic
Pairwise MaxEnt gene state model (Ising)	\(x_i \in \{0,1\}\)	\(\langle x_i \rangle\), \(\langle x_i x_j \rangle\)	Pairwise constraints capture essential dependence	Couplings are effective; latent confounders; needs large sample sizes
MaxEnt motifs with dependencies	\(s_\ell \in \{A,C,G,T/U\}\)	Single-site and pair marginals	Constraints encode motif statistics	Data sparsity; selecting relevant dependencies; interpretability vs. mechanism
Potts MaxEnt for MSAs (DCA)	Amino acids at \(L\) sites	\(f_i(a)\), \(f_{ij}(a,b)\)	Pairwise statistics sufficient	Needs deep MSAs; phylogenetic bias; higher-order effects

5. Toy models and practical illustrations

5.1 The two-gene toggle switch and symmetry breaking

The synthetic bistable toggle switch is a foundational toy model in systems biology representing the minimum architecture required for cellular decision-making. It consists of a mutual repression network where two genes strongly inhibit each other's transcription, characterised mathematically by small dissociation constants.

Deterministically modelled by coupled ODEs, the toggle switch permanently favours the dominance of one gene over the other based strictly on initial conditions. When modelled stochastically under a MaxEnt framework, the system maps to an energy landscape with two distinct attractor basins. Intrinsic transcriptional noise — arising from low copy numbers of regulatory molecules — induces stochastic transitions across the potential energy barrier, a phenomenon known as noise-induced tipping. This causes the dominant gene to switch stochastically, exploring the entirety of the allowable phase space.

This toy model illustrates and predicts spatial pattern symmetry breaking in bacterial colony expansion, demonstrating how simple genetic circuits interface with microenvironmental heterogeneity to govern macroscopic behaviours such as sporulation patterns in Bacillus subtilis and surface colonisation decisions in Pseudomonas aeruginosa.

5.2 The Hopfield model as the Waddington epigenetic landscape

The Hopfield model (1982), originally developed for associative memory in neural networks, serves as an abstract toy model for the Waddington landscape using MaxEnt principles. The energy landscape features distinct stable attractors corresponding to stored memories, which map directly to stable expression profiles of fully differentiated cell types. A multipotent stem cell is represented as existing in a higher-energy, higher-entropy state near the top of the landscape, characterised by promiscuous and highly variable gene expression.

As the cell differentiates, the geometric constraints of the regulatory landscape force the developmental trajectory into a specific attractor basin. The Hopfield toy model confirms the systemic hypothesis that as cellular trajectories are progressively constrained by the landscape's geometry, the intrinsic dimension and transcriptomic entropy of the cell strictly decrease. This measurable drop in entropy maps directly to a loss of developmental potency, providing a quantifiable, geometry-based metric for cellular differentiation that is highly robust against technical noise and data sparsity in scRNA-seq datasets.

5.3 Pairwise MaxEnt (Ising) model for gene expression patterns

Set-up. Let \(x_i \in \{0,1\}\) encode whether gene \(i\) is detected in a cell. Estimate empirical means \(\langle x_i \rangle\) and pairwise correlations \(\langle x_i x_j \rangle\).

MaxEnt problem. Find \(p(x_1, \ldots, x_G)\) maximising Shannon entropy subject to \(\mathbb{E}[x_i] = m_i\) and \(\mathbb{E}[x_i x_j] = C_{ij}\) for \(i < j\). The resulting distribution is an Ising model under \(\{0,1\}\) encoding:

Ising model for gene expression

\[ p(x) = \frac{1}{Z} \exp\!\left(\sum_i h_i x_i + \sum_{i < j} J_{ij} x_i x_j\right) \]

This structure is used in large-scale gene-expression MaxEnt modelling, including the 2025 Sarra et al. study of mammalian brain scRNA-seq data.

Practical fitting via pseudolikelihood. When \(G\) is large, the partition function \(Z\) is intractable. Pseudolikelihood fits the conditional probability for each gene:

Pseudolikelihood conditional

\[ p(x_i = 1 \mid x_{\setminus i}) = \sigma\!\left(h_i + \sum_{j \neq i} J_{ij} x_j\right) \]

Large \(|J_{ij}|\) indicates an effective statistical coupling, which can reflect direct regulation, shared regulators, cell cycle state, or technical detection biases; causal interpretation requires perturbational or time-resolved data.

5.4 MaxEnt models for regulatory sequence motifs

Set-up. Given an aligned set of nucleotide motif instances of length \(L\), estimate single-position marginals \(f_\ell(a) = \Pr(s_\ell = a)\) and optionally selected pairwise marginals \(f_{\ell m}(a,b) = \Pr(s_\ell = a, s_m = b)\). The resulting MaxEnt distribution is the motif model of Section 4.4. Using only single-site constraints recovers an independent PWM-like model; adding pairwise constraints yields the full MaxEnt motif model.

Fitting algorithm (high-level):

1. Initialise all h_ℓ(a) = 0 and J_ℓm(a,b) = 0.
2. Repeat until convergence:
   a. Estimate model marginals under current parameters
      (enumeration if small L, MCMC or importance sampling if larger).
   b. Update parameters to reduce discrepancy between model marginals
      and empirical marginals (gradient ascent or iterative scaling).
3. Output p(s) and optionally score sequences by log p(s).

5.5 Potts MaxEnt for protein sequence alignments and coevolution

Set-up. Given an MSA of a protein family, estimate single-site frequencies \(f_i(a)\) and pairwise frequencies \(f_{ij}(a,b)\), reweighted to mitigate phylogenetic biases. The Potts model solution is given in Section 4.4. Couplings \(J_{ij}\) identify sites that co-evolve and are informative of spatial contacts in the folded protein.

Practical inference (pseudolikelihood):

# Potts pseudolikelihood (L sites, q amino acids)
# Fit h_i(a), J_ij(a,b) by maximising sum_i log p(a_i | a_{-i})
for epoch in range(E):
    for sequence in MSA:
        for site i in 1..L:
            logits[a] = h_i[a] + sum_{j != i} J_ij[a, sequence[j]]
            p = softmax(logits)
            update(h_i, J_ij) using (one_hot(sequence[i]) - p)

As of 2025–2026, Potts/DCA remains the key interpretable baseline while MSA transformers and protein language models are benchmarked against it for reproduction of higher-order statistics and disentanglement of phylogenetic effects.24 25

5.6 Metabolic networks and species abundance distributions

A simple metabolic toy model represents the allocation of cellular resources. If regulatory pathways control internal metabolic fluxes, each pathway can be modelled as a Gaussian information channel contributing to a targeted biological objective, with metabolic cost bounded by the number of regulatory molecules. Higher molecular counts enable precise control and thus higher informational throughput, but impose a thermodynamic and metabolic cost, forcing an evolutionary trade-off quantifiable within the MaxEnt framework.

The Species Abundance Distribution (SAD) in evolutionary modelling illustrates a parallel application. By setting up constraints based on the expected values of the number of distinct species observed across evolutionary trials, a precise configurational entropy model can be formulated. Calculating the expected value of this entropy from the toy model matches in-silico experimental results, verifying that entropy maximisation correctly distributes observed species over the theoretical space of potential evolutionary outcomes and providing a thermodynamic basis for ecological diversity.

Toy model	Mathematical framework	Biological phenomenon illustrated
Toggle switch	Stochastic bistable repression	Cellular decision-making, noise-induced tipping, pattern symmetry breaking
Hopfield model	Energy landscape attractors	Waddington epigenetic landscape, stem cell differentiation, entropy decrease with commitment
Pairwise Ising model	MaxEnt over binary expression	Cell type emergence, gene state co-occurrence, energy landscape
Gaussian channels	Information theory	Metabolic flux regulation, energetic cost of regulatory precision
SAD model	Configurational entropy	Evolutionary species distribution across available niches

6. Current research status and frontiers (2026)

6.1 Single-cell gene expression: high-dimensional generative models

The scale of scRNA-seq has enabled fitting MaxEnt models to hundreds of genes across large cell populations. The 2025 study by Sarra and colleagues (Physical Review E) demonstrates a concrete pipeline: binarise gene presence/absence, fit pairwise MaxEnt (Ising) models matching means and correlations, validate via higher-order statistics, and interpret landscape structure for cell classification.19 The probability distribution of cellular states derived from the Ising model possesses multiple distinct local maxima — mathematically equivalent to local energy minima of a spin-glass model. Grouping individual, unlabelled cells according to the basin of attraction of these local maxima results in a classification structure that aligns with empirically validated biological cell types, while the MaxEnt model also proves capable of identifying previously undiscovered subtypes with distinguishable expression patterns.

6.2 Aging, cancer, and network structural entropy

A significant advance in 2025–2026 computational biology is the application of network structural entropy to quantify the dynamic complexity of single-cell gene networks in tissue ageing and cancer progression. A landmark study published in January 2025 analysed ageing-related gene networks using scRNA-seq data from over 15,000 cells to decipher the molecular mechanisms behind skin ageing. A random-walk model was used to rank gene importance; gene interaction patterns were extracted and network structural entropy calculated by mapping the degree distributions of gene nodes into a normalised probability distribution.

Findings revealed profound heterogeneity in how different cell subtypes age:

Increased entropy (disordered state). Aged differentiated keratinocytes, pericytes, and secretory cells exhibited increasing structural entropy. Their gene regulatory networks became highly disordered, leading to a dilution of genomic information and measurable functional decline.
Stable entropy (functional resilience). Epithelial stem cells, undifferentiated progenitors, and T cells showed no significant entropic changes, maintaining stable, rigid gene interaction patterns to preserve homeostatic functions.
Decreased entropy (simplified patterns). Melanocytes and mesenchymal cells displayed a paradoxical decrease in network structural entropy. Their network degree distributions became highly concentrated and uniform, implying regulatory simplification — technically more ordered, but less adaptable to environmental stress and linked to carcinogenesis.

6.3 Thermodynamic uncertainty relations and non-equilibrium bounds

Because living systems cannot be perfectly modelled by static entropy maximisation, modern biophysicists increasingly rely on Stochastic Thermodynamics to bound biological efficiency. A primary limitation in all biological modelling is the inability to observe all internal hidden variables experimentally; typically, only a fraction of a cell's metabolic or transcriptomic state is visible.

The Thermodynamic Uncertainty Relation (TUR) proves mathematically that in any non-equilibrium biological process, it is physically impossible to simultaneously reduce both stochastic noise (fluctuation) and the entropy production rate to arbitrarily small levels:

Thermodynamic uncertainty relation

\[ \mathrm{Var}(J) \cdot \dot{\Sigma} \geq 2k_B \langle J \rangle^2 \]

\(J\) is an observable current; \(\dot{\Sigma}\) is the mean entropy production rate. The inequality enforces the trade-off between precision and dissipation: an organism can suppress noise to achieve high-precision enzymatic reactions or flawless sensory cycles, but must pay an unavoidable energetic penalty by increasing its entropy production rate.

By framing biological energy consumption within this physical optimisation framework, researchers can infer optimal bounds on the rate of entropy production using only partial, noisy measurements of the biological system. This approach allows non-zero bounds to be established even when the observed biological trajectories appear perfectly time-symmetric and seemingly obey detailed equilibrium balance. While classical MaxEnt is constrained by equilibrium assumptions, TUR-based frameworks successfully bridge the theoretical gap, mapping the thermodynamic cost of precision required to sustain biological life.

6.4 Sequence modelling: Potts/DCA vs. deep generative models

Potts/DCA (pairwise MaxEnt) continues to be treated as a mechanistically interpretable baseline for protein sequence constraints and structural contacts. Work in 2022–2025 directly compares Potts models with protein language models and MSA transformers, testing reproduction of higher-order statistics and the ability to disentangle phylogenetic bias. While transformers can sometimes outperform DCA on downstream tasks, Potts remains valuable for its explicit interpretability: couplings have a direct statistical meaning as effective pairwise constraints absent in neural network weights.

6.5 Quantum biology and information thermodynamics

By 2026, the intersection of quantum mechanics and biology has gained significant theoretical and experimental traction, with MaxEnt serving as a quantitative foundation. Recent work has established mathematically that a quantum channel attains maximal entropy under a fixed mean energy constraint if and only if it is an "absolutely thermalizing channel" whose fixed output is the thermal state of that identical mean energy. This provides a rigorous framework for understanding how absolute thermalization processes emerge under physically realistic energy constraints, with implications for private randomness distillation from energy-constrained quantum biological processes.

Concurrently, studies in information thermodynamics have applied principles of Lattice Field Theory (LFT) — traditionally reserved for high-energy particle physics — to logic-based biological models, allowing simulation of complex processes like mammalian cortical development with unprecedented fidelity by leveraging the computational boundedness of observers to bypass exponential complexity growth.

7. Limitations and pitfalls

7.1 The paradox of life and non-equilibrium dynamics

In 1943, Erwin Schrödinger articulated the thermodynamic paradox of biology in What is Life?: according to the Second Law, reaching maximum entropy is functionally synonymous with thermodynamic equilibrium — death.26 To survive, a biological system must maintain an ordered state far from equilibrium, acting as an open system continuously exchanging energy and matter with its environment. Therefore, directly applying static MaxEnt to whole organisms fundamentally misrepresents their nature as continuously driven, open, non-equilibrium dissipative structures. The interpretive claim must always be: MaxEnt describes the modeller's least-structured inference given measured constraints, not the organism's thermodynamic objective.

7.2 Identifiability and underspecification

MaxEnt returns the least-structured distribution consistent with given constraints, but if constraints are too weak, many biologically distinct mechanisms collapse into the same MaxEnt fit. Conversely, if too many noisy constraints are imposed, the model may overfit the empirical moments. This is a central modelling trade-off: constraint choice is the modelling choice. MaxEnt is not parameter-free; the scientist chooses the constraint set \(\{f_k\}\), encoding the inductive biases of the analysis. Shore–Johnson provides consistency axioms conditional on the modelling set-up, but the axioms do not validate that a constraint set is biologically sufficient.4

7.3 Effective vs. mechanistic couplings

In network and sequence models, pairwise couplings frequently summarise indirect effects. In proteins, couplings are useful for contact prediction but remain confounded by phylogeny; in gene expression, couplings may reflect shared transcription factors, cell cycle state, or technical artefacts rather than direct regulatory interactions. Causal interpretation requires perturbational experiments, time-series data, or multi-omic integration — additional steps not provided by MaxEnt modelling alone.

7.4 Data requirements and computational costs

High-dimensional MaxEnt fits require large datasets and careful regularisation. Sequence Potts models need deep MSAs; gene expression Ising models need many cells and stable covariance estimation. In practice, the partition function \(Z\) is intractable for large systems, necessitating approximate inference methods (pseudolikelihood, MCMC) whose statistical guarantees depend on dataset depth and quality. Technical noise sources — dropouts in scRNA-seq, phylogenetic biases in MSAs, batch effects — can corrupt the empirical moments that MaxEnt uses as constraints.

7.5 The failure of minimal and maximum entropy production principles

Attempts to govern biological systems via entropy production rate have encountered fundamental difficulties. The Minimal Entropy Production Principle (MinEPP), formulated by Prigogine for systems near equilibrium, does not apply to biological systems operating far from equilibrium, where highly ordered phenomena (oscillations, convective structures) spontaneously emerge with paradoxically high entropy production. The Maximum Entropy Production Principle (MEPP) — proposed as an alternative, with successful applications in planetary energy balances and fluid dynamics — is heavily critiqued in systems biology: living systems self-regulate to ensure survival, not to unconditionally maximise energy dissipation, and many dissipative biological structures show a measurable decrease in entropy production rate past critical developmental points. The lack of universality of MEPP remains a contentious topic in non-equilibrium thermodynamics.

7.6 Misinterpretation: "biology maximises entropy"

A persistent misreading in biological inference papers interprets MaxEnt as a biological teleology — as though the organism is itself performing the maximisation. In the Jaynes framework, the maximisation is performed by the modeller to select a distribution under uncertainty, not by the system. This distinction is critical: MaxEnt inference is an epistemological operation, not a claim about the dynamical objective function of the biological system under study.

References

1
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal.
2
Jaynes, E. T. (1957). Information theory and statistical mechanics I & II. Physical Review.
3
Kullback, S. & Leibler, R. A. (1951). On information and sufficiency. Annals of Mathematical Statistics.
4
Shore, J. E. & Johnson, R. W. (1980). Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. IEEE Transactions on Information Theory.
5
Csiszár, I. (1975). I-divergence geometry of probability distributions and minimization problems. Annals of Probability.
6
Csiszár, I. (2006). A simple proof of Sanov's theorem.
7
Dembo, A. & Zeitouni, O. (2010). Large Deviations Techniques and Applications (2nd ed.).
8
Caticha, A. (2010/2021). Entropic inference and the foundations of physics.
9
Pressé, S. et al. (2013). Principles of maximum entropy and maximum caliber in statistical physics. Reviews of Modern Physics.
10
Uffink, J. (1995). Can the maximum entropy principle be explained as a consistency requirement? Studies in History and Philosophy of Modern Physics.
11
Umegaki, H. (1962). Conditional expectation in an operator algebra IV. Entropy and information. Kodai Mathematical Seminar Reports.
12
Lindblad, G. (1974). Expectations and entropy inequalities for finite quantum systems. Communications in Mathematical Physics.
13
Petz, D. (2003). Monotonicity of quantum relative entropy revisited.
14
Weis, S. (2012–2016). Continuity and discontinuity of quantum maximum-entropy inference.
15
Tian, J. (2025). Stability of maximum-entropy inference in finite dimensions.
16
Dixit, P. D. (2013). Quantifying extrinsic noise in gene expression using the maximum entropy framework. Biophysical Journal.
17
Dixit, P. D. et al. (2020). MERIDIAN: Maximum entropy inference of signalling heterogeneity. Cell Systems.
18
Lezon, T. R. et al. (2006). Using the principle of entropy maximization to infer genetic interaction networks from gene expression patterns. PNAS.
19
Sarra, C. et al. (2025). Maximum entropy models for single-cell gene expression patterns. Physical Review E.
20
Yeo, G. & Burge, C. B. (2004). Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. Journal of Computational Biology.
21
Morcos, F. et al. (2011). Direct-coupling analysis of residue coevolution captures native contacts across many protein families. PNAS.
22
Ekeberg, M. et al. (2013). Improved contact prediction in proteins: Using pseudolikelihoods to infer Potts models. Physical Review E.
23
De Martino, A. & De Martino, D. (2018). An introduction to the maximum entropy approach and its application to inference problems in biology. Heliyon.
24
Lupo, U. et al. (2022). Accurate protein structure prediction via direct-coupling analysis and transformer models. Nature Communications.
25
Khatri, K. et al. (2025). Comparing Potts models and MSA transformers: Higher-order statistics and phylogenetic effects.
26
Schrödinger, E. (1943). What is Life? Cambridge University Press.

← Back to Writing