Thesis | Xeon Labs

01 Preface

This document is a statement of our worldview that guides our work and the systems we build at Xeon Labs.

02 Data is undervalued

Data has proven itself to be amongst the most crucial and complex components of modern deep learning models to get right and to scale. Engineers tend to set a low thresholds for what can come to pass as 'good' data. The long terms effects have been the process of data labelling being regarded as a chore due to its 'underwhelming' nature or perceived as a task that should be automated.

We believe that brushing over the data labelling process lightly is highly counterintuitive to the broader conversations surrounding model performance. In other words, the same models that can draft an email or college paper are also the models that are the foundation of systems that provide decision assistance in settings where Artificial Intelligence (AI) matters more for providing an edge then it does for providing certain capabilities (i.e., in national security, policing , medical emergencies etc).

03 When AI stops being about shallow benchmarks

Most contemporary AI systems are evaluated on capability, questioning the performance of the model. With the world so fixated on the notion of Artificial General Intelligence (AGI), engineers have placed the application of AI that truly matter on a back burner, especially efforts that focus on national security.

These so-called 'high-stake' domains are far less forgiving. What matters here is not whether a system can act but whether it can: (i) justify its actions; (ii) adapt when its assumptions fail; and (iii) remain intelligible to human operators both during and after a decision has been made. These properties cannot be patched on after training - they are not mere add-ons or trivial engineering efforts.

04 Graphs in the world

The world is not composed of independent facts. It is an integrated network of entities, relationships, causes and effects unfolding through time. Human reasoning mirrors this structure almost identically. We do not reason in flat sequences but rather, we reason relationally, constantly updating beliefs based on observations from our environment that propagates downstream. Graphs are, therefore, the most faithful representation we possess of both reality and human cognition. They allow us to encode dependency, hierarchy and causality in a way that sequential representations cannot.

"What you want is the inner thought monologue of your brain as you're doing problem solving ... then AGI is here."
— Andrej Karpathy (2025)

While we do not expect for this alone to ultimately lead us to AGI, we do expect the process of capturing structured reasoning to expedite the uncovering of a significantly more capable domain‑specific intelligence.

Expert reasoning in operational environments cannot usually materialise in isolation. Multiple operators must converge on shared subgoals, their individual 'inner monologue', as mapped out on workflow graphs, intersecting and diverging as the situation evolves. Graph structures naturally represent this collaborative cognition, workflows are given freedom to branch, merge and reference nodes on path to a common goal in a way that traditional approaches cannot capture.

05 How humans actually reason

Expert reasoning, particularly in operational environments, exhibits several consistent properties: (i) it is causal, in that actions are chosen for their expected effects; (ii) it is hierarchical, with high‑level intent decomposing into increasingly granular sub‑workflows mirroring the variation within the complexity of real-world goals; (iii) it is selective, allocating attention unevenly based on novelty, risk and consequence; and (iv) it is temporal, evolving over time.

Across operators, chunks of reasoning are shared so that reasoning itself becomes modular (i.e., built up of smaller 'bytes of intelligence' (nodes)). This allows for the foundations of multiple operators' intuition to be crafted from a series of simpler building blocks. Of course, representing this form of reasoning as a graph is a choice but we opine that it is a choice so obvious that it becomes somewhat undeniable.

06 Intent behind action, thought behind intent

Thoughts give rise to actions and actions alter the physical world. The resulting changes feed back into subsequent beliefs. This closed loop is the core of intelligence, yet it is rarely represented explicitly in modern AI training data pipelines.

By modelling thoughts, decisions, actions and environmental change within a single world graph model, we can create explicit links between internal reasoning (inner monolouge) and external consequence (physical world). This enables causal explanation at a level that is otherwise unattainable, particularly in a nuanced, dynamic environment.

07 Generative World States

A "mission" in our framework at Xeon Labs is not simply the perfection of a dataset; it is a temporal sequence of aligned graphs, capturing how expert cognition evolved in the physical world. Each workflow node (i.e., a thought, decision, or action) is directly embedded into the environmental graph at the moment it occurred. This means that we can trace, precisely, how a decision changed spatial relationships in the real world.

This, in turn, curates training data where causality is explicit. The model doesn't learn correlations between text and outcomes but instead, it learns the structural transformation rules: which graph operations (expert actions) produce which graph evolutions (world state changes).

Thus, a operational example is that we can now simulate counterfactuals. The question "what if I had taken this route instead?" becomes a graph-generated problem. The system produces not a text description but an evolved world state - queryable, verifiable and causally grounded answer. This is fundamentally different from asking an LLM to imagine outcomes.

08 The Thought Index

Not all reasoning requires equal depth. A driver seeing a red light and stopping requires no elaborate "why" beyond the observed signal. By quantifying the novelty, criticality and salience of each decision point (node) through a 'thought index', only key aspects are labelled; systems can then allocate computational resources adaptively between faster and slower models. The structure of the graph architecture allows us to do exactly this by using natural properties that already live within the graph and are purposed to aid in the refinement of the 'thought index'.

09 Determinism and Failure

We can consider contemporary LLMs as associative memory systems through a simple example: one learning the sequence starting 2, 4, 6, 8, 10, 12, 14... and so forth. In this case, a human understands the rule as taking the given number and returning that number with an addition of two. Whereas, an LLM instead memorises transitions, from 2→4, 4→6, 6→8... and so on. Importantly, it is this very distinction that will determine what happens when the model encounters inputs outside its training distribution.²

For associative memory systems, out-of-distribution inputs produce undefined outputs. The model has learned correlations, not causation. When forced to respond, it hallucinates; which in critical scenarios could be fatal. This is the inevitable consequence of its architecture that cannot be avoided.¹

However, graph-based models behave differently, they encode relationships rather than surface patterns. Their predictions are reproducible and auditable. Whilst it is noted that this doesn't make them universally superior, it does makes their failures legible and is most certainly a step in the right direction.²

Generative models, nonetheless, excel where creativity and approximation suffice; they fail where precision, repeatability, and accountability matter. In high-stakes systems, the question is not whether a model is powerful but whether its uncertainty can also be controlled.

10 Why high-stake domains are a priority

As briefly aforementioned, high-stake industries such as national security are often placed on the back-burner by engineers. Yet these sectors deserve far greater attention: they drive the nation forward through defense innovation and tackle fundamental challenges that lead to breakthrough advances rather than incremental consumer product updates. High-stake domains are not purely profit focused but rather prioritise solving the harder problems that would otherwise be ignored.³

11 Does data have to be scalable?

The current AI market promotes a utopian plug-and-play narrative promising instant intelligence at the cost of depth, reliability and interpretability. Xeon Labs' products embraces a slower, security-first process that prioritises data integrity, robustness and capture real reasoning behind an expert's decisions. Where lives and critical infrastructure are concerned, procedural rigour takes precedence over operational efficiency. Only once the foundations are concrete will the data be scalable.

12 Levels of Abstraction

Traditional command systems show entities, such as people, as dots on a map: position, status and capability. Decisions are made from this bird's-eye view based on such location metrics. However, this is, at its core, over-abstracted. It collapses an operator's reasoning into simple state variables, erasing the reasoning behind their actions (such as how Person A got from position X -> Y).

Our approach essentially 'fills in the blanks' by representing operators as super-nodes containing nested workflow graphs. The orchestrator doesn't just see that "Person A is at position Y" but can query the reasoning inside that node. This includes the observations Person A has made at position X, what assumptions Person A is operating under and what subgoals Person A is pursuing at that time.

This enables coordination grounded in actual cognition, not inferred from coordinates. The orchestrator is free to zoom out to strategic intent or zoom in to moment-by-moment decisions without losing the causal links between levels. Reducing the levels of abstractions that current systems use when interacting with the real world is vital for high-stake intelligence.

13 Closing Statement

High-stakes AI requires different foundations that simply scaling - whether it be language or vision - alone will not be able to achieve. There needs to be the infrastructure to bridge the two together. This thesis accompanies ongoing research, deployments and software demonstrations at Xeon Labs in achieving our world view.

14 Appendix

¹ Faith and Fate: Limits of Transformers on Compositionality [Dziri et al 2023]
² GNN & LLM [NVIDIA]
³ The Technological Republic [Alex Karp, 2025]