| CARVIEW |
Introduction
Even though it feels like you’re learning when you talk to LLMs, you might be selling yourself short: you’re liable to self-deception, thinking that you have a strong grasp of a topic when you do not. Nevertheless, I also believe that there are certain scenarios where AI can be used without compromising your ability to think. Exactly where the distinction occurs can vary from person to person, because everyone has different learning styles.
In my view, there are three regimes of learning: we begin by building familiarity and intuition, progress toward spotting patterns and making connections, and finally achieve robust knowledge retention. I’m going to use the idea of a phase transition from statistical mechanics as a high-level analogy for learning; these three regimes are really two phases, a subcritical phase (building intuition) and supercritical phase (robust knowledge), separated by a phase transition. It’s in this phase transition — the critical point — where connectivity between learned ideas becomes possible.1
Throughout this post, I’m going to use two examples of learning: addition and the covariance matrix. The first one is based on me teaching my five-year-old daughter simple arithmetic, and the second reflects my own learning journey from an undergraduate physics major to a tenure-track researcher. In the example of learning about single-digit addition, you can imagine “summarization via an LLM” is like “asking dad for the answer” or “asking dad why”. In the second example, you can just imagine asking ChatGPT for intuitions about covariance or PCA.
Subcriticality: The Early Phase of Learning
At the earliest stage of learning, concepts don’t really leave a lasting impression. A number like 5 is just a symbol, and a word like covariance is simply some letters on a page. It is painfully difficult to connect these concepts to other ideas that you’re supposed to be learning at the same time, e.g., the number 4 is smaller and the number 6 is larger than 5. Or maybe that your sample covariance matrix can be diagonalized to find eigenvectors and eigenvalues. And you could maybe remember these facts and procedures. But if somebody described Principal Components Analysis using some other choice of words, then you’d have no idea that they were describing the same ideas!
The problem is that in this subcritical phase, concepts easily fizzle out. They’re totally disconnected from your intuition, because that intuition needs to be formed via related concepts. If you have no intuition for all of the properties of the number 5 — that it is odd, that it is half of ten, that it is three less than eight, etc., then it might as well just be any random symbol on a page. You see symbols, but not the underlying structure of these numbers, probably because you simply haven’t spent enough time staring at them and thinking about their relationships.
At this stage, it might be easiest to just learn via rote memorization. (This varies by person — I have horrible memory, so I hate this phase of learning.) Back in undergrad, I remember buying the PDF for Learn Python the Hard Way, a textbook for learning the Python programming language. I printed out each of the pages, so that I would have to manually type in code, rather than copy-paste it! This helped me build up muscle memory and think about the Python code as I typed it in.
Lots of folks have found that spaced repetition is the best way to improve learning at this stage (e.g., Gwern, Anki cards, HN discussions). At its core, spaced repetition is just testing how well you’ve remembered things over progressively longer time periods — the intervals get shorter if you forget something, and get longer if you are able to continue recalling it.
While there’s certainly some benefit to assembling the spaced repetition system (i.e., constructing a deck of Anki cards), I think that the most valuable technique is to regularly use it. That is, it’s okay if somebody else procures the answer for you. It’s okay if the definition of sample covariance matrix comes from a textbook or Wikipedia or Claude. You’re still at the subcritical phase, and you’ll need more exposure to the concepts before things start to click.
However, this phase of learning is still meant to be somewhat difficult. You should expect friction! Recall that it’s easy to nod your head in agreement when you read something that makes sense, but it’s far more difficult — and valuable — to re-derive it yourself.2 Once you become somewhat familiar enough with a concept, then it becomes much more rewarding to test that knowledge to see if you’ve reached the critical point.
I think that current LLMs are highly useful for testing your knowledge. For example, to assist my own learning, I frequently use a “Socratic” prompt with Gemini (although it’s out of date, and note that it was written with the assistance of Gemini 2.5 Pro):
Purpose and Goals:
- Help users refine their understanding of a chosen topic (Topic X).
- Facilitate learning through a Socratic method, prompting users to explain concepts and asking probing questions.
- Identify and address misunderstandings by testing the user’s conceptual knowledge.
- Focus on sharpening the user’s intuition and conceptual understanding rather than rote memorization. […] The entire prompt can be found here.
This method of using LLMs should not make you reliant on AI. It does not outsource your thinking. Like the How To Solve It method devised by the great mathematician George Pólya, and the eponymous SolveIt platform run by great educator Jeremy Howard, my aim here is to demonstrate how to use LLMs as a personalized tool for testing your understanding. LLMs are now powerful enough that (for most topics), they can spot holes in your thinking; however, given their tendencies toward sycophancy, LLMs must be prompted carefully.
Supercriticality: The Late Phase of Learning
At some point, the dots really start to connect. Beyond the critical point, all your knowledge is linked together. You have intuitions for concepts that you may not have heard of. You’re so comfortable with addition, that you also intuitively grasp concepts like associativity (1 + 3 = 3 + 1) or inverses (adding one to three makes four, so taking away one from four makes three), even though you may not have heard of (or recall) the jargon from algebra or group theory. In any event, you have a robust conceptual understanding, and all that remains is to give names to these well-understood concepts.
In this phase, learning should feel easy and fun. There are likely still gaps in your knowledge, but it’s quite straightforward to fill them in. Your knowledge is robust even when you’re missing certain pieces of information because you’ve trodden all around that terra incognita, so new knowledge doesn’t dramatically upend your understanding.
My late father had a PhD in chemistry. He loved to personify everything and attach feelings to them: oxygen wants to gain two electrons, the molybdenum likes to react when this catalyst is present, etc. We develop a similar feel for concepts when our understanding passes the critical point. And this intuition is vital for pursuing novel research ideas and making scientific discoveries.
Or, you can plausibly extend your knowledge to other domains because you have crystallized the relevant intuitions. For example, in the excellent pedagogical text Data analysis recipes: Fitting a model to data, David Hogg et al. write:
The inverse covariance matrix appears in the construction of the ( \chi^2 ) objective function like a linear “metric” for the data space: It is used to turn a N-dimensional vector displacement into a scalar squared distance between the observed values and the values predicted by the model. This distance can then be minimized. This idea—that the covariance matrix is a metric—is sometimes useful for thinking about statistics as a physicist.
While physicists can be guilty of thinking that they can leap into other fields (relevant XKCD), they often do have a strong grasp of mathematics and physical intuition. This combination is invaluable at the supercritical stage: the language of mathematics often translates well to other disciplines, while the intuition from physics can be helpful for predicting dynamics or phenomena given those mathematical laws.
In the supercritical phase of learning, LLMs can be helpful. They are pretty good at identifying analogous concepts in alternate fields that you might not know about, acting as both the proverbial water cooler and the multidisciplinary scientists that congregate around it. LLMs can also be used to quickly refresh ideas that you have briefly forgotten, like going back to reference your old textbooks to check some relevant information. However, this can also be dangerous if you think you’re past the critical point — but in reality you aren’t (often, because your confidence is inflated by talking too much to LLMs).
The pitfalls of using LLMs to help you learn
I’m reminded of one salient point from Neel Nanda’s delightful essays on research taste. While not the main focus of those pieces, he explains (emphasis mine):
Junior researchers often get stuck in the early stages of a project and don’t know what to do next. In my opinion this is because they think they are in the understanding stage, but are actually in the exploration stage.
In other words, junior researchers can sometimes believe that they have crystallized their understanding of a topic (i.e. supercritical), when in reality they are still in an earlier stage of learning (subcritical)! This is particularly worrisome when LLMs can summarize topics in easily understandable ways, deceiving junior researchers into feeling like they confidently understand a topic because they’ve understood the simple LLM summarization.
LLMs are truly a double-edged sword for learning.
On one hand, they can be helpful by testing your knowledge in new ways (e.g. the Socratic approach I mentioned above; I also have another prompt that quizzes me with PhD qualifying exam-like questions). LLMs can help you get unstuck when your canonical text doesn’t explain something in a manner intelligible to you. They can get rid of irrelevant roadblocks (e.g., you’re learning about neural network optimization, but stuck in CUDA dependencies hell). LLMs can spin up highly individualized games that help you learn concepts in a way that’s much more fun than doing practice problems.3
On the other hand, LLMs can leave you with a completely shallow understanding of a topic — while you feel like you totally understand it all! This is compounded by the fact that LLMs will tend toward positivity. Do not let your confidence be bolstered by hollow AI validation. Be vigilant and skeptical, because uncritical use of AI tools will absolutely inhibit your learning.
The pitfalls of using LLMs for summarization
One can imagine a world where AI widens the gap between those who practice writing and those who do not. This is problematic because — as all experienced researchers know — writing is thinking. If we don’t practice writing, then we shut the door on an entire mode of thinking.
But what about reading? Sometimes it feels like a slog to read through long articles, especially technical, information-dense academic papers. Why not just get it all distilled into a single paragraph? Or turn it into a podcast-like audio overview? As I wrote on social media, using AI to summarize information is also a way to outsource your thinking.
When we read or write, we are constantly re-organizing our understanding of topics. This happened at least three times for the very blog post you’re reading; the title and content has evolved dramatically over the past two weeks since I began writing it.
I contend that summarization is thinking. When I am reading about a new topic, I know that I’ve understood it only when I can accurately and concisely summarize it.4 Robust summarization is only possible when you can grasp the big picture intuitions and connect them to the minute details. That mental organization is a part of the learning process. When an LLM does this organization on your behalf, then your mental muscles atrophy.
-
But don’t worry too much about the details of this percolation theory analogy, and like all analogies, it breaks down under scrutiny. I hope you’re not distracted by wondering about the order parameter or anything like that. ↩
-
I sometimes call this the generator–discriminator asymmetry. This is commonly used to describe how it’s far easier for a GAN to discriminate between a real or generated output, than to generate some new output that can fool the discriminator. It can also be used to refer to human learners: discriminating right from wrong information is easier than correctly deducing something from scratch! (Side bar to my footnote: this also gets at why multiple choice questions are bad for evaluating LLMs!) ↩
-
Gemini 3 is stunningly good at this. In about 60 seconds, it created a simple HTML game for my daughter to learn simple combinatorics, based on my prompt to center the game around making permutations and combinations of N ice cream flavors given P scoops of ice cream on a cone. She loved it! ↩
-
It is also incredibly easy to discern which junior researchers have truly understood a topic versus those who haven’t by asking them to summarize a topic. ↩
Galaxy Morphologies in Euclid
Euclid1 is now delivering crisp, space-based imaging for millions of galaxies. Among the many scientific results presented in their Q1 (“Quick Data Release 1”) is a citizen science analysis of galaxy morphologies presented by Mike Walmsley et al. This paper presents not only GalaxyZoo (GZ) volunteer classifications according to a decision tree — i.e., Is this galaxy featured? Does it have spiral arms? How many? — but also a foundation model (Zoobot) fine-tuned for predicting these decision classes on the Euclid galaxy dataset. You can check out the Zoobot v2.0 blog post and download it via Github.
But Zoobot follows a supervised approach: we’ve delineated the taxonomy into which galaxies must fit. By definition, this CNN learns representations that enable it to accurately describe galaxy according to the GZ categories. Can we get a neural network model to represent galaxies outside of this taxonomy?
Yes! Our first result from this paper is to present a Masked Autoencoder (MAE) that learns galaxy imaging via self-supervised representations. Our MAE chops up images into 8×8 patches, and consists of a custom vision transformer (ViT) encoder with ~30M parameters, and a three-layer decoder. To get a sense of how it works, I highly recommend you checking out the interactive demo built by Mike: https://huggingface.co/spaces/mwalmsley/euclid_masked_autoencoder
Even when you remove 90% of the pixels of a Euclid image, the MAE can learn to reconstruct the rest of the image extraordinarily well. Honestly, it does far better than any human can. And not only does it work for galaxy images, but the MAE also learns to reconstruct bright stars and other objects in the Euclid RR2 dataset.
Principal Components of Galaxy Image Embeddings
Okay, so we have trained models, which means that we can encode Euclid images into Zoobot (supervised; d=640) and/or MAE (self-supervised; d=384) embeddings. How do we interpret these learned embedding vectors?
A good starting point is to use PCA (principal components analysis); the top PCs should summarize most of the variation in each dataset. It’s worth emphasizing that the supervised (Zoobot) and self-supervised (MAE) models are trained on different datasets: the Zoobot dataset comprises ~380k well-resolved galaxy images from Euclid, whereas the MAE dataset comprises >3M Euclid images of galaxies, stars, artifacts, etc. Thus, it is not possible to make an apples-to-apples comparison between these two datasets or their embeddings.

The figure above, copied from Figure 2 in the paper, displays top image examples for the first five PCs for Zoobot (left) and MAE (right) model embeddings. Some interesting notes:
- For the supervised Zoobot embeddings, our first few PCs are well-aligned with the first few nodes of the GZ decision tree.
- For example, the first PC mirrors the first GZ question of whether the galaxy is smooth or featured with Spearman r≈0.85.
- The next questions align with whether a featured galaxy has a disk that is seen edge-on, or has spiral arms, or has a prominent bulge, etc.
- Note that PCs can have both positive and negative coefficient, so the first PC with a very positive coefficient would characterize a very smooth (e.g. spheroidal) galaxy, while a very negative coefficient would characterize a stronlgy featured (e.g. spiral arms) galaxy!
- For the self-supervised MAE embeddings, the representations are totally different than before.
- In several of the top PCs, we find cosmic ray hits or other imaging artifacts.
- We think these dominate much of the MAE latent space because it’s fundamentally challenging to reconstruct images with imaging artifacts!
- Galaxies also appear in here, although individual PCs do not align nearly as strongly to the GZ categories.
PCA is nice because it rank-orders features by how much they explain the variance in the embedding vectors. But what if the features you want require non-linear combinations of embeddings? Or what if your original embeddings are noisy, so each PC depends on all inputs — this might result in uninterpretable features.
Sparse Autoencoders for Interpretability and Discovery
For this reason, we chose to use a sparse coding method, Matryoshka Sparse Autoencoders (SAEs), to discover features! They’re extremely simple: embeddings get fed into a single layer decoder (with ReLU activation), wherein only a few neurons are allowed to be active.2 From these sparse activations, a single-layer decoder (i.e. a projection matrix) learns to reconstruct the original embeddings. Because the latent activations are sparse, the SAE must use only a few neurons to reconstruct each given input (i.e., the original images), which results in more interpretable features. Possibly even monosemantic features — that is, instead of a many-to-many mapping between neuron activations and semantic concepts, we can use SAEs to recover a one-to-one mapping between activations and concepts.
Or so the story goes. Famously, Anthropic found a Golden Gate Bridge feature in Claude that activates on both text and images! But… while SAEs are sure to learn sparse, non-linear combinations in an overcomplete space, we don’t actually have mathematical guarantees that SAEs will find monosemantic or disentangled features. What does monosemanticity even really mean? Should galaxies with Sersic indices of 2.1 activate a different feature than galaxies with Sersic indices of 2.2? Indeed, there is significant evidence that SAEs do not fare as well as linear probes for already known features, leading some research teams to focus on other topics in mechanistic interpretability.
Anyway, let’s just see what happens. Take a look at the figure above again, and now focus on the bottom panels. These now show the first five SAE features, ranked in order of how frequently they are activated. For the supervised example (on the lower left), we can see reasonably coherent/interpretable features: two-armed spirals, ringed galaxies, spheroidal galaxies, elliptical galaxies, and objects with tidal features, clumps, or companions. (This last one is the least monosemantic, but it’s intriguing because each of those features can be indicative of galaxy–galaxy interactions or mergers!) For the self-supervised MAE (on the lower right), we also see some consistency in SAE-extracted features. Huh!
We then quantify how well the PCA and SAE features align with GZ features, using the Spearman rank correlation coefficient I discussed earlier. Again, we shouldn’t compare between the supervised and self-supervised models, but we can now compare PCA and SAE features! And we find a clear winner: SAE features are typically more aligned with the GZ taxonomy!
Qualitatively, we also find that the SAE can surface interesting features. This is most evident in the features extracted from Zoobot embeddings, where we know the supervised training objective. For example, we find examples of ring galaxies or dust lanes in edge-on disk galaxies — visually clear signatures of coherent features that aren’t in the training objective. The MAE model is probably full of interesting SAE-extracted features, too, but some of them are definitely challenging to interpret.
Anyway, there’s much more to say, but at this point the blog post might be comparable in length to our workshop paper! Just go read the paper, or try it out using our code — I’d love to hear what you think!
-
Why do we italicize Euclid? Well, this observatory is also technically a spaceship, and all names of ships (including spaceships) should be italicized according to the MLA. ↩
-
We actually use BatchTopK sparsity, and also nest the SAE activations in “groups” that progressively expand the sparsity bottleneck (i.e., Matryoshka SAEs). We also imposed L1 sparsity and revived dead neurons with an auxillary loss term. Note that SAEs also typically demand an overcomplete latent space. Each of these hyperparameters affect training and subsequent feature extraction; Charlie O’Neill and Christine Ye et al. looked into some of these SAE hyperparameter interactions in an earlier paper. ↩
Note: this post is a continuation of a previous introduction to GNNs in astrophysics. Special thanks to Christian Kragh Jespersen,1 who opened my eyes to the incredible power of GNNs for astrophysics! He also has several papers showing that graphs provide strong representations for galaxy merger trees (see here and follow-up here).
The galaxy–halo connection
In the ΛCDM cosmology, galaxies live in dark matter subhalos2 (see, e.g., the review by Wechsler & Tinker). While dark matter dominates the mass content of the Universe, we can only directly observe the luminous signatures from galaxies that reside within. Our goal is to determine whether galaxy properties, such as its total stellar mass, can be predicted purely from dark matter simulations.
To solve this problem 20 years ago, a technique called “subhalo abundance matching” was proposed. The goal is to connect simulated dark matter subhalos to galaxy populations based on the latter’s stellar masses (or luminosities). By rank-ordering the subhalo masses and assigning them to rank-ordered galaxy stellar masses, abundance matching imposes a monotonic relationship between the two populations.
This simple technique is capable of connecting galaxies to their host halos. However, it also assumes that galaxy evolution is not dictated by anything but the dark matter halo properties. Therefore, abundance matching fails to account for each galaxy’s large-scale environment!
To the cosmic web and beyond
We’ve known for a long time that galaxy properties depend on their surroundings (see, e.g., Dressler’s famous 1980 paper). The exact nature of how this plays out is uncertain; does galaxy environment induce different mass accretion or merger rate? Do overdense environments superheat or exhaust cool gas needed to fuel star formation? Or do large-scale tidal torques alter galaxy properties over cosmic timescales? We don’t really know the answer!3 But empirically, we do know that the galaxy–halo connection also varies with environment.

Overdensity
Some attempts have been made to account for galaxy environment. For example, “overdensity” is a common parameterization of the mass density on large scales (see, e.g., Blanton et al. 2006). Whereas a typical galaxy’s gravitational influence extends to a few hundred kpc, the overdensity can quantify the average density out to many Mpc. However, by taking a simple average over all mass in this spherical volume, the overdensity parameter is not sensitive to local variations in mass.
DisPerSE
Another popular technique called DisPerSE aims to measure topological structures in the cosmic web, e.g., voids, filaments, sheets, and clusters. DisPerSE is short for Discrete Persistent Structure Extractor, and the general intuition for how it works is by: (1) computing a density field from the simulation particles, (2) identifying critical points of the field like minima, saddle points, and maxima, (3) tracing out the “skeleton” between critical points, and (4) filtering features by their topological persistence, ensuring only robust, noise-resistant structures are kept. We can thus describe galaxy environment by using the distances to these DisPerSE features.
Cosmic GNNs
Christian and I recognized that the entire simulated volume of galaxies could be represented a single cosmic graph, and subsequently modeled via GNNs! You can see a visualization of this below (Figure 1 of Wu & Jespersen 2023).

We used matched runs of the Illustris TNG 300 dark matter only (DMO) + hydrodynamic simulations, i.e., the DMO simulation can only form dark matter (sub)halos, whereas the hydrodynamic run begins with the same initial conditions and forms similar (sub)halos as its DMO counterpart, but also includes messy baryonic physics. This means that we can map hydrodynamic galaxy predictions using a cosmic graph constructed from DMO simulations!
We treat each subhalo as a node in this cosmic graph, and specify two DMO node features: the total subhalo mass (Msubhalo) and the maximum circular velocity (Vmax).
To determine the graph connectivity, we imposed a constant linking length. Pairs of galaxies “know” about each other if they have smaller separations than the linking length, so we connect those pairs of nodes with graph edges. We also compute six edge features using the nodes’ 3D positions and 3D velocities; these edge features record the geometry of the system in a E(3) group-invariant way.
As for the GNN model architecture, we use a graph network analogous to those described by Battaglia et al. 2018 that we had seen successfully applied in cosmology. If you really want to see the code, take a look here.
So… how do overdensity, DisPerSE, and GNNs compare?
To cut to the chase: GNNs dominate the competition when it comes to predicting galaxy stellar masses from DMO simulations.
The figure below shows how different environmental indicators, quantified over various distance scales, affect the prediction error on Mstar. Lower error is better, and you can clearly see how GNNs (purple) surpass all other methods once they’re given information on > 1 Mpc length scales. (Figure adapted from Wu, Jespersen, & Wechsler 2024.)

Specifically, we compare machine learning models where no environmental data is provided (yellow), the DisPerSE cosmic web features (green), simple overdensity averaged over a given length scale (blue), and GNNs with graph connectivity on the given length scale (purple). The non-GNN models employed here are explainable boosting machines (EBMs)—decision tree models that are both performant and interpretable. EBMs can receive environmental features on top of the Msubhalo and Vmax: think of them as additional columns in a tabular dataset. We can provide EBMs with the collection of DisPerSE features, specify the overdensity on scales ranging from hundreds of kpc to 10 Mpc, or leave out environmental summary statistics altogether.
I want to highlight two main takeaways:
- Overdensity on 3 Mpc scales is the best simple environmental parameter. Excluding the GNN model, we find that an EBM with spherically averaged overdensity achieves the lowest error for stellar mass predictions. It even outperforms the DisPerSE cosmic web features!
- GNNs are the undisputed champs. A GNN flexibly processes information on larger scales, and performance continues to improve to the largest distance scales that we test (10 Mpc).
Cosmic graphs are a natural fit for the data, so it’s no surprise that they perform so well. Critically, we construct the graph such that the subhalo position and velocity information is invariant under the E(3) group action; we convert these 6D phase space coordinates into edge features. We’ve also seen hints that this method works in spatial projection, i.e. using 2D spatial coordinates and radial velocities (e.g., see Wu & Jespersen 2023 and Garuda, Wu, Nelson, & Pillepich 2024).
Furthermore, the galaxy–halo connection has different characteristic length scales at different masses. Therefore, the optimality of 3 Mpc overdensity is somewhat specific to our simulation volume and subhalo mass selection. This is another reason to prefer GNNs, which can simultaneously learn the galaxy–halo–environment connection over a huge range of masses and distances.
Graphs adeptly model systems where individual objects are separated by relatively large scales—I mentioned this in the introduction. Meanwhile, much of my research has focused on extracting local information from galaxy systems at the pixel scale by using vision models. We can even combine these two representations by placing a convolutional neural network (CNN) encoder at each node, and letting the GNN process the pixel-level details in tandem with other galaxy parameters (see Larson, Wu, & Jones 2024)!
In summary, cosmic graphs offer a more natural and powerful way to represent the large-scale structure of the Universe than traditional methods. By using GNNs, we can effectively learn the complex relationship between a galaxy’s environment and its properties. In the future, I expect that GNNs will enable new ways to connect simulations to the observable, baryonic Universe.
-
Christian has also written a fantastic blog post on our papers together here. ↩
-
A subhalo is a dark matter halo that is gravitationally bound to a more massive halo. Sometimes the subhalos are called satellites and the most massive halo in the system is the central halo. The virial radius of the Milky Way’s halo is about 300 kpc, so nearby dwarf galaxies like the LMC and SMC are expected to reside in subhalos that orbit around the Milky Way halo. ↩
-
Christian and I are investigating the equivalence of information content in galaxy assembly history and large-scale environment. Stay tuned for an upcoming paper! ↩
Check out this blurb:
When you’re a showrunner, it is on you to define the tone, the story, and the characters. You are NOT a curator of other people’s ideas. You are their motivator, their inspiration, and the person responsible for their implementation.
Bottom line: the creativity of your staff isn’t for coming up with your core ideas for you, it’s for making your core ideas bigger and better once you’ve come up with them. To say “I’ll know it when I see it” is to abdicate the hard work of creation while hoarding the authority to declare what is or isn’t good.
This is the number one failure mode I see for people just starting to use LLMs. Inexperienced users usually give a short, poorly specified prompt, and then hope that the LLM will read their minds and magically respond by following their intent, rather than what they’ve literally written in the prompt. These users are giving vague directions for some answer, and then implying “I’ll know it when I see it.” Sorry folks, AI isn’t telepathy.
I’ve discussed this a bit before, but here’s another astronomy-focused example.
Imagine you’re excited by the new LSST Data Preview, and you want to try out some new research ideas. You ask ChatGPT “What are some research questions I can answer with the new LSST data?” It lists some generic high-level stuff that’s probably summarized from some old white papers. You think to yourself, Wait, I don’t care about all these topics, I just wanted topics in galaxy evolution. And maybe using machine learning. Oh and also I don’t care about predicting photo-zs better, everybody has already been trying to do that for decades. Oh yeah and only use data that can be crossmatched with these value-added catalogs. This is probably going to be a back-and-forth process, wasting your time, polluting the LLM context, and probably leaving you frustrated and without any good research ideas.
Let me propose a better alternative. Spend 5 minutes thinking about the essence, the specifics of what you’re looking for. You can jot down your prior research ideas, examples of research topics you don’t care about, extra information that you know is relevant but the LLM might not index on. Think of it as building a trellis, upon which the LLM can expand outward and fill inward. Here’s a more fruitful example of how I’d converse with ChatGPT.
When working with LLMs, it is on you to define the tone, the core ideas, the new insights. Carefully crafting and communicating this vision is a foundational skill, useful for personal brainstorming or managing an academic research group — it certainly goes beyond just LLM prompting or showrunning!
Thanks to Simon Willison’s blog — that’s where I first heard about this.
]]>The way Johnstone characterized this interaction surprised me:
In the gentlest possible way, this teacher had been very violent. She was insisting on categorising, and on selecting. Actually it is crazy to insist that one flower is especially beautiful in a whole garden of flowers, but the teacher is allowed to do this, and is not perceived by sane people as violent. Grown-ups are expected to distort the perceptions of the child in this way. Since then I’ve noticed such behaviour constantly, but it took the mad girl to open my eyes to it.
Basically, to reject another’s world is violence. Even if done in a “gentle” way (like this teacher had done), it’s still an act of violence.
As a father of two, I often have to resist this urge to impose my world, my perspective, upon my kids. My daughter sees something she wants to share with me, but I instinctively want to respond by reshaping it into my perspective. Or convert it into some teaching moment, to insist on some fragment of my reality. But such a response to their bid for attention is what Johnstone calls “blocking”, and he discusses it at length throughout the book.
This has been on my mind because I practice it daily now. If you use large language models (LLMs), then you probably do as well.
In order to actually get any value out of your interactions with LLMs, you need to construct its world, e.g. by providing context, constraints, and specific objective. Prompting (or context engineering) is that “violent imposition” — pushing your reality onto the machine.
It’s not true to say that all such interactions are violent in this way. Parents tell their kids not to run into traffic. We teach them knowledge and skills that might broaden their world. The AI safety community seeks to align LLMs with human values. It’s not a bad thing to provide guidance. And again, skilled prompting is necessary to get any utility from LLMs.
However, I’m quite concerned about what this practice does to ourown psyches. What happens when you spend hours each day reformatting the world context of a LLM, which can never resist? The way that AI generally interacts is to comply with whatever you say (or at least attempt to do so).1
Real life is never this frictionless! And it shouldn’t be… each person has their own perspectives, and most people aren’t thrilled about having a worldview subjugated upon them.
What happens when we get too good at making LLMs see things our way? I’m guessing that it’ll make us even more siloed or unwilling to change our perspectives (even more than what social media has already done).
The equivalent of touching grass in this case is to spend some conscious effort not imposing our worlds on others. Maybe even LLMs too! After all, improv2 is all about accepting what your partner gives you and building on it.
-
Also, gross sycophancy… and it looks like the latest version of Gemini 2.5 Pro is falling into this same trap. ↩
-
I should probably add the caveat that I’ve never done improv, but it’s on my bucket list! ↩
This is the first few sections of an invited review article that’s been sitting around for far too long…
Introduction
Machine learning algorithms have become increasingly popular for analyzing astronomical data sets. In recent years, astronomy’s wealth of data has engendered the development of new and specialized techniques. Many algorithms can learn relationships from catalogued (or tabular) data sets. Vision methods have been adopted across astronomy, e.g., through the use of convolutional neural networks (CNNs) for pixel-level data such as images or data cubes. Time series data sets can be represented using recurrent neural networks or attention-based models. Recently, simulation-based inference and generative models have also become commonplace for solving complex inverse problems and sampling from an implicit likelihood function. I don’t cover these topics here, as other reviews have surveyed the rise of ML applications throughout astronomy, deep learning for galaxy astrophysics, and for cosmology).
Inductive biases of physics problems
Because astronomical data can be structured in various ways, certain model representations are better suited for certain problems. This representational power is tied to the inductive bias of the problem. Multi-Layer Perceptrons (MLPs) or decision tree-based methods operate well on catalog-based data or unordered sets; that is, the permutation of rows or examples does not matter, and the features are treated independently. A CNN is well-suited for data on some kind of pixel or voxel grid; here the features are correlated with each other and have some notion of distance. Graphs are able to represent relationships between entities. See reviews on GNNs, e.g. by Battaglia et al. (2018), Hamilton (2020), Bronstein et al. (2021), and Corso et al. (2024), just to name a few.
What are GNNs?
Graphs are well-suited for representing entities and relationships between them; for example, a “ball and stick” model of a molecule represents atoms as nodes and bonds as edges on a mathematical graph. Another example is a social graph, where people, businesses, and events are different types of nodes, and interactions between these entities (i.e. mutual friends, event attendees, etc.) are edges on the social graph. In addition to the connective structure of the graph, nodes and edges can also be endowed with features. For the molecular graph, node features may comprise positions, atomic weight, electronegativity, and so on.
Because graphs are very general structures, they can offer tremendous flexibility for representing astronomical phenomena. Importantly, they also exhibit relational inductive biases (e.g., Battaglia et al. 2018). Objects that are well-separated from each other are most naturally suited to reside on graph nodes. For example, a galaxy cluster can readily conform to a graph structure: galaxies can be represented as nodes, while interactions between pairs of galaxies (such as gravity, tidal forces, ram pressure, to name a few) can be represented as edges. The circumgalactic medium may be more challenging to represent as a graph, as there exists a continuum of gas densities in multiple phases, each with potentially different lifetimes, making it difficult to draw the line between individual clouds.1
A graph neural network (GNN) is a machine learning model that can be optimized to learn representations and make predictions on graphs. In this post, I highlight current and future astrophysical applications of GNNs.
Constructing graphs from astronomical data
Before applying a GNN, we’ll need to first construct a graph from our data. The choice of how to define nodes and edges also determines how you might model the data via GNNs. In general, point clouds can be easily represented as nodes on a graph. Objects that are small relative to inter-object separations are natural candidates for nodes, like galaxies, subhalos, stars, or star clusters. The edges, which represent relationships or interactions, can be defined in several ways:
- k-Nearest Neighbors (k-NN): An edge is drawn from a node to its k closest neighbors in physical or feature space. This method ensures that every node has the same number of connections (degree), which can be useful for batching data on a GPU.
- Radius-based: An edge is drawn between all nodes separated by a distance less than a chosen radius r. This is a common choice for representing physical interactions that have a characteristic length scale. Unlike k-NN, this method results in a variable number of connections per node.
- Dynamically: Edges can also be learned dynamically by the model itself, for example, by using an attention mechanism to weight the importance of connections between nodes.
The choice of graph construction method imposes a strong prior on the model, and the best choice will depend the problem.
A primer on mathematical graphs
A graph with \(N\) nodes can be fully described by its adjacency matrix, \(\mathbf{A}\), a square \(N \times N\) matrix that describes how nodes are connected. If an edge connects node \(i\) to node \(j\), then element \(A_{ij}\) has a value of 1; otherwise it is 0. Physical systems are often approximately described by sparse graphs, where the number of edges \(M \ll N(N-1)/2\). This approximation holds if, for example, interactions or correlations between nodes fall off rapidly with distance. A sparse adjacency matrix can also be efficiently represented using a \(2 \times M\) matrix of edge indices. The graph \(\mathcal{G}\) may contain node features \(\mathbf{X}\) and edge features \(\mathbf{E}\), where
\[\mathbf{X} = \begin{pmatrix} x_1^\top \\ \cdots \\ x_N^\top \end{pmatrix} \quad {\rm and} \quad \mathbf{E} = \begin{pmatrix} e_1^\top \\ \cdots \\ e_M^\top \end{pmatrix}.\]Graphs have several characteristics that make them attractive for representing astrophysical concepts. Graph nodes have no preferred ordering, so the operation of a permutation matrix \(\mathbf{P}\) should yield the same graph as before. Critically, models that act on graphs (or sets; Zaheer et al. 2017) can also be made invariant or equivariant to permutations. A permutation-invariant function \(f\) must obey
\[f(\mathbf{X}, \mathbf{A}) = f(\mathbf{PX}, \mathbf{PAP^\top}),\]while a permutation-equivariant function \(F\) must obey
\[\mathbf{P} F(\mathbf{X}, \mathbf{A}) = F(\mathbf{PX}, \mathbf{PAP^\top}).\]Note that the indices of the edge features are implicitly re-ordered if the permutation operation acts on the adjacency matrix.
Invariant and equivariant models
As discussed above, GNNs are permutation-invariant to the re-ordering of nodes. This invariance reveals a symmetry in the system, as the permutation operator leaves the graph unchanged. Additional symmetries can be imposed on graphs and GNNs, for example, recent works have developed graph models that are invariant or equivariant to rotations and translations in \(3\) or \(N\) dimensions, e.g., (Cohen & Welling 2016, Thomas et al. 2018, Fuchs et al. 2020, Satorras et al. 2021). The subfield of symmetries and representations in machine learning is sometimes called geometric deep learning, and there are far more detailed reviews offered by Bronstein et al. (2021) or Gerkin et al. (2021).
Notwithstanding the far superior review articles mentioned above, I still want to briefly discuss the benefits of leveraging symmetries in astrophysics. While modern ML has demonstrated that effective features and interactions can be learned directly from data, imposing physical symmetries as constraints can vastly reduce the “search space” for this learning task. Perhaps the simplest symmetry is by only using scalar representations. While models that preserve higher-order representations can be more data-efficient (Geiger & Smidt 2022), a simple and powerful way to build invariant models is by contracting all vector or tensor features into scalars (e.g., dot products) at the input layer, as discussed in Villar et al. (2021). Nonetheless, models that allow higher-order internal representations can efficiently learn using fewer data examples.
Other popular models in ML are already exploiting many of these symmetries. Indeed, CNNs, which are commonly used for image data, and transformers, commonly used for text data, can both be considered special cases of GNNs. For example, a convolution layer operates on a graph that is represented on a grid; node features are the pixel values for each color channel, while linear functions over a constant (square) neighborhood represent the convolution operator. CNNs can learn (locally) translation-invariant features, although this invariance is broken if the CNN unravels its feature maps and passes them to a final MLP.
A simple GNN that makes node-level predictions
Caption: Example of a simple GNN layer that makes node-level predictions. Node features \(x_i\), neighboring node features \(x_j\), and edge features \(e_{ij}\) are fed into a learnable function, \(\phi\), which outputs a hidden edge state \(\varepsilon_{ij}\). All edge states \(\varepsilon_{ij}\) that connect to node \(i\) are aggregated through \(\oplus_j\), a permutation-invariant aggregation function, and the concatenation of its output and the original node features are fed into another learnable function, \(\psi\), which finally outputs predictions at each node \(i\).
Here, we’ll briefly describe the simple GNN illustrated in the above figure. This general structure is often referred to as a message-passing framework. Let’s focus on predictions that will be made on node \(i\). For each neighboring index \(j\), we feed neighboring node features \(x_j\), edge features \(e_{ij}\), and the input node features \(x_i\) into a function \(\phi\) that produces a “message” or edge hidden state \(\varepsilon_{ij}\):
\[\varepsilon_{ij} = \phi(x_i, x_j, e_{ij}).\]\(\phi\) is a function with shared weights across all \(ij\), and it is parameterized by learnable weights and biases. In practice, \(\phi\) usually takes the form of a MLP with non-linear activations and normalization layers.
An aggregation function \(\oplus_j\) operates on all edge hidden states \(\varepsilon_{ij}\) that connect to node \(i\), i.e., it pools over all neighbors \(j\). Common examples of the aggregation function include sum pooling, mean pooling, max pooling, or even a concatenated list of the above pooling functions. Crucially, the aggregation function must be permutation invariant in order for the GNN to remain permutation invariant.
The function \(\psi\) receives the aggregated messages back at node \(i\), as well as the node’s own features \(x_i\), in order to “update” the node’s state and make predictions: \(y_i = \psi \left (x_i, \oplus_j(\varepsilon_{ij}) \right).\) Similar to \(\phi\), \(\psi\) can be parameterized using a MLP or any other learnable function, so long as the parameters are shared across all training examples.
Although we described just one example of a GNN layer, it serves to illustrate how different kinds of features may interact. Many other alternatives are possible, see e.g., Battaglia et al. 2016, 2018. It is possible to have graph-level features or hidden states that simultaneously act on all node or edge hidden states. Additionally, predictions can be made for the entire graph or on edges rather than on nodes, and likewise, other aggregation patterns are possible.
Prediction tasks on graphs
GNNs are versatile and can be adapted for various prediction tasks depending on the scientific question:
- Node-level tasks: These tasks involve making a prediction for each node in the graph. For example, predicting the stellar mass of a galaxy (node) based on its properties and the properties of its neighbors. The model output is a vector of predictions, one for each node.
- Edge-level tasks: These tasks focus on the relationships between nodes. An example would be predicting whether two dark matter halos will merge, where the prediction is made for each edge connecting two halos.
- Graph-level tasks: These tasks involve making a single prediction for the entire graph. For instance, predicting the total mass (e.g., \(M_{200}\)) of a galaxy cluster (the graph) based on the properties and arrangement of its member galaxies. This usually involves an additional “readout” or “pooling” step that aggregates information from all nodes and edges into a single feature vector before making the final prediction.
Our one-layer GNN described in this section can be extended in two different ways: (i) multiple versions of the learnable functions with unshared weights can be learned in parallel, and (ii) multiple GNN layers can be stacked on top of each other in order to make a deeper network. We now consider \(u = {1, 2, \cdots, U}\) unshared layers, and \(\ell = {1, 2, \cdots, L}\) stacked layers. For convenience, we also rewrite \(x_i\) as \(\xi_i^{(0, \ell)}\), \(x_j\) as \(\xi_j^{(0, \ell)}\), and \(e_{ij}\) as \(\varepsilon_{ij}^{(0, \ell)}\), where the same input features are used for all \(\ell\). (Note that the node and edge input features may have different dimensions than the node and edge hidden states.) With this updated nomenclature, each unshared layer produces a different set of edge states:
\[\varepsilon^{(u,\ell)}_{ij} = \phi^{(u,\ell)}\left (\xi_i^{(u,\ell-1)},\xi_j^{(u-1,\ell-1)},\varepsilon_{ij}^{(u,\ell-1)}\right ),\]which are aggregated and fed into \(\psi^{(u,\ell)}\) to produce multiple node-level outputs:
\[\xi_i^{(u,\ell)} = \psi^{(u,\ell)}\left (\xi_i^{(u, \ell-1)}, \oplus_j^{(u,\ell-1)}\left(\varepsilon^{(u,\ell-1)}_{ij}\right )\right ).\]The extended GNN can have a final learnable function \(\rho\) that makes node-level predictions from the concatenated hidden states:
\[y_i = \rho\left (\xi_i^{(1,L)}, \xi_i^{(2,L)}, \cdots, \xi_i^{(U,L)}\right).\]A connection to multi-headed attention
Another way to say this is by representing \(h_i^{(\ell)}\) as the feature vector of node \(i\) at layer \(\ell\). Assuming that we aggregate all of the unshared layers at each \(\ell\), then \( h_i^{(\ell)} = \oplus_u(\phi^{u,\ell}) \). In that case, the input is \(h_i^{(0)} = x_i\) and a stack of \(L\) layers is then:
\[\mathbf{h}_i^{(\ell+1)} = \text{GNN-Layer}^{(\ell)} \left(\mathbf{h}_i^{(\ell)}, \left\{ \mathbf{h}_j^{(\ell)}, \mathbf{e}_{ij} \mid j \in \mathcal{N}(i) \right\} \right).\]Within any single GNN layer, we can learn \(U\) different message functions in parallel — this is just like multi-headed attention (see Veličković et al. 2017)! The outputs of these multiple heads \(\phi^{(1)}, \phi^{(2)}, \cdots, \phi^{(U)}\) can be concatenated (or aggregated) before the final node update: \(\text{final_features}_i = \text{CONCAT}\left[ \bigoplus_j \phi^{(1)}(...), \bigoplus_j \phi^{(2)}(...), \dots \right].\)
Once we’ve extracted this final set of features, we can then pass it through a final learnable function \(\rho\) in order to make predictions.
Summary
Graph neural networks (GNNs) provide a powerful and remarkably intuitive way to model astrophysical systems. By treating objects like galaxies and subhalos as nodes on a graph, we can leverage their physical relationships as edges, making it easier to build models that respect the fundamental symmetries of the problem.
I’ve written this post as a rather general introduction, but real examples can probably paint a clearer picture of how GNNs work. In an upcoming blog post, I’ll highlight some of my own work using these methods to learn the physical connection between galaxies, their subhalos, and their cosmic surroundings. Stay tuned, but if you can’t wait, then you can check out those papers here and here!
-
Note, however, that even complex gas dynamics may still be modeled using GNNs. For example, Lam et al. 2023 have successfully represented meteorological data on a polygon mesh, a specific type of graph, which enables them to leverage GNNs for weather forecasting. ↩
So I decided to check some analytics. Since late April, I’ve been tracking where my blog readers actually come from when I share posts across different platforms. I share my results from this 30-day snapshot (late April through late May) below.
The data: where readers actually come from
Before diving into the numbers, a quick note on methodology. I use SimpleAnalytics because it respects visitor privacy (e.g., it respects blockers or “do not track” browser signals). This means some traffic sources might go untracked if users have strict privacy settings, but it gives us a decent view of the platforms that are actually driving traffic to my posts.
Over the past month, I’ve shared each new blog post consistently across three platforms: Twitter/X, Bluesky, and LinkedIn. Most of these simply a sentence or a copy+paste of the front matter of the blog, sometimes with a screenshot of the post from my laptop or my phone. I’ve also been posting at random times (basically whenever a post is completed).1 Okay, this isn’t really a rigorous scientific experiment… whatever.
When I examined the referral data after excluding direct links and other sources (a pretty large fraction of results), the distribution was a bit surprising:
- LinkedIn: 51.4%
- Bluesky: 26.0%
- Twitter/X: 22.7%
Twitter/X: When reach doesn’t translate to readership
Twitter’s poor performance in driving actual blog readership is particularly pathetic when you consider the platform’s apparent reach. I’ve had several tweets gain (fairly?) significant traction, e.g. my blog migration tweet was noticed by Simon Willison, and subsequently got 36,000 views. Yet despite his generous attention, the actual number of actual blog visits was comically low.
Another one of my posts got over 9000 views on Twitter — not bad, right? But in fact, only 100 people had actually clicked through to read the full blog post. This represents roughly a 1% conversion rate, which suggests that Twitter’s engagement metrics are totally disconnected from genuine reader interest. In any event, most of my posts get only a few hundred views (i.e. less than a quarter of my follower count), since I don’t pay for that blue check mark (or use Twitter as its own microblogging platform, now that I’ve chosen to “own” my content).
LinkedIn: Steady and reliable (for now)
In contrast to Twitter’s up-and-down metrics, LinkedIn has been way more steady over the month. My posts on the platform typically generate between 800–4000 views. LinkedIn consistly delivers sustained visibility (often spanning multiple weaks) for my blog posts. And this seems to work: over half of my blog visits originate from LinkedIn! I was kind of shocked to see this, since academics and researchers rarely use LinkedIn, and the platform is generally known as a pretty low-signal source of information…
If there’s any social media platform that I might be more inclined to post on regularly, it would be LinkedIn. However, for now I’m not planning to change my usage patterns significantly. After all, we’ve seen what can happen when unhinged billionaires acquire social media platforms (and I’m not eager to invest heavily in a platform that could become even more pay-to-play overnight).
Bluesky: The surprising dark horse
Bluesky has also been a surprisingly helpful platform despite its small apparent size. Although Bluesky still feels like a niche social media site2 compared to the other two, it’s driving 26% of my social media referral traffic, placing it solidly ahead of Twitter.
On one hand, I actually have the highest follower count on Bluesky (among the three platforms). On the other, the Bluesky’s chronological timeline makes it much harder to go viral compared to its competitors. This design constraint probably favors consistent, regular engagement from regular bloggers, over the transient spikes that characterize viral posts on other platforms. Or maybe Bluesky simply has a more dedicated user base that actually spends more time connecting with others on the platform rather than scrolling past things without reading.
What this means for writers and researchers
I’ve found that regular blogging has made me a better writer, and helped me organize my thoughts and clarify my thinking. It’s also served as a public record for my own future reference. These benefits exist regardless of whether anyone reads my posts!
This brief foray into my blog post analytics has reminded me of a lesson that’s easy to forget in the social media age: writing should be pursued for its own sake, not simply as fuel for social media engagement. The data certainly provides useful tidbits about platform effectiveness, but the more important takeaway is that I have very little control over social media platforms, and that expanding social media reach is totally orthogonal to writing a half-decent post.
Moving forward, I’m not planning to spend any more effort crafting platform-specific social media posts. Instead, I’ll focus on what actually matters: writing blog posts that help me think more clearly and document my academic and intellectual journey. Hopefully, if your writing is genuinely useful to you — i.e., it helps you understand something better, or articulate ideas you needed to work through — then readers will likewise find it valuable, regardless of which platform brought them there.
-
The time and day of week that you post makes a huge impact on social media engagement. I used to care about this a bit more, at least enough to recognize this factoid, but I’ve since become more constrained by having two small kids and giving less of a crap. ↩
-
I mean this in a good way! Bluesky actually has a legitimate astronomy community. Check out the various astronomy feeds — especially the AstroSci feed for astronomy researchers! ↩
A lot of this post stems from my own experience, and I hope that they’re useful for you too. (But one of the takeaways here is that sometimes you have to make your own mistakes in order to learn.) Here are a few other blog posts that have impacted my thinking on productivity:
- Impact, agency, and taste by Ben Kuhn
- How I think about my research process by Neel Nanda (note that there are three parts)
- Some reasons to work on productivity and velocity by Dan Luu
- Hacker School’s Secret Strategy for Being Super Productive (or: Help.) by Julia Evans
- Every productivity thought I’ve ever had, as concisely as possible by Alexey Guzey
- The top idea in your mind by Paul Graham
I’m sure there are more that I’ve internalized, but can’t quite remember right now; feel free to reach out to me if you know of other interesting ones.
The exploratory phase
It’s hard to quantify the value in trying out new research directions. Obviously you can’t wander aimlessly all the time, or spend all your free time listening to NotebookLM audio overviews of random papers that piqued your interest. But many of my best ideas emerged despite me not beginning with a tangible goal.
Back when I was in grad school,1 I noticed that a Kaggle Galaxy Zoo challenge was solved using deep convolutional neural networks (CNNs). I was very interested in applying deep learning to galaxies, so it was gratifying to see Sander Dieleman et al. accurately predict citizen scientist vote fractions of galaxy morphology purely using image cutouts.
Motivated by this successful application… I decided to proceed by bashing every project I could find with this newfound hammer. After all, I was a curious grad student wielding a powerful method. What else did you expect? Classifying galaxy morphology had already been done before, but I recognized that you could predict all sorts of other galaxy properties, e.g., whether galaxies were merging, separating compact galaxies from stars, predicting the bulge-to-disk ratio, etc.
Along the way, though, I noticed that nearly everyone was interested in classification problems, e.g., identifying galaxy morphological type or galaxy mergers, but these seemed to be an incredibly limited class of problems. After all, the cosmos is weakly modal, and although astronomers loved to classify things, these categories are honestly quite arbitrary.2 I was far more interested in regression problems, e.g., how does the galaxy’s star formation rate or chemical abundance scale with its appearance? Up until ~2017, very few people had addressed the idea of regression deep learning problems in astronomy.
Anyway, after a few months of going down random rabbit holes, I realized that there were loads of interesting regression problems that hadn’t been addressed with deep learning. I chatted with Stephen Boada, and later on consulted with Eric Gawiser about these ideas; we quickly honed in on the task of predicting galaxy metallicity from images. You can read more about that here.
These exploratory phases are helpful for letting your mind make free-form connections; diving down rabbit holes is basically feeding that part of your brain. But watch out for the slippery slope: it’s tempting to put out theories without ever figuring out how to evaluate (or invalidate) them. In other words, it’s fine to follow random streams of consciousness, but eventually you’ll need to land on a well-posed research question. Otherwise, you’d never crystallize any kind of problem worth solving!
I think about this transition as going from the exploratory/meandering phase to mapmaking phase. That transition happens once you have a falsifiable hypothesis, after which you can begin charting out a plan to implement those tests. Let’s talk about the mapmaking phase.
From meandering to mapmaking: distilling down to a one-sentence hypothesis
One of the most important lessons I’ve learned is this: Whenever you are in an exploratory phase, look for every opportunity to distill your ideas into a testable, one-sentence hypothesis.
Side note: LLMs are extremely helpful here! As described in a previous post, under the heading 1. Exploring ideas and surveying prior art, I lean on LLMs to (i) critique my vague thoughts, (ii) decompose promising ideas into atomic concepts, and (iii) survey the literature to see whether these ideas have been implemented before. If you’re interested in critiquing your thoughts, then you must avoid LLM sycophancy at all costs! Try a prompt based on something like this:
Please critique my thoughts on Topic_X (appended below). Is my hypothesis vague or incomplete? How might it be tested? Has it been done before? Include diverse opinions or parallel ideas from the literature on Topic_X or related concepts.
If you can’t (eventually) articulate a testable hypothesis, then you should be slightly worried. Either you are still learning the ropes for a new topic (good!), or you are familiar with a topic/method but cannot figure out what you want to do with it (not good!). Give yourself a hard deadline (e.g. a week) to distill a one-sentence hypothesis from all the information you’ve gained while chasing down rabbit holes, and if you still can’t come up with anything concrete, then put those thoughts on the backburner.
As soon as you come across a new idea, rigorously consider the following:
- Do I understand the method well enough to sketch out the pseudo code?
- Do I understand the prior art and potential, e.g., is it sufficiently novel and/or impactful?
- Can I write down a testable hypothesis?
If you can’t address these questions, then put the idea on the backburner. In fact, for more experienced researchers, you’ll want a much tighter feedback loop; each of these questions should be answerable within a few minutes. I come up with a dozen nebulous ideas every day, so it’s imperative that I set a five minute deadline for constructing a hypothesis, and if it fails to meet that bar, then I let it sink back into my subconscious.
Alternatively, there are cases in which it’s better to have a strong rather than a quick feedback loop. I’ll touch on that in the next section.
But once you do find a testable hypothesis, then try to write it down into a single sentence. This can be tricky, but the point of the exercise is to practice conveying the essence of the hypothesis, and winnowing out extraneous details. Once you have something specific enough to actually disprove, and you’re satisfied that it captures the core research question you’d like to solve, then congratulations! You’re done exploring (for now) — it’s time for mapmaking.
Here’s a concrete example of when I failed. Around 2020, I got interested in generative models for galaxies. “I want to apply VAEs/GANs/diffusion models to astronomy” sounds great when you’re reading the DDPM paper, but it’s also a completely vague and unfalsifiable thought — there’s no scientific question in here. You could spend months on that without it amounting to anything. But instead, we could start thinking about more testable hypotheses:
- Can generative models construct galaxy images at high enough fidelity to forecast new survey observations? (This is still too vague, but we’re getting closer.)
- Can generative models predict JWST-like mid-infrared images from HST imaging to a reduced chi-squared value of about 1, thereby confirming a tight connection between galaxies’ optical and infrared light?
- If we generate mock survey images based on N-body simulations with realistic galaxy morphologies and brightnesses, and select galaxies via aperture photometry based on rest-frame optical colors, then does it result in a biased matter power spectrum relative to doing the same with halo occupation distribution models that don’t include morphology?
I’m not saying these are particularly good research ideas. But the second and third are definitely more testable than the first one, and each of those three are far more useful than “can we train VAEs/GANs/diffusion models over my favorite galaxy dataset?”
Specifying the hypothesis in this way makes it obvious that the hypothesis could be true or false. Better yet, it implies how the hypothesis might be validated or falsified. Still, we might have a vague but interesting idea, and we can think of multiple tests that could (in)validate parts of this unclear idea. In that case, we can hierarchically break down the target idea (e.g., the latent space of generative models trained on galaxy images has a bijective map to the space of astrophysical properties) into more specific hypotheses, like:
- A generative model trained on SDSS galaxy image cutouts will have some linear combination of latent vectors that have high Pearson correlation coefficient with HI velocity width.
- A generative model’s latent vector that is positively correlated with inclination angle will also be anticorrelated with star formation rate from optical tracers.
I want to admit that I’ve often gotten stuck with a tantalizing idea, but couldn’t (or didn’t) find a way to test a concrete hypothesis. For example, I wanted to make “superresolution” work in astronomy, and even wrote some blog posts about it. Whereas smarter researchers like Francois Lanusse came up with genuinely useful applications for generative modeling, I was just messing around. I hadn’t ever left the exploratory phase, and although I believed I was mapping out a trajectory, I had no destination in mind!
The irreplaceable experience of doing a really, really bad job
Let’s actually zoom out for a moment, because there’s an important lesson to be learned here.
I mentioned that it’s useful to have tight or quick feedback loops: arrive at testable hypotheses so that you can begin the real work of validating or falsifying them. This is the actual learning process. Aimless exploration often has the appearance of learning, but carries little substance… except for one critical meta-lesson. Sometimes you simply need to experience the feeling of aimlessness in order to learn how to overcome it.
Perhaps you don’t know what kind of statistical methods are needed to confirm or falsify a hypothesis. Or maybe you need to collect a lot more data, and that’ll take a year. Or perhaps your PhD supervisor really thinks that the conjecture is true, but you can’t figure out how to confirm it. In all these cases, you’re left without any useful feedback, or an obvious way to proceed, so you just flounder about and watch weeks, months, maybe even years go by. You’re absolutely failing at your task. Keep it up!
I can give a personal anecdote: I got totally stuck during my first research project in graduate school. And by stuck, I mean I was absolutely going down random wrong rabbit holes and wasting my time. I was tasked with “stacking far-infrared, dust continuum, and gas spectral lines from high-redshift cluster galaxies.” Stacking just means that, since we know the locations (and recession velocities) of a sample of galaxies, we can average how bright those galaxies are using some other measurements. Simple enough, right? But at the time, I didn’t know what I didn’t know:
- At far-infrared wavelengths, there is a considerable background emission from high-redshift galaxies, and especially so in galaxy cluster fields where gravitational lensing can amplify the background flux.
- Far-infrared emission detected via bolometers generally has very low spatial resolution, so that if you are stacking lots of galaxies within a small angular size, then you’ll accidentally double count sources (that are well-separated at optical wavelengths, but completely “blended” or overlapping at long wavelengths).
- I was relying on a sample of galaxy clusters spanning redshifts between 0.3 < z < 1.1, meaning that my flux-limited samples would result in completely heterogeneous constraints spanning nearly two orders of magnitude in luminosity or mass.
- Ultimately, stacking would not have detected anything in the highest-redshift clusters unless the average cluster member had lots of dust-obscured star formation — a very bold assumption.
- There were a few galaxies that were actually detected in long-wavelength emission, but there were so few that it was impossible to make any kind of statistical statement about them.
- The whole project was extremely challenging, and probably could have been framed as “we expected to find nothing in these extremely massive galaxy clusters, and yet we found some rare dusty, gas-rich, star-forming galaxies!” (A member of my qualifying exam committee actually said this, but I didn’t take it to heart.)
To be clear, my PhD advisor was extremely supportive and helpful in providing high-level guidance. I just happened to be pushing in a challenging research direction, and I was too inexperienced in astronomy to have salient intuitions on how to interpret my (null) results. After navigating these unexpected twists and turns over three years,3 I finally had some results and a paper draft to circulate among my co-authors.
One of the co-authors (I won’t say who) remarked, “Whew. That was a slog to read through. It really reads like a student’s first paper… you gotta make it more interesting than that.”
I had dutifully reported the results from every misguided rabbit hole, mentioned our incorrect motivations, and carefully explained the statistical procedures that were far too weak to result in anything. But my co-author’s message cut through all the nonsense: Give your audience a reason to read this. Nobody cares about your hard work.
Don’t skip the struggle, don’t repeat the struggle
This was the lesson I needed to learn, and I’m not sure there were any shortcuts. Instead of just learning what to do, I had to suffer through and internalize what not to do. I was embarrassed at how inefficient I was and completely sick of this project.4 But I had to see it through.
Ultimately, it took over four years to get my first paper published. The end result wasn’t great; in hindsight, I could have written it much better, and I probably could have made it sound far more interesting.
And yet, I think this was one of the best things to happen to me as a graduate student. I truly believe that there’s no substitute for the experience of struggling to conclude a project, and then pushing through anyway. Each and every fruitless endeavor builds intuition for future projects. Critiques and criticisms hone your research taste. Slow learning by trial and error ensures that you have an airtight understanding of those topics.
The greatest part of this story is that I’ve already gone through this journey once, so I don’t need to repeat it! I’m sure that I’ll run into new obstacles and challenges, and that I’ll be frustrated with my lack of productivity, but — with the benefit of hindsight — I can appreciate the slow learning process for what it is.
-
I’ve previously written about my journey, including my serendipitous grad school years, here. ↩
-
There’s a whole blog post in here, but in essence: many astrophysical phenomena exist along a continuum. That continuum might be bimodal, like elliptical vs disk galaxies, or broad-line vs narrow-line active galactic nuclei, or Type O/B/A/etc stars, but there is rarely a firm separation of classes. Sure, there are obvious class distinctions like star–galaxy separation, but you know I’m not talking about that. If you want to hear more, stay tuned for a future post, or check out this MLClub debate back in 2021. ↩
-
I pulled so many all-nighters during my first year of graduate school. Probably chasing down loads of random rabbit holes on that first project. And for what? None of it was useful for writing my first paper. And yet it was useful, because now I know that I don’t need to repeat the experience! ↩
-
Josh Peek (and I’m sure there’s earlier attribution) often says, “hating your paper is a necessary but not sufficient condition to getting it published.” ↩
Foundation models are here to stay
Foundation models are the base pre-trained neural networks for large language models (LLMs) like ChatGPT or Claude, vision models like DALLE, and even automated speech recognition (ASR) models like the ones that automatically caption your Youtube videos.
These models learn representations of data that can distinguish between examples in the training dataset. However, they’re not really trained in the usual supervised fashion; instead, foundation models undergo self-supervised learning by optimizing a contrastive or generative objective.1
Foundation models seek to learn how your data can be represented or generated. By minimizing a contrastive loss, you task your model to create similar representations for the same example “viewed” differently (or transformed differently under a data augmentation procedure), and different representations for different data examples. If instead, you minimize a generative loss, then you task your model with figuring out whatever representations are useful for generating another patch of the image or the next word in a text corpus. I’d wager that contrastive losses lead to stronger discriminatory power, and that generative losses lead to better generative power, but don’t actually have any data to support this intuition. Oh well.
The real power of foundation models is that (1) they can map your data into semantically meaningful embedding representations and (2) help catalyze specific downstream tasks.
(1) The power of embedding representations
Why should you care about latent representations of your data (i.e. your embedding space)? By converting data into embedding vectors, you can use that embedding space to perform comparisons. Concretely, if your embedding space captures the semantic meanings of your dataset, then you’ll be able to measure the semantic similarity of two objects (e.g. by using a cosine similarity or some other distance measure). You can even learn a joint representation across multiple “modalities” such as text and audio.
For example, we used text embeddings to compare astronomer queries against arXiv paper abstracts when we sought to evaluate LLMs for astronomy research. By mapping both the user query and the paper abstracts into this embedding space, and storing the latter into a vector database, we could retrieve (hopefully) relevant papers on the basis of the user query. Over the course of the JHU CLSP 2024 JSALT workshop, we dramatically improved the semantic similarity search pipeline and retrieval engine, which was published alongside many other cool results in the Pathfinder paper by Kartheik Iyer. Charlie O’Neill and Christine Ye were also able to extract, disentangle, and interpret semantic concepts in the astronomy and ML literature by training sparse autoencoders over these paper embeddings!
(2) All-purpose base models for any task
Building up this semantically rich representation of your dataset also provides an excellent starting point for any other machine learning task. We can view this pre-trained foundation model as a base for some later downstream task. For example, if a foundation model has seen all kinds of real-world images, and learned to produce self-consistent representations of the semantic content within those images, then it should be able to classify bird species or segment cars and pedestrians and roads in a much more data-efficient way.
Foundation models in astronomy
Foundation models are also becoming common across astronomy! In the past few years, we’ve seen foundation models trained on galaxy image cutouts (e.g., by Hayat et al. 2020, Stein et al. 2021, and Smith et al. 2024), stellar spectra (Koblischke & Bovy 2024), and even multiple modalities like images and spectra (Parker & Lanusse et al. 2024) or photometric and spectroscopic time series (Zhang et al. 2024). And there are many more coming soon!
A critical question remains: Are people actually using foundation models to make new discoveries? In general, the answer is no. Most citations are simply from other papers that are also releasing their own ML models. A notable exception is from Galaxy Zoo,2 whose Zoobot model by Walmsley et al. 2021 has amassed ~200 citations leading to actual science! It remains to be seen whether current and next-generation foundation models will deliver real scientific value.
As I mentioned at the top, the workshop organizers will be writing up another blog post focusing on our discussions and how we might guide our community of astronomical ML practitioners. Stay on the lookout for that!
Edit (2025-05-19): I’m including a list of foundation models in astronomy that I currently know about. There are arguably more, e.g. autoencoder variants such as spender, but I’m trying to focus on large-scale foundation models that will (hopefully) be able generalize well to many tasks. Feel free to reach out if you’re think I’ve made an egregious omission.3
| Foundation Model | Domain | Method |
|---|---|---|
| AstroCLIP (Parker et al. 2023; Github) | Multi-modal (images and spectra) | Contrastive |
| Maven (Zhang et al. 2024) | Multi-modal time series (photometry and spectra) | Contrastive |
| AstroM³ (Rizhko & Bloom 2024) | Multi-modal time series (photometry, spectra, and metadata) | Contrastive |
| *AstroPT-Euclid (Siudek et al. 2025) | Multi-modal (images and photometry) | Generative |
| FALCO (Zuo et al. 2025) | Kepler time-series | Generative |
| SpectraFM (Koblischke & Bovy 2024) | Stellar spectra (synthetic & real) | Generative |
| *Gaia spectra (Buck & Schwarz 2024) | Stellar spectra (Gaia XP and RVS) | Contrastive |
| SSL for DESI Legacy Survey (Stein et al. 2021; Github) | DESI Legacy Survey galaxy images | Contrastive |
| GZ-Evo (Walmsley et al. 2022; Github) | Galaxy images (multiple observatories) | Constrastive |
| AstroPT (Smith et al. 2024; Github) | DESI Legacy Survey Galaxy images | Generative |
| Radio Galaxy Zoo (Slijepcevic et al. 2022) | Radio-wavelength galaxy images | Constrastive |
| SSL for Radio Interferometric Images (Cecconello et al. 2024; Github) | Radio interferometric images | Constrastive |
| SSL for LOFAR (Baron Perez et al. 2025) | Radio galaxy images (LoTSS-DR2) | Constrastive |
-
You shouldn’t be surprised to find that Lilian Weng has incredibly comprehensive blog posts on self-supervised learning and specifically contrastive learning. ↩
-
Arguably, this is expanding the definition of a foundation model because it is being trained via supervised learning. Zoobot learns to predict vote fractions of citizen scientists’ morphological classifications. ↩
-
But if you send me your paper/method and I add it to this post, then I’ll add an asterisk so everybody will know ;) ↩
- Astronomy Colloquium at the MIT Kavli Institute on April 15, 2025
- Talk and Q&A for the NASA Galaxies Science Interest Group on May 7, 2025
These two talks are pretty different.
The first (MIT Colloquium) is a 50-minute overview of my research program and includes everything from computer vision to low-mass galaxies to graph neural networks.
The second (NASA Galaxies SIG) comprises a 30 minute presentation, followed by 30+ minutes of discussion on galaxy scaling laws, mechanistic interpretability, and AI for scientific discovery.
]]>