Presentation at Louisiana State University
Will AI surpass human intelligence in the near future?
World models + scalable objectives will drive AI improvement
— Danijar Hafner, Google DeepMind (YouTube interview)
Silicon Valley consensus: AGI within reach in 3–5 years.
Example: Relay Polymerization (Yu Wang et al., Chem. Commun. 2021, 57, 3331–3334)
Original article scheme
AI-generated concept illustration
DOI: 10.1039/D1CC00682G
Example: Relay Polymerization (continued)
AI-generated concept illustration ✓ Good
AI-generated reaction scheme ✗ Disastrous
Does AI always do well on common, fundamental knowledge?
Chain-Growth Polymers
| Name | Monomer SMILES | Repeating Unit SMILES |
|---|---|---|
| Styrene | c1ccccc1C=C | *CC(*)c1ccccc1 |
| Methyl Acrylate ✗ | CC=CC(=O)OC | *CC(*)C(=O)OC |
| Acrylonitrile | C=CC#N | *CC(*)C#N |
| Methyl Methacrylate | CC(=C)C(=O)OC | *CC(*)(C)C(=O)OC |
Step-Growth Polymers
| Polymer | Monomer A SMILES | Monomer B SMILES | Repeating Unit SMILES |
|---|---|---|---|
| PET | C1=CC(=CC=C1C(=O)O)C(=O)O | OCCO | *CCOC(=O)c1ccc(cc1)C(=O)* |
| Nylon 6,6 | C(CCC(=O)O)CC(=O)O | NCCCCCCN | *NCCCCCCNC(=O)CCCCC(=O)* |
Task: generate SMILES strings for common monomers and their polymer repeating units. AI listed results for each individually, then summarized in a table. - Generated by Gemini 3.0 Pro
Does AI always do well on common, fundamental knowledge?
C1=CC=C(C=C)C=C1 or
c1ccccc1C=C.
CC=CC(=O)OC
when discussed individually
(crotonyl methyl ester — not methyl acrylate). Correct: C=CC(=O)OC.Example: Property trend from CH₄ → CH₃CH₃ → Polyethylene
Not bad overall, but the increasing degree of polymerization is depicted like a rising temperature on a thermometer.
Example: Natural Rubber Vulcanization
Almost correct — a relatively well-known process that AI handles well.
Example: Polymer Architecture Illustration
Good overall — except the star polymer depiction is misleading.
Example: Extended Chain vs. Random Coil Comparison
Does not resemble an extended polymer chain — the representation is incorrect.
Example: Conformation Energy Profile
Structures depicted in the energy profile are incorrect.
If AI cannot do this today, will it be solved in 6 months? 1 year? 10 years?
AI is improving rapidly — these are temporary problems.
There may be intrinsic, architectural limits that persist regardless of scale.
Knowing what AI cannot do is as important as knowing what it can do.
The Setup: Describe the rules of Go in plain language to a state-of-the-art general reasoning LLM. It plays against a human.

AlphaGo succeeded through game-specific training, not general reasoning.
AGI (Artificial General Intelligence): The long-term goal, not yet achieved.
2 parameters tuned to fit data
Billions of parameters, same idea
Input → $f(x_1, x_2, \dots, x_n)$ → Output
Feed the model labeled examples.
Measure standard solutions.
Compare output to correct answer (compute error).
Compare instrument response to known conc.
Nudge parameters to reduce error. Repeat.
Adjust calibration until error is minimized.
2 Parameters
Fits a straight line
~10 Parameters
Captures complex curves
Billions of Parameters
Captures language, images,
proteins
A model with "only" a few billion parameters is now considered a small language model.
A model is a fixed function. Weights do not change during inference.
Input → [Frozen Model] → Output
Training minimizes error but never eliminates it.
Adversarial attacks: imperceptible noise fools the model.
Fixing one behavior can silently degrade others.
Like brain surgery: fix one region, inadvertently disrupt connected functions.
Billions of intertwined parameters give correct outputs with no interpretable reasoning path — the model cannot tell you why it gave an answer.
A black box: you can observe inputs and outputs, but not the decision process inside.
All four problems reappear in LLMs.
By adding carefully crafted, human-imperceptible noise to an image, an ML model can be made to confidently misclassify it.
The model does not "see" the way humans do. It exploits high-dimensional statistical patterns.
Training minimizes average error but cannot guarantee robustness at every point in input space.
Annual ML-related publications in polymer science (2015–2025 proj.)
Enabling Infrastructure: Polymer-specific representations (SMILES → BigSMILES → PSMILES) and curated databases (PolyInfo, PI1M).
Raw 3-D coordinates fail as direct model input
Graph Neural Network encodes protein & drug structure
The model was trained with only bind/not-bind labels — binding site locations were never provided. Yet gradient/attention analysis shows the model can possibly identify the correct binding site without supervision.
Fig. 2 — Overview of TSR-based method & key generation workflow (Kondra et al., 2021)
Every triangle of Cα atoms → one integer “key” (hover for detail)
Select all Cα atoms; form every possible triangle. Each triangle → integer key encoding edge lengths, Theta, and amino-acid identity. A 100-residue protein generates C(100,3) ≈ 160,000 keys; large proteins exceed 1 M. Similarity between two proteins = shared-key fraction (Generalized Jaccard). Inherently invariant to rotation & translation by construction.
Kondra et al., Front. Chem. 8:602291 (2021). doi:10.3389/fchem.2020.602291
Competitive with Foldseek; sparse tensor enables memory-efficient large-scale analysis.
| Dataset | Size | Type | TSR Accuracy | SSE-TSR Accuracy | Δ |
|---|---|---|---|---|---|
| CATH-based | 9,200 | Structural | 96.00% | 98.33% | +2.33% |
| SCOP-based | 7,000 | Structural | 95.46% | 99.00% | +3.54% |
| Functional-1 (published) | 7,800 | Functional | 99.41% | 99.50% | +0.09% |
| Functional-2 (new) | 7,200 | Functional | 95.83% | 98.83% | +3.00% |
Model: 3-D CNN on sparse SSE-TSR tensor • Compared against baseline TSR (3-D CNN) and Foldseek • All datasets: balanced, non-redundant splits
Khajouie et al., IEEE Trans. Comput. Biol. Bioinform. (2026). doi:10.1109/TCBBIO.2026.3654047
“Garbage in, garbage out” —
but also: genius in, genius out.
The single highest-leverage decision in any ML project is how you represent your
data.
Takeaway: Invest your creative energy in the representation. Once the input captures the right physics and chemistry, a standard architecture will do the rest.


Approach: Multiclass morphology classification from processing/material variables (spin speed, annealing history, composition, substrate energy) using SVM/NN/CNN workflows, then SHAP/ridge-regression interpretation.
Findings: SVM reaches 93.75% for column/hole/island prediction; CNN-based AFM feature classification reaches ~97%; SHAP identifies additive ratio and processing knobs as dominant drivers.
Importance: moves morphology control from trial-and-error to interpretable process optimization.
Citations: Tu et al., Advanced Materials (2020), doi:10.1002/adma.202005713; R. et al., Soft Matter (2025), doi:10.1039/d5sm00335k; Lamb et al., Macromolecules (2026), doi:10.1021/acs.macromol.5c03272.
Process variables: solvent ratio, annealing trajectory, additive type/ratio, substrate surface energy
SVM/XGBoost for tabular prediction; CNN for AFM image-based morphology classification
93.75% morphology class accuracy; ~97% AFM feature classification; SHAP identifies additive ratio as dominant driver
Tu et al. (2020) Adv. Mater. 10.1002/adma.202005713; R. et al. (2025) Soft Matter 10.1039/d5sm00335k; Lamb et al. (2026) Macromolecules 10.1021/acs.macromol.5c03272.
Tu et al. (2020) Adv. Mater.; R. et al. (2025) Soft Matter; Lamb et al. (2026) Macromolecules

PET plastic waste accumulates. Wild-type PETase enzymes denature at the temperatures needed for efficient depolymerization.
Structure-based ML proposes stabilizing mutations → wet-lab validation → combinatorial variant screening
FAST-PETase (5 mutations): active 30–50°C, broad pH, degrades post-consumer PET in days–weeks
Lu et al. (2022) Nature 10.1038/s41586-022-04599-z; Jiang et al. (2023) Environ. Sci. Technol. Lett. 10.1021/acs.estlett.3c00293; Medina-Ortiz et al. (2025) bioRxiv 10.1101/2025.02.09.637306.
Lu et al. (2022) Nature; Jiang et al. (2023) Environ. Sci. Technol. Lett.; Medina-Ortiz et al. (2025) bioRxiv
Robotic processing + in-line characterization + Bayesian optimization — no human between cycles
Importance-guided BO balances exploration & exploitation over a high-dimensional process space
Fast convergence to high-conductivity, low-defect films; blueprint for self-driving R&D pipelines
Wang et al. (2025) Nat. Commun. 10.1038/s41467-024-55655-3; Roy et al. (2026) arXiv 10.48550/arXiv.2602.00103.
Wang et al. (2025) Nat. Commun.; Roy et al. (2026) arXiv
GA designs ROP monomers → virtual forward synthesis → polymer fingerprints. ML screens 6 properties ($T_g$, $T_d$, $\sigma_b$, $E$, $C_p$, $\Delta H_p$)
Surveys ~0.9M candidates in silico; fitness target: durability + recyclability for polystyrene replacement
>7,500 candidates hit all 6 targets; ~99.96% search cost reduction vs. exhaustive enumeration
Atasi, C.; Kern, J.; Ramprasad, R. J. Chem. Inf. Model. 2024, 64 (24), 9249–9259. doi:10.1021/acs.jcim.4c01530.
Atasi et al. (2024) J. Chem. Inf. Model. 64, 9249–9259. doi:10.1021/acs.jcim.4c01530.
Part A — Training
Question for the audience: A neural network needs pairwise data — input and correct output. What could those pairs be for language?
Answer — mask & predict: Take any sentence, remove some tokens, predict what was removed.
"Polystyrene ______ in toluene at room temperature." → dissolves
Every paragraph on the internet is automatically a labeled training example — no human annotation needed.
Critical caveat: The training objective is linguistic plausibility, not truth. The model minimizes loss equally well by writing "I do not know" — or by confidently hallucinating.
Part A — Training
Train on hundreds of billions of tokens. Objective: predict the next token. Outcome: learns grammar, facts, reasoning patterns — but no sense of helpfulness or safety. Would complete "How do I build a bomb?" with the same fluency as "How do I make bread?"
Human raters compare pairs of responses and mark which is better. Model is fine-tuned toward preferred responses. Outcome: helpful, clear, appropriately cautious.
Humans flag prompts the model should refuse. Model learns boundaries. Outcome: safety guardrails.
Part A — Training
In 2026, we do not rely on a single LLM in isolation. Instead, we orchestrate a combination of models, external resources, and tools, guided by carefully designed system prompts. This orchestration is what makes them appear so intelligent and capable. Yet even the most advanced 2026 models still commit surprisingly basic errors — a consequence rooted in the fundamental principles of how LLMs are trained.
Part B — Architecture
cat=1, dog=2, polymer=3, solvent=4…
Theoretically works — a large enough network can learn the mapping.
But: integers carry false structure. Math treats 1 and 2 as "close" — but cat and dog are not closer to each other than to polymer in any meaningful sense. The model wastes capacity fighting irrelevant numerical noise.
Theoretically works — universal function approximator in principle.
But: processes positions independently — no direct connection between distant words; parameters explode with sequence length; must re-learn the same patterns at every position. Extraordinarily wasteful.
Takeaway: Both approaches are not wrong in principle. The motivation for embeddings and attention is efficiency — the naive approaches cannot scale to the sizes needed to be useful.
"A titration technically works whether you use a burette or a firehose — but only one is practical."
Part B — Architecture
"Polystyrene dissolves in toluene"
Poly │ sty │ rene │ dis │ solves │ in │ to │ lu │ ene
Each fragment → an integer ID. The model processes numbers, not letters.
Each token ID maps to a vector of hundreds or thousands of numbers. Similar meanings cluster in this space — "polymer" and "macromolecule" land near each other; "polymer" and "Tuesday" do not.
Famous example: vector("king") − vector("man") + vector("woman") ≈ vector("queen") — the model was never told this; it emerged from training statistics.
This geometric encoding of meaning — meaning as spatial proximity — is the foundation on which everything else in an LLM rests.
Part B — Architecture
How should a model know what "it" refers to?
"The polymer dissolved because it is hydrophilic."
→ "it" refers to the polymer
"The polymer dissolved in water because it is a good solvent."
→ "it" refers to water
Identical grammatical structure — opposite referent. A plain feedforward network processes each position independently with no mechanism to connect "it" to "polymer" or "water" across the sentence. This is exactly what attention was designed to solve.
Part B — Architecture
| The | polymer | is | soluble | |
|---|---|---|---|---|
| The | ◑ | ◌ | ◌ | ◌ |
| polymer | ◌ | ● | ◑ | ◑ |
| is | ◌ | ◑ | ● | ◑ |
| soluble | ◌ | ●● | ◑ | ● |
"soluble" strongly attends to "polymer" (dark cell)
"Like a chemist reading a paper: when you see 'yield was low,' your attention jumps back to the reaction conditions. Attention — formalized as math."
In plain English: compute a relevance score between every pair of tokens; normalize to sum to 1; take a weighted average.
$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V$$
Q = Query ("what am I looking for?")
K = Key ("what do I contain?")
V = Value ("what info do I provide?")
Q, K, V are computed from the same input using ordinary weight matrices — no new math beyond Topic 2.
Part B — Architecture
All "progress" since 2017 = more layers · more data · more compute · better training — not a new design.
⚠ Notice: As long as the Transformer architecture remains fundamentally unchanged, all intrinsic limits of LLMs will not be eliminated — only mitigated by scale and better training strategies.
Part B — Architecture
Key message: Each component is a smart engineering strategy to train a better model with far less computation and data. The underlying principle is identical to one-integer-per-word + plain neural network. The Transformer achieves the same thing orders of magnitude more efficiently.
"Like assembling a house from bricks, windows, and beams — nothing fundamentally new, just a more efficient structure."
Part C — Generation
Input: "What is the capital of France?"
→ model → "Paris" (82%) → append
→ model → "." → append
→ model → [stop] → halt
"Generation is a repeated lottery — the most probable ticket usually wins."
Part C — Generation
The token-by-token lottery has no understanding behind it. The model samples "Paris" because that token statistically follows "capital of France" billions of times in training. It has no concept of France, no knowledge that Paris is a city.
"An LLM is a probability machine over tokens."
Trained on essentially all human-written text, the model's distribution spans the full space of plausible human responses. For any prompt, the correct answers, the helpful answers — and also the wrong, harmful, absurd ones — are all somewhere in that distribution.
AI companies did not make the model smarter — they developed strategies to steer sampling toward the good part of the distribution.
Distribution zones: helpful / correct plausible but wrong hallucination harmful
Post-training (RLHF, alignment, system prompts, chain-of-thought…) steers the sampling cursor toward the helpful/correct zone — but never eliminates the others.
"The model is not getting smarter — it is being steered."
Part D — Intrinsic Limits & Powers
| What it looks like | What is actually happening |
|---|---|
| ChatGPT remembers our previous conversations | Every session re-reads a text summary; no model weights change |
| AI reasons through hard problems step by step | Probability-weighted token sampling at every step |
| AI knows an enormous amount about everything | Frozen snapshot of training data; nothing learned since cutoff |
Understanding the gap is not pessimism — it is how you use AI effectively. Each of these will be unpacked in the slides that follow.
Part D — Intrinsic Limits & Powers
Chat interface recalls topics from last week; seems to know your past preferences and build on previous conversations.
A "Memory summary" text file outside the model is prepended to every new session. The frozen model reads it — it does not remember anything.
"Memory" is bounded by context window. Long sessions become unreliable as early facts compete with later text for attention.
Callback from Topic 2, Limit 1: Model weights are frozen. One input → one output. Always.
ChatGPT "memory" is a feature of the application layer, not the model. Continuity across sessions is only as reliable as those external summaries.
Part D — Intrinsic Limits & Powers
Complex organic synthesis planning → correct, structured multi-step answer.
Seen many times in training data → high-confidence token patterns.
"Count the letter 'r' in 'strawberry'" → wrong.
Counting requires a deterministic sequential scan — not a next-token-prediction task.
Same math question, slightly different phrasing → two different numerical answers.
No internal state. No logical verification. Each response is freshly sampled.
"The model cannot distinguish between 'I know this' and 'I am generating plausible text about this.'"
In chemistry: fluent, plausible-sounding reaction mechanisms are not the same as correct reaction mechanisms.
Part D — Intrinsic Limits & Powers
Training corpus
All human-written text up to the cutoff — scientific papers, books, conversations, code…
Training cutoff →
The model has no idea.
Any query after the cutoff is answered with confabulation.
"What is tomorrow's weather in Baton Rouge?" → needs real-time data
"Explain a board game invented after the cutoff" → any response is fabricated
"What did the latest JACS issue publish?" → named papers may be hallucinated
Workaround: External tools (web search, databases) connect the frozen model to live information — but these are scaffolding around the model, not changes to the model itself.
Part D — Intrinsic Limits & Powers
The model's only job is to produce the statistically most plausible next token. It has no mechanism to fact-check its own outputs and cannot distinguish what it knows from what it is generating.
"The model does not know it is wrong."
Callback from Topic 2 Limit 2: the model learned statistical patterns, not ground truth. The fake citation is particularly insidious — it looks exactly like a real reference and passes casual review.
Part D — Intrinsic Limits & Powers
Context window (filled left to right):
As the gray expands, the green (key facts) competes for attention — and loses.

Context window ≠ working memory. Having read something and reliably using it are different things. Long contexts also amplify hallucinations. Sometimes carefully chunked RAG is better than blind context stuffing.
Part D — Intrinsic Limits & Powers
Condense large volumes of text rapidly and accurately. Identify key arguments, themes, and conclusions across hundreds of documents in seconds.
Identify recurring themes, terminology, and methodological patterns across thousands of papers — impossible to do manually at scale.
Given explicit instructions, extract structured information, fill templates, follow multi-step workflows with high consistency.
Translate clearly described logical tasks into working code — anyone can automate complex processes without formal programming training.
Common thread: LLMs excel when answers can be assembled from patterns in language. They struggle when a task requires a deterministic procedure outside the domain of token prediction.
Part D — Intrinsic Limits & Powers
| AI automates or augments | Human expertise remains essential |
|---|---|
| Literature search and summarization | Deciding which questions are worth asking |
| Extracting structured data from text | Designing the extraction schema and validating outputs |
| Drafting text, code, protocols | Verifying chemical correctness and logical soundness |
| Identifying patterns across reports | Interpreting whether patterns reflect reality or artifacts |
| Executing defined multi-step workflows | Designing those workflows and catching edge cases |
"Not the end of programming, but the beginning of programming in plain language. Human judgment, domain expertise, and critical evaluation remain the irreplaceable component."
Limits are real — but so is capability.
When these capabilities are pipelined and automated, they become powerful tools for scientific research.
Message: The right strategy is not blind trust or rejection — it is controlled deployment.
Convert large volumes of polymer-science literature into a structured, queryable knowledge database.
Structures, values, and citations are correct.
Same logic gives stable outputs across reports.
No key synthesis, property, or context is missed.
Unconstrained chat-style usage rarely meets all three criteria at once.
This talk focuses on Approach B: explicit instructions, schema contracts, and feedback loops.
CL-Bench (Dou et al., Feb 2026 · arXiv 2602.03587) — 500 expert-crafted contexts, 1,899 tasks, 31,607 rubrics
Given a self-contained document (manual, legal code, SDK, lab protocol), can the model learn the new knowledge it contains and apply it to solve tasks — without relying on pre-training?
This is exactly what happens when you give an LLM a set of Agent Skills instructions.
Implication for this work: context engineering with explicit Agent Skills directly targets the dominant failure mode — context neglect. Structured schemas, validation loops, and clear rule documents are not over-engineering; they are the evidence-based answer to what unconstrained LLMs cannot reliably do on their own.
Practical proof: this approach already drove automated generation of study guides, slides, and exam PDFs using structured rule sets.
This is not a concept-only proposal — it has already worked in real workflows.
Chemical structures and math expressions handled via explicit Agent Skills rules — not left to LLM guessing.
Schema-driven extraction ensures consistent field coverage across different papers and report styles.
Study guides, slide decks, and exam PDFs generated automatically from structured pipeline output.
Rule-based handling for chemical structures and mathematical expressions
Automated generation of study guides, slide decks, and exam PDFs
“Polymer brush” = side-chain rich vs. surface grafted?
Synthesis, structures→graphs, properties, citation intent.
Combine keywords, concepts, and citation networks.
Structured, validated, cross-referenced format.
Traverse relations: "Find structurally similar polymers."
Decompose & Synthesize: "Trace evolution of synthesis."
Example: "Compare thermal stability characterization methods across all reported polyimides."
LLMs are most powerful when integrated as controlled components in a validated scientific workflow — not as unconstrained chat assistants.
| Approach | Pros & Cons |
|---|---|
| Manual search & curation | Thorough but weeks/months of effort; limited scope; human inconsistency. |
| Automated keyword search | Fast but shallow; massive unfiltered output; no conceptual understanding. |
| AI-augmented structured workflow | Rapid, concept-aware, structured output with built-in error-checking and cross-report analysis. |
Domain-agnostic: The same framework applies to any experimental science field by simply swapping the Agent Skills documents.
LLM-agnostic: The same set of Agent Skills works with any advanced LLM — GPT, Claude, Gemini — and produces consistent, reproducible results.

AI will not replace human intelligence, but it will fundamentally augment how we process information and design materials.
Funding
This work is supported by the National Science Foundation under Award NSF-2142043.
Project
OVESET — Open Virtual Experiment Simulator Education Tools for Polymer Science Education oveset.orz.how
Special Thanks
GitHub Copilot served as an indispensable AI collaborator throughout this project — contributing to code development, data analysis pipelines, and scientific problem-solving, as well as helping design and refine this presentation. A true AI pair programmer in every sense.
Though it makes a lot of stupid mistakes, often!
Questions & Discussion