When Notes Are No Longer Enough: Knowledge Governance in the Age of AI-Assisted Engineering
一种另类笔记尝试(中文在后面)
Current note-taking software can no longer solve the efficiency problem of AI-assisted programming.
Notes have always been one of the most important productivity tools for programmers. Almost every generation of engineers has tried—based on their own workflow—to build or modify a personal note system: from plain text files and wikis to Word, Evernote, Notion. At the core, they’ve all been trying to solve the same problem: how to preserve usable thinking outcomes in a high-intensity information-processing environment. But the reality is that note systems have never been good enough—especially for people who need to process large volumes of text, code, and decisions every day.
For the past few years, I used Notion as my primary note tool. A common pattern was to open a table: each row had a timestamp, links, and context. When debugging, I’d mark items with a 🐛 icon and keep stuffing clues, ideas, logs, and code into it—until one day I worked almost twelve hours straight, looked up, and realized I had produced close to two hundred records in a single day. Each Notion “record” is essentially an infinitely expandable page, not a cell like in Excel. Even if each page is only two to three thousand words, two hundred pages is already around 400,000 words—far beyond what any human can read, understand, and digest in a day. And this wasn’t a one-time extreme case. It gradually became the norm.
Against this backdrop, in the last few weeks I slipped into a strange state: lots of things to do, but progress was slow; I started subconsciously avoiding work; my sleep kept extending. Then the East Coast got hit by continuous heavy snow. I was stuck at home for a few days, then went out to ski for a few days. On the surface it looked like procrastination or escapism, but what I actually felt was deep cognitive fatigue. Meanwhile, large language models became absurdly powerful—code, documents, proposals could be generated almost instantly. But this capability didn’t make me more efficient or clearer. Instead it triggered intense frustration, because I realized it wasn’t just me. After talking with friends, we all seemed to be experiencing the same kind of overload: once personal capability is amplified by tools to a certain level, what follows isn’t liberation—it’s a more severe backlash.
The industrial language production machine
Looking back, the problem concentrates in three areas. First, the scale of generated information and text has already exceeded human physiological limits, regardless of individual skill. Second, generation is too fast—so fast that you can’t even remember what you just read. The LLM keeps answering, keeps expanding, keeps offering paths, but it never gives a “stop here” signal. Third—and most fatal—is implicit drift: your state depends entirely on maintaining the chat window; windows accumulate; a few days later you revisit code from a few days ago and can no longer reconstruct the context. The same concept, script, and index keeps drifting in names, paths, and IDs—differences like vault_index, index_vault, inventory_index pile up. A few weeks later, when you look at the whole project, what remains is a feeling of total loss of control.
During the days I was trapped by the snow, I kept thinking about this and gradually realized it wasn’t a personal time-management problem—it was an inevitable outcome of the mechanism. An LLM is essentially a zero-shot universe: it can generate content at near-zero cost, it doesn’t get tired, it doesn’t hesitate, and it never makes the decision “is it worth continuing.” Prompts are only soft constraints; there is no hard shutdown condition. It’s an industrial machine continuously producing language and symbols, while the human brain doesn’t have the same capacity. Drift isn’t accidental—it’s the inevitable product of this mechanism. Extending context length doesn’t solve it, because the model is always just predicting the next token; if it can’t find the most correct answer, it guesses one. More importantly, the whole process has no accountable actor. Hundreds or thousands of prompts per day strongly depend on the local context at the moment; afterwards it’s nearly impossible to reproduce, and impossible to hold anyone responsible.
This is a governance problem
In this situation, I started to realize the problem isn’t “I didn’t remember enough.” In this era, continuing to rely on the traditional assumption of note-taking—“write more and it’ll be useful someday”—is itself unsustainable. Notion isn’t lacking search. But when I truly need to solve a concrete problem and trace back my past thinking, search often returns hundreds or thousands of results, many of which conflict with each other, lack context, or are temporally misaligned. They’re almost impossible to use directly. Search doesn’t reduce cognitive load—it amplifies it.
And in this context, after several days of continuous development in this repo, I once again tried to rebuild my recording system. I’ve tried many times before; most attempts ended in failure. But this time, something finally felt “on track”—at least enough to write down and share. The key isn’t that I found a “better tool.” The key is that I realized this isn’t fundamentally a tool problem. At this stage, humans are inevitably hitting biological and cognitive constraints. Yes, I do need a system that fits my personal workflow better. But more importantly, attention has to return to the real core: this isn’t about notes, or programming, or a one-off engineering fix with a single app or library. The core is one thing only: governance of knowledge.
This is a governance problem.
What does “governance of knowledge” mean? At a large scale, almost all truly difficult problems in human society are governance problems. Politics is governance: how a state is governed. Law is governance: how legislation, judiciary, and enforcement are divided and constrained. A constitution is also governance: not just writing it down, but how it is interpreted, how it is obeyed, and how you ensure it is truly obeyed. Scale down: team management, institutional design, process constraints—these are also governance problems, not tool problems. Governance is never about “do we have information” or “can we access it,” but about what is recognized as valid, what judgments carry authority, what states can enter long-term memory, and what must be restricted, frozen, or deprecated.
At the personal level, the real difficulty of “knowledge management” is exactly here: not lack of information, not weak search, but the absence of a mechanism that can adjudicate, constrain, and take responsibility. That’s why countless programmers have repeatedly attempted to solve this: from early personal knowledge bases and tagging systems, to backlinks and graphs, to vector search, to Obsidian, semantic search plugins, and AI assistants. They solve “how to find content,” “how to link content,” “how to generate content.” But they almost all assume one premise: once content is written down, it is inherently valuable. As long as it can be stored and searched, the system is “successful.” Solutions like this—and open-source projects—are everywhere. I’ve borrowed from them too. I also store everything as .md files; an Obsidian vault can read it. But this is not the point!
In an era when information generation was far lower than today, “if you can find it, it’s a win.” In an era where content can explode at near-zero cost, that premise has collapsed. What’s missing is not a smarter search algorithm or a more complex representation. What’s missing is a governance structure that can answer: what can be recognized as knowledge, what is merely process noise, what must be accountable, what must be reproducible, and what is not allowed to drift silently.
So I’ll talk about what I’m trying now. This repo is still new, but I strongly feel the direction is right. If you share the same pain, maybe this can inspire you.
A Sovereign Log: simple code, expensive writes, no LLM integration (yet) — to hold my core cognition steady
This system is not designed for high-frequency recording—quite the opposite. It deliberately makes writing expensive, slow, and thought-requiring. The implementation stays as simple as possible at the code level. It does not rely on complex systems, and it does not directly integrate an LLM, because once you integrate it, zero-cost generation immediately erodes its nature as an “adjudication record.” At this stage it’s all hard-coded. Even without a UI, it has already solved many real problems for me.
.
├── _system
├── docs
├── inbox
├── requirements
├── sovereign_log
├── templates
└── tools
There are two text libraries made of .md files: sovereign_log and docs.
Sovereign_Log
Nothing started from a grand system. I initially built an extremely small and disciplined vault. Its starting point was no different from an ordinary Obsidian vault. If you look at its graph, there’s almost no structural feature: no intentional backlink structure, no topic network, no pursuit of connection density. The only real difference is quantity—very small. This is a constraint I deliberately maintain. In this vault, every piece is not a draft, not “write first and see later,” but the result of repeated refinement. I iterate with AI in multiple rounds, compressing expression, correcting boundaries, clarifying meaning. Early on I wrote mainly in English, adding some Chinese translation as a semantic validation tool to make sure I truly understood what I wrote. As the system matures, it will likely become fully English. At this stage, Sovereign_Log is a highly refined personal knowledge set under full personal sovereignty. The principle is simple: don’t write casually, don’t treat it as drafts, and quantity is not the goal—precision is. Sometimes a full day of thinking, reasoning, and alignment results in only one or two notes. That is intentional entropy control. You must maintain this discipline, or it becomes the same as every other note system where you dump everything in.
Doc
Doc still looks like “notes,” but it is not a natural extension of Sovereign_Log. It is a clear promotion step. Content in Doc is not just “more formal writing”—it is promoted text in an IR state, somewhere between narrative and executable code. It is not code yet, but it no longer allows free-form prose. When designing this library, I introduced the counterintuitive principle of “expensiveness” to resist the fact that LLMs generate text and code too fast and too much. What enters the core must be expensive institutionally—not emotionally or stylistically. This expensiveness has three layers.
First: format-expensive. Doc is not free text. Anything that enters must conform to a fixed structure and pass machine-checkable format audits. The goal is not aesthetics; it is that the text is parseable, indexable, auditable, and constrainable. A format-invalid doc is simply not eligible for this layer.
Second: process-expensive. Content cannot enter Doc immediately. It must first exist in Sovereign_Log, accumulate time, be repeatedly searched and referenced in real use, be converted into code or decisions, and demonstrate actual behavioral impact. Only then does it become a candidate, and only after explicit authorization is it promoted into Doc. This matters even more once you introduce shared team knowledge: it blocks random ideas, models, or impulses from polluting the core layer. Doc is a scarce resource—not a place to store “cleaned up notes.”
Third (most important right now): semantic-expensive. My requirement for Doc is not just clarity or coherence. Semantics must meet the bar: rules cannot contradict each other; implicit conflicts are not allowed; responsibilities and boundaries cannot be vague; every declaration must have the potential to generate binding code. Many large document repositories fail not because they lack information, but because they accumulate semantic mess: drifting definitions, conflicting clauses, and ultimately no path to executable constraints. Doc aims at the opposite: every line is a potential system constraint; every semantic unit is a future source of code. That means semantics must also be auditable and machine-intervenable.
├── docs
│ ├── decisions
│ ├── governance
│ │ ├── CLAIM-AUTHORING-STYLE-V0-1.md
│ │ ├── EVIDENCE_SUGGEST_AUDIT.md
│ │ ├── RPT-2026-01-26-RUNTIME_ENV_GOVERNANCE.md
│ │ ├── SD-0007_RUNTIME_ENV_FREEZE.md
│ │ ├── SD-0008_INDEX_CONSOLIDATION.md
│ │ ├── SD-0009_DOC_IR_SPEC_v0.md
│ │ ├── SD-0010 — Normative Boundary & Semantic Consistency Rules
│ │ ├── SD-0011_ID_Taxonomy_and_Identity_Ledger.md
│ │ └── SD-0012 — Observability Causal Structure & Span Identity
│ ├── invariants
│ │ ├── INV_CATALOG_v0.md
│ │ ├── INV_DOC_IR_BOOTSTRAP_v0.md
│ │ └── INV_INDEX_ROOTS_0001.md
│ ├── Machine-Checkable Invariants.md
│ ├── runs
│ ├── SD-0001-System-Constitution.md
│ ├── SD-0002-Derived-Artifact-Path-Freeze.md
│ ├── Sovereign Log — Write Protocol v1.0.md
│ ├── stage_contracts
│ │ └── PRE_STAGE_5.md
│ └── taxonomy
│ ├── INDEX_TYPES.md
│ └── TDES.md
At this stage, my system is not integrated with any LLM, at least not at this layer. All constraints, screening, and promotion logic are implemented in code. If you’re a programmer reading this and you find the method useful, the technical implementation is not complex—it is straightforward engineering. I won’t expand on the code here; I’ll use one concrete example to explain what I mean by “semantic expensiveness.”
For semantic screening, every audit run produces two reports: a JSON report for machine consumption and a Markdown report for humans. Below is a summary of one semantic audit run.
The audit is AUDITDOCSEMANTIC — 20260130T205405Zdocsemantic_audit. The result is ok: False, meaning it fails institutionally. The system scanned the docs directory and extracted 77 claims, using embedding similarity and heuristic semantic pattern matching (model: all-MiniLM-L6-v2, threshold 0.92, top-k 8, max pairs 200). Modality/target/predicate inference is still heuristic in v0.1 and will later be promoted into explicit structured claim tags. Duplicates are warnings by default, but here I upgraded conflicts to errors—meaning they must be resolved.
The key violation is SEM-CNF-001, a semantic conflict. It is not vague; it is a modality conflict: across two documents, on the same (target, predicate) key (here ('GLOBAL','evidence')), the system sees mutually contradictory normative statements—one must, one must_not. The conflict is precisely localized to two concrete claims: SD-0011#C-0004 vs DOC-EVIDENCE-SUGGEST-AUDIT-V1#C-0003. That means at the Doc layer, the system is told “this must happen” and “this must not happen” with the same scope.
This is unacceptable. Once such semantics enters the core Doc layer, any attempt to generate binding code, policy, or runtime constraints immediately loses determinism. The audit therefore forces a choice: resolve by priority (e.g., INV > SD > taxonomy) or explicitly supersede one claim so it becomes institutionally inactive. This is about which rule continues to survive in the system.
To me, this is “semantic expensiveness”: not relying on human memory of “something seems inconsistent,” but making the system refuse non-closed, inconsistent, non-executable semantics at the Doc layer. That is what allows Doc to become more than a serious-looking document collection—it becomes eligible to constrain future code and system behavior.
AUDITDOCSEMANTIC — 20260130T205405Zdocsemantic_audit
ok: False
violations: 1
Meta
{
"docs_dir": "docs",
"claims_count": 77,
"dup_threshold": 0.92,
"topk": 8,
"max_pairs": 200,
"dup_severity": "warn",
"model": "sentence-transformers/all-MiniLM-L6-v2",
"notes": [
"v0.1 modality/target/predicate inference is heuristic; promote to structured claim tags later.",
"duplicates are warnings by default; set --dup_severity error to hard-enforce."
]
}
Violations
1. SEM-CNF-001 (error)
path: docs/governance/SD-0011_ID_Taxonomy_and_Identity_Ledger.md | docs/governance/EVIDENCE_SUGGEST_AUDIT.md
message: Modality conflict (must vs must_not) on (target,predicate)=('GLOBAL','evidence'): SD-0011#C-0004 vs DOC-EVIDENCE-SUGGEST-AUDIT-V1#C-0003
hint: Resolve via priority (INV > SD > TAXONOMY ...) or supersede one of the conflicting claims.
meta: 46502865b8712493e52dffd4a6b6871cd994c162f900ea8e6e7a8c3ebbcb5873
How it’s implemented
Technically, this mechanism does not require an LLM or “intelligent understanding.” The core idea is to compress document semantics into a set of structured elements that are computable, comparable, and fail-able, then audit them with deterministic rules. The system can be decomposed into independent layers; you can adopt or replace any layer without breaking the overall approach.
Step one is introducing declarative semantic units: documents are no longer treated as a single block of natural language, but must explicitly mark minimal governed units—stable numbered claims. You do not need complex parsing; a simple, robust textual pattern is enough. The key is not parsing power but forcing the author to choose: “this sentence is subject to governance.”
Step two is discretizing modality: instead of understanding everything, the system only cares about normative posture—must, must not, allow, forbid, or unknown—implemented via a deterministic rule table. This turns conflict detection into a finite-state problem, not a language understanding problem.
Step three is anchoring the target: to avoid all rules interfering globally, each claim should map to a stable target key (path prefix, resource name, explicit scope marker). Claims without anchors fall back to GLOBAL and thus incur higher conflict risk—this is a deliberate design pressure. GLOBAL creates endless false positives. In practice most claims are domain-specific; you rarely mean “this applies to the whole system.”
Step four is a coarse predicate classification to avoid false conflicts: you do not need a full ontology; a small extensible predicate enum (placement, authority, evidence, admission, index, …) is enough to reduce unrelated collisions.
After these four steps, each claim compresses to (modality, target, predicate) and becomes auditable by deterministic rules. Conflict detection becomes simple: on the same (target, predicate), do mutually exclusive modalities co-exist (must vs must_not, allow vs forbid)? If yes, fail immediately with precise claim-level localization. You can optionally add near-duplicate detection via embeddings as “governance hygiene” rather than a hard constraint.
Audit output should be machine-first: structured results for automation, human-readable reports for review. Crucially, audits must be able to fail and propagate that failure via exit codes/status signals—otherwise governance never truly enters engineering reality.
Why not integrate an LLM? I will integrate it later, but right now I need a foundation that is completely disconnected from the model. By this point the reason should be clear: history and governance must be borne by deterministic mechanisms; models can be advisors, not judges. Failure conditions must be stable; system boundaries must be clear. This already makes the whole process feel calmer.
Is hard-coding almighty problem-solver? Of course not. It cannot cover complex semantics. Keywords, rule tables, and predicate enums are low-resolution cognitive compression: they don’t understand metaphor, context, irony, or adapt automatically to new expression. Hard-coding looks dumb, slow, conservative. Each new predicate and rule change requires human judgment. It feels inefficient early on. But I want that “slowness”—it is friction by design, a safety mechanism against runaway generation. In an era where you can generate a hundred rules in one second, slowness itself becomes a safety feature. It forces the author to pay real cognitive cost before something enters the core layer. This is also a mental trap for programmers: it’s easy to detach and think you’re not part of the system mechanism. But in real projects, you are. You must be constrained too.
The core is Vault Index and Query
Let’s pause the big picture here. The remaining “details” (derived-layer storage, isolation, tool/templates, report lifecycle/cleanup) are huge, but technically not difficult. They’re mostly engineering hygiene. Once you have clear boundaries (Truth / Decision / Evidence / Derived) and strict write-path constraints, implementation is straightforward. The harder part is institutional consistency and long-term maintainability.
What I really want to show is my vault_indexer: it’s not “better search,” it’s an indexing mechanism that turns a note system into an evidence substrate. Most note apps will end up with “semantic retrieval + chunking.” But what determines long-term value isn’t retrieval itself; it’s how you organize evidence chunks, how you make them citable/reproducible/auditable, and most importantly: what layer the index belongs to (Truth or Derived).
I strictly treat the index as Derived: it never has “truth authority.” Its job is to compile text assets in docs/ and sovereign_log/ into retrieval-oriented intermediate artifacts: stable chunks (usually paragraph-level), stable IDs (chunk_id), source path and position, and content hashes/digests. Then you embed chunks, build a vector index (FAISS or similar), and query in natural language to retrieve evidence chunks. “Evidence” here is not “related text,” but an object with citation structure: each chunk can be precisely referenced, copied into evidence packs, reused by toolchains, and even required by audit systems (“your output must cite these chunk_ids”).
That explains why I don’t use LLMs as “writing machines” here, but as the best possible query operator. The LLM’s strength isn’t generation—it’s decomposing a vague question into searchable subquestions, constructing effective queries, structuring summaries based on retrieved evidence, and mapping those summaries back to development actions (which file to modify, which invariant to add, which claim to patch, which diff to write). Once you have a stable information substrate—core principles, boundaries, constraints, historical responsibility frozen in Doc/Sovereign_Log—the best use of an LLM is evidence-driven collaboration: query first, retrieve evidence packs (JSON/MD), then drive decisions and code from those packs, rather than letting the model hallucinate from memory.
This is structurally different from Obsidian plugins focused on “search.” Most plugins optimize UI speed: match → open note → human judge. They rarely solve: do hits have stable citations? can results be reproduced in CI/audit? will retrieved text be mistaken as authoritative? can you form a loop of “question → evidence → decision → code → write back evidence”? My vault_indexer turns retrieval output into versionable, reusable intermediate artifacts (packs/reports/audits) and explicitly locates them in the Derived layer: rebuildable, replaceable, cleanable—but never allowed to contaminate Truth/Doc authority.
In short: many plugins treat search as a reading entry; I treat indexing as a development input. The former optimizes “finding notes,” the latter optimizes turning notes into engineering evidence so LLMs act as “evidence analyzers + query compilers.” This is why, once your system substrate stabilizes, retrieval stops being a nice-to-have and becomes a core engine: you’re not “writing more notes,” you’re using retrieval to feed past constraints and evidence back into present decisions and code.
No more talk—here is evidence.
We start from a real query_vault result. One premise: this system is still early; it doesn’t become magical overnight. It needs long-term use, real projects, and continuous maintenance. Early on, scores won’t be very high. In my vault, a 0.7 hit is already “good”—worth careful reading. Index quality depends on what you put in: real development, real rules, validated high-quality content, and continuous maintenance. Over time: the more you use it, the better it gets, and the more it reduces mental load. That’s what a real second brain looks like.
A query_vault run is a reproducible retrieval object: it doesn’t “answer,” it lists candidate evidence chunks. Key fields: run_id, query, model, and chunk-level results. Each result has stable identity (chunk_id), precise location, content hash, mtime, score (ranking signal only), and a preview snippet (not the full evidence).
The key is what you’re asking. For run session trace ordering, you’re not asking definitions—you’re asking ordering and legitimacy constraints: what inferences are forbidden, and what must be explicitly declared rather than guessed from time or structure. That’s why a 0.6–0.7 hit can be high-value: it anchors LLM behavior and prevents conceptual drift.
{
"run_id": "20260131T014112Z_query_vault_b52ba27830",
"query": "run session trace ordering",
"scope": {
"note_path_prefix": null
},
"topk": 12,
"per_note_cap": 2,
"exclude_temporal_note": false,
"model": "sentence-transformers/all-MiniLM-L6-v2",
"results": [
{
"chunk_id": "f91b794ab50cc20b12ee18d2a74a532ed5b0630d",
"note_path": "docs/governance/SD-0011_ID_Taxonomy_and_Identity_Ledger.md",
"heading_path": "",
"paragraph_index": 65,
"hash": "ed7cc7a84aa209ae63eb30ef531ab5cfd5cb0a649b8b98da78db3c9d3cf0f73c",
"mtime": 1769822404.9170113,
"score": 0.7674295067787171,
"snippet": "- Trace continuity MUST NOT be used to infer the existence of a session. - Session existence MUST NOT be inferred from trace structure. - Time ordering alone MUST NOT substitute for explicit run or session declaration.",
"boost": {
"prefix": "docs/",
"factor": 1.1
}
},
{
"chunk_id": "74abc9e9857e9ee0f805584d3b95dee8c0142f29",
"note_path": "sovereign_log/Session-Run-Trace-Formal Definitions.md",
"heading_path": "",
"paragraph_index": 29,
"hash": "f08025c0500946d913c04bbecbb5ee1ed106120b5d9ecc3f21d9a5254395578b",
"mtime": 1769737995.4226983,
"score": 0.6208187341690063,
"snippet": "- A Run **must exist** before any Session. - A Session **must exist** before Events are promoted. - A Trace **must exist** for any causal claim. - Time ordering **must never replace** trace structure."
},
{
"chunk_id": "8c355b0da6ce66924ec7fdb83adc71dc6a4e17d7",
"note_path": "docs/governance/SD-0011_ID_Taxonomy_and_Identity_Ledger.md",
"heading_path": "",
"paragraph_index": 60,
"hash": "c056ccfc21a6cd4e9352ed599b679a0018f95bbf828fd0f017fdb9ff6af51c90",
"mtime": 1769822404.9170113,
"score": 0.5683699011802674,
"snippet": "- A `trace_id` MUST be associated with a valid `session_id`. - Trace-level artifacts that lack an explicit session binding are non-legitimate.",
"boost": {
"prefix": "docs/",
"factor": 1.1
}
},
{
"chunk_id": "b8ecd51c38bb6cffceb35546af4f27176d67027d",
"note_path": "sovereign_log/Session-Run-Trace-Formal Definitions.md",
"heading_path": "",
"paragraph_index": 32,
"hash": "32a2b22ffaf1ffcabcb49924a30bdcbbf00c46b052403ca5ea32e9ad1739b0c6",
"mtime": 1769737995.4226983,
"score": 0.5365391969680786,
"snippet": "> **Run answers “which execution.” > Session answers “which lifecycle.” > Trace answers “why.”**"
},
We’ve become too impatient: whenever we have a problem, we immediately prompt for an answer. LLMs are great at weaving “perfect answers.” That’s dangerous. That’s the root of drift.
If you’re considering building your own version, start with two core scripts: index_vault.py and query_vault.py
Reproducing the code is not hard. The real difficulty is not the code itself, but whether you are willing to accept a knowledge governance regime like this.
At the current stage, I deliberately do not integrate an LLM into the execution pipeline. I only hand the LLM the evidence that has already been produced (evidence packs) and the audit reports, and ask it to interpret and suggest. Whether you use the LLM in an interactive chat window today, or later embed it inside your system, this does not change the essence:
The LLM’s strength is understanding your question, organizing an appropriate query, and making suggestions under the constraints of the evidence you provide—NOT deciding what is true.
What determines system quality is how you treat evidence, and how you make knowledge governable. This is my position right now, and it’s very explicit.
Therefore, this is not “a search script.” It is a minimal Evidence Retrieval Substrate, with a deliberately restrained goal:
Compile a vault (Markdown) into a chunk-level semantic index (Derived, rebuildable, and with no truth authority)
Return citable evidence units in query results (
chunk_id+ precise location + content hash + snippet)Persist every query result as an evidence pack (JSON + MD) so both LLMs and humans can reuse it repeatedly during real development
If you want to reproduce this, what matters is not FAISS or the embedding model per se, but these three invariants:
chunk identity must be stable, citations must be traceable, and outputs must be replayable.
Script A: index_vault — Build the semantic index
Technical choices and underlying principle
Input: Markdown under docs/ and sovereign_log/
Output: _system/artifacts/vault_index/ (explicitly a Derived layer)
What it does is essentially “compilation”:
Scan Markdown files strictly inside the vault
Split each file into paragraph-level chunks
For each chunk:
Drop chunks that are too short (
min_chars)Generate a stable
chunk_idRecord minimal but critical metadata (
note_path,paragraph_index,hash,mtime)
Use
sentence-transformersto encode chunk text into vectors (with normalization)Build a FAISS vector index (
IndexFlatIP, equivalent to cosine under normalized embeddings)Persist artifacts:
meta.jsonl: one line per chunk, containing identity + provenanceindex.faiss: the vector index fileconfig.json: index configuration for reproducibility
Why IndexFlatIP? Very simple: with normalize_embeddings=True, inner product equals cosine similarity. This is the most stable and interpretable baseline. Until you hit real performance bottlenecks, there is no need to jump to IVF/HNSW-style complexity.
Invariants you must preserve
Chunking must remain stable
If
split_to_paragraph_chunks()changes behavior,paragraph_indexwill drift and historical citations will break.chunk_idmust be bound to contentA recommended strategy is:
hash(path + paragraph_index + content_hash)This guarantees: once content changes, the ID changes—preventing “citation hallucinations.”
Meta and index order must align exactly
The vector order in FAISS must match the line order in
meta.jsonlone-to-one. If they diverge, the entire index becomes invalid.
Script B: query_vault — Query and generate evidence packs
Technical choices and underlying principle
Input: a natural-language query
Read: _system/artifacts/vault_index/{meta.jsonl, index.faiss}
Output:
_system/artifacts/packs/citations/<run_id>.json_system/artifacts/runs/<run_id>.md
Core workflow:
Load metadata + FAISS index
Encode the query using the same model and the same normalization
Run a broad retrieval first (raw_k is typically 10× topk or larger) to preserve headroom for post-filtering
Post-filter results:
scope_prefix: constrain the retrieval spaceexclude_temporal_note: remove content that should not participate in governanceper_note_cap: prevent a single note from dominating the results
Read snippets back from source text (re-chunk by
paragraph_index)Apply lightweight ranking adjustments (docs_boost / prefix boost)
Produce an evidence pack (JSON) and a run note (MD), and write them under the Derived root
The key value here is:
The retrieval output is not a UI search result—it is a citable, persisted, auditable evidence object.
Invariants you must preserve
Index model and query model must match
If embedding spaces differ, the scores become meaningless.
Snippets are display-only
The true identity of evidence always comes from metadata (
chunk_id/hash). The snippet is only for human readability.All outputs must land in the Derived root
Strictly constrain writes under
_system/artifacts/and bind them to arun_id. This is what makes the process replayable and prevents contamination of the Truth/Doc layers.
Minimal reproduction checklist (for engineers)
Define your vault input scope (1–2 directories are enough)
Implement a stable chunker (paragraph-level is the best starting point)
Define a clear
chunk_idstrategyBuild the index and metadata with consistent ordering
Make query output return evidence objects, not just file paths
Add governance-friendly filtering (per-note cap, scope)
Normal search: find notes → human judgment
This system: retrieve chunk-level evidence → generate evidence packs → enable audit / citation / LLM analysis → feed back into real development
Indexing is only the entry point. The real product is:
citable evidence units, and evidence packs that can be repeatedly reused and re-checked.
Write a wrapper: attach the Vault directly to a real project repo — develop while using it, and feed it back while developing
When you are writing code, you should be able to query—at any time—the system principles, governance rules, historical constraints, and failure boundaries that you have already written. At the same time, new problems you encounter during development, new decisions you make, and new evidence you generate should flow back and settle into the vault. Both sides grow together; both sides prevent conceptual drift. The project won’t “forget the constitution” just because you’re busy, and the constitution won’t become empty theory because it’s detached from real work.
Right now, I treat the Vault as an “external evidence substrate” for the project
I treat sovereign_knowledge as an independent, long-lived knowledge/governance repository, and then in the real project repo I add two extremely thin scripts:
skq: throw the development question directly into the vault’squery_vault.pyski: after you update vault content, rebuild the semantic index (index_vault.py)
The real value of these two scripts is not “saving a few keystrokes.” It is that they turn retrieval into a development action.
While coding, you can continuously pull evidence chunks from your governance base, generate a persistable evidence pack, and then hand that pack to an LLM (or read it yourself) to prevent conceptual drift.
A reproduction guide for engineers reading this
Clone your knowledge repo to any path (for example
~/sovereign_knowledge)In your project repo, add two scripts:
scripts/skq(query)scripts/ski(index)
Make them executable:
chmod +x scripts/skq scripts/skiConfigure environment variables once (optional):
export SK_VAULT=~/sovereign_knowledgeexport PYTHON=python3(or your venv python)
Day-to-day usage:
After updating the vault:
scripts/skiWhen you hit uncertainty during coding:
scripts/skq "run session trace ordering" --scope_prefix docs/
That completes the “develop while using, develop while feeding back” loop. You don’t need to move the vault code into your project, and you don’t need to maintain an indexing pipeline inside your project repo. You only need to treat the vault as an external governance substrate, and query it whenever you need.
One crucial point: why I think “external repo + wrapper” is more engineering than plugins
Because it turns retrieval output into stable artifacts (packs / run notes), rather than transient UI results. And because it extracts “knowledge governance” from the note-taking tool layer and turns it into an engineering object that can enter CI, auditing, and code review. That is why a “second brain” built this way can grow over the long run: it doesn’t depend on what you can remember; it depends on a toolchain continuously producing traceable evidence.
A senior engineer with years of experience once told me: in the LLM era, you need as much text as code. But you can’t just dump every project into a single text repository.
In the end, the LLM’s ability to decompose problems, design queries, and read sparse fragments from evidence chunks to produce suggestions—this is why, at this stage, this approach can reduce anxiety while you are using LLMs intensely.
现行的笔记软件已经无法解决AI辅助编程效率的问题了。
笔记一直是程序员最重要的效率工具之一,几乎每一代程序员都会试图从自己的工作方式出发,开发或改造一套属于自己的笔记系统,从最早的文本文件、Wiki,到 Word、印象笔记、Notion,本质上都是在解决同一个问题:如何在高强度的信息处理环境中,保留可用的思考结果。但现实是,笔记系统始终是不够好的,尤其是对每天需要处理大量文字、代码和决策的人来说。
过去几年我一直使用 Notion 作为主力笔记工具,常见的做法是开一个表格,每条记录带时间戳、链接和上下文说明,处理 bug 时在 icon 🐛上放一只小虫子,把所有线索、想法、日志和代码都塞进去,直到有一天我连续在电脑前工作了将近十二个小时,回过神来发现当天已经产生了接近两百条记录。Notion 的每一条记录本质上都是一个可以无限展开的页面,而不是 Excel 里的一个单元格,即便每一页只有两三千字,二百页也已经是四十万字,这个规模早已远远超出人类在一天之内能够阅读、理解和消化的极限,而且这并不是一次性的极端情况,而是逐渐变成了常态。
正是在这种背景下,我在最近几周明显进入了一种奇怪的状态:事情很多,但推进极慢,下意识回避工作,睡眠时间不断拉长,再加上东岸连续的大雪,被困在家里几天,又干脆出去玩了几天雪,表面看像是怠工或逃避,但真实感受是一种深度的认知疲惫。与此同时,大语言模型变得异常强大,代码、文档、方案几乎可以瞬间生成,但这种能力提升并没有让我成为一个更高效、更清醒的个体,反而带来强烈的挫败感,因为我发现这种状态并不只发生在我身上,和一些朋友交流后,大家普遍都在经历类似的认知过载:个人能力被工具放大到一定程度之后,随之而来的不是解放,而是更严重的反噬。
工业化语言生产机
回头看,问题集中体现在三个方面:第一,信息和文本的生成规模已经彻底超出人类的生理极限,这与个人能力强弱无关;第二,生成速度过快,导致根本记不住自己看过什么,LLM 必须持续回答、持续发散、持续给路径,却从不提供“到此为止”的信号;第三,也是最致命的,是隐性漂移,状态完全依赖窗口维持,窗口不断叠加,几天之后再回看几天前的代码,已经无法理解当时的情境,同一个概念、脚本和索引在命名、路径和 ID 上不断漂移,vault_index、index_vault、inventory_index 之类的差异一路堆叠,几周之后再看整个项目,只剩下一种彻底失控的感觉。
被雪困住的那几天里,我反复思考这个问题,逐渐意识到这并不是个人管理能力的问题,而是机制层面的必然结果:LLM 本质上是一个 zero-shot universe,几乎零成本生成内容,不会疲劳,不会犹豫,也不会做“是否值得继续”的决策,prompt 只是软约束,它没有硬停机条件,它是一台持续生产语言和符号的工业化机器,而人类的大脑并不具备相同的承载能力;漂移不是偶然,而是这种机制的必然产物,拉长上下文并不能解决问题,因为模型永远只是在预测下一个 token,他要找不到最正确的答案他就给你猜一个;更重要的是,整个过程没有责任主体,每天几百上千次 prompt 强烈依赖当下环境,事后几乎无法复现,也无法追责。
这是一个治理Governance的问题
正是在这种情况下,我开始意识到,问题并不在于“记得不够多”,而在于在这个时代,继续沿用“多记一点总有用”的传统笔记假设,本身就是不可持续的。Notion 并不是没有搜索功能,但当我真的需要解决一个具体问题、回溯自己过去的思考时,搜索出来的往往是成百上千条结果,其中大量内容彼此冲突、语境缺失、时间错位,几乎无法直接使用,搜索本身并没有降低认知负担,反而把问题放大了。也正是在这个背景下,在这个 repo 连续开发了几天之后,我又一次尝试去重新搭建自己的记录体系——这种尝试其实已经发生过很多次,绝大多数都以失败告终,但这一次我明显感觉到有些地方开始对路了,至少已经值得写出来和别人分享。关键并不在于我是否找到了一个“更好的工具”,而在于我发现,这根本不是工具层面的问题,而是人在这个阶段不可回避地遭遇了自身的生理与认知约束。一方面,我确实需要一个更贴合我个人工作方式的系统,但另一方面,更重要的是必须把注意力放回真正的核心问题上:这个问题既不是笔记,也不是程序,更不是靠一个软件、一个库就能一次性解决的工程问题,它的核心只有一个——知识的治理。
这是个Governance的问题。
简单说,“知识的治理”是什么意思?从更大的尺度看,人类社会中几乎所有真正困难的问题,本质上都是治理问题。政治是治理问题,国家如何被治理;法律是治理问题,立法、司法、执法如何分工与约束;宪法本身也是治理问题,不只是写下来,而是如何被解释、如何被遵守、如何保证它真的被遵守。把尺度缩小,团队管理、制度设计、流程约束,同样是治理问题,而不是工具问题。治理的核心从来不在于“有没有信息”“能不能访问”,而在于哪些东西被承认为有效、哪些判断具有权威、哪些状态可以进入长期记忆、哪些必须被限制、冻结或废弃。
回到个人层面,所谓“知识管理”的真正难点也在这里:不是信息不够,不是搜索不强,而是缺乏一套能够裁决、约束和负责的机制。正因为如此,这个问题其实已经被无数程序员反复尝试解决过,从早期的个人知识库、标签系统、到后来的双链、图谱、向量化检索,再到 Obsidian、各种语义搜索和 AI 助手,它们解决的更多是“如何找到内容”“如何关联内容”“如何生成内容”,但几乎都默认了一个前提:内容一旦写下,就天然具有价值,只要存得住、搜得到,就算成功。这个解决方案和各种开源项目到处都是,我自己也是借鉴了的。我现在也是把内容全部存成 .md文档,obsidian vault可读。But this is not the point!
在信息和生成能力远低于今天的年代,搜到就赚到。但在一个内容可以零成本爆炸式生成的时代,这个前提本身已经失效了。真正缺失的并不是更聪明的搜索算法,也不是更复杂的表示方式,而是一套能够回答“什么可以被承认为知识”“什么只是过程噪声”“什么需要被追责”“什么必须可复现”“什么不允许悄然漂移”的治理结构。
所以,我说说我现在自己尝试的这种方案,这个repo,虽然开始没多久,但是我感觉这个方向很行。如果你有这个需求,也许我能启发一些目前同样困扰的人。
一个Sovereign Log, 简单代码,昂贵写入,没有接LLM,后期可以接,帮我定住我最核心的认知。
不是为了高频记录而设计的,恰恰相反,它刻意让写入变得昂贵、缓慢和需要思考;它的实现保持在尽可能简单的代码层面,不依赖复杂系统,也不直接接入 LLM,因为一旦接入,零成本生成就会立刻侵蚀它作为“裁决记录”的性质。现阶段都是硬编码。我先简单说说都包括啥,现在连UI都没有,就已经简单的解决了我大量的问题。
.
├── _system
├── docs
├── inbox
├── requirements
├── sovereign_log
├── templates
└── tools
这里面有两个.md组成的文本库,一个是sovereign_log,另一个是doc。
Sovereign_log
当然,一切并不是从一个宏大的系统开始的。我最初建立的只是一个极小、极其克制的 vault,它的起点和任何一个普通的 Obsidian vault 并没有本质区别。如果你看到它的 graph,会发现几乎没有什么结构特征:没有刻意构建的 backlink,没有主题网络,也不追求连接密度。唯一真正不同的地方在于数量——极少。这是一个被我刻意维持的约束。在这个 vault 里,每一篇内容都不是草稿,不是“先写下来再说”,而是经过反复推敲的结果。我会与 AI 多轮对话,压缩表达、校正边界、澄清含义。早期我主要使用英文写作,同时辅以少量中文翻译,是作为一种语义校验手段,用来确认我是否真正理解了自己写下的东西。随着系统成熟,这个 vault 很可能会完全切换为英文。这个阶段的 Sovereign_Log,本质上是一个高度精炼、完全由个人主权控制的知识集合,其核心原则非常简单:不要乱写,不要当成草稿,数量不是越多越好,而是越精良越好。有时一整天的思考、推演和结构对齐,最终只会留下 1–2 篇笔记,是一种有意为之的熵控制。你一定要维持这种纪律,否则又跟你其他笔记一样,照搬过来一大堆东西。
Doc
Doc 在形式上依然是“笔记”,但它并不是 Sovereign_Log 的自然延伸,而是一次明确的跃迁。Doc 中的内容并非写得更正式,而是被提升过的文本,处于一种介于叙事语言与可执行代码之间的 IR 状态。它们尚未成为代码,但已经不再允许自由叙述。在设计这个知识库时,我刻意引入了“昂贵性”这一反直觉原则,用来对抗大语言模型生成文字与代码过快、过多的问题。我认为,真正进入知识库核心层的内容,必须在制度上是昂贵的,而不是在情绪或表达上显得复杂。这种昂贵性体现在三个层面。
首先是格式昂贵。Doc 不是自由文本,任何内容想要进入这一层,必须满足固定结构并通过机器可执行的格式检查。确保这些文本在原则上是可解析、可索引、可审计、可约束的。一篇格式不合格的内容是根本不具备进入该层的资格。
其次是流程昂贵。内容不能一开始就进入 Doc,它必须先在 Sovereign_Log 中存在,经过时间沉淀,在真实使用中反复被搜索、引用、转化为代码或决策,逐渐显现出对系统行为的实际影响,才会成为候选项,并在明确授权后被提升进入 Doc。这一机制在未来引入团队共享知识库时尤为关键,因为它明确阻断了任何人、任何模型、任何一时兴起的想法对核心知识层的直接污染。Doc 被视为一种稀缺资源,而不是存放“整理后内容”的地方。
第三,也是现阶段最重要的一点,是语义昂贵。我对 Doc 的核心要求并不是表达清晰或逻辑自洽,而是语义必须达标。所谓语义达标,意味着条文之间不能互相矛盾,不允许隐含冲突,不允许模糊责任与边界,每一条声明都必须具备生成 binding 代码的潜力。我观察过许多文档规模庞大的知识库,它们的共同问题并不是信息不足,而是语义杂乱、定义漂移、条文冲突,最终无法转化为真实可执行的系统约束。这样的知识库看似厚重,实则不可执行。而 Doc 的目标恰恰相反:每一条文本,都是潜在的系统约束;每一条语义,都是未来代码的来源。因此,语义本身也必须进入可被机器介入和审计的范围。
├── docs
│ ├── decisions
│ ├── governance
│ │ ├── CLAIM-AUTHORING-STYLE-V0-1.md
│ │ ├── EVIDENCE_SUGGEST_AUDIT.md
│ │ ├── RPT-2026-01-26-RUNTIME_ENV_GOVERNANCE.md
│ │ ├── SD-0007_RUNTIME_ENV_FREEZE.md
│ │ ├── SD-0008_INDEX_CONSOLIDATION.md
│ │ ├── SD-0009_DOC_IR_SPEC_v0.md
│ │ ├── SD-0010 — Normative Boundary & Semantic Consistency Rules
│ │ ├── SD-0011_ID_Taxonomy_and_Identity_Ledger.md
│ │ └── SD-0012 — Observability Causal Structure & Span Identity
│ ├── invariants
│ │ ├── INV_CATALOG_v0.md
│ │ ├── INV_DOC_IR_BOOTSTRAP_v0.md
│ │ └── INV_INDEX_ROOTS_0001.md
│ ├── Machine-Checkable Invariants.md
│ ├── runs
│ ├── SD-0001-System-Constitution.md
│ ├── SD-0002-Derived-Artifact-Path-Freeze.md
│ ├── Sovereign Log — Write Protocol v1.0.md
│ ├── stage_contracts
│ │ └── PRE_STAGE_5.md
│ └── taxonomy
│ ├── INDEX_TYPES.md
│ └── TDES.md
我现在的系统是完全没有接入任何 LLM 的,至少在这一层没有。这些约束、筛查和升级机制,全部是在代码层面完成的。我一直相信,如果你是一个看到这里的程序员,并且你觉得这套方法对你有借鉴意义,那么它在技术上并不复杂,甚至可以说是非常直白的工程实现。这里我不展开具体代码细节,只用一个非常具体的例子来说明我所谓的“语义昂贵”到底是什么意思。
以语义筛查为例。我每次跑 audit,通常都会顺手生成两份报告:一份是给机器继续消费的 JSON,一份是给人看的 Markdown。下面这段就是我某一次跑出来的语义审计报告摘要。
这次 audit 的标识是 AUDITDOCSEMANTIC — 20260130T205405Zdocsemantic_audit,整体结果是 ok: False,意味着这次检查在制度上是不通过的。系统扫描了 docs 目录下的文档,共识别出 77 条 claims,使用的是基于 embedding 的相似性与语义模式匹配(这里用的是 all-MiniLM-L6-v2,阈值 0.92,top-k 为 8,最大配对 200)。需要特别说明的是,这里对 modality、target、predicate 的推断仍然是启发式的,属于 v0.1 阶段,这一点在 meta 里是明确写出来的,后续会被提升为显式的结构化 claim tags。也正因为如此,重复或冲突在默认情况下只是 warning,但在这次配置里,我把冲突严重性提升为了 error,要求必须被解决。
真正关键的是下面这条 violation:SEM-CNF-001。它指向的是一个语义冲突,而且不是模糊冲突,而是非常具体的 modality 冲突。系统检测到,在两个不同的文档中,针对同一个 (target, predicate) 组合——也就是 ('GLOBAL', 'evidence')——出现了相互矛盾的规范性断言:一条是 must,另一条是 must_not。冲突来源被精确定位到了 SD-0011#C-0004 和 DOC-EVIDENCE-SUGGEST-AUDIT-V1#C-0003 这两条具体的 claim 上。这意味着,在当前 Doc 层的语义空间中,系统同时被告知“这件事必须发生”和“这件事必须不能发生”,而且两者作用域完全一致。
这类错误在我的系统里是不可接受的。一旦这样的语义进入核心 Doc 层,任何试图从中生成 binding 代码、策略或运行时约束的行为,都会立刻失去确定性。因此 audit 是直接要求你做出选择。hint 里给出的路径也非常明确:要么通过既定的优先级体系来裁决(例如 INV 高于 SD,高于 taxonomy),要么明确使用 supersede 机制,让其中一条 claim 在制度上失效。这是哪一条规则在系统中继续存活的问题。
对我来说,这正是“语义昂贵”的具体体现。不是靠人脑记住“这里好像有点矛盾”,而是让系统在 Doc 层直接拒绝语义不闭合、不一致、不可执行的状态。也正因为有了这样的机制,Doc 才不再是一个“写得很严肃的文档集合”,而是真正开始具备约束未来代码与系统行为的资格。
AUDITDOCSEMANTIC — 20260130T205405Zdocsemantic_audit
ok: False
violations: 1
Meta
{
"docs_dir": "docs",
"claims_count": 77,
"dup_threshold": 0.92,
"topk": 8,
"max_pairs": 200,
"dup_severity": "warn",
"model": "sentence-transformers/all-MiniLM-L6-v2",
"notes": [
"v0.1 modality/target/predicate inference is heuristic; promote to structured claim tags later.",
"duplicates are warnings by default; set --dup_severity error to hard-enforce."
]
}
Violations
1. SEM-CNF-001 (error)
path: docs/governance/SD-0011_ID_Taxonomy_and_Identity_Ledger.md | docs/governance/EVIDENCE_SUGGEST_AUDIT.md
message: Modality conflict (must vs must_not) on (target,predicate)=('GLOBAL','evidence'): SD-0011#C-0004 vs DOC-EVIDENCE-SUGGEST-AUDIT-V1#C-0003
hint: Resolve via priority (INV > SD > TAXONOMY ...) or supersede one of the conflicting claims.
meta: 46502865b8712493e52dffd4a6b6871cd994c162f900ea8e6e7a8c3ebbcb5873
怎么实现的
从技术实现上讲,这套机制并不依赖大语言模型,也不需要任何“智能理解”。它的核心思想是:把文档的语义压缩成一组可计算、可比较、可失败的结构化要素,然后用一套完全确定性的规则对这些要素做审计。整个系统可以拆解为几个相对独立的技术层,你可以按需要取用或替换其中任意一层,而不影响整体成立。
第一步是声明式语义单元的引入。文档不再被当作一整块自然语言,而是被要求显式标注最小治理单元——也就是稳定编号的 claim。它们本质上等价于“规范性断言”,每一条都是一个潜在的系统约束。实现上不需要复杂解析器,只要用简单、鲁棒的文本模式(例如固定前缀或行结构)就可以稳定抽取。这一步的关键不是解析能力,而是强迫作者在写作阶段就做出“这是我要被系统治理的句子”的选择。
import re
from dataclasses import dataclass
CLAIM_RE = re.compile(r"(?m)^\\s*-\\s*(C-\\d{4})\\s*:\\s*(.+?)\\s*$")
@dataclass
class ClaimRaw:
claim_id: str
text: str
def extract_claims(md: str) -> list[ClaimRaw]:
out = []
for cid, txt in CLAIM_RE.findall(md):
out.append(ClaimRaw(claim_id=cid, text=txt.strip()))
return out
复现要点:你只需要一个稳定、可被 grep 的写作格式(比如 - C-0001:)。解析就能保持极简。
第二步是规范性强度(modality)的离散化。系统并不尝试理解句子的全部含义,而是只关心它在制度上的态度:是必须、禁止、允许,还是无法判定。这可以通过一组确定性的关键词或规则完成,而不是统计模型。重要的是,这一步把“模糊的自然语言语气”压缩成一个有限集合,使得后续冲突判断变成一个有限状态问题,而不是语言理解问题。
import re
MODALITY_RULES = [
("must_not", re.compile(r"\\b(must not|shall not|forbidden|prohibited|not allowed)\\b", re.I)),
("must", re.compile(r"\\b(must|shall|required|mandatory)\\b", re.I)),
("forbid", re.compile(r"\\b(forbid|prohibit|disallow|deny)\\b", re.I)),
("allow", re.compile(r"\\b(allow|permit|allowed)\\b", re.I)),
]
def infer_modality(text: str) -> str:
for name, rx in MODALITY_RULES:
if rx.search(text):
return name
return "unknown"
复现要点:这不是 NLP,是可审计的规则表。你可以按你组织的写作词汇(MUST/SHALL/禁止/允许)扩展规则。
第三步是约束对象(target)的锚定。为了避免所有规则都在一个全局空间里相互干扰,系统需要一个机制,把每条 claim 映射到一个明确的约束对象上。实现方式可以非常简单,例如通过显式标注、路径前缀、资源名或作用域标识符。只要这个 target 是稳定、可比较的,就足够用于治理。未能明确锚定的规则,可以被保守地归入全局作用域,从而承担更高的冲突风险,这本身就是一种设计激励。
import re
BACKTICK_RE = re.compile(r"`([^`]+?)`")
def infer_target(text: str) -> str:
# Prefer explicit anchors like `docs/` `_system/` `sovereign_log/`
for m in BACKTICK_RE.finditer(text):
tok = m.group(1).strip()
if tok.startswith(("docs/", "_system/", "sovereign_log/")):
# collapse file path to a directory-ish prefix for stability
parts = tok.split("/")
if len(parts) >= 2:
return parts[0] + "/" + parts[1] + "/"
return parts[0] + "/"
return "GLOBAL"
复现要点:你要的不是“精准路径”,而是稳定可比对的 target key。写作上用 backticks 做显式锚点,工程上就能把 GLOBAL 降到最少。Global会给你很多的false positive, 没完没了的修。其实一般的claims都是针对特定域的,你不会动不动就写一个“我这条整个系统都必须通用。”
第四步是谓词维度的粗分类(predicate)。这是为了避免不相关的规则在同一 target 上产生误报冲突。系统并不需要一个完整的本体论,只需要一个非常小、可扩展的谓词集合,用来区分“这是关于存放位置的规则”“这是关于权限或权威性的规则”“这是关于证据或生成机制的规则”等。这一步同样可以通过关键词或标签完成,其目标是缩小冲突比较的语义空间,而不是追求精细分类。
def infer_predicate(text: str) -> str:
t = text.lower()
if any(w in t for w in ["authoritative", "canonical", "binding", "truth layer"]):
return "authority"
if any(w in t for w in ["faiss", "semantic index", "vault_index", "index root", "index output"]):
return "index"
if any(w in t for w in ["write to", "stored under", "must exist only at", "reside at", "live under"]):
return "placement"
if any(w in t for w in ["admitted", "accepted", "effective", "governance-valid"]):
return "admission"
if any(w in t for w in ["evidence", "ingest", "payload", "manifest"]):
return "evidence"
return "general"
在完成上述四个要素之后,每一条文档声明在系统中都会被压缩成一个结构化记录:(modality, target, predicate)。到这里,文档已经不再是自然语言集合,而是一个可以被机器枚举、分组和比较的约束集合。
基于这个结构,语义冲突检测就变成了一个纯规则问题:在同一个 (target, predicate) 空间内,是否同时存在互斥的规范性态度(例如 must 与 must_not,allow 与 forbid)。一旦出现这种情况,系统可以直接判定为不可执行状态,并给出精确到声明级别的冲突定位。这里不涉及概率、不涉及模型判断,也不需要人工解释,冲突即失败。
在此之上,可以叠加一个近似重复检测层,用于发现语义上高度相似、但尚未构成直接冲突的声明。这一层可以使用任何向量化或相似度技术来实现,其角色更接近于“治理卫生检查”,而不是硬约束。是否将其视为警告还是错误,完全取决于你对系统严格程度的要求。
整个审计流程的输出应当是机器优先的:结构化结果用于后续自动化处理,人类可读报告用于审查与决策。关键在于,审计必须能够失败,并且失败应当通过明确的进程退出、状态标志或流水线信号向外传播,这样文档治理才能真正进入工程系统,而不是停留在规范建议层。
如果你想在自己的系统中复现这一机制,并不需要复制具体实现。你只需要确认几件事:你是否愿意引入显式的声明式语义单元;你是否愿意接受把自然语言压缩为有限语义维度所带来的约束;以及你是否愿意让“文档语义不一致”像代码编译错误一样中断流程。一旦这三个问题的答案都是“是”,那么无论你使用哪种语言、哪种工具、哪种存储结构,这套“语义昂贵”的文档治理模式都可以被复现出来。
为什么不引入大模型?当然我后期肯定会引入的。但是现在我需要搭建一个完全断开大模型的基层。这个原因我觉得读到这里的人应该都能大概理解。因为你想要的是历史与治理必须由确定性机制承担,模型只能作为建议者,而不能成为裁决者。失败条件是稳定的,系统边界是清晰的。这件事情目前让我感觉上就舒心很多。
硬编码就万能吗?当然不是!它覆盖不了复杂语义。关键词、规则表、枚举谓词,本质上都是低分辨率的认知压缩。它无法理解隐喻、上下文、反讽,也无法自动适应新的表达方式。硬编码会显得笨、慢、保守。每加一个 predicate、每调一个规则,都需要人来做判断。这在早期看起来效率很低。但我恰恰需要这种“慢”。它是一种摩擦机制,用来对抗“生成过快”的系统风险。在一个可以一秒钟生成一百条规则的时代,慢本身就是一种安全设计。它迫使规则的作者为进入核心层付出认知成本。这也是我认为程序员容易陷入的一种精神陷阱,就是很容易把自己抽离。认为自己不是系统机制的一部分。我个人反而因为现在大量自己做项目,容易理解这种“自身必须带入,自身必须收到约束”的想法。
核心当然是Vault Index and Query
好,我们先把系统总图停在这里,剩下的那些“细枝末节”(derived 层怎么落盘、怎么隔离、各种 tools/templates、审计与报告的目录法、生命周期与清理策略)确实规模很大,但它们本质上都是工程卫生问题:只要你的边界清楚(Truth / Decision / Evidence / Derived),再加上写入权限与路径约束,技术实现不会难,更多是“制度一致性”和“长期可维护性”的问题。
我这里真正想展示的,是我的 vault_indexer:它不是一个“更好用的搜索”,而是一个把笔记系统变成证据底座(evidence substrate)的索引机制。现在几乎所有笔记应用迟早都会走到“语义检索 + chunk”这条路,但我在做的时候发现,决定它最终价值的不是检索本身,而是你如何组织证据 chunk、如何让它们具备可引用、可复现、可审计的属性,以及最关键的:这个索引产物在你的系统里到底处于什么层级(Truth 还是 Derived)。
我的做法是把索引严格定位为 Derived:它永远不拥有“真理权”。它只负责把 docs/、sovereign_log/ 这些文本资产,编译成一个面向检索的中间产物:把文档切成稳定粒度的 chunks(通常是段落级或标题-段落组合级),为每个 chunk 生成稳定标识(chunk_id)、保留来源路径、heading/段落位置、以及内容哈希(或可追溯的 digest)。然后对 chunk 文本做 embedding,构建向量索引(FAISS 或同类),最终让你可以用自然语言 query 去命中一组证据 chunks。注意这里“证据”不是泛指相关文本,而是具备引用结构的对象:每个 chunk 都能被精确地指向、被复制到 evidence pack、被后续工具链复用,甚至被审计系统要求“你生成的结论必须引用这些 chunk_id”。
这就直接解释了为什么 LLM 在我这里不是“写作机器”,而是“最佳 query 手”。LLM 的强项不是凭空生成,而是:把一个模糊问题拆解成可检索的子问题、构造有效 query、对返回的证据进行结构化归纳,并把归纳结果映射回开发动作(改哪个文件、加哪个 invariant、补哪个 claim、写哪个 diff)。当你已经建立了信息底座——也就是把核心原则、边界、约束、历史责任都冻结在 Doc / Sovereign_Log 里——LLM 最合理的用法就是站在索引之上,做“证据驱动的开发协作”。你问一个问题,它先检索,再给你一个带引用的证据包(JSON/MD),你再用这个证据包去推进决策与代码,而不是让它用记忆和幻觉替你编故事。
也因此,我的方法和很多以“搜索”为核心的 Obsidian 插件有一个结构性差异:大多数插件的目标是“更快找到笔记”,而我的目标是“让检索结果成为可治理的证据对象”。插件式搜索通常停留在 UI 体验层:关键词或语义匹配 → 打开笔记 → 人脑判断;它们很少解决这些问题:检索命中的内容有没有稳定引用?是否能在 CI/审计里复现同一个结果?检索产物是否会被错误地当成权威?能否形成“问题 → 证据 → 决策 → 代码 → 回写证据”的闭环?而我的 vault_indexer 把检索的输出直接做成一种可落盘、可版本化、可再利用的中间态(pack / report / audit),并且明确把它放在 Derived 层:它可以被重建、可以被替换模型、可以被清理,但不能反向污染 Truth/Doc 的权威性。
换句话说:很多笔记插件把搜索当作“阅读入口”,我把索引当作“开发输入口”。前者优化的是找笔记,后者优化的是把笔记变成工程证据,让 LLM 以“证据分析器 + query 编译器”的身份参与开发。这也是为什么当你的系统底座一旦稳定(规则、边界、历史责任明确),索引检索就不再是锦上添花,而会变成你推进长期工程的主引擎:你不是在“写更多笔记”,你是在用检索把过去的制度与证据,持续地喂回现在的决策与代码。
废话不多说,直接给你看证据。
我们先从一份非常真实的 query_vault 结果说起。需要先说清楚一个前提:这个系统现在还在早期阶段,它不是那种“装完就立刻很神”的工具。它需要你在真实项目中长期使用、长期维护,并且真正把它当成核心系统来用,才会逐渐成长为你自己的“第二大脑”。在这个阶段,检索结果的 score 普遍不会特别高,这是正常现象。对我现在的库来说,0.7 已经算是相当不错、可以认真阅读的命中率了。
原因很简单:语义索引的质量,最终取决于你往里放了什么。你需要一边做真实的开发、一边写真实的规则、一边不断用“已经被你验证过的优质内容”去填充这个库;同时还要持续升级、清理和维护。这个过程是累积性的:越用越好用,越用越省心,而不是一开始就给你“标准答案”。这才是第二大脑应有的成长方式。
从结构上看,一次 query_vault 的 run,本质上只是一个可复现的检索结果对象。它并不试图直接回答问题,而是把“当时系统认为可能相关的证据”列出来。这个对象里最重要的几个字段是:
run_id:这次检索的唯一标识。你可以把它当成一次“证据采样”的编号,后续可以落盘、归档、甚至被审计。query:你当时输入的原始问题,用于回溯“当时你在想什么”。model:用于向量化的 embedding 模型,它定义了“相似”的数学意义。results[]:真正的核心内容——chunk 级别的命中结果,而不是文件级搜索。
每一条 result,都不是“这篇文档命中了”,而是“这一个具体段落(chunk)被认为相关”。因此,每条 result 都带着一组对工程师来说非常关键的元信息:
chunk_id:这个段落的稳定身份。后续引用证据时,你引用的是它,而不是“我记得哪篇文档里有一句话”。note_path / heading_path / paragraph_index:用于精确定位原文位置。hash:该段落内容的哈希,用来保证“你现在看到的证据”和“未来审计时看到的证据”是同一段文字。mtime:最后修改时间,帮助你判断新旧、是否存在语义漂移风险。score:相似度分数,只是排序信号,不是“正确性”或“权威性”的度量。snippet:展示用的截断文本,只能当预览,不能当证据全文。
所以你要对这个结果有一个非常清醒的认识:
它是一份**“可引用的检索命中清单”**,而不是“已经回答你问题的证据”。
接下来才是关键:你问的是什么问题?你希望这份证据帮你解决什么?
在这次 query 里,问题是:
run session trace ordering
这并不是在问“Run、Session、Trace 分别是什么”,而是在做一次围绕 Run / Session / Trace 的存在顺序与合法性(ordering & existence constraints) 的证据检索。
换句话说,这是一个非常“硬”的问题。你真正关心的不是定义,而是:
哪些推断是被系统明确禁止的?
哪些顺序关系必须显式声明,而不能从时间或结构中“猜”出来?
这也是为什么这次命中的 chunk,即便 score 只有 0.6~0.7,依然是高价值证据。比如:
明确禁止从 trace 连续性推断 session 存在
明确禁止用时间顺序替代 run / session 的显式声明
明确要求 trace 必须绑定到合法 session,否则就是 non-legitimate
明确区分 Run / Session / Trace 各自回答的问题域
这些内容并不是“百科式解释”,而是制度级约束。它们的价值不在于“讲清楚概念”,而在于防止你和 LLM 在开发过程中犯错。
这也正好引出了 LLM 在这里的正确角色定位。
在这个系统里,LLM 不是用来脑补解释的,而是一个“证据分析器”。它不应该看到 snippet 就开始编故事,也不应该试图用时间顺序、因果叙事去“合理化”缺失的结构。相反,它应该被这些 chunk 锚定住行为边界:
这些推断是被明确禁止的 → 不要做
这些关系必须显式声明 → 如果缺失,就指出缺口
如果规则已经存在但没有被完美回答 → 就用已有证据做锚,而不是重新发明规则
即便这个问题在你的系统里还没有一个“教科书式的最终答案”,0.7 分左右的证据依然有巨大价值:它为 LLM 提供了方向约束,防止概念漂移,防止在日常开发中慢慢把系统“想歪”。
这就是我做这套 vault_indexer + query 机制的初衷:
不是让检索直接给你答案,而是让它在你和 LLM 之间,持续提供一个可以被引用、被校验、被审计的证据锚点。只要这些锚点是稳定的,你的系统就不会在长期使用中悄悄变形。
{
"run_id": "20260131T014112Z_query_vault_b52ba27830",
"query": "run session trace ordering",
"scope": {
"note_path_prefix": null
},
"topk": 12,
"per_note_cap": 2,
"exclude_temporal_note": false,
"model": "sentence-transformers/all-MiniLM-L6-v2",
"results": [
{
"chunk_id": "f91b794ab50cc20b12ee18d2a74a532ed5b0630d",
"note_path": "docs/governance/SD-0011_ID_Taxonomy_and_Identity_Ledger.md",
"heading_path": "",
"paragraph_index": 65,
"hash": "ed7cc7a84aa209ae63eb30ef531ab5cfd5cb0a649b8b98da78db3c9d3cf0f73c",
"mtime": 1769822404.9170113,
"score": 0.7674295067787171,
"snippet": "- Trace continuity MUST NOT be used to infer the existence of a session. - Session existence MUST NOT be inferred from trace structure. - Time ordering alone MUST NOT substitute for explicit run or session declaration.",
"boost": {
"prefix": "docs/",
"factor": 1.1
}
},
{
"chunk_id": "74abc9e9857e9ee0f805584d3b95dee8c0142f29",
"note_path": "sovereign_log/Session-Run-Trace-Formal Definitions.md",
"heading_path": "",
"paragraph_index": 29,
"hash": "f08025c0500946d913c04bbecbb5ee1ed106120b5d9ecc3f21d9a5254395578b",
"mtime": 1769737995.4226983,
"score": 0.6208187341690063,
"snippet": "- A Run **must exist** before any Session. - A Session **must exist** before Events are promoted. - A Trace **must exist** for any causal claim. - Time ordering **must never replace** trace structure."
},
{
"chunk_id": "8c355b0da6ce66924ec7fdb83adc71dc6a4e17d7",
"note_path": "docs/governance/SD-0011_ID_Taxonomy_and_Identity_Ledger.md",
"heading_path": "",
"paragraph_index": 60,
"hash": "c056ccfc21a6cd4e9352ed599b679a0018f95bbf828fd0f017fdb9ff6af51c90",
"mtime": 1769822404.9170113,
"score": 0.5683699011802674,
"snippet": "- A `trace_id` MUST be associated with a valid `session_id`. - Trace-level artifacts that lack an explicit session binding are non-legitimate.",
"boost": {
"prefix": "docs/",
"factor": 1.1
}
},
{
"chunk_id": "b8ecd51c38bb6cffceb35546af4f27176d67027d",
"note_path": "sovereign_log/Session-Run-Trace-Formal Definitions.md",
"heading_path": "",
"paragraph_index": 32,
"hash": "32a2b22ffaf1ffcabcb49924a30bdcbbf00c46b052403ca5ea32e9ad1739b0c6",
"mtime": 1769737995.4226983,
"score": 0.5365391969680786,
"snippet": "> **Run answers “which execution.” > Session answers “which lifecycle.” > Trace answers “why.”**"
},
我们现在已经非常的操之过急的,习惯于,每次自己有什么问题,马上写prompt,马上要得到答案。然而LLM又非常善于给你编织“完美答案”。这是非常危险的。这就是漂移的根源。
如果你考虑自己弄一个,那么可参考两个核心脚本:index_vault.py; query_vault.py
代码层面的复现其实并不难,真正的难点也不在代码本身,而在于你是否愿意接受这样一套知识治理制度。在现阶段,我是刻意不把 LLM 接入系统执行链路的。我只会把已经生成好的证据(evidence packs)和审计报告,交给 LLM 去“解释”和“建议”。无论你是在交互窗口里用 LLM,还是将来把它接入系统内部,这一点在本质上并不会改变:LLM 的强项是理解问题、组织合适的 query、并在你提供的证据约束下给出建议,而不是替你决定什么是真理。真正决定系统质量的,是你如何看待证据、如何让知识变得可治理。这是我目前非常明确的立场。
因此,这套东西并不是“一个搜索脚本”,而是一个最小可用的 Evidence Retrieval Substrate。它的目标非常克制:
把 vault(Markdown)编译成 chunk 级的语义索引(Derived、可重建、没有真理权)
查询时返回的是可引用的证据单元(chunk_id + 精确定位 + 内容 hash + snippet)
把每一次查询的结果落盘为 evidence pack(JSON + MD),供后续 LLM 和人类在开发中反复引用
如果你要复现,真正需要抓住的并不是 FAISS 或 embedding 模型,而是这三条不变量:
chunk 身份必须稳定、引用必须可追溯、输出必须可重放。
脚本 A:index_vault —— 构建语义索引
技术选型与原理
输入:docs/、sovereign_log/ 下的 Markdown
输出:_system/artifacts/vault_index/(明确属于 Derived 层)
它做的事情本质上是一次“编译”:
只在 vault 内扫描 Markdown 文件
将每个文件切成段落级 chunks(paragraph chunks)
对每个 chunk:
丢弃过短内容(
min_chars)生成一个稳定的
chunk_id记录最小但关键的元数据(note_path、paragraph_index、hash、mtime)
使用
sentence-transformers将 chunk 文本编码成向量(并归一化)使用 FAISS 构建向量索引(
IndexFlatIP,等价 cosine)将索引产物落盘:
meta.jsonl:每一行对应一个 chunk 的身份与来源index.faiss:向量索引config.json:索引配置与可复现参数
之所以选择 IndexFlatIP,原因非常简单:在归一化 embedding 的前提下,内积等价 cosine,这是最稳定、最可解释的基线方案。在你真正跑进性能瓶颈之前,没有必要引入 IVF/HNSW 之类的复杂结构。
必须保留的不变量
chunk 切分必须稳定
split_to_paragraph_chunks()的行为一旦改变,paragraph_index就会漂移,历史引用就会失效。chunk_id 必须与内容绑定
推荐的策略是:
hash(path + paragraph_index + content_hash)这样可以确保:内容一变,ID 必然变化,避免“引用幻觉”。
meta 与 index 顺序必须严格对齐
FAISS 向量的顺序,必须和
meta.jsonl的行顺序一一对应,否则整个索引就不可用。
脚本 B:query_vault —— 查询并生成证据包
技术选型与原理
输入:自然语言 query
读取:_system/artifacts/vault_index/{meta.jsonl, index.faiss}
输出:
_system/artifacts/packs/citations/<run_id>.json_system/artifacts/runs/<run_id>.md
核心流程是:
载入 meta 与 FAISS index
用同一个模型、同一种归一化方式对 query 编码
先做一次宽松检索(raw_k 通常是 topk 的 10 倍以上)
再进行后处理过滤:
scope_prefix:限制检索空间exclude_temporal_note:剔除不应参与治理的内容per_note_cap:避免单个文件刷屏
回到原文读取 snippet(通过 paragraph_index 重切 chunk)
进行轻量排序调整(docs_boost / prefix boost)
生成 evidence pack(JSON)与 run note(MD),并统一写入 Derived root
这里的关键价值在于:
检索的输出不是 UI 展示结果,而是一个可以被落盘、被引用、被审计的证据对象。
必须保留的不变量
索引模型与查询模型必须一致
否则 embedding 空间不一致,分数没有任何意义。
snippet 只能是展示层
证据的真实身份永远来自 meta(chunk_id / hash),snippet 只是为了让人读得懂。
所有输出必须写入 Derived root
严格限制在
_system/artifacts/下,并绑定 run_id,这是保证可重放、不污染 Truth/Doc 层的关键。
最小复现 Checklist(给工程师)
明确你的 Vault 输入域(1–2 个目录即可)
实现一个稳定的 chunker(段落级是最优起点)
定义清晰的 chunk_id 生成策略
构建 index 与 meta,并保证顺序一致
查询时输出的是“证据对象”,而不是路径列表
加上最基本的治理过滤(per-note cap、scope)
普通搜索:找到笔记 → 人脑判断
这套系统:找到 chunk 证据 → 生成 evidence pack → 支持审计 / 引用 / LLM 分析 → 反哺真实开发
索引只是入口,真正的产物是:
可引用的证据单位,以及可以被反复使用和复查的证据包。
写个 wrapper:把 Vault 直接接入真实项目 Repo,边开发边用,边开发边反哺
你在写代码的时候,随时可以 query 自己写过的系统原则、治理规则、历史约束、失败边界;同时你在开发过程中遇到的新问题、新决策、新证据,也会被反向沉淀回 vault。两边一起增长,两边一起防止概念漂移——项目不会因为忙而忘记制度,制度也不会因为脱离实战而空转。
我目前是把 Vault 变成“项目的外置证据底座”
把 sovereign_knowledge 当成一个独立、长期存在的知识/制度库,然后在真实项目 repo 里写两个极薄的脚本:
skq:把“开发中遇到的问题”直接扔给 vault 的query_vault.pyski:在你更新了 vault 内容后,重建语义索引(index_vault.py)
这两个脚本的关键价值不是“省打字”,而是把查询变成一种开发动作:
你在写代码时,随时可以从自己的制度库里抽取证据 chunk,生成可落盘的 evidence pack,然后再把 pack 交给 LLM 或自己读,防止概念漂移。
给屏幕前程序员的复现指南
把你的知识库 repo clone 到任意路径(例如
~/sovereign_knowledge)在你的项目 repo 里放两个脚本:
scripts/skq(query)scripts/ski(index)
给执行权限:
chmod +x scripts/skq scripts/ski配置一次环境变量(可选):
export SK_VAULT=~/sovereign_knowledgeexport PYTHON=python3(或 venv python)
日常开发用法:
更新了 vault 内容:
scripts/ski写代码遇到问题:
scripts/skq "run session trace ordering" --scope_prefix docs/
这就完成了“边开发边用”的闭环:你不需要把 vault 代码搬进项目,也不需要在项目里维护一套索引管线;你只需要把它当作外置制度底座,随时 query。
一个很关键的点:我为什么认为这种“外置库 + wrapper”比插件更像工程
因为它把检索的输出变成了稳定产物(pack/run note)而不是 UI 瞬时结果;也因为它把“知识治理”从笔记工具里抽离出来,变成可以进入 CI、进入审计、进入代码评审的工程对象。这就是“第二大脑”之所以能长期生长的地方:它不是靠你记得住,而是靠工具链持续生成可追溯证据。
一个多年工程经验的大哥曾经说过,LLM时代,文本和代码要一样多。但是你总不能所有的项目全部塞一个文本库吧。
最后,LLM强大的解构问题能力,设计query能力,并且从证据段中读取只字片语,生成建议的能力。这是现阶段,这个办法可以在你密集使用LLM的同时,减轻焦虑的一个办法。



