Architecture & Methodology

Language Lineage is generated by a deterministic data pipeline that transforms vast amounts of unstructured web data into a rigorous, typed JSON dataset.

Phase 1: Source Discovery Automated

The process begins by identifying high-value programming languages, compilers, and runtimes across the web. Using Wikipedia and Wikidata APIs, we scan categories like "Programming languages created in 2020" to discover new entities.

Each entity is assigned a unique, canonical identifier (e.g., lang:rust or tool:v8) which acts as the primary key for the entire data pipeline.

Phase 2: Data Harvesting Automated

Custom TypeScript crawlers (e.g., harvestWikipediaContent.ts) are deployed to scrape the identified sources. They download thousands of pages of unstructured text, historical release notes, and infobox tables.

This phase aggressively caches raw HTML and markdown locally to ensure reproducibility and prevent redundant network requests during subsequent pipeline runs.

Phase 3: LLM Structuring AI-Assisted

The raw, messy text data is fed into Large Language Models using strict JSON schema extraction prompts. The models are tasked with extracting specific, strongly-typed fields: first_release_year, paradigms, and influenced_by.

Crucially, the LLMs parse complex technical paragraphs to identify implementation relationships, such as extracting the fact that the early Rust compiler (rustc) was written in OCaml before being bootstrapped in Rust.

Phase 4: Human Verification Manual Curation

Language Models confidently hallucinate. Before any data enters the canonical dataset, it must pass a strict human review. Researchers manually inspect the diffs to ensure accuracy.

Human curators resolve cyclic dependency loops (like self-hosting bootstrap chains), verify disputed historical dates against primary sources (like mailing list archives), and manually inject relationships that the automated scrapers missed.

Phase 5: Static Generation Automated

Once the dataset (lineage_v5.json) is finalized and locked, a suite of TypeScript build scripts takes over. They parse the graph to compute PageRank metrics, transitive implementation chains, and topological layers.

The build system then generates over 300 statically-rendered HTML pages, SVG timeline cards, and the JSON-LD schema markup required for optimal search engine discovery.