How Early Google PageRank Worked (quick refresher)
PageRank was built on the idea that links = votes. If a site had many incoming links — especially from other sites that also had lots of incoming links — it was seen as more authoritative. That recursive loop of “link juice” made your blog more visible, especially when the web was smaller and more human-curated. Then came:
- Ads (AdWords, AdSense)
- SEO tricks and keyword farming
- Google favoring its own services (YouTube, Maps, Shopping, etc.)
LLMs (like ChatGPT & Perplexity): “Just Give Me the Answer”
These tools work very differently than search engines. Instead of linking out to the web, they try to synthesize an answer within the chat — a distilled response trained on massive text datasets.
So, where do LLM sources come from?
There are two major pathways, depending on the tool:
1. Trained Sources (Pre-2023/2024 data). For models like ChatGPT (GPT-4) when it’s not browsing:
- It’s trained on a mixture of licensed, publicly available, and publicly scraped text.
- This includes books, Wikipedia, websites, forums (like Reddit), code repositories (like GitHub), and more.
- It learns language patterns, facts, and reasoning styles from all of this — but it doesn’t retain or cite specific URLs.
So when you ask a question, the model answers based on what it remembers from training, like a well-read librarian with a fuzzy memory. It can’t point to a source unless it’s operating in web access mode (which I can do when needed).
2. Live Sourcing (Perplexity, Bing, ChatGPT with browsing). When web access is enabled:
- The model issues a real-time search query in the background (like Google or Bing would).
- It quickly scans the top ranked results (often the first page or two).
- It uses NLP to summarize or extract relevant content.
- Those links that appear as citations? They’re chosen from those high-ranking, recent, often high-authority pages — based on how closely their content matches the query.
- There may be some filtering for: recency; source reputation; coherence with the rest of the retrieved content
So ironically, even though it feels like the LLM is “just answering,” it’s still doing a mini search engine dance behind the curtain.
Do people actually look at the sources?
Most users don’t. They just take the synthesized answer and move on. But when people do click them, it’s usually to:
- double-check the claim
- chase the rabbit deeper
- cite it themselves
So what does this mean going forward?
- Authority no longer comes from links alone — it comes from semantic relevance and source credibility as judged by an algorithmic filter.
- Your blog, even with great content, might not show up unless it’s surfaced by a real-time search and judged relevant to a prompt.
- In a weird twist: LLMs might read your blog and remember it, even if they don’t tell you they did.
Bonus: Why LLMs feel so much better than search
Because instead of giving you 10 partial answers, they try to give you one synthesized, thoughtful reply. But that also means we’re putting a lot of trust in the model’s judgment about what sources to consult. Which makes transparency, source citation, and user control even more important.