New Google Patent Gives Insights into Ranking Algorithm

Yesterday (March 31st 2005), Google was awarded a patent for a system of identifying and scoring documents in relation to historical data. The patent is quite complicated, but here are my initial and personal thoughts on what it entails.

The patent illustrates that Google no longer relies simply on its Page Rank method (also patented) of scoring pages but now, and increasingly in the future, will assign scores to websites and individual web pages by analysing various historical data associated with the page and site, and pages and sites that link to it, since its creation.

Google's specific definition of document is "any machine-readable and machine-storable work product". It gives examples such as newsgroup postings, web advertisements -- as well as, interestingly, emails and files.

In the context of its web search engine, however, the term document usually refers simply to a web page or website.

Within the framework of this patent, a document's score can be affected (positively or negatively -- Google does not always specify which), by any or all of the following:

Factors Affecting Document (Website/ Web Page) Score

  • Frequency of document change, e.g. how often a web page or site is updated. Google specifically states that "updated" documents (how regularly, it doesn't say) are given a boost in score.
  • The magnitude ("amount") of change to a document over time. This applies to both changes within individual pages (e.g. updating of content on a Homepage), and changes to the overall document (e.g. pages added to a site) over time.
    • Note: The amount of change over time score is further affected by the perceived importance of the sections that change. For (a speculative) example, changes to a Homepage may be regarded by Google as more -- or less -- significant than changes to a Contact Details page.
  • The manner in which the content of a document changes over time. To give an example that, again, is purely speculative: the content of a website may change if the domain name has been bought out (and the domain registration details have changed accordingly). In this case, Google might deem the manner of change more significant (with negative consequences to its score) than a simple content update.
  • How often the document is selected when the document is included in a set of search results. E.g. If a web page is number 10 in the results for query x, but is usually selected more than the first nine results, the web page is likely to gain a higher score (and subsequently move to a higher position in the results).
    • Note: Google in this instance states clearly that frequent document selection in results in a higher score, as we would expect.
  • Whether the document history is associated with frequent search queries. By "is associated with", I assume that Google mainly means that a page includes the words used in a popular search query, and/or they appear in the anchor text of inbound links to page or site.
    • Note: "frequent search queries" refers to what are known in the SEO world as "money keyphrases" -- and typically (but not by definition) have a higher number of results (several million or more).
  • Whether documents are outdated or "stale". Staleness is determined, "at least in part," by users not selecting documents as much as other results alongside which they appear, for a given query. Google specifically states that stale documents are penalized.
  • Age of links and associated documents, i.e. the dates on which the links were first created, and the age of the documents on which they were created. Google specifies that the system aims to penalize a document's ranking if the links and their associated documents are short-lived, and vice-versa.
  • Freshness of links. Link scores are weighted according to their "freshness". Freshness is determined by the dates of any changes to the links themselves (particularly the anchor text), and of the documents that contain them.
  • "Authority" and trustworthiness of the document containing the link. For some time now, Google-watchers have known that it is better to get links from reputable, quality sites, particularly those known in the SEO world as "hubs, authorities and expert" sites (e.g. university sites).
  • Differences in documents and anchor text associated with links. The anchor text of a link is still important, but so is the relevancy of the anchor text to the linked site; the relevancy of the anchor text to the linking site; and the difference in anchor text among inbound links. (SEO practitioners have known since Florida that these patterns should look "natural".)
  • Historical information about the "behaviour" of (inbound) links to documents. Google explains that its system tries to determine "whether there is a trend toward appearance of new links ... versus disappearance of existing links". Presumably, the former is rewarded while the latter is not. Google specifically mentions penalizing a document's ranking if the "link churn" (rate of change of inbound links) is above a certain threshold.
  • Characteristics and changes to visitor traffic patterns in relation to the documents. By "characteristics" of traffic, Google may mean the web source and/or the geographical source of the traffic, as well as more granular features, or demographic features. Note: How Google obtains site traffic information is open to debate, but here are some suggestions:
    • By analysing traffic that passes through its search engine results;
    • By analysing Google Toolbar data;
    • In the future, by providing visitor analytics -- see Google's recent acquisition of Urchin statistics.
  • Changes in the visitor traffic patterns over time. Was the site once popular but not now? Or vice-versa? These factors could help determine the relevancy of a site.
  • User behaviour relating to the document, such as (Google's example) how much time they spend looking at a website or page.
  • Historical domain-related information, including:
    • The "legitimacy" of the domain
    • The expiration date of the domain

    • The domain server/ name server records (presumably to check for servers known to be associated with spam)
  • Prior ranking history of the document, i.e. past performance in Google.
  • The rate at which the document moves in the history, presumably to flag if document rises unusually quickly -- 'spikes' -- in search results.
  • The rate at which the document is selected as a search result over time. Again, this has to do with determining a page or site's current popularity with searchers.
  • User maintained or generated data. Google says such data includes favourites lists, bookmarks, temporary (internet) files and cache files. However, it is unclear as yet how Google would access this information. Google specifically mentions the identification of trends where users attempt to add or remove the document from their own generated or maintained data
  • Information relating to document topic(s), as extracted from the document, and changes to document topic over time. It is well known that Google's programs attempt to surmise the theme of any given document -- this is how it matches relevant ads to third-party web pages.

I have not yet fully dissected Google's own examples and interpretations that follow the patent description, and my thoughts are likely to change on so doing.

However, it is clear from reading this patent that Google has (as we suspected) moved far away from its once cutting-edge PageRank algorithm, and is evolving to an algorithm (itself a set of algorithms) that is much more sophisticated.

This explains recent phenomena such as the Google Sandbox effect and the notorious Florida update (when much of the above, I suspect, was first implemented. Note that the patent was originally filed on Dec 31, 2003 -- a month after the Florida update, when the Sandbox effect also first took hold).

I believe that these changes are geared much more towards countering the effect of search engine spammers, by deterring "instant gratification" on the SERPs. The convenient side benefit for Google, of course, is that new sites that find it difficult to get listed are likely to spend money on Google Adwords.

However, the evolution of search engines is inevitable, and the pace of such evolution is impressive. The symbiotic relationship between Google and search engine optimisers remains. While the changes here are clearly designed to benefit aged websites that are regularly updated and have gained their links "naturally", the complexity of the algorithm is such that businesses are more likely to seek the services of skilled optimisers, rather than less.

The game continues.

Comments

4 comments / Skip to comment form

Jay Devine / April 1, 2005 6:27 PM / #

Michael, thanks for making the contents of the patent so simple that they make sense to someone like me who is not really an IT guru but would love to see a higher ranking for my blog. I still wonder if I should simply keep creating new content without worrying about all these changes or should I really try to get into the technical details.

Michael Heraghty / April 4, 2005 9:44 AM / #

Hi Jay,

I'm glad it was useful for you.

I think the main message to take out of Google's patent is to build good sites with relevant content, to maintain and update them regularly, and to try and get acquire relevant links steadily and genuinely.

Seun Osewa / May 7, 2005 3:04 PM / #

... and to watch out for cases where your innocent actions while updating and maintaining your website could cause Google to (wrongly) penalize you.

Nancy McCord / June 1, 2005 10:24 PM / #

Great article. One thing that I wondered when I read the patent information was "How are they getting some of this information? Isn't some of it really personal? Stored on your hard disk--like cookies and cached pages?"

Then I thought about the Google toolbar, if you enable the page rank, you are sending info to Google on your surfing habits. Then I thought, "oh yeah, I have a gmail account", so Google is scanning my email for terms and content. I also remembered that they just introduced (and then pulled temporarily) the web accelerator for broadband users. It cached pages that you visited on their servers. Hmm, did they promise not to peak at them? :o)

We wonder where some of this information that Google has and is planning to use big-time in the patent is coming from, well the reality is that we are giving it to them either without really thinking or without really caring.

Don't forget Google is a money-making machine. They use the information on a massive scale to create the ad generating money machine that targets AdWords ads to our own search queries.

Hmm, I keep forgetting that I am helping Google to know what to serve me and how to make money from me from by using their products or being careless about my privacy.

Definitely worth thinking about.

Leave a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)





Search

About

Mediajunk is Michael Heraghty's blog, with articles on web design, usability, online marketing, digital innovation, etc. More »