Archives for "Search Engines: 2005"

Wikipedia to Challenge Google?

A post on the Socialtext blog claims that Google's dominance as the web's number one jumping-off point may be threatened not by another search engine, but by a wiki -- Wikipedia, no less.

Over time, Wikipedia has been slowly eating the entire Web's knowledge base until it becomes itself a faster, better, and--most critically--unspammed reference matter of what are the relevant and valuable resources. And unlike mere link directories, it doesn't simply list links, it tells a story about them.

Incidentally, a lot of search-engine watchers have noted that Wikipedia is dominating the SERPs. One says:

Personally I love it when I enter a query and a wikipedia entry appears in the results - I know, or at least believe, I have at least one good result. Which is precisely the opposite to how I feel when an About result appears, and both are often there together. In fact I often append the word 'wikipedia' to the end of a search if the original results don't look too promising.

Is this the sound of rumbling in the undergrowth? Probably not, but I continually remind myself that the Internet is still growing, and its change often comes from an unanticipated source.

Horoscopes Show That Google's Losing Focus

I can't believe what I see today -- you can sign in for a "personalised" version of the Google homepage, that has weather forecasts (yes weather forecasts!) and news headlines.

Google's Our Philosophy page used to say, "Google does not do horoscopes, financial advice or chat". This statement was recently removed.

I have hitherto said that Google "gets" the web; now I believe that it may be losing focus.

Google Blog Search

Google now has a blog search engine, devoted entirely to weblogs.

As with Google News, entries that are more recent are more likely to come up top.

Does this mean search results relating to blog entries will eventually be removed from the main index? That idea has certainly been touted in the past.

Google PageRank Goes Missing

The PageRank indicator on the Google toolbar has been grayed out, for all sites, since late on May 27th. Is this going to be an end of a significant phase in Google's -- and the web's -- evolution? Or are they just working on it? Rumours abound...

Why Google Gets It

My own belief in the internet as a revolutionary technology has been strengthened in the years since the dotcom crash. I believe that those early internet startups didn’t “get it”. The internet is simply a new channel for squeezing money out of consumers; it is not another medium in which to advertise; and it is certainly not a virtual shopping mall.

I can relate to Google’s founders. They clearly believed, right from the beginning, that the internet offered the potential for radical social and cultural change. Thus their corporate mission soon became “to organise and make available the world’s information”. To change the world. Sound daft? Well, Google is changing the world, albeit in quiet, barely perceptible ways – from how drivers get directions to how kids do their homework.

The Google way of understanding the internet is reflected in its corporate culture. Before undertaking any new project, Google employees ask “how will it benefit the user”? From high-storage email to web accelerator utilities, users are always at the centre of Google’s thinking.

And so they should. A concern for users drives innovation. Had the primary concern of those who built the first steam locomotives been how many advertising messages they could emblazon on the side, the industrial revolution would never have happened.

Google strong focus on users is helping it to accelerate past all competitors, as its product suite grows in both breadth and innovation. Recent Google innovations include a satellite image version of Google maps and a new service whereby users will be able to upload, store and sell large video clips, and a web accelerator for speeding up your download times.

Recent acquisitions include the hugely impressive image management program, Picasa (now offererd as a free download), and the best web analytics package I’ve personally ever used, Urchin.

Of course, it may not seem good business sense to offer innovations for free – but these are new technologies, that involve changing user behaviours. And new user behaviours inevitably open up new business opportunities. Google’s Adwords wouldn’t be possible, for example, if Google’s search engine hadn’t been so useful, and become so popular. Adwords wasn’t something the founders envisaged when they started the company; now it is Google’s main source of revenue.

The revolution is happening. Google gets it. I try to.

New Google Patent Gives Insights into Ranking Algorithm

Yesterday (March 31st 2005), Google was awarded a patent for a system of identifying and scoring documents in relation to historical data. The patent is quite complicated, but here are my initial and personal thoughts on what it entails.

The patent illustrates that Google no longer relies simply on its Page Rank method (also patented) of scoring pages but now, and increasingly in the future, will assign scores to websites and individual web pages by analysing various historical data associated with the page and site, and pages and sites that link to it, since its creation.

Google's specific definition of document is "any machine-readable and machine-storable work product". It gives examples such as newsgroup postings, web advertisements -- as well as, interestingly, emails and files.

In the context of its web search engine, however, the term document usually refers simply to a web page or website.

Within the framework of this patent, a document's score can be affected (positively or negatively -- Google does not always specify which), by any or all of the following:

Factors Affecting Document (Website/ Web Page) Score

  • Frequency of document change, e.g. how often a web page or site is updated. Google specifically states that "updated" documents (how regularly, it doesn't say) are given a boost in score.
  • The magnitude ("amount") of change to a document over time. This applies to both changes within individual pages (e.g. updating of content on a Homepage), and changes to the overall document (e.g. pages added to a site) over time.
    • Note: The amount of change over time score is further affected by the perceived importance of the sections that change. For (a speculative) example, changes to a Homepage may be regarded by Google as more -- or less -- significant than changes to a Contact Details page.
  • The manner in which the content of a document changes over time. To give an example that, again, is purely speculative: the content of a website may change if the domain name has been bought out (and the domain registration details have changed accordingly). In this case, Google might deem the manner of change more significant (with negative consequences to its score) than a simple content update.
  • How often the document is selected when the document is included in a set of search results. E.g. If a web page is number 10 in the results for query x, but is usually selected more than the first nine results, the web page is likely to gain a higher score (and subsequently move to a higher position in the results).
    • Note: Google in this instance states clearly that frequent document selection in results in a higher score, as we would expect.
  • Whether the document history is associated with frequent search queries. By "is associated with", I assume that Google mainly means that a page includes the words used in a popular search query, and/or they appear in the anchor text of inbound links to page or site.
    • Note: "frequent search queries" refers to what are known in the SEO world as "money keyphrases" -- and typically (but not by definition) have a higher number of results (several million or more).
  • Whether documents are outdated or "stale". Staleness is determined, "at least in part," by users not selecting documents as much as other results alongside which they appear, for a given query. Google specifically states that stale documents are penalized.
  • Age of links and associated documents, i.e. the dates on which the links were first created, and the age of the documents on which they were created. Google specifies that the system aims to penalize a document's ranking if the links and their associated documents are short-lived, and vice-versa.
  • Freshness of links. Link scores are weighted according to their "freshness". Freshness is determined by the dates of any changes to the links themselves (particularly the anchor text), and of the documents that contain them.
  • "Authority" and trustworthiness of the document containing the link. For some time now, Google-watchers have known that it is better to get links from reputable, quality sites, particularly those known in the SEO world as "hubs, authorities and expert" sites (e.g. university sites).
  • Differences in documents and anchor text associated with links. The anchor text of a link is still important, but so is the relevancy of the anchor text to the linked site; the relevancy of the anchor text to the linking site; and the difference in anchor text among inbound links. (SEO practitioners have known since Florida that these patterns should look "natural".)
  • Historical information about the "behaviour" of (inbound) links to documents. Google explains that its system tries to determine "whether there is a trend toward appearance of new links ... versus disappearance of existing links". Presumably, the former is rewarded while the latter is not. Google specifically mentions penalizing a document's ranking if the "link churn" (rate of change of inbound links) is above a certain threshold.
  • Characteristics and changes to visitor traffic patterns in relation to the documents. By "characteristics" of traffic, Google may mean the web source and/or the geographical source of the traffic, as well as more granular features, or demographic features. Note: How Google obtains site traffic information is open to debate, but here are some suggestions:
    • By analysing traffic that passes through its search engine results;
    • By analysing Google Toolbar data;
    • In the future, by providing visitor analytics -- see Google's recent acquisition of Urchin statistics.
  • Changes in the visitor traffic patterns over time. Was the site once popular but not now? Or vice-versa? These factors could help determine the relevancy of a site.
  • User behaviour relating to the document, such as (Google's example) how much time they spend looking at a website or page.
  • Historical domain-related information, including:
    • The "legitimacy" of the domain
    • The expiration date of the domain

    • The domain server/ name server records (presumably to check for servers known to be associated with spam)
  • Prior ranking history of the document, i.e. past performance in Google.
  • The rate at which the document moves in the history, presumably to flag if document rises unusually quickly -- 'spikes' -- in search results.
  • The rate at which the document is selected as a search result over time. Again, this has to do with determining a page or site's current popularity with searchers.
  • User maintained or generated data. Google says such data includes favourites lists, bookmarks, temporary (internet) files and cache files. However, it is unclear as yet how Google would access this information. Google specifically mentions the identification of trends where users attempt to add or remove the document from their own generated or maintained data
  • Information relating to document topic(s), as extracted from the document, and changes to document topic over time. It is well known that Google's programs attempt to surmise the theme of any given document -- this is how it matches relevant ads to third-party web pages.

I have not yet fully dissected Google's own examples and interpretations that follow the patent description, and my thoughts are likely to change on so doing.

However, it is clear from reading this patent that Google has (as we suspected) moved far away from its once cutting-edge PageRank algorithm, and is evolving to an algorithm (itself a set of algorithms) that is much more sophisticated.

This explains recent phenomena such as the Google Sandbox effect and the notorious Florida update (when much of the above, I suspect, was first implemented. Note that the patent was originally filed on Dec 31, 2003 -- a month after the Florida update, when the Sandbox effect also first took hold).

I believe that these changes are geared much more towards countering the effect of search engine spammers, by deterring "instant gratification" on the SERPs. The convenient side benefit for Google, of course, is that new sites that find it difficult to get listed are likely to spend money on Google Adwords.

However, the evolution of search engines is inevitable, and the pace of such evolution is impressive. The symbiotic relationship between Google and search engine optimisers remains. While the changes here are clearly designed to benefit aged websites that are regularly updated and have gained their links "naturally", the complexity of the algorithm is such that businesses are more likely to seek the services of skilled optimisers, rather than less.

The game continues.

Google News Gets Personal

Google have added an interesting new feature to Google News, allowing users to customise their News Pages.

Until now, I'd always considered this kind of personalisation a red herring, but in this case I'm impressed by what those boffins at the Plex have achieved.

Notably, the customise feature's interface seems (I'm guessing here) to use the same XMLHTTP technology that has proved so popular on Google Maps. This technology allows for real-time refresh of browsers -- hence you can slide the icons across the page.

If other sites follow Google's lead on XMLHTTP, the nature of the web could well change toward slicker, livelier interfaces -- what Flash has promised but rarely delivers.

I suggest Google should make the following enhancements to its customised news feature:

- Make it easier to delete sections. Currently, you have to click into a section to delete it (it took me a while to figure this out). There should be a way of deleting from what I'll call the "sliding cards view".

- Don't force the user to have to look at Top Stories. And/or don't force those stories to appear at the top of the page.

- Allow me to send my Customised News template to a friend. I'd like to be able to share the "page I made" with headlines relating to my interests.

Google TV Search

Last year Sergey Brin said that Google had no plans to introduce video search, and listed off the various reasons why.

So how to explain the launch earlier today of Google Video Search?!

Was Brin lying? So much for Google's "don't be evil" motto! (Unless lying isn't evil any more. Hmmmm. It's so hard to keep up.)

In truth, the search is limited to close-caption TV segments from US networks. This is another of Google's technology seeds -- it's anyone's guess as to which will bear fruit.

Google Trivia

Alan Williamson recaps a presentation on user experience given by Google's Marissa Mayers and gives us some notable trivial. Here are my favourites:

The prime reason the Google home page is so bare is due to the fact that the founders didn't know HTML and just wanted a quick interface. Infact it was noted that the submit button was a long time coming and hitting the RETURN key was the only way to burst Google into life.

Due to the sparseness of the homepage, in early user tests they noted people just sitting looking at the screen. After a minute of nothingness, the tester intervened and asked 'Whats up?' to which they replied "We are waiting for the rest of it". To solve that particular problem the Google Copyright message was inserted to act as a crude end of page marker.

The infamous "I feel lucky" is nearly never used. However, in trials it was found that removing it would somehow reduce the Google experience. Users wanted it kept. It was a comfort button.