Wednesday, August 08, 2007

How to Get More Pages into Google's Index

Many people obsess about every word Matt Cutts says, but there are plenty of other Googlers that can teach us a thing or two about Google's inner workings. At Search Engine Strategies Chicago 2006 I was on a panel with one of them: Dan Crow, who is part of Google's search quality group and is the Product Manager for the crawl infrastructure group.

When Jill Whalen and Pauline Kerbici of High Rankings started a local organization called Search Engine Marketing New England (SEMNE), I suggested that they invite Dan to speak. Besides his charming British accent, Dan's a great speaker because he knows everything about Googlebots and indexing.

Last month in Providence, nearly 100 SEMNE members and guests showed up to meet Dan. To learn about the official presentation, you can read Jill's summary, "Getting into Google," and Rand Fishkin's post, "Dan Crow of Google on Crawling, Indexing & Ranking." Instead of yet another summary, here I will cover the unofficial story, the conversations I had with Dan before and after the main event.

Dan Crow's Advice to Webmasters

Dan started our conversation by saying that the World Wide Web is very large, and Google is not even sure how large. They can only index a fraction of it. Google has plenty of capital to buy more computers, but there just isn't enough bandwidth and electricity available in the world to index the entire Internet. Google's crawling and indexing programs are believed to be the largest computations ever.

Googlebots fetch pages, and then an indexing program analyzes the pages and stores a representation of the page into Google's index. The index is an incomplete model of the Web. From there, PageRank is calculated and secret algorithms generate the search results. The only pages that can show up in Google's search results are pages included in the index. If your page isn't indexed, it will never rank for any keywords.

Because the Web is so much larger than the index, Google has to make decisions about what to spider and what to index. Dan told me that Google doesn't spider every page they know about, nor do they add every spidered page to the index. Two thoughts flashed through my mind at that moment: (1) I need to buy Dan a drink, (2) What can I do to make sure my pages get indexed?

Bandwidth and electricity are the constraining resources at Google. On some level they have to allocate those resources among all the different Web sites: Google isn't going to index Web sites A – G and then ignore H-Z. Dan suggested that each day Google has a large but limited number of URLs it can spider, so for large sites it's in the site owners' interests to help the indexing process run more efficiently, because that may lead to more pages being indexed.

How much effort Google decides to put into spidering a site is a secret, but it's influenced by PageRank. If your site has relatively few pages with high PageRank, they'll all get into the index no problem, but if you have a large number of pages with low PageRank, you may find that some of them don't make it into Google's index.

Clean Code Matters

What can we do to get more pages indexed? I've always suspected that streamlining HTML code is a good way to facilitate indexing. Reducing code bloat helps pages load faster and use less bandwidth. I asked if it would help to move JavaScript and CSS definitions to external files, and clean up tag soup. Dan's answer was refreshingly clear. "Those would be very good ideas," he said.

SEOs pay a lot of attention to issues like duplicate content, link building to increase PageRank, and link structure to move PageRank throughout the site. However, I haven't seen many SEO articles about the importance of proper Web development methodology. All too often when I look at a new site, I am appalled at the sloppy coding. The typical site could be streamlined significantly.

Yes, you should try to increase the PageRank of your pages, and you should design your link structure so that PageRank is distributed throughout your site in a way that makes sense. You should provide unique and valuable content. Those tactics will help your indexing, but you also need to pay attention to the dirty details of how your pages are put together. If everybody served clean code, Google would be able to index significantly more pages.

Why doesn't Google do more to educate webmasters about the efficient use of bandwidth and computing power? Perhaps it would look bad for Google to ask webmasters to recode their sites to make Google's job easier. Nonetheless, if Google can tell me how to get more of my pages into the index, I'm ready to listen and cooperate.

Clean HTML is good not just for getting indexed, but also because it means more people can read your site. The cleaner and more compatible your code, the wider a range of browsers it will work with, and this is especially important for users with screen readers and those using mobile devices such as cell phones.

Source: searchenginewatch.com