How to ensure that Google indexes each of my URLs submitted in my sitemap?

I have a rather large site, around 5.3 million unique URLs, and its been a hassle to get Google to index these. Through the Google Webmaster Tools I am able to see how many URLs have been indexed, and it is around 250,000. The number was closer to 400,000 last week.

I have changed nothing on the site, other than adding more content, yet something seems to be working against me with the Sitemap and SEO.

Any ideas what may be causing this?

Really? Are these really 5.3 million different pages, each with its own content? And, if so, does each page have enough useful content for Google to index? I would really like to know you achieve that.

That said, did you know that Google recently changed the way they report indexed URLs within GWT? They now specifically exclude different URLs that point to duplicate or non-canonical content, or to pages on a different “version” of your site (specifically http: vs. https:). For these reasons, they say that “The number of indexed URLs is almost always significantly smaller than the number of crawled URLs”. (Source.) That might explain what you are seeing.

Note that what I said above about duplicate and non-canonical content applies only to the way URLs are reported within GWT. It has nothing to do with search engine ranking.

MIke

Thanks for the reply, Mikl.

Each of my 5.3 million URLs are wrapped up in 200+ smaller sitemaps that you can see here: http://www.sportscardslist.com/sitemap.xml

Each page contains unique content about items indexed and cataloged on my website. My website focuses on sports trading cards, and creates a unique page for every card that is added to our master database.

http://www.sportscardslist.com/baseball/1952/topps/1123601/311-mickey-mantle-dp
http://www.sportscardslist.com/basketball/2003-04/upper-deck-exquisite-collection/1454902/78-lebron-james

And so on…

Well, that pretty well confirms what I said. Based on the examples you quoted, there is virtually no information for Google to index - just a few words in each case, without any obvious theme or meaning.

Put another way, Google has decided that there is no user query which any of these pages is likely to answer. I don’t know whether that explains the low figure you are seeing in GWT, but it’s unlikely that these pages will ever show up in search results. Maybe that’s the problem you need to focus on.

Mike

Thanks for the feedback.

My first reaction after reading this post was: WHAT? 5.3 million Unique URLs. How old is your site?

The site is about 16 months old. It’s a catalog of every sports card ever produced. The site creates a unique URL for each card from each set.

You can in GWT check your indexing status, as well as all indexing errors

If I was you I would be doing the following:

Adding functionality to navigate by team, not just by year here:
http://www.sportscardslist.com/baseball
Add around 1000 words of content (and images etc) to these sport pages.

Add around 1000 words of content for team pages (write about the teams history etc).

Create canonical tags on all the card pages back to the team pages to shift page rank to the team pages.
with content added you should be able to rank the team pages in google for things like “team x baseball cards”