Usman Farooq is one of the new emerging SEO/SEM experts, With his diverse skills and Search engine knowledge his aim is to reach high in the sky getting himself to be known round the globe for his Search engine optimization and Search engine marketing techniques and providing customer satisfactions at its best.
Thursday, June 24, 2010
Bing It? “Bring It,” Says Google
*** Read the full post by clicking on the headline above or, in Facebook, by clicking on the 'View Original Post' link below. ***"
SearchCap: The Day In Search, June 23, 2010
*** Read the full post by clicking on the headline above or, in Facebook, by clicking on the 'View Original Post' link below. ***"
Viacom Loses Google/YouTube Lawsuit
*** Read the full post by clicking on the headline above or, in Facebook, by clicking on the 'View Original Post' link below. ***"
Four Search Agencies Merge To Form BlueGlass Interactive
*** Read the full post by clicking on the headline above or, in Facebook, by clicking on the 'View Original Post' link below. ***
"
Twitter’s First Head Of State Visit: Russian President Dmitry Medvedev
*** Read the full post by clicking on the headline above or, in Facebook, by clicking on the 'View Original Post' link below. ***"
Two Advanced Tactics For PPC Copywriting
*** Read the full post by clicking on the headline above or, in Facebook, by clicking on the 'View Original Post' link below. ***
"
Tuesday, June 22, 2010
SearchCap: The Day In Search, June 22, 2010
From Search Engine Land:
Google, Twitter Argue Against Throttling Speed Of News
Google and Twitter have teamed up to support a web site that’s facing legal trouble for reporting news too quickly.
Reuters reports that Google and Twitter filed [...]
*** Read the full post by clicking on the headline above or, in Facebook, by clicking on the 'View Original Post' link below. ***
Bing Entertainment Unwrapped: Music, Movies, Games & TV
The official Bing blog has all the details about the new features in all aspects of entertainment: Music, Movies, TV listings and gaming information.
At [...]
*** Read the full post by clicking on the headline above or, in Facebook, by clicking on the 'View Original Post' link below. ***
Whiteboard Friday - What's Working for You? with Richard Baxter
Posted by Scott Willoughby
The avalanche-like flow of special guest Whiteboard Fridays continues this week with another installment featuring our beloved London SEO expert, Richard Baxter (anchor text, y'all). Last week Richard helped us all learn how to get our fresh content indexed licketty-split, and this week he's back to help us learn how to identify which areas of our sites are working hardest for us.
Whether you have multiple types of content on your site (maybe a blog, tools, articles, etc.), or you have limited content types across different topics (blog posts about cats, kittens, evil cats, ninja kittens, evil ninja kitten cats, etc.), wouldn't it be nice to know which content types or topics bring you the most and best traffic? Never fear, Richard's here to explain his handy-dandy system to do just that! By the end of this video you'll know exactly which stats to pull from your analytics to create a so-shiny-it's-practically-chromed spreadsheet that will let you peer deep into the inky black heart of your site and know the stars, the slackers, and the shiftless hobos among your content.
Wow! It's like the future is now! And, since thinking of the future always makes me think of 'Flash', and thinking of 'Flash' reminds me that those of you without Adobe Flash can't watch the video, I'll try to summarize Richard's bard-like musings on content segmentation and performance analysis.
In order to track and analyze the performance of your individual content, you'll want to segment out your analytics data by content type. This is really, really easy to do if you have good, clean site structure (which you have, right? RIGHT?!). You can just pull Richard's data points (below) for the different sections or subfolders of your site. If you were lazy and thought the best way to organize your site was to throw all of the pages into a virtual bucket, dump them out, name them by throwing your keyboard at a stump, and call it a day, you'll have to get a little more involved with how you filter your segments. No matter what though, you might consider segments like all blog posts (perhaps a 'CONTAINS /blog' filter), all tools, all content written by Belverd Needles, III (/authors/belverd), etc.
Once you have your segment filters in place, you just need to pull the data that Richard suggests and you'll be able to see exactly how Belverd's content compares to that of his bloggitty arch-nemesis, Marmaduke Huffsworth, Esq. (/authors/marmaduke). What data you say? This data:
1. Number of Pages per Segment Richard advocates crawling your site using something like Link Sleuth to get this number; you'll use it for all sorts of fun calculations. Yes, calculations can be fun. If you don't believe me, just ask these racially diverse, embroidered youths.
2. Number of Keywords Sending Traffic You can pull this from your analytics. Don't worry so much about the words themselves here, you just want to know how many different keyword terms are delivering one or more visits to each segment.
3. Number of Pages Getting Entries from Search Engines How many pages within the segment received one or more visits from a search engine (pick an engine, any engine, or all of them, whatever matters to you...so Google, basically).
4. Total Visits from Google Search Engines Like it says on the tin, this is just the total number of visits to the segment from search traffic.
5. Percentage of Total Visits that Performed a Conversion Action This will require that you have some conversion actions setup in your analytics, but it's a key data point if you want to figure out your strongest content.
So what can all of this stuff tell you? LOTS! By tracking these numbers, you'll be able to quickly identify which content is working hardest for you. You'll be able to know whether Marmaduke or Belverd is better at drawing high-converting traffic. You'll know which subjects and content types are most deserving of your precious time and the investment of your hard-bilked pennies. You'll know who put the bop in the bop shoo bop, who moved your cheese, and why birds suddenly appear every time I'm near (it's because my pockets are full of birdseed). You'll be 12.7-29.4% awesomier than you were before, and you'll smell delightful ALL THE TIME!
Now aren't you glad Richard stopped by and shared his magic secrets with you? Thanks, Richard!
p.s. Richard has posted more about getting things indexed quickly w/ PubSubHubBub and more on his blog - well worth a read.
Amazon Web Services: Clouded by Duplicate Content
Posted by Stephen Tallamy
This post was originally in YOUmoz, and was promoted to the main blog because it provides great value and interest to our community. The author's views are entirely his or her own and may not reflect the views of SEOmoz, Inc.
At the end of last year the website I work on, LocateTV, moved into the cloud with Amazon Web Services (AWS) to take advantage of increase flexibility and reduced running costs. A while after we switched I found that Googlebot was crawling the site almost twice as much as it used to. Looking into it some more I found that Google had been crawling the site from a subdomain of amazonaws.com.
The problem is, when you start up a server on AWS it automatically gets a public DNS entry which looks a bit like ec2-123.456.789.012.compute-1.amazonaws.com. This means that the server will be available through this domain as well as the main domain that you will have registered to the same IP address. For us, this problem doubled itself as we have two web servers for our main domain and hence the whole of the site was being crawled through two different amazonaws.com subdomains and www.locatetv.com.
Now there were no external links to these AWS subdomains but, being a domain registrar, Google was notified of the new DNS entries and went ahead and indexed loads of pages. All this was creating extra load on our servers and a huge duplicate content problem (which I cleaned up, after quite a bit of trouble - more below).
A pretty big mess.
I thought I'd do some analysis into how many other sites were being affected by this problem. A quick search on Google for site:compute-1.amazonaws.com and site:compute.amazonaws.com reveals almost 1/2 million web pages indexed (often dodgy stats with this command but it gives some scale of the issue):
My guess is that most of these pages are duplicate content with the site owners having separate DNS entries for their site. Certainly this is the case for the first few sites I checked:
- http://ec2-67-202-8-9.compute-1.amazonaws.com is the same as http://www.broadjam.com
- http://ec2-174-129-207-154.compute-1.amazonaws.com is the same as http://www.elephantdrive.com
- http://ec2-174-129-253-143.compute-1.amazonaws.com is the same as http://boxofficemojo.com
- http://ec2-174-129-197-200.compute-1.amazonaws.com is the same as http://www.promotofan.com
- http://ec2-184-73-226-122.compute-1.amazonaws.com is the same as http://www.adbase.com
For Box Office Mojo, Google is reporting 76,500 pages indexed for the amazonaws.com address. That's a lot of duplicate content in the index. A quick search for something specific like "Fastest Movies to Hit $500 Million at the Box Office" shows duplicates from both domains (plus a secure subdomain and the IP address of one of their servers - oops!):
Whilst I imagine Google would be doing a reasonable job of filtering out the duplicates when it comes to most keywords, it's still pretty bad to have all this duplicate content in the index and all that wasted crawl time.
This is pretty dumb for Google (and other search engines) to be doing. It's pretty easy to work out that both the real domain and the AWS subdomain resolve to the same IP address and that the pages are the same. They could be saving themselves a whole lot of time time crawling URLs that are due to a duplicate DNS entry.
Fixing the source of the problem.
As good SEOs we know that we should do whatever we can to make sure that there is only one domain name resolving to a site. There is no way, at the moment, to stop AWS from adding the public DNS entries and so a way to solve this is to make sure that if the web server is accessed using the AWS subdomain then redirect to the main domain. Here is an example using Apache mod_rewrite of how to do this:
RewriteCond %{HTTP_HOST} ec2-123-456-789-012.compute-1.amazonaws.com
RewriteRule ^(.*)$ http://www.mydomain.com/$1 [R=301,L]
This can be put either in the httpd.conf file or the .htaccess file and basically says that if the requested host is ec2-123-456-789-012.compute-1.amazonaws.com then 301 redirect all URLs to the equivalent URL on www.mydomain.com.
This fix quickly stopped Googlebot from crawling our amazonaws.com subdomain addresses, which took considerable load off our servers, but by the time I'd spotted the problem there were thousands of pages indexed. As these pages were probably not doing any harm I thought I'd just let Google find all the 301 redirects and remove the pages from the index. So I waited, and waited, and waited. After a month the number of pages indexed (according to the site: command) was exactly the same. No pages had dropped out of the index.
Cleaning it up.
To help Google along I decided to submit a removal request using Webmaster Tools. I temporarily removed the 301 redirects too allow Google to see my site verification file (obviously it was being redirected to the verification file on my main domain) and then put the 301 redirect back in. I submitted a full site removal request but it was rejected because the domain was not being blocked by robots.txt. Again, this is pretty dumb in my opinion because the whole of the subdomain was being redirected to the correct domain.
As I was a bit annoyed with the fact that the removal request would not work in the way I wanted it to I thought I'd leave Google another month to see if it found the 301 redirects. After at least another month, no pages had dropped out of the index. This backs up my suspicion that Google does a pretty poor job of finding 301 redirects for stuff that isn't in the web's link graph. I have found this before, where I have changed URLs, updated all internal links to point at the new URLs and redirected the old URL. Google doesn't seem to go back through it's index and re-crawl pages that it hasn't found in it's standard web crawl to see if they have been removed or redirected (or if it does, it does it very, very slowly).
Having had no luck with the 301 approach, I decide to change to using a robots.txt file to block Google. The issue here is that, clearly, I didn't want to edit my main robot.txt to block bots as that would stop crawling of my main domain. Instead, I created a file called robots-block.txt that contained the usual blocking instructions:
User-agent: *
Disallow: /
I then replaced the redirect entries from my .htaccess file to something like this:
RewriteCond %{HTTP_HOST} ec2-123-456-789-012.compute-1.amazonaws.com
RewriteRule ^robots.txt$ robots-block.txt [L]
This basically says that if the requested host is ec2-123-456-789-012.compute-1.amazonaws.com and the requested path is robots.txt then serve the robot-block.txt file instead. This means I effectively have a different robots.txt file served from this subdomain. Having done this I went back to Webmaster Tools, submitted the site removal request and this time it was accepted. "Hey presto", my duplicate content was gone! For good measure I replaced the robots.txt mod_rewrite with the original redirect commands to make sure any real users are redirected properly.
Reduce, reuse, recycle.
This was all a bit of a fiddle to sort out and I doubt many webmasters hosting on AWS will have even realised that this is an issue. This is not purely limited to AWS, as a number of other hosting providers also create alternative DNS entries. It is worth finding out what DNS entries are configured for the web server(s) serving a site (this isn't always that easy but you can use your access logs/analytics to get an idea) and then making sure that redirects are in place to the canonical domain. If you need to remove any indexed pages then hopefully you can do something similar to the solution I proposed above.
There are some things that Google could do to help solve this problem:
- Be a bit more intelligent in detecting duplicate domain entries for the same IP address.
- Put some alerts into Webmaster Tool so webmasters know there is a potential issue.
- Get better at re-crawling pages in the index not found in the standard crawl to detect redirects
- Add support for site removal when a site wide redirect is in place
In the meantime, hopefully I've given some actionable advice if this is a problem for you.
Sunday, June 20, 2010
URL Rewrite Smack-Down: .htaccess vs. 404 Handler
Posted by MichaelC
First, a quick refresher: URL prettying and 301 redirection can both be done in .htaccess files, or in your 404 handler. If you're not completely up to speed on how URL rewrites and 301s work in general, this post will definitely help. And if you didn't read last week's post on RewriteRule's split personality, it's probably helpful background material for understanding today's post.
"URL prettying" is the process of showing readable, keyword-rich URLs to the end user (and Googlebot) while actually using uglier, often parameterized URLs behind the scenes to generate the content for the page. Here, you do NOT do a 301 redirection. (Unclear on redirection, 301s vs. 302s, etc.? There's help waiting for you here in the SEOmoz Knowledge Center.) |
301s are done when you really have moved the page, and you really do want Googlebot to know where the new page is. You're admitting to Googlebot that it no longer exists in the old location. You're also asking Googlebot to give the new page credit for all the link juice the old page had earned in the past.
|
If you're trigger-happy, you might leap to the conclusion that RewriteRule is the weapon of choice for both URL prettying and 301 redirects. Certainly you CAN use RewriteRule for these tasks, and certainly the regex syntax is a powerful way to accomplish some pretty complex URL transformations. And really, if you're going to use RewriteRule, you should probably be using it in your httpd.conf file instead.
The Apache docs have a great summary of when not to use .htaccess.
Fear Not the 404 Handler
First, all y'all who tremble at the thought of creating your very own custom 404 handler, take a Valium. It's not that challenging. If you've gotten RewriteRule working and lived to tell the tale, you're not going to have any difficulty making a custom 404 error handler. It's just a web page that displays some sort of "not found" message, but it gives you an opportunity to have a look at the page that was requested, and if you can "save it", you redirect the user to the page they're looking for with just a line or two of code. |
If not, the 404 HTTP status gets returned, along with however you'd like the page to look when you tell them you couldn't find what they were looking for.
By the way, having your own 404 handler gives you the opportunity to entertain your user, instead of just making them feel sorry for themselves. Check out this post from Smashing Magazine on creative 404 pages.
Having a good sense of humor could inspire love & loyalty from a customer who otherwise might just be miffed at the 404.
Here's an example of a 404 handler in ASP. Important note: don't use Response.Redirect -- it does a 302, not a 301!
For PHP, you need to add a line to your .htaccess pointing to wherever you've put your 404 handler:
- ErrorDocument 404 /my-fabulous-404-handler.php
Then, in that PHP file, you can get the URL that wasn't found via:
- $request = $_SERVER['REDIRECT_URL'];
Then, use any PHP logic you'd like to analyze the URL and figure out where to send the user.
If you can successfully redirect it, set:
- header("HTTP/1.1 301 Moved Permanently");
- header ("Location: http://www.acmewidgets.com/purple-gadgets.php");
And here's where it gets a bit hairy in PHP. There's no real way to transfer control to another webpage behind the scenes--without telling the browser or Googlebot via 301 that you're handing it off to the other page. But you can use call require() on the fly to pull in the code from the target page. Just make sure to set the HTTP code to 200 first:
- header('HTTP/1.1 200 OK');
And you've got to be careful throughout your site to use include_once() instead of include() to make sure you don't pull a common file in twice. Another option is to use curl to grab the content of the target page as if it were on a remote server, then regurgitate the HTML back in-stream by echoing what you get back. A bit hazardous if you're trying to drop cookies, though...
And, if you really need to send a 404:
- header('HTTP/1.0 404 Not Found');
Very Important: be careful to make sure you're returning the right HTTP code from your 404 handler. If you've found a good content page you'd like to show, return a 200. If you found a good match, and want Googlebot to know about that pagename instead of what was requested, do a 301. If you really don't have a good match, be sure you send a 404. And, be sure to test the actual response codes received--I'm a huge fan of the HttpFox Firefox plug-in.
Ease of Debugging
This is where the 404 handler really wins my affection. Because it's just another web page, you can output partial results of your string manipulation to see what's going on. Don't actually code the redirection until you're sure you've got everything else working. Instead, just spit out the URL that came in, the URL you're trying to fabricate and redirect to, and any intermediate strings that help you figure it all out. With RewriteRule, debugging pretty much consists of coding your regex expression, putting in the flags, then seeing if it worked. Is the URL coming in in mixed case? The slashes...forward? Reverse? Did I need to escape that character...or is it not That Special? |
You're flying blind. It works, or it doesn't work.
If you're struggling with RewriteRule regular expressions, Rubular has a nice regex editor/tester.
Programming Flexibility
With RewriteRule, you've got to get all the work done in the single line of regex. And while regex is elegant, powerful, and should be worshipped by all, sometimes you'll want to do more complex URL rewriting logic than just clever substitution. In your 404 handler, you can call functions to do things like convert numeric parameters in your source URL to words and vice versa. |
Access to Your Database
If you're working with a big, database-driven site, you may want to look up elements in your database to convert from parameters to words.
And since the 404 handler is just another webpage, you can do anything with your database that you'd do in any other webpage. |
For example, I had a travel website where destinations, islands, and hotels all were identified in the database by numeric IDs. The raw page that displayed content for a hotel also needed to show the country and island that the hotel was on.
The raw URL for a specific hotel page might have been something like:
/hotel.asp?dest=41&island=3&hotel=572
Whereas the "pretty URL" for this hotel might have been something like:
/hotels/Hawaii/Maui/Grand-Wailea/
When the "pretty URL" above was requested by the client, my 404 handler would break the URL down into sections:
- looking up the 2nd section in the destinations table (Hawaii = 41)
- looking up the 3rd section in the island table (Maui = 3)
- looking up the 4th section in the hotel table (Grand Wailea = 572)
Then, I'd call the ASP function Server.Transfer to transfer execution to /hotel.asp?dest=41&island=3&hotel=572 to generate the content.
Now, keep in mind that you'll probably want to generate the links to your pretty URLs from the database identifiers, rather than hard-code them. For instance, if you have a page that lists all of the hotels on Maui, you'll get all of the hotel IDs from the database for hotels where the destination = 41
and island = 3, and want to write out the links like /hotels/Hawaii/Maui/Grand-Wailea/. The functions you write to do this are going to be very, very similar
to the ones you need to decode these URLs in your 404 handler.
Last but not least: you can keep track of 404s that surprise you (i.e. real 404s) by having the page either email you or log the 404'ed URLs to a table
in your database.
Performance
For most people, the performance hit of doing the work in .htaccess is not going to be significant. But if you're doing URL prettying for a massive site, or have renamed an enormous list of pages on your site, there are a few things you might want to be aware of--especially with Google now using page load speed as one of its ranking factors. |
All requests get evaluated in .htaccess, whether the URLs need manipulation/redirection or not.
That includes your CSS files, your images, etc.
By moving your rewriting/redirecting to your 404 handler, you avoid having your URL pattern-matching code check against every single file requested from your webserver--only URLs that can't be found as-is will hit the 404 handler.
Having said that, note that you can pattern-match in .htaccess for pages you do NOT want manipulated, and use the L flag to stop processing early in .htaccess for URLs that don't need special treatment.
Even if you expect nearly every page requested to need URL de-prettying (conversion to parameterized page), don't forget about the image files, Javascript files, CSS, etc. The 404 handler approach will avoid having the URLs for those page components checked against your conversion patterns every single time they're fetched.
A Special Case
OK, maybe this case isn't all that special--it's pretty common, in fact. Let's say we've moved to a structure of new pretty URLs from old parameterized URLs.
Not only do we have to be able to go from pretty URL --> parameterized URL to generate the page content for the user, we also want to redirect link juice from any old parameterized URL links to the new pretty URLs.
In the actual parameterized web page (e.g. hotel.asp in the above example), we want to do a 301 redirect to the pretty URL. We'll take each of the numeric parameters, look up the destination, island, and hotel name, and fabricate our pretty URL, and 301 to that. There, link juice all saved...
But we've got to be careful not to get into an infinite loop, converting back and forth and back and forth:
When this happens, Firefox offers a message to the effect that you've done something so dumb it's not going even bother trying to get the page. They say it so politely though: "Firefox has detected that the server is redirecting the request for [URL] in a way that will never complete."
By the way, it's entirely possible to cause this same problem to happen through RewriteRule statements--I know this from personal experience :-(
It's actually not that tough to solve this. In ASP, when the 404 handler passes control to the hotel.asp page, the query string now starts with "404;http". So in hotel.asp, we see if the query string starts with 404, and if it does, we just continue displaying the page. If it doesn't start with 404;http then we 301 to the pretty URL.
Other References
Information on setting up your 404 handler in Apache:
- http://www.plinko.net/404/custom.asp
- http://www.webreference.com/new/011004.html
- http://www.phpriot.com/articles/search-engine-urls/4
Apache documentation on RewriteRule:
ASP.net custom error pages: