Sooner or later it will happen to you if it hasn’t already. You need, for some reason or another, to remove a page from your site. Maybe it’s a page that no longer works with a new site layout. Maybe it’s a piece of outdated content you want to be removed rather than updated. No matter the reason, you’ll quickly discover that deleting the page from your site is only the beginning. Google still shows it in search results, but you want it removed. After all, traffic finding a broken page is no good.
Why Google Keeps Content
Think about the situation from Google’s perspective. If a user performs a search, they want results. Having nothing to give them is a serious failure on the part of the search engine. On the other hand, finding a page that no longer exists is useful. It shows that the search engine can find that content, and it’s not its fault that the content no longer exists. Additionally, users can use cached versions of the page or pull the URL for the Internet Archive. There’s also the issue of temporary downtime. If you don’t take specific steps to tell Google one way or the other, Google will assume that the first crawl of a missing page found it missing because of a temporary site or host issue. Imagine the lost influence if your pages were removed from search every time a crawler landed on the page when your host blipped out!
Of course, Google doesn’t want to assist in something illegal. They will happily and quickly assist in the removal of pages that contain information that should not be broadcast. This typically includes credit card numbers, signatures, social security numbers, and other confidential personal information. What it doesn’t include, though, is that blog post you made that was removed when you redesigned your site.
Removing Indexed Content
You can take several steps to assist in the removal of content from your site, but in the majority of cases, the process will be a long one. Very rarely will your content be removed from the active search results quickly, and then only in cases where the content remaining could cause legal issues. What can you do?
Step 1: Entirely Delete the Content
When it comes to hiding and removing content on the web, there are a number of things you can do. If you want the page entirely hidden from Google, however, what you need to do is present the search engine with a 404 page. That means the content needs to be entirely removed, not redirected. It can be your custom 404 page, as long as it’s a 404 and not simply replaced content.
Step 2: Remove Internal Links
Next, scan your site for any pages that link to the removed page. This includes your site map, with one exception mentioned later. You should be able to generate a list of incoming links to the page and remove them throughout your site.
Step 3: Remove External Links
This one is much harder. You need to examine the link profile of the page you’re removing and find third party sites that link to it. Normally, those backlinks are valuable. When you’re trying to remove the content, however, you want to remove as many of those incoming links as possible. Contact webmasters and blog owners, explain that the content is being removed and that they will want to remove the broken link as soon as possible.
Step 4: Robots.txt
It may be tempting to block the page with your robots.txt file, to keep Google from crawling it. In fact, this is the opposite of what you want to do. If the page is blocked, remove that block. When Google crawls your page and sees the 404 where content used to be, they’ll flag it to watch. If it remains gone, they will eventually remove it from the search results. If Google can’t crawl the page, it will never know the page is gone, and thus it will never be removed from the search results.
Step 5: Submit a Removal Request
In the Google Webmaster Tools, there is a URL remover tool. You can use this to submit a page for removal from live search and from cached results. Again, this may or may not be an effective way to remove the page, depending on the reason for removal.
Once you have taken these steps, all you can do is wait. Google will eventually learn that the page no longer exists and will stop offering it in the live search results. If you’re searching for it specifically, you may still find it, but it won’t have the SEO power it once did.
A Faster Removal for Site Search
All of the above is for the main live search through Google itself. What if you want to remove the page from your custom site search, powered by Google? Thankfully, you can customize this much more easily. Google understands that you don’t want to serve pages on your site that are no longer part of your site, and has made it easy to remove a page from your custom site search.
The tool you’re looking for is Google’s On-Demand Indexing. It sounds counter-intuitive to use an indexing tool to remove content, but it’s really the same system operating in reverse. You have two options.
Option 1: URL Submission
This option is easy. Simply submit the URL with a “-“ in front of it. For example, -http://www.website.com/deletedcontent.html. This will once processed, remove the content from your custom site search.
Option 2: Sitemap Removal
The other option is to submit an edited sitemap. What you want to do is, beneath the <loc> tag for your page, add a <expires> tag. You can set a date for the content to expire. Setting a date that has already passed will flag the content as expired and remove it from the site search. Additionally, you can learn to use the expires tag for timed content you want to be removed after a certain amount of time, such as contests.
Sitemap submission has some restrictions, as seen here: https://developers.google.com/custom-search/docs/indexing
Both options are restricted to your custom site search and do not affect the core live search on Google’s website.
410 GONE and NOINDEX
Another option is to configure your server to serve a 410 GONE page rather than the 404 NOT FOUND. A simple 4040 means the page is missing, but there’s always the chance that it was an error that caused the page to break. With a 410 GONE, Google knows that the page is unlikely to return and will take the appropriate steps. It won’t help your page get crawled any faster than normal, but when it is crawled, it will tell Google what you want it to know.
The NOINDEX meta tag is another option. It’s a bit of a brute force method, and it involves the content staying live, so it doesn’t entirely work for deleted pages. The NOINDEX tag tells Google that you don’t want the page in live search at all, even if other sites link to it. There can be a few valid uses for it, and removing content you don’t want seen is one of them.