Internet Archive will crawl sites regardless of the settings of robots.txt

Internet

Internet site – this is the usual set of files and folders that lies on the server. Among these files there is almost always one, called robots.txt, it is placed at the root. It serves to instruct “spiders”, it is set up so that the search robots understand what can be scanned and what is not. In a number of cases, webmasters close duplicate content (tags, categories, etc.) with these instructions to improve SEO-indicators, in addition, protect against robots and data that should not be on the network for any reason.

The idea with robots.txt appeared more than 20 years ago and since then, although different settings for different search bots have changed, everything works just like it did many years ago. Instructions saved in this file are listened to by almost all search engines, as well as the Internet Archive bot, which roams the Internet in search of information for archiving. Now the service developers believe that it’s time to stop paying attention to what’s in robots.txt.

The problem is that in many cases the domains of abandoned sites “drop”, that is, not renewed. Or simply the content of the resource is destroyed. Then such domains are “parked” (with a variety of purposes, including receiving money for advertisements placed on the parked domain). The robots.txt file of the webmaster usually closes all the contents of the parked domain. Worst of all, when the Internet Archive robot sees an instruction in the file to close the directory from indexing, it deletes the already saved content for the site that used to be on this domain.

In other words, there was a site in the database Internet Archive, and there is none, although the domain owner is already different, and the content of the site, saved by the service, has long ago sunk into oblivion. As a result, unique data that could well be of great value for a certain category of people is deleted.

Internet Archive creates “snapshots” of sites. If the site exists for a certain amount of time, such “snapshots” can be a lot. So the history of the development of various sites can be traced from the very beginning to the newest version. An example of this is habrahabr.ru. If you block access to the site using robots.txt, you can not track its history or get any information.

A few months ago, the staff at the Internet Archive stopped monitoring the instructions in the file on US government websites. This experiment was successful and now the Internet Archive bot will stop paying attention to instructions in robots.txt for any sites. If the webmaster wants to delete the content of his resource from the archive, he can apply to the Internet Archive administration by mail.

So far, developers will monitor the robot’s behavior and the operation of the service itself in connection with future changes. If everything goes well, then these changes will remain.

Internet Archive will crawl sites regardless of the settings of robots.txt

Internet

Leave a Reply Cancel reply

Search Posts

‎‎‎‎‎Explore Our Categories

Trending Stories

NERA | The future is near with 3D printed electric motorcycle

Matrix for CCTV cameras. What to look for? / Geektimes

Advanced lenses to cameras Canon, Nikon and Sony / Blog company M.Video / Geektimes

TikTok Star Reveals Math Hack to Win Fair Games – It Really Works!

20th-century black history | List of unbelievable events of the century

Uber on the market in a trillion dollars / Geektimes

Explore More

Follow US on Social Media

About My Viral Box

Internet

You Might Also Like

Leave a Reply Cancel reply

Search Posts

‎‎‎‎‎Explore Our Categories

Trending Stories

Explore More

Follow US on Social Media

About My Viral Box