A web crawler is a software agent created to browse the World Wide Web, based on some strict methods and criteria. Web crawlers are a fundamental part of search engines, with an optimized architecture and algorithms, kept as a rule secret by the latter. Their identity is usually revealed to web servers, the administrators being entitled to know when their web pages are to be indexed by a particular search engine.
So, basically, search engines use crawlers for updating their data about websites. Crawlers do that by generating copies of the pages they visit. Then search engines intervene, indexing these and providing fast searches. Of course, the World Wide Web is not only incredibly large, but also dynamic, continuously changing, therefore crawlers can download merely a small number of pages by comparison, at a given time. As such, they have to come with strategies and criteria in order to be more effective.
The first step is to visit a list of URLs, to detect the hyperlinks and add them to the said list. Subsequently, crawlers revisit them, in order to look for changes made meanwhile. But till then, for being able to increase their downloading output and its importance for search engines, crawlers have to prioritize their targets for a better relevance of the pages visited. As such, they may use criteria such as the inherent quality of a page, its popularity and even its URL.
After identifying the pages to be crawled, they have to set a revisiting schedule, observing two points: identifying changes made, and not crawling the same pages, in other words not retrieving duplicate content. In order to be able to do that, given the sheer size of their crawling, crawlers modify and standardize URLs of visited pages, for the sake of a prompt recognition. They have to be speedy in updating, given the innumerable pages waiting for them and the number of new creations or changes. Crawlers would be of no use to search engines, if not able to maintain the average age of downloaded pages as low as possible.
Crawlers may create though serious problems for servers, overloading them through their requests or size of documents downloaded, especially given that their access intervals are so short - from 20 seconds to 3-4 minutes. Well, it is true that administrators may forbid crawlers to access specific parts of the server or may set some fixed intervals to be strictly observed. It's not that crawlers are not polite, but they were created to move fast.