Sudarshan Singh: Web Crawler

A web crawler is a relatively simple, automated program, or script that methodically scans or “crawls” through Internet pages to create an index of the data it’s looking for. Alternative names of a web crawler include web spider, web robot, bot, crawler and automatic indexer.

A web crawler can be used for many purposes. Probably the most common use associated with the term is related to search engines. Search engines use web crawlers to collect information about what is out there on public web pages. Their primary purpose is to collect data so that when Internet surfers enter a search term on their site, they can quickly provide the surfer with relevant web sites.

When a search engine’s web crawler visits a web page it “reads” the visible test, the hyperlinks, and the content of the various tags used in the site, such as keyword rich Meta tags. Using the information gathered from the crawler, a search engine will then determine what the site is about and index the information. The website is then included in the search engine’s database and its page ranking process.

Web crawlers may operate one time only, say for a particular one-time project, or if its purpose is for something long term, as is the case with search engines, they may be programmed to comb through the Internet periodically to determine whether there has been any significant changes. If a site is experiencing heavy traffic or technical difficulties, the spider may be programmed to note that and revisit the site again, hopefully after the technical issues have subsided.

Web crawling is an important method for collecting data on, and keeping up with the rapidly expanding, Internet. A vast amount of web pages are continually being added every day and information is constantly changing. A web crawler is a way for the search engines and other users to regularly ensure that their databases are up to date.

With This
Yours Friend
Sudarshan Singh

14 March, 2010

Web Crawler