Photo by Nicolas Picard on Unsplash
If you use google, duckduckgo, bing or any other search engine, you have definitely leveraged the work done by a web crawler. A web crawler, sometimes called a spider 🕷 or spiderbot is a computer program responsible for browsing the world wide web in methodical and automated fashion.
Note: Data Crawling and Data Scraping are two facets of crawling. In this post we will be focusing data crawling and not data scraping. Read More here.
The information crawled by a crawler is generally turned into a giant index - a BIG list of words and the web pages where these words occur. Web crawling is a continuous process which maintains an up-to-date index of the world wide web. So, next time you search the web looking how to solve that nasty bug in your code, remember that a web crawler did all the hard-work on your behalf to make that information available to you.
In this post, we will dig into the world of web crawlers and understand the basic architecture and principles that they follow.
A URL frontier is like a todo list for a web crawler. This is where a list of all the URLs that are going to be crawled by a web crawler is stored. Generally, it can be something as simple as a priority queue where, the URLs are prioritised on their freshness - latest URLs are given more priority over older ones. The URL frontier is initially populated using the seed URLs that are stored in form of a set. Based on how crawler is designed there can be more than one URL frontiers - one for each domain in the seed set.
This is the most important component of a web crawler, more like an engine in a car. Primary responsibility of this component is to access web and download the resource located at the given address. Usually, web resources to be downloaded are obtained from the URL Frontier. The most common web resource downloaded by a fetcher are HTML documents.
In order to make sure that the crawler is working in a continuous manner, it is important to keep the URL frontiers populated. A Link Extractor parses the HTML document and extracts out all the links within that document and passes them on to another process which is responsible for processing the links to look for duplication and filter out the links that do not belong to one of the domains in the Seed Urls set.
Another important part of a web crawler is a component responsible for extracting out relevant information, and persisting that information in some kind of database technology. For example: A spider crawls a tech blog and extract out all the headlines and the content body from it and persist in MongoDB. This information can be used by data analysts to understand the trends in the Tech industry for a given year.
In mathematics, a directed graph or digraph is a set of vertices connected by edges, where the edges have a direction associated with them.
We can easily model web as a directed graph, where each page is modeled as a graph node and the hyperlinks are taken as the edges. Modelling web as a graph can help us in yielding useful insights into web algorithms for crawling.
Fig. In above webgraph there are 6 webpages labelled as A, B, C, D, E and F.Page B has in-degree 3 and out-degree 1. This example graph is not strongly connected: there is no path from any of pages B-F to page A.
Note: Research suggests that links are not randomly distributed, the number of links that point into a web page do not follow the Poisson Distribution. Rather, this distribution is largely a a power law.
Following are some graph data structures widely used in crawling the web:
The widely used web crawlers are as follows:
The keyword in this type of crawler is “relation”. This type of web crawler downloads web resources which are related to each other in some way. Primary job of this kind of web crawler would be to determine how a particular web page is relevant to a particular theme/topic. This is the widely used type of crawling as it is economically more feasible and has temporal advantage over the other types. An example would be limiting a crawler to crawl all the pages of a given news publication by filtering out the extracted links on the basis of the domain. Set of Seed URLs will play an important role in such case.
This type of web crawler downloads the pages periodically in order to refresh the old documents. The frequency at which a page needs to be re-visited is estimated as a function of how often a web page changes. It addresses the problem of the freshness of the pages. An example of such type of crawling would be best suited for online news publications, dynamic sources of data like stock prices, etc,.