Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

1. What is involved in creating a web crawler? What are the differences between

ID: 3730586 • Letter: 1

Question

1. What is involved in creating a web crawler? What are the differences between static and dynamic web content?

One of the topics covered in Analysis of Algorithms are algorithms for traversing graphs. The structure of the world-wide-web is an example of a directed graph with each web page forming a vertex and each URL or web link forming an edge. In Analysis of Algorithms we learned about different algorithms used to traverse a graph. One traversal approach was referred to as the depth-first search (DPS). The basic idea of DPS was that each path, which is comprised of nodes and edges was traversed using a stack data structure to the required depth

Your indexer must have the following characteristics:

Web crawler must prompt the user to enter a starting web site or domain. This must be entered in the form http://mywebpage.com and place this url in a queue of URL’s to crawl.

The web crawler must extract from each visited web page all of the ‘out links’ or the links to other web pages and websites. Each link must be placed into the queue of URL’s to crawl. This queue is called the URL frontier.

The code that places URL’s on the URL frontier should keep track of the number of URL’s that are in the frontier and stop adding URL’s when there are 500 in the queue.

The crawler must extract the text from the web page by using some mechanism to remove all of the HTML tags and formatting. In the example above, the module BeautifulSoup was used to accomplish this. You can use any technique that you want to remove the HTML tags and formatting, however, if you would like to use the BeautifulSoup module, instructions and download links are available along with installation instructions in the unit resources.

Your web crawler must produce statistics similar to those listed below to indicate how long it took to index your selected web site and key metrics such as the number of documents (in this case it will be the number of web pages), Number of tokens extracted and processed, and the number of unique terms added to the term dictionary.

Your web crawler must use the exact same format for the inverted index as was used in the indexer part 2 assignment so that the search engine developed in unit 5 can be used to search your web index.

Output Produced Against Example Website

>>>
Enter URL to crawl (must be in the form http://www.domain.com): http://www.hometownlife.com
Start Time: 15:44
Indexing Complete, write to disk: 16:53
Documents 473 Terms 16056 Tokens 2751668
End Time: 21:2

Explanation / Answer

1)Answer:

web crawler:

A web crawler (also known as a web spider or web robot) is a program or automated script which browses the World Wide Web in a methodical, automated manner.

This process is called Web crawling or spidering.

Many legitimate sites, in particular search engines, use spidering as a means of providing up-to-date data.

Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches.

Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code.

Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam).

A web crawler is a simple program that scans or “crawls” through web pages to create an index of the data it’s looking for. There are several uses for the program, perhaps the most popular being search engines using it to provide web surfers with relevant websites. Google has perfected the art of crawling over the years! A web crawler can pretty much be used by anyone who is trying to search for information on the Internet in an organized manner. It is referred to by different names like web spider, bot, indexer etc. Anyway, that article got me thinking about building a web crawler. I just wanted to fiddle with it and see how much time it will take to get something working on my machine. It turned out to be quite easy!

Where is it used?

Web crawlers can be used in many different ways, but they are usually used by someone seeking to collect information on the Internet. Search engines frequently use web crawlers to collect information about what is available on public web pages. The speed and accuracy of search engines is heavily dependent on the design of the crawler. Their primary purpose is to collect data so that when Internet surfers enter a search term on their site, they can quickly provide the surfer with relevant web sites. Linguists may use a web crawler to perform text and language analysis. Market researchers may use a web crawler to determine and assess trends in a given market. The possibilities are endless!

How does it actually work?

When a web crawler visits a web page, it reads all the visible text. It includes things like text, hyperlinks, content of the various tags used in the site etc. It then crawls to the links present on that site by treating it as a graph search problem. Each link is treated as a node and the traversal is achieved using depth first search or breadth first search depending on the application. Using the information gathered from the crawler, a search engine will then determine what the site is about and index the information. The website is then included in the search engine’s database and its page ranking process.

Most of the times, web crawlers are designed to do a specific thing. If we want it to be general purpose (like a search engine), web crawlers should be programmed to crawl through the Internet periodically to check for any significant changes. This is very useful in keeping the results up-to-date.

build a web crawler.

Now that we know how it works, we are ready to build a web crawler. I will show you how to get a basic Python web crawler working on your machine. Given a link, you will be able to crawl through the page and get all the links. You can then crawl through those pages and get more links. This process will continue until the desired depth is achieved. You will get a list of all the links in the end. To do anything further (like natural language processing, content parsing, etc), you will have to write your own code.

Prerequisites: The first thing you need to do is make sure you have Python installed. If you don’t have Python, install it before you proceed further. To run this code, you would need BeautifulSoup. It is a Python library to parse HTML documents. The steps to install BeautifulSoup are given below:

Running the crawler: Go to a comfortable location on your terminal and download the web crawler code from my github account using the following command:

The above command will create a folder called “Python-WebCrawler” on your machine and get all the required files in one go. It is a modified version of James Mills’ original recipe. If you don’t have git, then just download “crawler.py” from this link. The good thing about Python is that almost any function we would ever need is either inbuilt or already written by someone. You just have to arrange them properly so that they can do the dance for you!

Now we are ready to do some crawling! You can use “crawler.py” to crawl various websites. I have listed a few use cases below:

differences between static and dynamic web content:

Static web content:

Static content is any content that can be delivered to an end user without having to be generated, modified, or processed. The server delivers the same file to each user, making static content one of the simplest and most efficient content types to transmit over the Internet

The content of a website that remains the same across pages is referred to as static content. This could be served from a database as well, but it would be the same across all pages. For instance, the navigation menu, logo of the website, or any other information on the header or footer would not depend on inputs from the visitor.

Static content are files that don't change based on user input, and they consist of things like JavaScript, Cascading Style Sheets, Images, and HTML files. They comprise the structural components of SAP BI web applications such as BI Launch Pad and the Central Management Console (CMC).

Static content never requires dynamic data, such as bttoken, and can be served equally well by an application server such as Tomcat, or with a web server such as Apache. Because there is a significant amount of static content in BI Launch Pad, it can clutter the JMeter interface.

Static Web pages display the exact same information whenever anyone visits it. Static Web pages do not have to be simple plain text. They can feature detailed multimedia design and even videos. However, every visitor to that page will be greeted by the exact same text, multimedia design or video every time he visits the page until you alter that page's source code.

Dynamic content:

The content of a website that does not remain constant and changes according to user input(s) is referred to as dynamic content. For instance, in case of a product page, all product details such as Product Name, Price, Quantity, and Description are stored in a database and are fetched when a user is viewing a the webpage of a particular product. Therefore, this content is dynamically generated by the CMS and it changes across products.

Dynamic content requires processing by an application server such as Tomcat, and typically invokes backend services via the SDK. For example, logging on calls the logon.do method, which invokes the Central Management Server (CMS) for the user authorization process. Similarly, viewWebiReport.do engages a Web Intelligence Report Server in order to render a report.

Dynamic content almost always requires dynamic data such as bttoken and is what performs the actual business usage within BI4.

Dynamic Web pages are capable of producing different content for different visitors from the same source code file. The website can display different content based on what operating system or browser the visitor is using, whether she is using a PC or a mobile device, or even the source that referred the visitor. A dynamic Web page is not necessarily better than a static Web page. The two simply serve different purposes