Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Java algorithm : Use BFS to implement a crawler to crawl and discover the wikipe

ID: 3707623 • Letter: J

Question

Java algorithm: Use BFS to implement a crawler to crawl and discover the wikipedia graph. Called WikiCrawler.java

1 BFS and Web Graph We can the model web as a directed graph. Every web page is a vertex of the graph. We put a directed edge from a page p to a page q, if page p contains a link to page q. Recall the BFS algorithm from the lecture 1. Input: Ditrected Graph G V.E, and root e V. 2. Initialize a Queue Q and a list visited. 3. Place root in Q and visited. 4. while Q is not empty Do (a) Let v be the first element of Q (b) For every edge ?v,u) e E DO If uvisited add u to the end of Q, and add u to visited. If you output the vertices in visited, that will be BFS traversal of the input graph. We can use BFS algorithm on web graph also. Here, the input to the algorithm is a seed url (instead of entire web graph). View that page as the the root for the BFS traversal. Here is the BFS algorithm on Web graph.

Explanation / Answer

package proj2; import java.io.BufferedWriter; import java.io.File; import java.io.FileNotFoundException; import java.io.IOException; import java.io.PrintWriter; import java.net.URL; import java.util.ArrayList; import java.util.HashMap; import java.util.HashSet; import java.util.LinkedList; import java.util.Scanner; /** * * @author michael sila * * Just like you guys wanted */ public class WikiCrawler { static String BASE_URL="https://en.wikipedia.org"; private String seedUrl,fileName; private Integer maxPages; /** * See project description * @param seedUrl * @param max * @param fileName */ public WikiCrawler(String seedUrl,int max,String fileName) { this.seedUrl=seedUrl; this.maxPages=Integer.valueOf(max); this.fileName=fileName; } /** * The logic is in the linkscrapper class * @param doc the contents of an html wikipedia page * @return an list of html links (relative) */ public ArrayList extractLinks(String doc) { LinkScrapper extractor=new LinkScrapper(doc); return extractor.scrapeLink(); } public void crawl() { HashSet visited=new HashSet(); LinkedList queue=new LinkedList(); HashSet inGraph=new HashSet(); queue.add(seedUrl); //Lines to write to output ArrayList outputLines=new ArrayList(); int currentEdges=0; int timesPolled=0; inGraph.add(queue.peekFirst()); //Start BFS while (!queue.isEmpty()) { String currLink=queue.removeFirst(); timesPolled++; if (timesPolled%100==0) { //Sleep as required try { Thread.sleep(3000); } catch (InterruptedException e) { e.printStackTrace(); } } visited.add(currLink); ArrayList neighbors=this.extractLinks(this.readHTML(currLink)); for(String n: neighbors) { if (!visited.contains(n) && inGraph.size()
Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote