Problem Description: Implement a web crawler that visits http://www.cnn.com/serv

ID: 3687150 • Letter: P

Question

Problem Description: Implement a web crawler that visits http://www.cnn.com/services/rss/, and visits all the sub-rss sites. From each sub-rss site retrieve the following for every news article:

title,

publication date,

and url link within the guid tag

Save the data in a text file. The contents of the text file should be tab separated for all the information related to a single news article, and new line separated for the different articles.

You can find web crawler code in your books in Listing 12.18 to give you a head start. However, to accomplish the task at hand, you will need to add to the code.

If you go to a RSS-sub site, and view the page source, you will find that for each article, the information is embedded in the following format.

<item><title>Netanyahu meets

Putin</title><link>http://rss.cnn.com/c/35492/f/676961/s/
4a07d739/sc/24/l/0L0Scnn0N0C20A150C0A90C210Cworld0Crussia
0Eisrael0Enetanyahu0Eputin0Emeeting0Cindex0Bhtml0Deref0Fr
ss0Itopstories/story01.htm</link><description>With Russia
apparently beefing up its military presence in Syria,
some countries are getting nervous about what could
happen -- including, perhaps, Israel. <a
href="http://rc.feedsportal.com/r/238386053268/u/192/f/67 6961/c/35492/s/4a07d739/sc/24/rc/1/rc.htm"
rel="nofollow"><img
src="http://rc.feedsportal.com/r/238386053268/u/192/f/676 961/c/35492/s/4a07d739/sc/24/rc/1/rc.img"
border="0"/></a> <a
href="http://rc.feedsportal.com/r/238386053268/u/192/f/67 6961/c/35492/s/4a07d739/sc/24/rc/2/rc.htm"
rel="nofollow"><img
src="http://rc.feedsportal.com/r/238386053268/u/192/f/676 961/c/35492/s/4a07d739/sc/24/rc/2/rc.img"
border="0"/></a> <a href="http://rc.feedsportal.com/r/238386053268/u/192/f/67 6961/c/35492/s/4a07d739/sc/2/rc/3/rc.htm" rel="nofollow"><img src="http://rc.feedsportal.com/r/238386053268/u/192/f/676 961/c/35492/s/4a07d739/sc/24/rc/3/rc.img" border="0"/></a> <a href="http://da.feedsportal.com/r/238386053268/u/192/f/67 6961/c/35492/s/4a07d739/sc/24/a2.htm"><img src="http://da.feedsportal.com/r/238386053268/u/192/f/676 961/c/35492/s/4a07d739/sc/24/a2.img" border="0"/></a> <a href="http://adchoice.feedsportal.com/r/238386053268/u/19 2/f/676961/c/35492/s/4a07d739/sc/24/ach.htm"><img src="http://adchoice.feedsportal.com/r/238386053268/u/192 /f/676961/c/35492/s/4a07d739/sc/24/ach.img" border="0"/></a><img width="1" height="1" src="http://pi.feedsportal.com/r/238386053268/u/192/f/ 676961/c/35492/s/4a07d739/sc/24/a2t.img" border="0"/><img width="1" height="1" src="http://pi2.feedsportal.com/r/238386053268/u/192/f/ 676961/c/35492/s/4a07d739/sc/24/a2t2.img" border="0"/><img width='1' height='1' src="http://rss.cnn.com/c/35492/f/676961/s/4a07d739/sc/24 /mf.gif" border='0'/></description><pubDate>Mon, 21 Sep 2015 12:41:35 GMT</pubDate><guid isPermaLink="false">http://www.cnn.com/2015/0/21/world/r ussia-israel-netanyahu-putin- meeting/index.html</guid><media:thumbnail url="http://i2.cdn.turner.com/cnn/dam/assets/150915114929 -russia-satellite-syria-base-top-tease.jpg" width="90" height="51" /><media:content height="51" lang="" type="image/jpeg" width="90" url="http://i2.cdn.turner.com/cnn/dam/assets/150915114929 -russia-satellite-syria-base-top-tease.jpg" /></item>

This is the code from the listing the listing 12.18

import java.util.Scanner;
import java.util.ArrayList;

public class WebCrawler {
   public static void main(String[] args) {
       java.util.Scanner input = new java.util.Scanner(System.in);
       System.out.print("Enter a URL: ");
       String url = input.nextLine();
       crawler(url); // Traverse the Web from the a starting url
   }

public static void crawler(String startingURL) {
 ArrayList listOfPendingURLs = new ArrayList<>();
 ArrayList listOfTraversedURLs = new ArrayList<>();

listOfPendingURLs.add(startingURL);
 while (!listOfPendingURLs.isEmpty() && listOfTraversedURLs.size() <= 100) {
 String urlString = listOfPendingURLs.remove(0);
 if (!listOfTraversedURLs.contains(urlString)) {
 listOfTraversedURLs.add(urlString);
 System.out.println("Craw " + urlString);

               for (String s : getSubURLs(urlString)) {
                   if (!listOfTraversedURLs.contains(s))
                       listOfPendingURLs.add(s);
               }
           }
       }
   }

public static ArrayList getSubURLs(String urlString) {
ArrayList list = new ArrayList<>();

       try {
           java.net.URL url = new java.net.URL(urlString);
           Scanner input = new Scanner(url.openStream());
           int current = 0;
           while (input.hasNext()) {
               String line = input.nextLine();
               current = line.indexOf(""", current);
               while (current > 0) {
                   int endIndex = line.indexOf(""", current);
                   if (endIndex > 0) { // Ensure that a correct URL is Found
                       list.add(line.substring(current, endIndex));
                       current = line.indexOf("http:", endIndex);
                   } else
                       current = -1;
               }
           }
       } catch (Exception ex) {
           System.out.println("Error: " + ex.getMessage());
       }
       return list;
   }

}

Explanation / Answer

code is right but the problem is that below, you should specify as ArrayList<String>
ArrayList listOfPendingURLs = new ArrayList<>();
ArrayList listOfTraversedURLs = new ArrayList<>();
and
ArrayList list = new ArrayList<>();

if you won't specify the ArrayList type i'll get following errors

WebCrawler.java:16: warning: [unchecked] unchecked call to add(E) as a member of the raw type ArrayList
listOfPendingURLs.add(startingURL);
^
where E is a type-variable:
E extends Object declared in class ArrayList
WebCrawler.java:19: error: incompatible types: Object cannot be converted to String
String urlString = listOfPendingURLs.remove(0);
^
WebCrawler.java:21: warning: [unchecked] unchecked call to add(E) as a member of the raw type ArrayList
listOfTraversedURLs.add(urlString);
^
where E is a type-variable:
E extends Object declared in class ArrayList
WebCrawler.java:26: warning: [unchecked] unchecked call to add(E) as a member of the raw type ArrayList
listOfPendingURLs.add(s);
^
where E is a type-variable:
E extends Object declared in class ArrayList
1 error
3 warnings

-------------------------------

ArrayList<String> listOfPendingURLs = new ArrayList<>();
ArrayList<String> listOfTraversedURLs = new ArrayList<>();
ArrayList<String> list = new ArrayList<>();

Navigate

Problem Description: IN JAVA... In this program you will create an array, fill i

Problem Description: In this assignment, you will implement a version of a word

Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.

Problem Description: Implement a web crawler that visits http://www.cnn.com/serv

Question

Explanation / Answer

Related Questions

Navigate