Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Problem Description: Implement a web crawler that visits http://www.cnn.com/serv

ID: 3687150 • Letter: P

Question

Problem Description: Implement a web crawler that visits http://www.cnn.com/services/rss/, and visits all the sub-rss sites. From each sub-rss site retrieve the following for every news article:

title,

publication date,

and url link within the guid tag

Save the data in a text file. The contents of the text file should be tab separated for all the information related to a single news article, and new line separated for the different articles.

You can find web crawler code in your books in Listing 12.18 to give you a head start. However, to accomplish the task at hand, you will need to add to the code.

If you go to a RSS-sub site, and view the page source, you will find that for each article, the information is embedded in the following format.

<item><title>Netanyahu meets                                                                                                      

Putin</title><link>http://rss.cnn.com/c/35492/f/676961/s/
4a07d739/sc/24/l/0L0Scnn0N0C20A150C0A90C210Cworld0Crussia
0Eisrael0Enetanyahu0Eputin0Emeeting0Cindex0Bhtml0Deref0Fr
ss0Itopstories/story01.htm</link><description>With Russia
apparently beefing up its military presence in Syria,
some countries are getting nervous about what could
happen -- including, perhaps, Israel.&lt;br
clear='all'/&gt;&lt;br/&gt;&lt;br/&gt;&lt;a
href="http://rc.feedsportal.com/r/238386053268/u/192/f/67<br/> 6961/c/35492/s/4a07d739/sc/24/rc/1/rc.htm"
rel="nofollow"&gt;&lt;img
src="http://rc.feedsportal.com/r/238386053268/u/192/f/676<br/> 961/c/35492/s/4a07d739/sc/24/rc/1/rc.img"
border="0"/&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;a
href="http://rc.feedsportal.com/r/238386053268/u/192/f/67<br/> 6961/c/35492/s/4a07d739/sc/24/rc/2/rc.htm"
rel="nofollow"&gt;&lt;img
src="http://rc.feedsportal.com/r/238386053268/u/192/f/676<br/> 961/c/35492/s/4a07d739/sc/24/rc/2/rc.img"
border="0"/&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;a   href="http://rc.feedsportal.com/r/238386053268/u/192/f/67                                                                        6961/c/35492/s/4a07d739/sc/2/rc/3/rc.htm"                                                                                                                                           rel="nofollow"&gt;&lt;img src="http://rc.feedsportal.com/r/238386053268/u/192/f/676                                                                   961/c/35492/s/4a07d739/sc/24/rc/3/rc.img"                                                                                            border="0"/&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;a                                                             href="http://da.feedsportal.com/r/238386053268/u/192/f/67                                                                6961/c/35492/s/4a07d739/sc/24/a2.htm"&gt;&lt;img                                                               src="http://da.feedsportal.com/r/238386053268/u/192/f/676                                                          961/c/35492/s/4a07d739/sc/24/a2.img" border="0"/&gt;&lt;/a&gt;&lt;br/&gt;&lt;a href="http://adchoice.feedsportal.com/r/238386053268/u/19                                                              2/f/676961/c/35492/s/4a07d739/sc/24/ach.htm"&gt;&lt;img                                        src="http://adchoice.feedsportal.com/r/238386053268/u/192                                                                /f/676961/c/35492/s/4a07d739/sc/24/ach.img"                                                                                      border="0"/&gt;&lt;/a&gt;&lt;img width="1" height="1"                                                 src="http://pi.feedsportal.com/r/238386053268/u/192/f/                                                                    676961/c/35492/s/4a07d739/sc/24/a2t.img"    border="0"/&gt;&lt;img width="1" height="1"    src="http://pi2.feedsportal.com/r/238386053268/u/192/f/                                                                676961/c/35492/s/4a07d739/sc/24/a2t2.img"    border="0"/&gt;&lt;img width='1' height='1'                                                                               src="http://rss.cnn.com/c/35492/f/676961/s/4a07d739/sc/24                                                                         /mf.gif" border='0'/&gt;</description><pubDate>Mon, 21                                                                          Sep 2015 12:41:35 GMT</pubDate><guid                                                     isPermaLink="false">http://www.cnn.com/2015/0/21/world/r                                                                 ussia-israel-netanyahu-putin-                                                                                            meeting/index.html</guid><media:thumbnail                                                                     url="http://i2.cdn.turner.com/cnn/dam/assets/150915114929                                                                    -russia-satellite-syria-base-top-tease.jpg" width="90"                                                                              height="51" /><media:content height="51" lang=""                                                                                   type="image/jpeg" width="90"                                                                                             url="http://i2.cdn.turner.com/cnn/dam/assets/150915114929                                                                        -russia-satellite-syria-base-top-tease.jpg" /></item>           

This is the code from the listing the listing 12.18

import java.util.Scanner;
import java.util.ArrayList;

public class WebCrawler {
   public static void main(String[] args) {
       java.util.Scanner input = new java.util.Scanner(System.in);
       System.out.print("Enter a URL: ");
       String url = input.nextLine();
       crawler(url); // Traverse the Web from the a starting url
   }

   public static void crawler(String startingURL) {
       ArrayList listOfPendingURLs = new ArrayList<>();
       ArrayList listOfTraversedURLs = new ArrayList<>();

       listOfPendingURLs.add(startingURL);
       while (!listOfPendingURLs.isEmpty() && listOfTraversedURLs.size() <= 100) {
           String urlString = listOfPendingURLs.remove(0);
           if (!listOfTraversedURLs.contains(urlString)) {
               listOfTraversedURLs.add(urlString);
               System.out.println("Craw " + urlString);

               for (String s : getSubURLs(urlString)) {
                   if (!listOfTraversedURLs.contains(s))
                       listOfPendingURLs.add(s);
               }
           }
       }
   }

   public static ArrayList getSubURLs(String urlString) {
       ArrayList list = new ArrayList<>();

       try {
           java.net.URL url = new java.net.URL(urlString);
           Scanner input = new Scanner(url.openStream());
           int current = 0;
           while (input.hasNext()) {
               String line = input.nextLine();
               current = line.indexOf(""", current);
               while (current > 0) {
                   int endIndex = line.indexOf(""", current);
                   if (endIndex > 0) { // Ensure that a correct URL is Found
                       list.add(line.substring(current, endIndex));
                       current = line.indexOf("http:", endIndex);
                   } else
                       current = -1;
               }
           }
       } catch (Exception ex) {
           System.out.println("Error: " + ex.getMessage());
       }
       return list;
   }

}

                                                                                     

Explanation / Answer

code is right but the problem is that below, you should specify as ArrayList<String>
ArrayList listOfPendingURLs = new ArrayList<>();
ArrayList listOfTraversedURLs = new ArrayList<>();
and
ArrayList list = new ArrayList<>();

if you won't specify the ArrayList type i'll get following errors

WebCrawler.java:16: warning: [unchecked] unchecked call to add(E) as a member of the raw type ArrayList
listOfPendingURLs.add(startingURL);
^
where E is a type-variable:
E extends Object declared in class ArrayList
WebCrawler.java:19: error: incompatible types: Object cannot be converted to String
String urlString = listOfPendingURLs.remove(0);
^
WebCrawler.java:21: warning: [unchecked] unchecked call to add(E) as a member of the raw type ArrayList
listOfTraversedURLs.add(urlString);
^
where E is a type-variable:
E extends Object declared in class ArrayList
WebCrawler.java:26: warning: [unchecked] unchecked call to add(E) as a member of the raw type ArrayList
listOfPendingURLs.add(s);
^
where E is a type-variable:
E extends Object declared in class ArrayList
1 error
3 warnings

-------------------------------

ArrayList<String> listOfPendingURLs = new ArrayList<>();
ArrayList<String> listOfTraversedURLs = new ArrayList<>();
ArrayList<String> list = new ArrayList<>();

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote