Problem Description: Implement a web crawler that visits http://www.cnn.com/serv
ID: 3687150 • Letter: P
Question
Problem Description: Implement a web crawler that visits http://www.cnn.com/services/rss/, and visits all the sub-rss sites. From each sub-rss site retrieve the following for every news article:
title,
publication date,
and url link within the guid tag
Save the data in a text file. The contents of the text file should be tab separated for all the information related to a single news article, and new line separated for the different articles.
You can find web crawler code in your books in Listing 12.18 to give you a head start. However, to accomplish the task at hand, you will need to add to the code.
If you go to a RSS-sub site, and view the page source, you will find that for each article, the information is embedded in the following format.
<item><title>Netanyahu meets
Putin</title><link>http://rss.cnn.com/c/35492/f/676961/s/
4a07d739/sc/24/l/0L0Scnn0N0C20A150C0A90C210Cworld0Crussia
0Eisrael0Enetanyahu0Eputin0Emeeting0Cindex0Bhtml0Deref0Fr
ss0Itopstories/story01.htm</link><description>With Russia
apparently beefing up its military presence in Syria,
some countries are getting nervous about what could
happen -- including, perhaps, Israel.<br
clear='all'/><br/><br/><a
href="http://rc.feedsportal.com/r/238386053268/u/192/f/67<br/> 6961/c/35492/s/4a07d739/sc/24/rc/1/rc.htm"
rel="nofollow"><img
src="http://rc.feedsportal.com/r/238386053268/u/192/f/676<br/> 961/c/35492/s/4a07d739/sc/24/rc/1/rc.img"
border="0"/></a><br/><br/><a
href="http://rc.feedsportal.com/r/238386053268/u/192/f/67<br/> 6961/c/35492/s/4a07d739/sc/24/rc/2/rc.htm"
rel="nofollow"><img
src="http://rc.feedsportal.com/r/238386053268/u/192/f/676<br/> 961/c/35492/s/4a07d739/sc/24/rc/2/rc.img"
border="0"/></a><br/><br/><a href="http://rc.feedsportal.com/r/238386053268/u/192/f/67 6961/c/35492/s/4a07d739/sc/2/rc/3/rc.htm" rel="nofollow"><img src="http://rc.feedsportal.com/r/238386053268/u/192/f/676 961/c/35492/s/4a07d739/sc/24/rc/3/rc.img" border="0"/></a><br/><br/><a href="http://da.feedsportal.com/r/238386053268/u/192/f/67 6961/c/35492/s/4a07d739/sc/24/a2.htm"><img src="http://da.feedsportal.com/r/238386053268/u/192/f/676 961/c/35492/s/4a07d739/sc/24/a2.img" border="0"/></a><br/><a href="http://adchoice.feedsportal.com/r/238386053268/u/19 2/f/676961/c/35492/s/4a07d739/sc/24/ach.htm"><img src="http://adchoice.feedsportal.com/r/238386053268/u/192 /f/676961/c/35492/s/4a07d739/sc/24/ach.img" border="0"/></a><img width="1" height="1" src="http://pi.feedsportal.com/r/238386053268/u/192/f/ 676961/c/35492/s/4a07d739/sc/24/a2t.img" border="0"/><img width="1" height="1" src="http://pi2.feedsportal.com/r/238386053268/u/192/f/ 676961/c/35492/s/4a07d739/sc/24/a2t2.img" border="0"/><img width='1' height='1' src="http://rss.cnn.com/c/35492/f/676961/s/4a07d739/sc/24 /mf.gif" border='0'/></description><pubDate>Mon, 21 Sep 2015 12:41:35 GMT</pubDate><guid isPermaLink="false">http://www.cnn.com/2015/0/21/world/r ussia-israel-netanyahu-putin- meeting/index.html</guid><media:thumbnail url="http://i2.cdn.turner.com/cnn/dam/assets/150915114929 -russia-satellite-syria-base-top-tease.jpg" width="90" height="51" /><media:content height="51" lang="" type="image/jpeg" width="90" url="http://i2.cdn.turner.com/cnn/dam/assets/150915114929 -russia-satellite-syria-base-top-tease.jpg" /></item>
This is the code from the listing the listing 12.18
import java.util.Scanner;
import java.util.ArrayList;
public class WebCrawler {
public static void main(String[] args) {
java.util.Scanner input = new java.util.Scanner(System.in);
System.out.print("Enter a URL: ");
String url = input.nextLine();
crawler(url); // Traverse the Web from the a starting url
}
public static void crawler(String startingURL) {
ArrayList listOfPendingURLs = new ArrayList<>();
ArrayList listOfTraversedURLs = new ArrayList<>();
listOfPendingURLs.add(startingURL);
while (!listOfPendingURLs.isEmpty() && listOfTraversedURLs.size() <= 100) {
String urlString = listOfPendingURLs.remove(0);
if (!listOfTraversedURLs.contains(urlString)) {
listOfTraversedURLs.add(urlString);
System.out.println("Craw " + urlString);
for (String s : getSubURLs(urlString)) {
if (!listOfTraversedURLs.contains(s))
listOfPendingURLs.add(s);
}
}
}
}
public static ArrayList getSubURLs(String urlString) {
ArrayList list = new ArrayList<>();
try {
java.net.URL url = new java.net.URL(urlString);
Scanner input = new Scanner(url.openStream());
int current = 0;
while (input.hasNext()) {
String line = input.nextLine();
current = line.indexOf(""", current);
while (current > 0) {
int endIndex = line.indexOf(""", current);
if (endIndex > 0) { // Ensure that a correct URL is Found
list.add(line.substring(current, endIndex));
current = line.indexOf("http:", endIndex);
} else
current = -1;
}
}
} catch (Exception ex) {
System.out.println("Error: " + ex.getMessage());
}
return list;
}
}
Explanation / Answer
code is right but the problem is that below, you should specify as ArrayList<String>
ArrayList listOfPendingURLs = new ArrayList<>();
ArrayList listOfTraversedURLs = new ArrayList<>();
and
ArrayList list = new ArrayList<>();
if you won't specify the ArrayList type i'll get following errors
WebCrawler.java:16: warning: [unchecked] unchecked call to add(E) as a member of the raw type ArrayList
listOfPendingURLs.add(startingURL);
^
where E is a type-variable:
E extends Object declared in class ArrayList
WebCrawler.java:19: error: incompatible types: Object cannot be converted to String
String urlString = listOfPendingURLs.remove(0);
^
WebCrawler.java:21: warning: [unchecked] unchecked call to add(E) as a member of the raw type ArrayList
listOfTraversedURLs.add(urlString);
^
where E is a type-variable:
E extends Object declared in class ArrayList
WebCrawler.java:26: warning: [unchecked] unchecked call to add(E) as a member of the raw type ArrayList
listOfPendingURLs.add(s);
^
where E is a type-variable:
E extends Object declared in class ArrayList
1 error
3 warnings
-------------------------------
ArrayList<String> listOfPendingURLs = new ArrayList<>();
ArrayList<String> listOfTraversedURLs = new ArrayList<>();
ArrayList<String> list = new ArrayList<>();
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.