PYTHON 1. We will perform the same task using the HTMLParser tool. When properly
ID: 3827540 • Letter: P
Question
PYTHON
1. We will perform the same task using the HTMLParser tool. When properly working, the program should be given a URL, and print (to the console) all of the headlines on the page.
In order to do this, we will need to write all 3 handler methods for HTMLParser. Here is a brief description of what each should do:
handle_startag: Checks the tag to see if it is a headline tag (<h1>, <h2>, or <h3>). We will not concern ourselves with a threshold for this assignment. If the tag is a headline tag, then a flag is set to True in order to indicate that a headline element has been found. This is necessary for the handle_data method to extract the headline.
handle_data: Check the flag set by handle_starttag to see if we are in a headline element. If so, the the headline should be printed.
handle_endtag: Sets the flag back to False, indicating that we are no longer in a headline element.
from urllib.request import *
from urllib.error import *
from html.parser import *
from urllib.parse import urljoin
class Headlines(HTMLParser):
def __init__(self, url):
HTMLParser.__init__(self)
self.url = url
self.tag = None
self.f = open('headlines.html','w')
def handle_starttag(self, tag, attrs):
if tag in ['h1', 'h2', 'h3']:
pass # REPLACE THIS
def handle_data(self, data):
if self.tag != None:
pass # REPLACE THIS
def handle_endtag(self, tag):
if tag in ['h1', 'h2', 'h3']:
pass # REPLACE THIS
def headlines(self):
contents = urlopen(self.url).read().decode()
self.feed(contents)
self.f.close()
Explanation / Answer
from urllib.request import *
from urllib.error import *
from html.parser import *
from urllib.parse import urljoin
class Headlines(HTMLParser):
def __init__(self, url):
HTMLParser.__init__(self)
self.url = url
self.tag = None
self.f = open('headlines.html','w')
def handle_starttag(self, tag, attrs):
if tag in ['h1', 'h2', 'h3']:
flag=True
return flag
def handle_data(self, data):
if self.tag != None:
if flag==True:
print(data)
def handle_endtag(self, tag):
if tag in ['h1', 'h2', 'h3']:
flag=False
return flag
def headlines(self):
contents = urlopen(self.url).read().decode()
self.feed(contents)
self.f.close()
Related Questions
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.