Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

PYTHON 1. We will perform the same task using the HTMLParser tool. When properly

ID: 3827540 • Letter: P

Question

PYTHON

1. We will perform the same task using the HTMLParser tool. When properly working, the program should be given a URL, and print (to the console) all of the headlines on the page.

In order to do this, we will need to write all 3 handler methods for HTMLParser. Here is a brief description of what each should do:

handle_startag: Checks the tag to see if it is a headline tag (<h1>, <h2>, or <h3>). We will not concern ourselves with a threshold for this assignment. If the tag is a headline tag, then a flag is set to True in order to indicate that a headline element has been found. This is necessary for the handle_data method to extract the headline.

handle_data: Check the flag set by handle_starttag to see if we are in a headline element. If so, the the headline should be printed.

handle_endtag: Sets the flag back to False, indicating that we are no longer in a headline element.

from urllib.request import *
from urllib.error import *
from html.parser import *
from urllib.parse import urljoin

class Headlines(HTMLParser):

    def __init__(self, url):
        HTMLParser.__init__(self)
        self.url = url
        self.tag = None
        self.f = open('headlines.html','w')
      
   

def handle_starttag(self, tag, attrs):
        if tag in ['h1', 'h2', 'h3']:
            pass   # REPLACE THIS      
   

def handle_data(self, data):
        if self.tag != None:
            pass   # REPLACE THIS
          

    def handle_endtag(self, tag):
        if tag in ['h1', 'h2', 'h3']:
            pass # REPLACE THIS

   

def headlines(self):
        contents = urlopen(self.url).read().decode()
        self.feed(contents)
        self.f.close()

Explanation / Answer

from urllib.request import *
from urllib.error import *
from html.parser import *
from urllib.parse import urljoin
class Headlines(HTMLParser):
def __init__(self, url):
HTMLParser.__init__(self)
self.url = url
self.tag = None
self.f = open('headlines.html','w')
  

def handle_starttag(self, tag, attrs):
if tag in ['h1', 'h2', 'h3']:
flag=True
return flag

def handle_data(self, data):
if self.tag != None:
if flag==True:
print(data)
  
def handle_endtag(self, tag):
if tag in ['h1', 'h2', 'h3']:
flag=False
return flag   

def headlines(self):
contents = urlopen(self.url).read().decode()
self.feed(contents)
self.f.close()