PYTHON CHECK SIMPLE CODE Write a recursive function findTitles () that takes a s
ID: 3678115 • Letter: P
Question
PYTHON CHECK SIMPLE CODE
Write a recursive function findTitles() that takes a string representing a name of a directory as a parameter and returns a list of all the titles found inside the html files in that directory or subdirectories of that directory, etc. Files that appear in more than one directory will only appear once in the title list returned. The titles in the list do not have to be ordered in any particular way.Please note that your functionmust be recursive and that it must work as described on any directory and file structure, not just the ones provided as examples. You may not make any assumption about the nesting of subdirectories, either with respect to how many subdirectories exist at each level or how deep the nesting of subdirectories is. You may not change the number of parameters of the function, define any helper functions, or use any global variables, and the function must return and not printthe list of unique files. It is expected that a correct recursive solution to the problem will contain at least one loop. Your code must work on any operating system
from html.parser import HTMLParser
from urllib.parse import urljoin
from urllib.request import urlopen
class Collector(HTMLParser):
def __init__(self, url):
HTMLParser.__init__(self)
self.url = url
self.links=[]
def handle_starttag(self,tag,attrs):
if tag == 'a':
for attr in attrs:
if attr[0] == 'href':
absolute = urljoin(self.url,
def getLinks(self):
return self.links
class Crawler(object):
def __init__(self):
self.visited = set()
def getFiles(self):
print(self.fileList)
def reset(self):
self.visited = set()
def crawl(self, url):
'recursive web crawler that calls analyze() on each web page'
self.visited.add(url)
links = self.analyze(url)
for link in links:
if link not in self.visited:
try:
self.crawl(link)
except:
pass
def analyze(self, url):
content = urlopen(url)
content=content.read()
content=content.decode()
collector = Collector(url)
collector.feed(content)
urls = collector.getLinks()
return urls
attr[1])
if absolute[:4] == 'http':
self.links.append(absolute)
Explanation / Answer
import os
import glob
path = 'pahtofYourDirectory/'
for infile in glob.glob( os.path.join(path, '*.fasta') ):
print "current file is: " + infile
Related Questions
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.