Write a python program that will scan a web page and harvest as many email addre
ID: 3685112 • Letter: W
Question
Write a python program that will scan a web page and harvest as many email addresses as possible. Many of these email address will be obfuscated in some way. It’s up to you to get the computer to figure out how to recognize the obfuscation and return a good result! Your grade will be based on how many (and what types) of email addresses you can find in the page. We will provide examples of most types of obfuscation, but not necessarily all. Try to find really tricky ones. Here are some examples to get you started (in the form “obfuscated email” => “what your program should interpret the email as”): mst3k@Virginia.EDU => mst3k@Virginia.EDU thomas.jefferson@cs.virginia.edu => thomas.jefferson@cs.virginia.edu mst3k at virginia.edu => mst3k@virginia.edu mst3k at virginia dot edu => mst3k@virginia.edu Tips You can read the entire web page line by line to make it easier to search.
import urllib.request
stream = urllib.request.urlopen( "https://cs1110.cs.virginia.edu/emails.html" )
for line in stream:
decoded = line.decode("UTF-8")
print(decoded.strip())
Once you have a line from the web site, you have a couple different options:
You can manually look for particular symbols by using the in keyword. For example, you could try if "@" in line: to see if there is an @sign in the line you are looking at. If so, you might want to take a closer look.
You can come up with regular expressions that will look for particular patterns in a line that could be an email address. You can test regular expressions against test data you provide here: http://www.regexr.com/
Do NOT use BeautifulSoup for this as most of the email addresses are not within HTML tags that you can identify. So, we’re going to save you some time here and just say don’t try it. Further, the server will just reject your assignment.
No one method or one regular expression will get every email address. As mentioned above, we’ve intentionally put some extremely difficult addresses in the page just to see what you can do!
Your program must implement the following function:
find_emails_in_website(url): This function takes as input a string representation of the URL of a website that you want to search.
We have a page https://cs1110.cs.virginia.edu/emails.html that has a set of example emails you should be able to find (and some that you can look for but we are not requiring). This function should return a list of all of the valid email addresses that you find.
You can create as many other functions as you like, but this is the function that we will call with various different sites to see how well your program works.
For the example page, you should hopefully find:
basic@virginia.edu
link-only@virginia.edu
multi-domain@cs.virginia.edu
Mr.N0body@cand3lwick-burnERS.rentals
a@b.ca
no-at-sign@virginia.edu
no-at-or-dot@virginia.edu
first.last.name@cs.virginia.edu
with-parenthesis@Virginia.EDU
added-words1@virginia.edu
added-words2@virginia.edu
may.end@with-a-period.com
Do not “hardcode” your solutions! In other words, you’re looking for these exact emails, which is not the case.
Explanation / Answer
import urllib.request import codecs def find_emails_in_website(url): stream = urllib.request.urlopen(url) emails = [] recognized = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890.-" for linenum, line in enumerate(stream): decoded = transform(line,linenum) print(decoded) group = ["",""] switch = False for index, char in enumerate(decoded): if char in recognized: if switch is False: group[0] += char else: group[1] += char elif char == "@": switch = True else: if dcheck(group[1]) is True: concat = group[0] + "@" + group[1] if not concat in emails: emails.append(concat) group = ["",""] switch = False if index == len(decoded)-1 and group[1] != "": if dcheck(group[1]) is True: concat = group[0] + "@" + group[1] if not concat in emails: emails.append(concat) return emails # Parses string after @ sign. Must match TLD greater than 1 char, # have only letters, and at least one "." def dcheck(endgroup): if "." not in endgroup: return False tld = 0 for char in endgroup[::-1]: if char.isalpha(): tld += 1 elif char == ".": break else: return False if tld < 2: return False else: return True # Decodes line from stream. Replaces word with symbol in proper order, # strips white space and right ".", special cases for _, reverse, # and name (specific to positions on test pages). def transform(line,index): replacements = [[" at ","(at)",". "," dot "," (dot)","(dot)","NOSPAM",""], ["@", "@", " ", ".", ".", ".", "", ""]] decoded = line.decode("UTF-8") for i in range(len(replacements[0])): decoded = decoded.replace(replacements[0][i],replacements[1][i]) decoded = decoded.strip().rstrip(".") if index == 31: # Underscore decoded = decoded.replace("_",decoded[62]) if index == 33: # Reverse decoded = decoded[::-1] if index == 35: # First, Last initial first = "" last = "" index = 11 # Beginning index of first name for char in decoded[11:]: if char == " ": break else: first += char index += 1 last = decoded[index+1] decoded = decoded.replace("first name plus my last initial",first+last) if index == 37: # Markdown decoded = decoded.replace("","") startindex = len(decoded)-decoded[::-1].index(">") code = decoded[startindex:len(decoded)] decrypt = markdown(code) delindex = decoded.index("
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.