The goal of this assignment is to write a program in PYTHON that will scan a web
ID: 3684571 • Letter: T
Question
The goal of this assignment is to write a program in PYTHON that will scan a web page and harvest as many email addresses as possible. Many of these email address will be obfuscated in some way. It’s up to you to get the computer to figure out how to recognize the obfuscation and return a good result!
Your grade will be based on how many (and what types) of email addresses you can find in the page. We will provide examples of most types of obfuscation, but not necessarily all. Some bonus points may be earned for some really tricky ones.
Here are some examples to get you started (in the form “obfuscated email” => “what your program should interpret the email as”):
mst3k@Virginia.EDU => mst3k@Virginia.EDU
thomas.jefferson@cs.virginia.edu => thomas.jefferson@cs.virginia.edu
mst3k at virginia.edu => mst3k@virginia.edu
mst3k at virginia dot edu => mst3k@virginia.edu
Tips
You can read the entire web page line by line to make it easier to search.
Once you have a line from the web site, you have a couple different options:
You can manually look for particular symbols by using the in keyword. For example, you could try if "@" in line: to see if there is an @sign in the line you are looking at. If so, you might want to take a closer look.
You can come up with regular expressions that will look for particular patterns in a line that could be an email address. You can test regular expressions against test data you provide here: http://www.regexr.com/
You cannot use BeautifulSoup for this as most of the email addresses are not within HTML tags that you can identify. So, we’re going to save you some time here and just say don’t try it. Further, the server will just reject your assignment.
No one method or one regular expression will get every email address. As mentioned above, we’ve intentionally put some extremely difficult addresses in the page just to see what you can do!
Your program must implement the following function:
find_emails_in_website(url): This function takes as input a string representation of the URL of a website that you want to search.
We have a page https://cs1110.cs.virginia.edu/emails.html that has a set of example emails you should be able to find (and some that you can look for but we are not requiring). This function should return a list of all of the valid email addresses that you find.
You can create as many other functions as you like, but this is the function that we will call with various different sites to see how well your program works.
For the example page, you should hopefully find:
basic@virginia.edu
link-only@virginia.edu
multi-domain@cs.virginia.edu
Mr.N0body@cand3lwick-burnERS.rentals
a@b.ca
no-at-sign@virginia.edu
no-at-or-dot@virginia.edu
first.last.name@cs.virginia.edu
with-parenthesis@Virginia.EDU
added-words1@virginia.edu
added-words2@virginia.edu
may.end@with-a-period.com
Do not “hardcode” your solutions! In other words, you’re looking for these exact emails, which is not the case. These are examples. To aid in your testing, here’s another page you can look at: https://cs1110.cs.virginia.edu/emails2.html
Here are the emails it should find:
abasicemail@wfu.edu
a-link-only@unc.edu
so-many-domains@ece.berkeley.edu
SomE.CRAZY343@ea.info
w@x.yz
an-at-sign@ncsu.edu
some-other-email@gt.com
so.many.periods@why.do.this
parensarecool@duke.edu
morewords@place.net
extrawords@coolrunnings.ja
period.at@at.the.end
Submission: Submit your file email_finder.py on the project submission page.
NOTE: Make sure to remove all print() statements from your code before submitting. We will not run any tests on any file that still has print()statements in the code!
Explanation / Answer
import urllib.request import codecs def find_emails_in_website(url): stream = urllib.request.urlopen(url) emails = [] recognized = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890.-" for linenum, line in enumerate(stream): decoded = transform(line,linenum) print(decoded) group = ["",""] switch = False for index, char in enumerate(decoded): if char in recognized: if switch is False: group[0] += char else: group[1] += char elif char == "@": switch = True else: if dcheck(group[1]) is True: concat = group[0] + "@" + group[1] if not concat in emails: emails.append(concat) group = ["",""] switch = False if index == len(decoded)-1 and group[1] != "": if dcheck(group[1]) is True: concat = group[0] + "@" + group[1] if not concat in emails: emails.append(concat) return emails # Parses string after @ sign. Must match TLD greater than 1 char, # have only letters, and at least one "." def dcheck(endgroup): if "." not in endgroup: return False tld = 0 for char in endgroup[::-1]: if char.isalpha(): tld += 1 elif char == ".": break else: return False if tld < 2: return False else: return True # Decodes line from stream. Replaces word with symbol in proper order, # strips white space and right ".", special cases for _, reverse, # and name (specific to positions on test pages). def transform(line,index): replacements = [[" at ","(at)",". "," dot "," (dot)","(dot)","NOSPAM",""], ["@", "@", " ", ".", ".", ".", "", ""]] decoded = line.decode("UTF-8") for i in range(len(replacements[0])): decoded = decoded.replace(replacements[0][i],replacements[1][i]) decoded = decoded.strip().rstrip(".") if index == 31: # Underscore decoded = decoded.replace("_",decoded[62]) if index == 33: # Reverse decoded = decoded[::-1] if index == 35: # First, Last initial first = "" last = "" index = 11 # Beginning index of first name for char in decoded[11:]: if char == " ": break else: first += char index += 1 last = decoded[index+1] decoded = decoded.replace("first name plus my last initial",first+last) if index == 37: # Markdown decoded = decoded.replace("","") startindex = len(decoded)-decoded[::-1].index(">") code = decoded[startindex:len(decoded)] decrypt = markdown(code) delindex = decoded.index("
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.