Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Algorithm Tokenize ( string s ): list ( string ) tokens := [ ] while not isEmpty

ID: 3801862 • Letter: A

Question

Algorithm Tokenize(string s):

list(string) tokens := [ ]

while not isEmpty(s)

If s begins with a token, remove the longest possible token from the beginning of s and push that token onto the back of tokens

If head(s) is a whitespace character, pop(s).

return tokens

For example, below is a trace of the call tokenize("hello, world 3+1")

# it

tokens

s

0

[]

"hello, world 3+1"

1

["hello"]

", world 3+1"

2

["hello",","]

" world 3+1"

3

["hello",","]

"world 3+1"

4

["hello",",","world"]

" 3+1"

5

["hello",",","world"]

"3+1"

6

["hello",",","world","3"]

"+1"

7

["hello",",","world","3","+"]

"1"

8

["hello",",","world","3","+","1"]

""

Homework:

Implement Tokenize in Python. That is, write a Python function called tokenize such that if s is a tring for which Tokenize(s) is defined, then Tokenize(s) == tokenize(s).

Test cases:

Note the following

tokenize("hello world")

["hello", "world"]

tokenize("hello'world'foo")

["hello", "'world'", "foo"]

["hello", "'world'", "foo"]

['hello', ''world'', 'foo']

tokenize("'hello\'world'")

["'hello\'world'"]

tokenize("'hello world'")

["'hello world'"]

tokenize("3.33(33..")

not in test suite

tokenize("\")

["\"]

tokenize(" ")

[ ]

tokenize("'a'e'c'")

["'a'", "e", "'c'"]

tokenize("3.3+1")

["3.3","+","1"]

tokenize("''")

["''"]

For example, you could enter into repl.it or IDLE,

   >> tokenize("'hello\'world'")== ["'hello\'world'"]

   >> True


Send your tokenizer as an attached .py file

EXAMPLE CODE

The Python program below implements Tokenize. In this exercise, we assume the input is correct (that is, that it consists of tokens, whitespace, and comments). This is an oversimplification of the problem of tokenizing real world text, but it is still not easy.

# whiteChar(c) iff c is a whitespace character.

def whiteChar(c): return (c in " ")

# tokenize(s) = Tokenize(s), whenever Tokenize(s) is defined

def tokenize(s):

i = 0

tokens = [ ]

# invariants:0 <= i <= len(s), tokens = Tokenize(s[:i]),

while i < len(s):

   if whiteChar(s[i]):

     i = i+1

   elif i < len(s)-1 and s[i:i+2] == "//":

     # skip the comment until the next return or new line character

     i = i+2

     while i < len(s) and s[i] not in " ":

       i = i+1

   else:

     # process the longest possible token

     tok = munch(s,i)

     tokens.append(tok)

     i = i + len(tok)

return tokens

# If 0<= i < len(s) and s[i:] begins with a token, then munch(s,i)

# is longest token that is a prefix of s[i:]

def munch(s,i):

A,j = 'em',i

# invariants: i <= j <= len(s), A is the state that describes s[i:j]

while True:

   if j == len(s): break # end of string

   A = newState(A,s[j])

   # A is now the state that *would* result if we process

   # one more character.

   if A == 'err': break

   # A is not 'err', so good with one more character

   j = j+1

return s[i:j]

'''

A *state* is a string . States *describe* strings as given below:

1. 'em' describes str iff str is empty

2. 'id' describes str iff str is an identifier

3. 'num' describes str iff is a numeral

4. 'err' describes str iff str is not a prefix of any token

'''

# If state A describes a string and c is a character, than newState(A,c)

# describes the string A+c.

def newState(A,c):

if A=='em':

   if c.isalpha(): return 'id'

   elif c.isdigit(): return 'num'

elif A=='id':

   if (c.isalpha() or c.isdigit()): return 'id'

elif A == 'num':

   if c.isdigit(): return 'num'

return 'err'

# END PROGRAM

Below is a transcript of a test session in repl.it with the program above:

Python 3.5.2 (default, Dec 2015, 13:05:11)
[GCC 4.8.2] on linux
  
  tokenize("hello world")
=> ['hello', 'world']   

tokenize("hello //comment world//")
=> ['hello', 'world']   

tokenize("hello you little 8// com")
=> ['hello', 'you', 'little', '8']   

tokenize("duh 8 // comment hi3")
=> ['duh', '8', 'hi3']

# it

tokens

s

0

[]

"hello, world 3+1"

1

["hello"]

", world 3+1"

2

["hello",","]

" world 3+1"

3

["hello",","]

"world 3+1"

4

["hello",",","world"]

" 3+1"

5

["hello",",","world"]

"3+1"

6

["hello",",","world","3"]

"+1"

7

["hello",",","world","3","+"]

"1"

8

["hello",",","world","3","+","1"]

""

Explanation / Answer

This is the thing you are looking for:

You could add input mapped to output condiions as per your preference:

def tokenize(s):

tokenizedStr = []
tokens = generate_tokens(StringIO(s)) # using the grenerate_token method.
for tokennum, tokenvalue, _, _, _ in tokens:
if tokennum == NUMBER and '.' in tokenvalue:
result.extend([
(OP, OP,
(STRING, repr(tokenvalue)),
])
else:
result.append((tokennum, tokenvalue))
return untokenize(tokenizedStr )

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote