Data compression is a way of finding economical ways to encode information, for
ID: 3913189 • Letter: D
Question
Data compression is a way of finding economical ways to encode information, for storing files using limited disk space, or for transferring files over a network. Sophisticated data compression algorithms for text files take into account sequences of characters that occur often, so that they can encode those sequences in an efficient way. For example, if the word Rumplestiltskin occurs often, then there will be a very short code for that word. But a simpler kind of data compression works on individual characters, without paying attention to how characters are grouped. This assignment has to do with that simple form of data compression, using an algorithm called Huffman coding.
A file is a sequence of bits. Codes such as Ascii and Unicode use a fixed number of bits to encode each character. For example, in the 8-bit Ascii code, the letters 'a', 'b' and 'c' have the following encodings.
'a'
01100001
'b'
01100010
'c'
01100011
In the 16 bit Unicode standard, the codes for 'a', 'b' and 'c' are as follows.
'a'
0000000001100001
'b'
0000000001100010
'c'
0000000001100011
Some characters are used more than others. For an economical encoding, instead of using the same number of bits for every character, it would be better to use shorter codes for characters that occur more frequently. For example, suppose that a document contains only the letters a, b and c, and suppose that the letter b occurs far more often than either a or c. An alternative character encoding is as follows.
'a'
10
'b'
0
'c'
11
This code has the property that no character code is a prefix of any other character code. Using this kind of character code (called a prefix code) allows you to break apart a string of bits into the characters that it encodes. For example, string of bits "01001110" encodes "babca". To decode a string, you repeatedly remove a prefix that encodes a character. Since no code is a prefix of any other code, there is only one way to select the next character.
Requirements
For this assignment you will write two programs. The first program will be given two file names, which I will refer to as A and B. It should do the following.
Count how many occurrences of each character are in file A.
Show the character frequencies on the standard output.
Construct a Huffman code for the characters that occur in file A (ignoring characters that do not occur at all).
Write the code on the standard output.
Reread file A and, using the constructed Huffman code, write an encoded version of file A into file B.
The second program will also be given two file names, A and B. It reads file A, which must be a file that was written by your first program, and it writes file B, which should be the decoded version of A. The decoded file should look identical to the file that you encoded.
The encoder will be called huffman and the decoder will be called unhuffman.
The standard output
If the input String being encoded contains only characters a, b, c, d and e, the program might show the following information on the standard output, in addition to encoding and decoding files.
The character frequencies are:
a 500
b 230
c 300
d 250
e 700
A Huffman code is as follows.
a 10
b 010
c 00
d 011
e 11
Algorithmic Issues
There is an algorithm that is due to Huffman for finding efficient prefix codes; they are called Huffman codes. The input to the algorithm is a String of characters with their frequencies.
Trees and codes
A tree is used to define a prefix code. Each node of the tree is either a leaf or a nonleaf. Each nonleaf has two children, one labeled the '0' child, the other the '1' child. Each leaf contains a character. An example is the following tree.
.
/
0/
/
/
. .
/ /
0/ 0/
/ /
c b a d
A tree defines a code. You find the code for a character by following the path from the top (the root) of the tree to that character, writing down the 0's and 1's that you see. For example, in the tree above, the code for 'b' is 01, and the code for 'a' is 10. What is the code for 'c'? A different code is given by the following tree.
.
/
0/
/
b .
/
0/
/
a c
where 'b' has code 0 and 'a' has code 10.
A tree can be thought of as a composite character. A tree with leaves a and c stands for the combination of a and c. A character by itself is thought of as a tree that has just one leaf.
Huffman's algorithm
The algorithm to create the tree starts by making each character into a tree that is just a leaf holding that character. Each tree has a frequency, which is the character frequency. For example, if the frequency input is
a 500
b 230
c 300
d 250
e 700
then the algorithm creates five leaves, one holding a, with frequency of 500, one holding b with a frequency of 230, etc.
(Note. A frequency is associated with an entire tree, not with each node of a tree. Initially, each tree has just one node, but as the algorithm runs the trees will grow. You still have just one frequency per tree.)
Now the algorithm repeatedly performs the following, as long as there are still at least two trees left in the collection of trees.
Remove the two trees that have the lowest frequencies from the collection of trees. Suppose that they are trees S and T. Suppose that tree S has frequency FS and tree T has frequency FT.
Build a tree that looks as follows.
.
/
0/
/
S T
Say that this tree has frequency FS + FT, the sum of the frequencies of S and T.
Add the new tree, with its new frequency, to the collection of trees.
The number of trees decreases each time because you remove two trees and put back just one tree. When there is only one tree left, the algorithm is done. That tree is the result.
A notation for trees
It is a little awkward to draw pictures of the trees, so a notation for describing trees is useful. The notation groups the two subrees of a node in parentheses. So a pair of parentheses stands for a nonleaf node in the tree. For example, tree
.
/
0/
/
e c
is written (e c). Tree
.
/
0/
/
b .
/
0/
/
a c
is written (b (a c)). Tree
.
/
0/
/
/
. .
/ /
0/ 0/
/ /
c b a d
is written ((c b) (a d)). We use this notation to demonstrate the algorithm.
'a'
01100001
'b'
01100010
'c'
01100011
Explanation / Answer
package huffmancompression;
import java.io.File;
import java.io.FileNotFoundException;
import java.util.PriorityQueue;
import java.util.Scanner;
import java.util.TreeMap;
// Defines a class for encoding and decoding
public class HuffmanCompression
{
// To store file reading status
static final boolean readFromFile = true;
// To store new text status
static final boolean newTextBasedOnOldOne = false;
// Creates a priority quee node class object
static PriorityQueue<Node> nodes = new PriorityQueue<>((o1, o2) -> (o1.value < o2.value) ? -1 : 1);
// Creates a Tree map class object
static TreeMap<Character, String> codes = new TreeMap<>();
// To store original text
static String data = "";
// To store encoded text
static String encodedData = "";
// To store decoded text
static String decodedData = "";
// To store ASCII values
static int ASCII[] = new int[128];
// main method definition
public static void main(String[] args) throws FileNotFoundException
{
// Opens the file for reading
Scanner scanner = new Scanner(new File("Source.txt"));
// Sets the decision to 1
int decision = 1;
// Loops till decision is not -1
while (decision != -1)
{
// Calls the method to execute the decision
if (executeDecision(scanner, decision))
continue;
// Calls the method to read file contents
decision = readFile(scanner);
}// End of while loop
}// End of method
// Method to read file contents
private static int readFile(Scanner scanner)
{
int data;
// Reads the data from file and converts it into integer
data = Integer.parseInt(scanner.nextLine());
// Checks the reading file status
if (readFromFile)
System.out.println("Decision: " + data);
return data;
}// End of method
// Method to executes the decision type
private static boolean executeDecision(Scanner scanner, int decision)
{
// Checks if the decision is one read the new data from file
if (decision == 1)
{
// Calls the method to deal with new data
if (readNewData(scanner))
return true;
}// End of if condition
// Otherwise checks if decision is two encode the data
else if (decision == 2)
{
// Calls he method to encode data
if (encodingData(scanner))
return true;
}// End of else if condition
// Otherwise checks if decision is three decode the data
else if (decision == 3)
{
// Calls the method to decode the data
decodingData(scanner);
}// End of else if condition
return false;
}// End of method
// Method to read data from file
private static boolean readNewData(Scanner scanner)
{
// Stores the length
int oldTextLength = data.length();
// Reads the data
data = scanner.nextLine();
// Checks for the text validity
if (newTextBasedOnOldOne && (oldTextLength != 0 && !checkSameCharacter()))
{
System.out.println("Not Valid input");
data = "";
return true;
}// End of if condition
// Reset the variables and objects to null
ASCII = new int[128];
nodes.clear();
codes.clear();
encodedData = "";
decodedData = "";
// Displays the data
System.out.println("Data = " + data);
// Calls the method to calculate character Interval
calculateCharInterval(nodes, true);
// Calls the method create tree
createTree(nodes);
// Calls the method to create code
createCodes(nodes.peek(), "");
// Calls the method to display codes
displayCodes();
System.out.println("-- Encoding/Decoding --");
// Calls the method to encode data
encodeData();
// Calls the method to decode data
decodeData();
return false;
}// End of method
// Method to encoding data
private static boolean encodingData(Scanner scanner)
{
// Reads the dta from file
data = scanner.nextLine();
// Displays the data
System.out.println("Data to Encode: " + data);
// Checks for same character
if (!checkSameCharacter())
{
System.out.println("ERROR: Not Valid input");
data = "";
return true;
}// End of if condition
// Calls the method encode data
encodeData();
return false;
}// End of method
// Method to decoding data
private static void decodingData(Scanner scanner)
{
// Reads the data from file
encodedData = scanner.nextLine();
System.out.println("Data to Decode: " + encodedData);
// Calls the method to decode data
decodeData();
}// End of method
// Method to return same character status
private static boolean checkSameCharacter()
{
boolean foundStatus = true;
// Loops till length of the data
for (int i = 0; i < data.length(); i++)
// Checks the current character availability of the character in the ASCII array
if (ASCII[data.charAt(i)] == 0)
{
foundStatus = false;
break;
}// End of if condition
// Returns the flag status
return foundStatus;
}// End of method
// Method to decode the data
private static void decodeData()
{
decodedData = "";
// Peeps a node
Node node = nodes.peek();
// Loops till end of the encoded data
for (int counter = 0; counter < encodedData.length(); )
{
// Creates a temporary node
Node tmpNode = node;
// Checks for leaf node
while (tmpNode.leftNode != null && tmpNode.rightNode != null && counter < encodedData.length())
{
// Checks if the character is '1'
if (encodedData.charAt(counter) == '1')
// Extracts the right node
tmpNode = tmpNode.rightNode;
// Otherwise extracts right node
else
tmpNode = tmpNode.leftNode;
// Increase the loop counter by one
counter++;
}// End of while loop
// Checks if the node is NULL
if (tmpNode != null)
// Checks if the length is one
if (tmpNode.character.length() == 1)
// Concatenates the cahracter with decode data
decodedData += tmpNode.character;
// Otherwise display error message for invalid data
else
System.out.println("ERROR: Not Valid Data");
}// End of for loop
// Displays the decoded data
System.out.println("Decoded Data: " + decodedData);
}// End of method
// Method to encode data
private static void encodeData()
{
encodedData = "";
// Loops till length of the data
for (int counter = 0; counter < data.length(); counter++)
// Concatenates the data with encoded data
encodedData += codes.get(data.charAt(counter));
System.out.println("Encoded Data: " + encodedData);
}// End of method
// Method to create a tree
private static void createTree(PriorityQueue<Node> vector)
{
// Loops till size is greater than one
while (vector.size() > 1)
// Adds the node
vector.add(new Node(vector.poll(), vector.poll()));
}// End of method
// Method to display the code
private static void displayCodes()
{
System.out.println("--- Displaying Codes ---");
codes.forEach((k, v) -> System.out.println("'" + k + "' : " + v));
}// End of method
// Method to claculate character Interval
private static void calculateCharInterval(PriorityQueue<Node> vector, boolean intervalStatus)
{
// Checks the interval status
if (intervalStatus)
System.out.println("*********** Character Intervals ***********");
// Loops till end of the data
for (int counter = 0; counter < data.length(); counter++)
// Stores the character frequency
ASCII[data.charAt(counter)]++;
// Loops till length of the ASCII array
for (int counter = 0; counter < ASCII.length; counter++)
// Checks if the current counter value of the ASCII array is grater than zero
if (ASCII[counter] > 0)
{
vector.add(new Node(ASCII[counter] / (data.length() * 1.0), ((char) counter) + ""));
// Checks the interval status
if (intervalStatus)
// Displays the character and it Interval
System.out.println("'" + ((char) counter) + "' : " + ASCII[counter] / (data.length() * 1.0));
}// End of if condition
}// End of method
// Recursive method to create the code
private static void createCodes(Node node, String s)
{
// Checks if node is not null
if (node != null)
{
// Checks if right node is not null
if (node.rightNode != null)
// Recursively call the method for right child
createCodes(node.rightNode, s + "1");
// Checks if left node is not null
if (node.leftNode != null)
// Recursively call the method for left child
createCodes(node.leftNode, s + "0");
// Checks for leaf node
if (node.leftNode == null && node.rightNode == null)
codes.put(node.character.charAt(0), s);
}// End of if condition
}// End of method
}// End of class
// Class to generate node
class Node
{
// Instance variable to store left and right child node
Node leftNode, rightNode;
// To store number
double value;
// To store the character
String character;
// Parameterized constructor definition
public Node(double value, String character)
{
this.value = value;
this.character = character;
leftNode = null;
rightNode = null;
}// End of parameterized constructor
// Overrides parameterized constructor with left and right child node
public Node(Node left, Node right)
{
this.value = left.value + right.value;
character = left.character + right.character;
// Checks if left chaild value is less than right child value
if (left.value < right.value)
{
this.rightNode = right;
this.leftNode = left;
}// End of if condition
// Otherwise
else
{
this.rightNode = left;
this.leftNode = right;
}// End of else
}// End of method
}// End of class Node
Sample Output:
run:
Data = Ahmed Kamel Taha
*********** Character Intervals ***********
' ' : 0.125
'A' : 0.0625
'K' : 0.0625
'T' : 0.0625
'a' : 0.1875
'd' : 0.0625
'e' : 0.125
'h' : 0.125
'l' : 0.0625
'm' : 0.125
--- Displaying Codes ---
' ' : 101
'A' : 0101
'K' : 0100
'T' : 1000
'a' : 111
'd' : 1001
'e' : 001
'h' : 110
'l' : 000
'm' : 011
-- Encoding/Decoding --
Encoded Data: 0101110011001100110101001110110010001011000111110111
Decoded Data: Ahmed Kamel Taha
Decision: 2
Data to Encode: Rahim
ERROR: Not Valid input
Data to Encode: 3
ERROR: Not Valid input
Data to Encode: 0101110011001100110101001110110010001011000111110111
ERROR: Not Valid input
Data to Encode: 3
ERROR: Not Valid input
Data to Encode: 0100111011001000
ERROR: Not Valid input
Data to Encode: 1
ERROR: Not Valid input
Data to Encode: Taha
Encoded Data: 1000111110111
Decision: -1
Related Questions
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.