-
What is a data structure?
-
An organization of a data set
-
Several operations to be performed on the data set
-
The general problem of data structure design
-
Input: a data set and specifications of the operations to be supported
-
Design: Come up with an organization (structure) of the data elements,
and algorithms for the operations, so that the operations are as fast as
possible.
-
When do we need a data structure?
-
Review of Stacks and Queues
-
A stack is a data structure S that supports three operations:
-
The pushing (insertion) and popping (detetion) are done at one and the
same end of the stack, namely, its top.
-
One implementation is an array S[1:N] and a global index I pointing to
the top of the stack, assuming that S[i] is empty but S[i-1] is full
| Procedure |
Push (S,a) |
| begin |
|
S[I] := a; |
|
I := I+1; |
| end |
| function |
Pop(S) |
| begin |
|
I := I-1; |
|
return(S[I]); |
| end |
| function |
Top(S) |
| begin |
|
return(S[I-1]); |
| end |
-
A queue is a data structure Q that supports two operations:
-
enqueue(Q,a): it inserts a new element into the rear end (called tail)
of the Queue Q
-
deque(Q): deletes the front element from the head of the Queue.
-
One implementation is a circular array Q[0:N-1] and two global indexes
H and T, pointing to the head and tail elements of Q.
Q[H] is considered full, Q[T] is considered empty.
By circular we mean: the element before Q[0] is Q[N-1],
and the element after Q[N-1] is Q[0].
| Procedure |
enqueue(Q,a) |
| begin |
|
Q[T] := a; |
|
T = T - 1 modulo N; |
| end |
function dequeue(Q)
| begin |
|
H := H-1 modulo N; |
|
return (Q[H+1 modulo N]); |
| end |
-
Records: A record is a built-in structure data type. It is an aggregate
of several elements called fields, where each field is a variable of a
standard type or of a record type.
-
Syntax of a record:
| record |
name |
| begin |
|
field declarations; |
|
field declarations; |
|
... |
|
field declarations; |
| end |
Example:
| record |
employee |
| begin |
|
char name[1:30] |
|
integer SSN; |
|
char address[1:100] |
|
real salaray; |
| end |
Once a record has been defined, it becomes a new data type that can
be used in declaring record variables:
employee X;
employee Y[1:100];
| record |
company |
| begin |
|
char name[1:50] |
|
char address[1:100] |
|
employee Z[1:1000] |
| end |
to access an individual field F of a record R, use
R.F
| Examples: |
X.SSN := 123456789; |
|
X.salary := X.salary + 1000; |
-
Pointers data types
A pointer is a data type whose contents are addresses of other variables
of a specified type.
| Examples: |
charpointer p; |
intpointer q[1:n]; |
employeepointer r; |
|
charptr p; |
intptr q[1:n]; |
employeeptr r; |
to access the field F of a record pointed to by a record pointer r,
use
r ---> F
-
A self-referential record is a record that has a pointer field that points
to a record of the same type.
-
Example:
| record |
employee |
| begin |
|
char name[1:30] |
|
integer SSN; |
|
employeeptr next; |
| end |
-
Singly Linked Lists:
-
a singly linked list is a sequence of records, where every record has a
field that points to the next record
-
p pointer P that points to the first record is part of the linked list
structure
-
to access the i-th record, we have to "walk" from the first record to the
i-th record, one record at a time, using the next pointers.
-
to insert a record at a specified location, say after the i-th record,
we walk to the i-th record, then change the
values of the pointers so the i-th record points to the new record and
the new record points to the (i+1)st record.
-
deleting the i-the record is done in a similar fashion.
-
Doubly linked lists: these are like single linked lists except that each
record has a field that points to the previous record, and another field
that points to the next record.
-
Introduction to graphs
-
Definition: A graph G=(V,E) consists of a finite set V, whose elements
are called nodes, and a set E, which is a subset of V x V. The elements
of E are called edges.
If the directions of the edges are of significance, that is, (x,y) is
different from (y,x), the graph is called directed.
Otherwise, the graph is called undirected.
-
if (x,y) is an edge, then x is
said to be adjacent to y, and y is adjacent
from x.
-
in the case of undirected graphs, if (x,y) is an edge,
we just say that x and y are adjacent (or x is adjacent to y, or y is adjacent
to x). Also, we say that x is the neighbor of y.
-
The indegree of a node x is the number of
nodes adjacent to x
-
The outdegree of a node x is the number of
nodes adjacent from x
-
the degree of a node x in an undirected graph
is the number of neighbors of x
-
A path from a node x to a node y in a graph
is a sequence of node x, x1,x2,...,xn,y,
such that x is adjacent to x1, x1 is adjacent to
x2, ..., and xn is adjacent to y.
-
The length of a path is the number of its
edges.
-
a cycle is a path that begins and ends at
the same node
-
The distance from node x to node y is the
length of the shortest path from x to y.
-
an undirected graph is connected if there
is at least a path between every pair of nodes
-
Representations: Adjacency lists and adjacency matrices
Given a graph G=(V,E) of n nodes, G can be represented by a matrix
A[1:n,1:n], where
A[i,j]=1 if (i,j) is an edge;
A[i,j]=0 if (i,j) is not an edge.
The nodes are assumed to be conveniently labeled 1,2,...,n.
A is called an adjacency matrix.
G can also be represented by what's called adjacency lists: an array
P[1:n] of pointers, where P[i] points to the linked list of all the nodes
ajdacent from node i. Each one of those nodes is represented by a simple
record of two fields: the node label and the next pointer.
-
Trees
-
Definition: a tree is a connected acyclic graph (i.e., it has no cycles).
-
Rooted trees: A rooted tree is a tree where one node is designated as root.
By holding the tree at its root, and letting the other nodes descend from
it, we get a hierarchical structure. In that structure, every node has
a parent, and some nodes have one or more children.
-
Definitions:
-
a leaf is a node that has no children
-
the ancestors of a node x is the set of nodes on the path from x to the
root, including x and the root.
-
the proper ancestors of x are the ll the ancestors of x except x itself.
-
the descendents of x are the children of x, their children, and so on all
the way down.
-
the proper descendents of x are all the descendents of x except x itself.
-
the subtree rooted at x is the tree consisting of x and all its descendents.
-
the depth of x is the distance from the root to x.
-
the height of x is the distance from x to the farthest descendent of x.
-
the height (or depth) of the tree is the height of the root.
-
the nodes clearly partition into levels. The top level contains just the
root, and it is labeled level 0. The next level, labeled level 1, contains
the children of the root. The level after that, labeled level 2, conatins
the grandchildren of the root, and so on.
-
the depth of a node is the label of its level.
-
the height of the tree is the label of the lowest level.
-
Binary trees: rooted trees where every node has at most two children
-
Representation of binary trees: every node s represented by a record that
has at least three fields: the node label (or contents), a pointer to the
left child, and a pointer to the right child. The tree is pointed to by
the pointer to the root node.
-
Binary Search Trees: a BST is a binary tree T where
-
every node contains a value, called key (real, integer, character, etc.)
-
for every node x, all the nodes in the left subtree of x have keys that
are <= the key of x, and all the nodes in the right subtree of x have
keys that are > the key of x.
-
BST supports three operations: insert, delete, and search
In the three algorithms below, T is a pointer to the tree root, and a
a is a key.
The search function returns a pointer to the record
containing a, if any.
Otherwise, it returns nil.
| function |
search(T,a) |
| begin |
|
nodeptr p; |
|
p=T; |
|
while (p != nil |
and |
p --> key != a) |
do |
|
|
if |
a < p --> key |
then |
|
|
|
p := p --> left; |
|
|
else |
|
|
|
p := p --> right; |
|
|
endif |
|
endwhile |
|
return (p); |
| end search |
The insert procedure inserts the key a into T.
begin
nodeptr p;
p=T;
Bool done = false;
while not done
do
if key
then
if p --> left != nil
then
p = p --> left;
else
p --> left = new(node);
p --> left --> key := a;
done =true;
endif
else
if p ---> right != nil
then
p = p --> right;
else
p --> right = new(node);
p --> right -- > := a;
done = true;
endif
endif
endwhile
end insert
-
The procedure delete deletes the node containing key
a in the tree
-
p points to the node that will be found to contain a
-
q points to the parent of p
-
r points to a child of p
-
direction = 0 if p is a left child of q, 1 if p is
the right child of q
procedure delete(T,a)
begin
nodeptr p,q,r,s;
integer direction;
p = T;
while (p != nil and p-->key != a) do
if a < p-->key then
q := p;
p := p-->left;
direction := 0;
else
q := p;
p := p-->right;
direction := 1;
endif
endwhile
if p == nil then
return;
elseif p-->left == nil and p-->right == nil then
/* p has no children */
/* delete node pointed to by p and exit */
if direction == 0
q-->left = nil;
else
q-->right = nil
endif
free (p);
elseif p-->left == nil then
/* p has only one child, the right one */
if direction == 0 then
q-->left := p-->right;
/* shortcut from parent to grandchild */
else
q-->right := p-->right;
endif
elseif p-->right == nil then
/* p has only one child, the left one */
if direction == 0 then
q-->left := p-->left;
else
q-->right := p-->left;
endif
else /* p has two children*/
/* find the maximum node in the right */
/* subtree of p */
s := p-->left;
q := p;
/* now q will be the parent of s , and direction*/
/* will indicate the type of child s is to q*/
direction = 0;
while s-->right != nil do
q := s;
s := s-->right;
direction := 1;
endwhile
/*Now s points to the maximum node if the */
/*left subtree of p*/
p-->key := s-->key;
/* now node s must be deleted. But since s has */
/* no right child, the deletion is done by */
/* deletion or shortcutting */
if s-->left == nil then /* s is a leaf*/
if direction == 0 then
q-->left := nil;
else
q-->right := nil;
endif
free (s);
return;
else /* s has a left child */
if direction == 0 then
q-->left := s-->left;
else
q-->right := s-->left;
endif
free(s)
return;
endif
endif
end delete
-
the time complexity of insert, delete, and search is O(h) each, where h
is the height of the tree.
-
Full trees and almost complete trees:
-
A full binary tree is a binary tree where every non-leaf has two children
and all the leaves are at the same level.
-
Exercises: show that the number of nodes in level i of a full binary tree
is 2i. Show also that a full binary tree of height h has 2h+1
- 1 nodes.
-
the canonical labeling of a full binary tree is a labeling of the nodes
from top to bottom, left to right, starting with the root being labeled
1.
-
observe that the labels of the children of node i are 2i and 2i+1. The
label of the parent of i is [i/2] (integer division).
-
an almost complete binary tree of n nodes is the binary tree consisting
of the first n nodes of a full binary tree.
-
observe that if the bottom level of an almost complete binary tree is removed,
we are left with a full tree. Also, the nodes in the bottom level are packed
to the left end of the tree without any "holes"
-
an almost complete binary tree of n nodes can be implemented with an array
A[1:n] of n elements, where A[i] contains the contents of node of label
i.
-
Heaps (min-heaps and max-heaps)
-
Definition: a min-heap is an almost complete binary tree where every node
has a value smaller than the values of its children
-
Operations on a min-heap: delete-min and insert
function delete-min(H) /* H is the heap*/
begin
| |
x= root of H; |
|
a=key of x; /* to be returned at the end*/ |
|
remove a from node x; |
|
take the last node (node n), remove its key (call it b), and store
b in the root; |
|
remove node n; |
|
/* now we have to restor the heap*/ |
| |
while |
(x has a key bigger than one of its children) do |
|
|
swap x with the smaller child; |
|
|
make x point to that child; |
|
endwhile |
| /* note the while loop
will stop when x becomes smaller */ |
| /* than both its children, or when x becomes a leaf*/ |
| end delete-min |
Time of delete-min: O(h)=O(log n)
procedure insert(H,a)
/* inserts the key value a into the heap*/ begin
| |
create a node of label n+1, and insert a into that node; |
|
let x point to that node; |
|
while (x is smaller than its parent) do |
| |
swap the key of x with key of the parent of x; |
|
let x point to its parent; |
| |
/* this will loop until either x becomes larger than its parent, or
it becomes the root*/ |
end insert
Time of insert: O(h)=O(log n)
-
Note that with insert(), we can create a heap from scratch by n insertions
into an intially empty heap. This takes time =
O(log 1 + log 2 + ... + log n) = O(log(n!))= O(nlog n).
There is a procedure for constructing a heap in O(n)
time.
-
By starting with a heap, and calling delete-min n times, we sort the heap
in time =
O(logn + log(n-1) + ... + log 1) = O(nlog n).
-
Sets and the Union-Find Data Structure
-
General implementations of sets
-
Union-Find problem: Given n sets, where set i has a single element {i},
we need to perform O(n) operations of union (U) and find (F):
-
U(A,B) removes the sets A and B, and replaces them with a new set equal
to
A U B
-
F(x) is a function that returns the name of the set
containing x at the time F is called.
-
note that the sets are mutually disjoint at all times.
-
the problem is to design a data structure so that O(n) calls to U and F
take as little time as possible.
-
we will carry out the design by having two different representations of
the sets: one conceptual and one physical.
-
the conceptual representation of each set is a rooted tree (not necessarily
binary). The nodes are labeld with the elements of the corresponding set.
The label of the root serves the extra purpose of being the name of the
set.
-
the physical representation of all the trees (at once) is a single array
called PARENT[1:n].
-
if i is not a root, PARENT[i] is the parent of node i in whatever tree
i happens to be.
-
if i is a root, PARENT[i]= -(number of nodes in the
tree rooted at i).
-
three different implementations of the Union-Find
-
first implementation
U(i,j) is carried out by simply making i the parent of j, thus combining
the two trees into a single tree, and at once eliminating the old trees
as distinct sets.
F(x) returns the root of the tree containing x
by walking up that tree from x, using the PARENT array, until a node is
reached which has no parent. That is the root.
procedure U(i,j)
begin
| |
PARENT[i]=PARENTiI]+PARENT[j]; /* this computes the total number of
nodes in the new tree */ |
|
PARENT[j]=i; |
| end |
Time: O(1), that is, constant.
function F(x)
begin
| |
integer |
r=x; |
|
while |
(PARENT[r] > 0) do |
|
|
r = PARENT[r]; |
|
endwhile |
| |
/*now r is a root*/ |
|
return(r); |
end
Time: O(h) where h is the height of the tree containing x.
time for O(n) Us and Fs: up to O(n2). The proof is an exercise.
-
Second Implementation
in U(i,j), make the root of the heavier tree become the parent of the
other root. By heavier we mean "has more nodes".
F(x) is not modified here.
procedure U(i,j)
begin
| |
if |PARENT[i]| >= |PARENT[j]|
then |
| |
PARENT[i]=PARENTiI]+PARENT[j]; |
|
PARENT[j]=i; |
| |
PARENT[j]=PARENTiI]+PARENT[j]; |
|
PARENT[i]=j; |
end
time: O(1)
Time of an O(n) sequence of Us and Fs: O(nlog n).
This is because every tree is of height <= log(its number of nodes).
The proof for that is by induction on the number of Us performed.
Therefore, every tree is of height <= O(log n).
Hence, every F(x) takes O(log n) time.
As a result, O(n) Us takes O(n), and O(n) Fs take O(nlog n).
The total time for O(n) Us and Fs is then O(nlog
n).
-
Third implementation
U will remain the same as in the second implementation.
F will be changed using a technique called path compression. In particular,
F(x) first finds the root. Afterwards, in a second walk from x to the root,
every node on that path is made a direct child of the root.
function F(x)
begin
| |
integer r,s,t; |
|
r=x; |
|
while PARENT[r] > 0 |
| |
endwhile |
|
/*now r is a root*/ |
|
|
s=x; |
|
while s != r do |
| |
t := s; |
|
s := PARENT[s]; |
|
PARENT[t] := r; |
end
Time of F: O(h) where h is the height of the tree containing x.
Time of O(n) sequence of Us and Fs: O(nG(n)), where
G(n) <= 5 for all n <= 265000. That is, for all practical
values of n, the time is O(n).